[vox-tech] In Denial About These Hard Drive Problems

vox-tech@lists.lugod.org vox-tech@lists.lugod.org
Sat, 22 Jun 2002 16:16:09 -0400


On Thu, Jun 20, 2002 at 06:55:11PM -0700, Rick Moen wrote:
> Quoting Peter Jay Salzman (p@dirac.org):
> >   hda: IC35L040AVER07-0, ATA DISK drive

IBM Deskstar 60GXP, 40 Gig, 7200rpm.

> >   hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> >   hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=33873786,
> >     sector=2097216
> >   end_request: I/O error, dev 03:03 (hda), sector 2097216

  Has audible hardware errors... it does the standard click-click-click
noise like my IBM drive did, just much quieter.

  I recommend that anyone with a IBM drive models 60, 75, or 120 GXP, buy a 
replacement drive, transfer their data off the IBM drive, and then 
sell the IBM drive to some fool (or use it for data of *no* value until 
it fails).


> Richard, I second Mike Simons's suggestion that you drop everything and
> secure as good a backup as possible of all files on that hard drive you
> care about.  
> 
> There are underlying hardware errors occurring somewhere on your IDE
> chain, manifesting themselves as drive seek errors on your hard drive.

Summary
=======

  There were about 40 bad blocks in about 5 groups spread over the drive 
affecting two of the five filesystems on the machine.  The two affected 
partitions were the Redhat root partition, and what was a large auxiliary 
data area (containing bits of /usr/local and /var).  /home, NTFS, and vfat32 
partitions unaffected.

  Except for the 40 or so bad blocks all of the data on his drives have
been extracted and transfered to one of the replacement drives he 
purchased.  Both NT 2000 and Redhat are booting correctly, lilo is 
controlling the boot up, and the system is now running a SMP kernel.

  The NT 2000 system survived mirroring block by block from the failing 
hard drive.  

  There appears to have been some minor file loses on the Redhat system. 
In particular gpanel has lost it's default config file that controls
the lower panel.  rpmverify appears to find about 100 discrepancies
on the filesystem, but I couldn't find a man page or other documentation
on it's error report so I don't know what needs to be done or how
bad some of the reported problems are.

  The only noteworthy issue is the sound card might not be working.
It appears that Redhat uses ALSA for his machine and I didn't download
or compile the ALSA drivers.


  Unfortunately I don't think this recovery process is very economical.
It took about 12 hours from start to finish.  Even if nothing had gone 
wrong it would have taken at least 6 hours or so to recover 40 Gigs of 
data.
  That's 6 hours of time where practically everything time you interact
with the keyboard a minor typo could be very very bad...
  eg: (mkswap|mke2fs) /dev/hd[a-h][1-4]
It would be very expensive to have this thing done professionally.


  The rest of this is a bit long... 

Phase 0: Machine Details
========================

  The machine is about one year old, a dual ?850? Mhz P3, with Abit VP6
motherboard, 300 Watt power supply, and single hard drive, one cd-burner.
The house is a little dusty but air conditioned and I do not think
that the drive was close an overheat situation.  The cabling is correct
UDMA-100 cable.  The motherboard has two built in IDE controllers,
one VIA 686 UDMA-100, and one HPT 370 UDMA-100 "raid" controller.
Only the VIA controller was in use.
  The machine had been working fine until recently...

  I suspect the slow filling of /home was due to some gnutella style 
client applications running on the machine... all used disk space was 
traceable to actual files.

  The distribution was Redhat 7.2, ext3 filesystems which were configured
to *never* do a file system check.  Both mount count and day count based 
checks were both disabled.  I've seen this on a few other Redhat systems
so it must be an attribute of the installer... I personally think never
check mode is a bad idea (1).

  The hard drive was making slight noises when error messages appeared.


[Note: I first describe what I did, and later under "Adjustments" what I 
would do next time.  Don't repeat what I did.  The Adjustments route is 
much safer.]


Phase 1a: One down one to go
============================

  Richard had purchased two brand new retail boxed Western Digital 40 Gig,
5400 rpm, drives.

  The initial plan was to use dd_rescue(2) to pull off all the partitions
off the failing IBM drive, then correct them then, push them onto the second
new disk drive.  This allowed us to figure out where the damage was and
how bad the damage was.

  The first replacement drive hooked up fine.  Using the Redhat install
CD in rescue mode and dd_rescue, the NTFS filesystem was block by block
mirrored to it without problems.  The Redhat root partition mirror
was running and a few bad blocks had been hit on the old drive when
there started to be DMA resets on the new hard disk, transfer rates
got pretty bad.

  smartctl showed error logs entries and some block reallocation events
happening on the new drive (which had only been powered on 2 hours).
Several reboots were done to check the cabling and swap between controllers.
The new hard drive partition table sector was unreadable most of these
reboots, and when it was readable the filesystem was unmountable.
  This new drive is now completely dead... no readable blocks.


  The initial DMA problems with this new disk seemed to appear when the 
CD drive was spinning up... there were two main power lines from the 
power supply, as I write this I wonder if the CD was connected to
the same power connector as that new dead drive.
  The CD drive, was extremely noisy during spin up, then was just 
noisy and produced a fair amount of vibration while spinning.

- Is it likely that the CD spin up motor was lowering the power 
  available to the new hard drive enough to cause the almost instant 
  death?
- Could the operating vibration from a well mounted CD, kill a well 
  mounted hard drive?

  The above is speculation.  The second replacement drive which completed
the rescue process was also plugged into the same power line as the
CD the whole time, and has not failed... yet.


Phase 1b: Replacement Number Two
================================

  The second new hard drive was pulled out of the box.  This time I knew 
that the large windows partition had no bad blocks and we didn't have
much spare disk space, so a single large ext3 partition was made of the
new disk.  The plan changed:

1 - pull off Linux filesystems from old drive,
2 - e2fsck the disk images,
3 - loopback mount the images,
4 - tar and gzip the contents out of the images,
5 - unmount and nuke the image files.

6 - move the tar/gzip files onto a good portion of the IBM drive
7 - repartition the new good drive.
8 - mirror 2000 onto the new drive.
9 - transfer the tar'ed data onto the new drive.

  The purpose was to condense the files from the old drive, then move
that data back onto the failing IBM disk, so that the real partition 
structure could be setup on the new disk.  There would be a little risk:
if the IBM drive developed more bad blocks while preparing the final
partition layout on the new drive then that would be bad.  I wouldn't do
it this next time (see below).


  Step three is where it got interesting, the redhat 7.2 "rescue" boot cd
doesn't appear to support loopback mounting, the Debian woody rescue disks
don't support loopback mounts of files over 2 Gig.  So ignoring some
concerns about polluting the existing Redhat system, I installed the
Debian base system on the new disk to allow access to the disk images.  :)

  While the files where transferring around in other windows I pulled
and built a new 2.4.18 kernel for the machine with SMP support...

  During the data pulls I found that two of the partitions had only good
sectors, so those old partitions were deleted and a new large ext3 one
was created in their place.  The "gzip -1" compressed tar files were 
transfered back onto the bad drive... without problems.

  The Redhat boot cd was used in rescue mode, to wipe the Debian
system, prepare the new disk's partition table, and transfer the NTFS,
and the tar archives back to the new drive on one of the partitions.

  Before doing anything else we removed the failing IBM drive test booted 
2000 and used it to format the new vfat32 "share" drive between the two 
systems.

  Then rebooted back into Redhat rescue.  The next adventure began while
extracting the tar files we got a number of 
  "Names longer than 100 chars not supported.", the Debian rescue disks
did the same.


Phase 2: Intermission
=====================

  At this point it was early in the morning and I went home, since 
the 2000 system had been transfered completely and the linux data files 
were also on the new drive the real danger was over.

  The tar problem was tracked down to busybox not supporting the GNU tar 
extensions that provide support for long file names.  I built a staticly
linked gnu tar binary to use.


Phase 3: Extraction
===================

  I went back.

  First I tried the LinuxCare Bootable Business CD, 1.2 (which is very
old at this point).  For some reason it didn't recognize the partition
table of the new drive.  I need to get one of those updated BBC based 
CDs to carry around.

  So we switched back to the Redhat Rescue CD and extracted the linux 
tar files.

  Setup lilo on a boot floppy to test the Redhat system.  Adjusted the
lilo floppy to boot 2000 and the 2.4.18 kernel I compiled under Debian.
After testing, wrote that to the master boot record.

  A new discovery is you can write the NT master boot record to a floppy
disk and it can be used to boot the system just like it used to (using 
the NT boot loader directly :).
  dd if=/dev/hda of=/dev/fd0 bs=512 count=1


Phase 4: Final Surprise
=======================

  Once the complete system was operational we got Drive Fitness Test (DFT)
from IBM and reconnected the old drive.  DFT reports the same error code
that my drive did, and it's Richard's turn to call the RMA department.

  When we rebooted (with both drives attached), Redhat reported filesystem
problems and fsck (we heard more of the clicky noises), and the boot up
failed to mount the filesystems.  After a brief moment of concern I 
remembered Redhat uses fstab files that look like the following:

  LABEL=/        /        ext3     defaults,errors=remount-ro             0 1
  LABEL=/home   /home     ext3     defaults,errors=remount-ro             0 2

  Since we had two drives in the machine at that point with both
/home and / LABELs the system apparently tried to use the IBM drive
for bootup.  We disconnected the IBM and a final boot went smoothly.
I think I don't like LABELs.


Phase 5: Adjustments
====================

  If I were to do this process again, I would do a number of things 
differently.

If at all possible:
- Chop up the destination drive to look like it should once complete.
  Which means that the smallest partition on the new drive needs to be
  slightly bigger than the largest on the old drive.
- Use whatever is to be the swap partition as a temporary Debian install
  area, work out of there until all the data has been transfered and
  all the OSes boot tested, then wipe the Debian area and convert to swap.
- Setup the largest partition as ext3, create disk images onto that 
  partition.
- Use the other spare areas on new disk for archive file storage, if
  at all possible.  Not the failing disk.
- Use cpio/bzip2 archives not tar/gzip (3).

    Later,
      Mike

1) from "man tune2fs"
#             You  should  strongly  consider the consequences of
#             disabling mount-count-dependent checking  entirely.
#             Bad  disk  drives,  cables, memory, and kernel bugs
#             could all corrupt a filesystem without marking  the
#             filesystem  dirty  or  in  error.  If you are using
#             journaling on your filesystem, your filesystem will
#             never  be  marked dirty, so it will not normally be
#             checked.  A filesystem error detected by the kernel
#             will still force an fsck on the next reboot, but it
#             may already be too late to  prevent  data  loss  at
#             that point.

2) http://www.garloff.de/kurt/linux/ddrescue/

3) If the failing disk must be used due to partition sizing problems:

  Using tar/gzip for this was another bad idea, cpio/bzip2 should have 
been used instead.  This is because both tar and gzip do not handle bad
blocks inside their archives, if a bad blocks develops *all* information 
after that point in their archives is lost.  cpio automatically skips
garbage, bzip2 can be made to recover it's archives though the process
is more painful than it should be.
  So if the failing disk needed to be used to temporary store archives
of the least important data, then use cpio/"bzip2 -1".

There are some catches:
- cpio is more complex to use.
  find dir -depth -print0 | cpio -o0H newc | bzip2 -1 > dir.cpio.bz2
  bzip2 -cd dir.cpio.bz2 | cpio -im
- bzip2recover if you should need to use it is a pain.