[vox-tech] In Denial About These Hard Drive Problems

Sat, 22 Jun 2002 23:04:45 -0400

On Sat, Jun 22, 2002 at 03:21:57PM -0700, Rick Moen wrote:
> Quoting msimons@moria.simons-clan.com (msimons@moria.simons-clan.com):
> > IBM Deskstar 60GXP, 40 Gig, 7200rpm. [...] I recommend that anyone
> > with a IBM drive models 60, 75, or 120 GXP, buy a replacement
> > drive....
> 
> Yeah, that particular series has a bad reputation.  _Generally_ however,
> the IBM Deskstar and Ultrastar series have been very good.

  Past performance does not guarantee future results... ;)

  I should provide more of an explanation... I am not saying all IBM
drives are junk.  I do still recommend people with Deskstar 75GXP, 60GXP, 
and 120GXP models find replacements and use them instead.

Now the reasons:
================
IBM produced three generations of drive that I recommend avoiding,
-  75GXP (15, 20, 30, 45, 60, 75 Gig) beginning 2000-05,
-  60GXP (20, 40, 60 Gig)             beginning 2001-01,
- 120GXP (40, 80, 120 Gig)            beginning 2001-11.

  Based on skimming of web sites (more below) I see a large number
of complaints about the 75GXP which is now notorious for failing.
Many sites put the failure rates of these drives over 10% after only
two years on market.

  The 60GXP line is the next model IBM produced and so the drives are
younger.  The problem reports have not reached the noise level of 75GXP, 
but there are many rolling in.

  The 120GXP is the latest line of drive of Deskstars, there are not
many complaints, but being very young drives and the fact that the
last two lines have proven to be problematic people should avoid the 
120GXP for caution sake.

  Avoiding the current Deskstar should hurt IBM Storage financially, 
but being a corporation that (lack of profit) is the only way real way
to focus their attention on their problems.

  I can see that a class action suit in California, has started against IBM 
(around May 24, 2002), for sale of defective products where the focus 
is the 75GXP line (1).

  There are many reports of failing GXP drives (2) generally use google
with something like "Deskstar GXP failure" to find more collections. 

1: http://www.sheller.com/ibmclassaction.htm
2: http://www.tech-report.com/onearticle.x/3494
   http://www.viahardware.com/ibm120gxp_2.shtm
   http://www.geocities.com/dtla_update/

Now about first, second, and third hand knowledge...
  Mine was a 75 Gig 75GXP.
  Richards was a 40 Gig 60GXP.
  I have personally heard of a number of other second hand reports of 
problems with the GXP drives.

> >   Except for the 40 or so bad blocks all of the data on his drives have
> > been extracted and transfered to one of the replacement drives he 
> > purchased.  
> 
> Typically, the only thing you care about is data files and maybe
> dotfiles & configuration files.  Accordingly, you can ignore and blow 
> away everything else, including all program files.
[...]
> >   There appears to have been some minor file loses on the Redhat system. 
[...]
> But those would be classics in the "I don't care" department, right?  
> I mean, you're going to reinstall all program files onto a replacement
> drive, and those will come from installation media.
[...]
> >   Unfortunately I don't think this recovery process is very economical.
> 
> Well, but I'd expect that you only need _care_ about a tiny fraction of
> those files.

As easy as Redhat and 2000 may be to load and configure...
  If you consider either the time or the hassle trade-off,
    of reinstalling and reconfiguring 
      both operating systems and associated application suites, 
    and later reloading the data files 
     (those which were remembered at backup time)...
  then I am certain that the method used was far superior.
;)

  In this case both systems were bootable without reinstall.

  Not being able to correct the Redhat file corruption by reinstalling
the damaged packages has more to due with my unfamiliarity with Redhat
and rpmverify output than anything else.

  A major cause of delay was failure of the first replacement drive
(after almost 2 hours into the process).  More delay was caused
by the additional data transfer steps (tar/gzip) due to the lack of 
available disk space.

> >   The initial plan was to use dd_rescue(2) to pull off all the partitions
> > off the failing IBM drive...
> 
> You know, on a quiescent system (single-user), just "cp -ax" on modest
> sized directory trees, one at a time, will more than suffice.  No need
> to muck about with cpio, tar, gzip, bzip2, etc.

Two points:
- The filesystems were unstable: 
  they were marked as unclean and e2fsck could not run over the partitions
  due to bad blocks.  Mounting them could have been possible with -fr, but
  ext3 tries to replay the journal even in readonly mode... and I didn't
  want to risk more corruption.  I didn't try to fight it (*).
- "cp" places more stress on the drive, and handles errors badly.
  When running "cp" the drive will seek all over the place trying to get
  data off... which I think places more stress on it than a simple one 
  directional scan over the media.  Also, "cp" will abort the copy operation
  for a given file on the first bad block, you will lose any data beyond
  that block of that file.  With this method data blocks of any file effected
  will be null filled, so you lose only 512 bytes chunks, which might make
  the rest of file some value.

  If there had been plenty of disk space we could have pulled the images
fsck'ed them, then cp'ed out of the image to the destination drive... that
was plan 1a.

*: 

I did discover and not investigate what appears to be a dangerous problem...

  When I arrived at the machine it was booting into Linux and a prior 
unsuccessful shutdown had flagged the unclean bit on the root filesystem.
The e2fsck failed due to bad blocks... and it dumped us into the 'root 
password: please, repair the filesystem prompt".  The root filesystem had 
been mounted in read-only mode...
  The machine was shutdown (CAD) from this prompt to attach the new drive.
When the machine was brought back online the unclean bit on the filesystem 
was clear, so the system attempted to boot up into multi-user mode.
The network was functional but X was refusing to start.
  I am *certain* that no disk check operation had completed... the 
filesystem did have a very large number of errors once fsck was run
on the disk image.

  The fact that the unclean bit was cleared somehow is a serious bug.

> > I need to get one of those updated BBC based CDs to carry around.
> 
> Well worth the download:  http://www.lnx-bbc.org/download.html
> (The NTFS support is still problematic because the Linux kernel
> driver is likewise.  I have no experience with using it.)

  Thanks, I don't have a burner, but I'll keep that site in mind.