[vox-tech] Linux Block Layer is Lame (it retries too much)

Mike Simons vox-tech@lists.lugod.org
Tue, 27 May 2003 20:22:22 -0400


--2YJj5f1P6Th4nBRw
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

  Last week I was having problems with the ide layer... basically it retries
way too many times.  I was trying to read 512 byte blocks from a dying
/dev/hda (using dd_rescue which calls pread), for each bad sector the
kernel would try 8 times, at time 4 and 8 it would reset the IDE bus
(turning off things like DMA mode), and for every other failed attempt it=
=20
would seek to track 0...

  If that wasn't bad enough, for some reason the kernel was often trying
to fetch 8 sectors worth of information for a single sector read.  The=20
8 sector "chunk" being fetched was somehow related to the modulus of
the actual sector being requested, so if you had an 8 sector bad region...
  So if you requested any one of the 8 bad sectors from a chunk, each
of the 8 would have 8 read attempts made... 64 read attempts all will
fail before you can even move to the next sector, when you request the=20
next bad sector the process would begin again... even if you wanted the=20
last of the 8 sector chunk.

  Normally that is good... it's a best effort attempt to read a disk.
However I had hundreds of bad sectors on this drive and just wanted as
much of the filesystem as possible and didn't have days to wait.

  One bizarre thing is that this 8 sector chunk read didn't always happen,
it appears that if one of the 8 sectors was good the other 7 would only
be tried one batch of times.


  I found that linux/include/linux/ide.h has the following three defines:
=3D=3D=3D
/*
 * Probably not wise to fiddle with these
 */
#define ERROR_MAX       8       /* Max read/write errors per sector */
#define ERROR_RESET     3       /* Reset controller every 4th retry */
#define ERROR_RECAL     1       /* Recalibrate every 2nd retry */
=3D=3D=3D

  These settings are not configurable via /proc or sysctl... so
I changed them and recompiled such that only 3 attempts would be made=20
on any given block no resets or re-calibrations were done.  Still=20
reading *each* sector in a bad 8 chunk region was taking 100 seconds
(about 14 minutes to move to the next chunk of 8 sectors).

  Even better because the process is inside a system call, it is not
killable and so there is no practical way to speed up the process.

  I still do not know what is causing the 8 sector "chunk" to be read.
It seems that sys_pread calls mm/filemap.c: generic_file_read ->
do_generic_file_read, which seems like it might be expanding the request
size based on some read ahead parameters, it figures out what
"max_readahead" is by calling get_max_readahead on the inode.

  I tried most everything with hdparm and fiddled with sysctl=20
(vm/max-readahead and vm/min-readahead)... but there was no change in=20
behavior.  I tried obvious things like "hdparm -m 0 -a 0 -A 0 -m 0 -P 0",
I also tried 1's, all with no noticeable effect.

I want to be more ready next time...

- How does a 1 sector read is expanded to an 8 sector chunk?

- How this chunk reading behavior can be turned off=20
  (via command line or custom kernel patch)?

- Any other ideas on how to pull the disk blocks?


    Thanks,
      Mike Simons


  Basically I spent 20 hours trying to read from a failing drive, I=20
got about half way through the drive before time was up.  Of the 30 million
sectors, I read only about 500 were bad.  It is likely that the NTFS=20
filesystem on the drive would have been recoverable, if the pull had=20
finished... because I could read only mount the filesystem from the=20
drive itself, but what I have of the image has an error at mount time.

  I was using a custom knoppix boot floppy and a standard knoppix CD to=20
boot a laptop with the bad drive, NFS mounting a local machine, where I=20
was dd_rescue sending the blocks that could be read.=20

--=20
GPG key: http://simons-clan.com/~msimons/gpg/msimons.asc
Fingerprint: 524D A726 77CB 62C9 4D56  8109 E10C 249F B7FA ACBE

--2YJj5f1P6Th4nBRw
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE+1AE94Qwkn7f6rL4RAhgnAJ0TvjOTV+9NYRMKPI+oCqlADQwb5ACgnKAx
xTidWnzSaNdwPaF1xVFlRyg=
=XFKf
-----END PGP SIGNATURE-----

--2YJj5f1P6Th4nBRw--