[vox-tech] Linux Block Layer is Lame (it retries too much)

Jeff Newmiller vox-tech@lists.lugod.org
Wed, 28 May 2003 11:31:56 -0700 (PDT)


On Tue, 27 May 2003, Mike Simons wrote:

>   Last week I was having problems with the ide layer... basically it retries
> way too many times.  I was trying to read 512 byte blocks from a dying
> /dev/hda (using dd_rescue which calls pread), for each bad sector the
> kernel would try 8 times, at time 4 and 8 it would reset the IDE bus
> (turning off things like DMA mode), and for every other failed attempt it 
> would seek to track 0...
> 
>   If that wasn't bad enough, for some reason the kernel was often trying
> to fetch 8 sectors worth of information for a single sector read.  The 
> 8 sector "chunk" being fetched was somehow related to the modulus of
> the actual sector being requested, so if you had an 8 sector bad region...
>   So if you requested any one of the 8 bad sectors from a chunk, each
> of the 8 would have 8 read attempts made... 64 read attempts all will
> fail before you can even move to the next sector, when you request the 
> next bad sector the process would begin again... even if you wanted the 
> last of the 8 sector chunk.
> 
>   Normally that is good... it's a best effort attempt to read a disk.
> However I had hundreds of bad sectors on this drive and just wanted as
> much of the filesystem as possible and didn't have days to wait.
> 
>   One bizarre thing is that this 8 sector chunk read didn't always happen,
> it appears that if one of the 8 sectors was good the other 7 would only
> be tried one batch of times.
> 
> 
>   I found that linux/include/linux/ide.h has the following three defines:
> ===
> /*
>  * Probably not wise to fiddle with these
>  */
> #define ERROR_MAX       8       /* Max read/write errors per sector */
> #define ERROR_RESET     3       /* Reset controller every 4th retry */
> #define ERROR_RECAL     1       /* Recalibrate every 2nd retry */
> ===
> 
>   These settings are not configurable via /proc or sysctl... so
> I changed them and recompiled such that only 3 attempts would be made 
> on any given block no resets or re-calibrations were done.  Still 
> reading *each* sector in a bad 8 chunk region was taking 100 seconds
> (about 14 minutes to move to the next chunk of 8 sectors).
> 
>   Even better because the process is inside a system call, it is not
> killable and so there is no practical way to speed up the process.

It should be open to termination between the time the read system call
returns and the write system call starts.

>   I still do not know what is causing the 8 sector "chunk" to be read.
> It seems that sys_pread calls mm/filemap.c: generic_file_read ->
> do_generic_file_read, which seems like it might be expanding the request
> size based on some read ahead parameters, it figures out what
> "max_readahead" is by calling get_max_readahead on the inode.
> 
>   I tried most everything with hdparm and fiddled with sysctl 
> (vm/max-readahead and vm/min-readahead)... but there was no change in 
> behavior.  I tried obvious things like "hdparm -m 0 -a 0 -A 0 -m 0 -P 0",
> I also tried 1's, all with no noticeable effect.
> 
> I want to be more ready next time...
> 
> - How does a 1 sector read is expanded to an 8 sector chunk?

I don't know.  But I suspect it has to do with the "natural" way files are
read in... by "mmap"ing them to pages in RAM.  i386 memory managers
usually use 4k pages... ergo, 8 x 512B sectors.

Some of this behavior may be due to the algorithms in dd_rescue.

> - How this chunk reading behavior can be turned off 
>   (via command line or custom kernel patch)?

Dunno.

> - Any other ideas on how to pull the disk blocks?

Not easy ones. (Build your own device driver that doesn't use mmap.)

> 
>     Thanks,
>       Mike Simons
> 
> 
>   Basically I spent 20 hours trying to read from a failing drive, I 
> got about half way through the drive before time was up.  Of the 30 million
> sectors, I read only about 500 were bad.  It is likely that the NTFS 
> filesystem on the drive would have been recoverable, if the pull had 
> finished... because I could read only mount the filesystem from the 
> drive itself, but what I have of the image has an error at mount time.

I had a similar experience a few weeks ago... dd would fail at certain
areas on the disk, so I would use the skip option for dd to pick up after
the dead spots.  (I didn't know about dd_rescue.) Nevertheless, the
process was too slow, so I pulled the disk and simply replaced it.

>   I was using a custom knoppix boot floppy and a standard knoppix CD to 
> boot a laptop with the bad drive, NFS mounting a local machine, where I 
> was dd_rescue sending the blocks that could be read. 

Slick.  I was using netcat.

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil@dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...2k
---------------------------------------------------------------------------