[Fwd: [vox-tech] corrupted ext3 filesystem]

vox-tech@lists.lugod.org vox-tech@lists.lugod.org
Wed, 12 Feb 2003 13:48:58 -0500


short version:
  shutdown -F -r now
    to get all of your filesystems checked 

  tune2fs -l /dev/????
    to check what your max mount and check interval are.  if they are
    "never check" change it to do a full check periodically.

===
On Tue, Feb 11, 2003 at 04:20:03PM -0800, Jonathan Stickel wrote:
> I have managed to fix the filesystem with
> fsck (many, many errors) and reload the two affected programs with rpm.

  If there were "many many errors", there may have been damage to other
programs that you had not noticed as broken and even if there weren't
fsck may have caused more damage while trying to fix the filesystem... 

> I would like to solicit some help and comments about what may have
> caused my system to become corrupt and how to prevent this from
> happening in the future.  Please know that I am relatively new to Linux.

> A while back I posted a message asking about some occasional umount
> errors during shutdown.  My post must have seemed uninteresting because
> no one responded, which was fine because it didn't seem to be causing
> any problems.  However, I now wonder if this is related my filesystem
> corrupting on me.  Every time I shutdown or reboot, I get this message:
> 
> kjournald[150] exited with preempt count 1

  From a few minutes in google, this appears to be relevant:

http://lwn.net/Articles/17846/
# Kernel preemption.
# ~~~~~~~~~~~~~~~~~
# - The much talked about preemption patches made it into 2.5.
#   With this included you should notice much lower latencies especially
#   in demanding multimedia applications. 
# - Note, there are still cases where preemption must be temporarily disabled
#   where we do not. These areas occur in places where per-CPU data is used.
# - If you get "xxx exited with preempt count=n" messages in syslog,
#   don't panic, these are non fatal, but are somewhat unclean.
#   (Something is taking a lock, and exiting without unlocking)
# - If you DO notice high latency with kernel preemption enabled in
#   a specific code path, please report that to Andrew Morton <akpm@digeo.com>
#   and Robert Love <rml@tech9.net>.
#   The report should be something like "the latency in my xyz application
#   hits xxx ms when I do foo but is normally yyy" where foo is an action
#   like "unlink a huge directory tree".

(while this document is talking about 2.5, redhat normally applies a bunch
of custom patches to their production kernels, and I didn't bother checking 
if this has made it into 2.4 mainline).


> _Occasionaly_  I get half a dozen umount and umount2 errors that fly by
> so quickly I can't write them down.  As I posted before, I am not able
> to find a system log file with these shutdown messages either.

  It is unlikely that you would see error messages from the unmount'ing
stage of shutdown in your log files, because in order to unmount the
processes using a file system need to be stopped (this includes the
process that manages writing to the log files) and unmounting would
make the filesystem unavailable to further logging (even if the logger
process were not stopped, the messages may happen after the 'no longer
available' thing happens).

> During
> bootup of Linux after these umount errors, it lets me know that there
> was a problem and does a filesystem check.  This was actually how I was
> able to run fsck today to fix the problem.

  for future reference a very good way to force a filesystem check is
===
shutdown -F -r now
===
  the -F asks for a forced file system check on bootup... while it
requires some support from the bootup scripts to happen I imagine
support for that it is standard on most linux distributions.

> Does anyone know if these types of messages are common?  Could they be
> related to a filesystem corrupting?
> 
> I should mention that my machine is a laptop which does get jostled
> quite a bit during my bike commute.  In the old days of DOS, I remember
> "parking" hardrives on shutdown to prevent damage.  Is this at all
> related to unmounting filesystems on shutdown?  I am definitely showing
> my ignorance here.

  Modern hard drives automatically lock their heads someplace safe 
when they spin down.  You shouldn't need to do something special.
  As long as the laptop hard drive is not spinning when you transport 
your machine on your bike and you aren't involved in any traffic 
circle pile ups, it should be fine.  If the drive is still spining
running when you transport your laptop stop doing that (suspend 
it at least).


  Kernel bugs, faulty hardware, new bad blocks, not unmounting the
filesystem can all lead to filesystem damage...

  Unmounting a filesystem before shutting down causes all of the
pending buffers to be flushed to disk, and all the on disk book keeping
information to be updated with any changes made.  Filesystems not cleanly 
unmounted need to be checked before they are used again, since simple
changes to a filesystem often involve updating disk blocks multiple
places, important records of what is where could be disagreeing with 
each other.

  There are two main approaches to checking, the old way is a 
full filesystem check where the important filesystem data is 
read and checked against each other.  This can take a long while
for larger filesystems.
  The second newer approach is a journal is kept on disk saying what parts
of the filesystem are undergoing changes, so if the system boots and
finds the filesystem wasn't cleanly unmounted only those parts of the
filesystem which were being touched need to be checked.  This is much
much faster... but if something is very wrong a old style check may
be needed.
  ext filesystems also keep track of how many times they have
been mounted since last check, and how long in time it's been since 
last check.  It is recommended that even if there are no unclean 
unmounts a check should be performed occasionally...

  Your report says that you are using ext3 which is one of the newer
journal'ed type filesystems... and that you use Redhat. 

  A while ago I noticed that the Redhat installer created ext3 
filesystems that will never be checked periodically.  This can
lead to massive filesystem corruption later on if small errors
in the filesystem go undetected and the filesystem continues to
be used.
  This corruption happens because the kernel filesystem drivers
don't cross check the filesystem data on each use (it would be
slower), and since only the filesystem driver should change the
data it is trusted to be correct... if different records go out 
of sync very bizarre things can happen.

  You can use 'tune2fs -l' to check what the "Maximum mount count"
and "Check interval" are.  I would recommend having max mount be
something between 20 and 40, and check interval be something between
3 and 6 months.

===
  The distribution was Redhat 7.2, ext3 filesystems which were configured
to *never* do a file system check.  Both mount count and day count based 
checks were both disabled.  I've seen this on a few other Redhat systems
so it must be an attribute of the installer... I personally think never
check mode is a bad idea (1).
[...]
1) from "man tune2fs"
#             You  should  strongly  consider the consequences of
#             disabling mount-count-dependent checking  entirely.
#             Bad  disk  drives,  cables, memory, and kernel bugs
#             could all corrupt a filesystem without marking  the
#             filesystem  dirty  or  in  error.  If you are using
#             journaling on your filesystem, your filesystem will
#             never  be  marked dirty, so it will not normally be
#             checked.  A filesystem error detected by the kernel
#             will still force an fsck on the next reboot, but it
#             may already be too late to  prevent  data  loss  at
#             that point.
===

> I meant to also add these points.  Pete suggested I look in my system 
> logs for "diskseek" errors.  The only thing I thought was close was this 
> error:
> 
> Feb 11 09:07:24 chms-hp4 kernel: isofs_read_super: bread failed, 
> dev=0b:00, iso_blknum=16, block=16

  that appears to me to be an error from attempting to mount a cdrom 
drive ... "isofs" is the cdrom filesystem driver.

  I would recommend looking for things with 'hda' in the name.

one type of bad hard disk errors look like this:
===
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=33873786, sector=2097216
end_request: I/O error, dev 03:03 (hda), sector 2097216
===

    Later,
      Mike