[vox-tech] emergency: please help. /lib keeps disappearing

Peter Jay Salzman vox-tech@lists.lugod.org
Wed, 26 May 2004 06:52:22 -0700


Last night, on one of my machines I ran:

   apt-get update && apt-get upgrade

During the update, most of my consoles began to read:

   "init spawning too fast.  Disabling getty for 5 minutes"

When I typed "ps aux", bash returned with:

   /bin/ps: File not found

In fact, everything from "bash" to "mount" to "shutdown" yielded "file
not found" errors.  I've got a baaad feeling about this, Chewie.

I used the magic sysrq keys to sync my hard drives, then unmount my hard
drives, then to reboot the system.

Lilo came up healthy and I chose my 2.6.6 kernel.  Booting proceeded,
but towards the end of the bootup, I got:

   No init found.  Kernel panic.

Shit.  I reached for my Knoppix disk and booted it.  Mounted my root
partition under /mnt/root to take a look.  I found the problem
immediately.  /lib was gone.  Absolutely gone.  As in, the directory was
completely missing.

I've recovered from situations worse than this, so I rolled up my
sleeves and got to work.

First thing I did was "mkdir /mnt/root/lib".  After an "ifconfig" and a
"route", my network was working.

I used scp to copy the entire /lib directory from another machine to the
crippled machine.  I then ran:

   chroot /mnt/root /sbin/ldconfig

It gave me some warnings that a couple of files which ldconfig was
expecting to be symlinks were actual regular disk files.  That's OK (I
believe).  So at this point, I have a working libc and a lot of other
important libraries.  Next step is to uncripple my kernels.

Since I have a working libc, I do:

   chroot /mnt/root

Then, for each kernel source (2.4.26, 2.6.0, 2.6.5, 2.6.6), I do:

   make modules && make modules_install

OK.  Kernels should be OK.  libc is taken care of.  In theory, this
machine should boot, and except for packages on the crippled machine
that weren't on my other workstation, everything should be more or less
OK.

At this point, I'm aiming at full recovery, not "bringing back to life".

The machine boots fine.  First thing's first.  Just to make absolutely
sure:

   apt-get install --reinstall libc6 libc6-dev

Good.  Now I begin to recover packages.  I don't know how to get a list
of all packages that install files in /lib, so I do the only thing I
know how to do:

   COLUMNS=255 dpkg -l "*" | grep -v 'description'
      | awk '{ print $2 }' > recover.sh

which gives me a list of all installed packages.  BTW, if anybody knows
how to generate a list of all packages which have files in /lib, I'd
like to know.

The command above gives me a list that looks like:

   ...
   exim4-base
   exim4-config
   exim4-doc-html
   exim4-doc-info
   exuberant-ctag
   fakeroot
   fdflush
   fdutils
   ...

Using vim, I prepend each line with "apt-get install --reinstall --yes
--ignore-missing" so now recover.sh looks like:

   apt-get install --reinstall --yes --ignore-missing exim4-base
   apt-get install --reinstall --yes --ignore-missing exim4-config
   apt-get install --reinstall --yes --ignore-missing exim4-doc-html
   apt-get install --reinstall --yes --ignore-missing exim4-doc-info
   apt-get install --reinstall --yes --ignore-missing exuberant-ctag
   apt-get install --reinstall --yes --ignore-missing fakeroot
   apt-get install --reinstall --yes --ignore-missing fdflush
   apt-get install --reinstall --yes --ignore-missing fdutils

And I run it with:

   sh recover.sh

It started to work fine.  Everything seemed to be going OK.

About 5 or 10 minutes into it, I started to see on my console:

   "init spawning too fast.  Disabling getty for 5 minutes"

After doing a ps aux, I got:

   /bin/ps: File not found

Oh oh.  I've got a baaad feeling about this, Chewie.  I pressed ctl-c to
get out of apt-get, and did:

   sync
   ctl-d  (I'm now out of chroot)
   umount /dev/hdb1
   mount /dev/hdb1 /mnt/root
   chroot /mnt/root

It told me that "/bin/bash" was not found.  An "ls /mnt/root/" revealed:

   /lib was missing again.

I was outside of chroot, so I did a "ps aux" again.  There was a dd
process trying to access /dev/fd0.  I killed it.  Could that be a clue?
Does apt-get ever access the floppy?



My hypothesis is that there must be a Debian package that's doing this.
I checked debian-user but there's no mention of this, and this would be
BIG.  I'm sure I wouldn't be the first to find this.

If nobody has better ideas, I was going to try all these steps over, but
do something like:


   ...
   apt-get install --reinstall --yes --ignore-missing exim4-base
   echo "I did exim4-base" >> /root/log
   apt-get install --reinstall --yes --ignore-missing exim4-config
   echo "I did exim4-config" >> /root/log
   apt-get install --reinstall --yes --ignore-missing exim4-doc-info
   echo "I did exim4-doc-info" >> /root/log
   apt-get install --reinstall --yes --ignore-missing exuberant-ctag
   echo "I did exuberant-ctag" >> /root/log
   apt-get install --reinstall --yes --ignore-missing fakeroot
   echo "I did fakeroot" >> /root/log
   ...


That's about the only thing I can think of to try (other than installing
Gentoo which I've been thinking about for a few weeks now).

I could sure use some help!

Thanks to all who actually made it down here.

Pete

-- 
GPG Instructions: http://www.dirac.org/linux/gpg
GPG Fingerprint: B9F1 6CF3 47C4 7CD8 D33E 70A9 A3B9 1945 67EA 951D

Good, Fast, Cheap.  Pick any two (you can't have all three)
   --- RFC 1925: The Twelve Networking Truths (The 7th truth)