On Fri, Feb 02, 2007 at 02:03:16PM +0100, Johannes Wiedersich wrote: > First the good news: after some repairs, the system appears to be > 'clean' again, running as usual. > > The box has three hard disks: /dev/hda with the root partition and > /dev/hdb and /dev/hdd with a raid1 for data. > > Yesterday, /dev/hdb suddenly died: [snip] > At first the system was still working (except for the raid). It was > possible to ssh to the box and to diagnose via mdadm and looking at > syslog. After few minutes the system was frozen, it was unpingable, > impossible to locally switch to a console and/or log in, so I had to > switch it off the hard way. > > I replaced the defective disk and rebooted. Rebuilding and syncing the > raid device took a few hours, but worked fine. > > To check for sure that everything is ok I 'shutdown -rF now'ed the box. > On fscking / there were a lot of errors, involving files in /var/ . > > After another boot and fsck of all partitions everything seemed fine. > Only after trying to install some program via aptitude, I got the > following: > > /--- > dpkg: serious warning: files list file for package `libident' missing, > assuming package has no files currently installed. [snip] > This was very strange to me so I reinstalled all the mentioned packages. > That worked without problems, and now all the warnings are gone. > > I would just like to _be sure_, that everything is ok now, and that > there are no more missing or damaged files around. > > Here are my questions: > > Is it save to leave the system as it is, or should I do a reinstall in > order to be sure that the system is 'clean'? How could I check, that no > other files are affected except those 'reinstalled'? > > Is it common, that a failure of a raid disk leads to a system freeze, > even though the affected drive is _NOT_ part of / or any FSH directory? > > Is it normal that an ext3 fs with journal gets corrupted in the process? > > Is there anything I could do to try to avoid this for the future? >
Personally, I run something like samhain, not so much to check for intrusion as to monitor data integrity. I wonder if the failed /dev/hdb took out the controller (ide0) and so /dev/hda got corrupted. Its too bad that your system (as opposed to data) wasn't also protected by raid1. If it were me and I had solid backups and could afford the downtime to reinstall, I'd reinstall. Etch. LVM on Raid1 for the system at least. And I'd only have one drive on each controller channel. E.g. no hdb or hdd unless its for CD/DVD or something. Before a total reinstall, I'd really stress-test the ide0 controller to ensure that it wasn't damaged. Then again, I'm paranoid. I've only had one drive failure (bearing seize, hard head crash). Pulled the plug within 10 seconds. No other damage. I really hope you have good backups. Doug. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]