on Fri, Jul 09, 2004 at 02:17:04AM -0700, Karsten M. Self ([EMAIL PROTECTED]) wrote: > on Thu, Jul 08, 2004 at 05:59:57PM -0400, Silvan ([EMAIL PROTECTED]) wrote: > > On Thursday 08 July 2004 06:59 am, Karsten M. Self wrote: > > > > > - Bad drive? > >
> That's pretty much the conclusion I'm coming to. No more results to > publish right now, but some more poking around with SMART tells me the > drive's risking imminent failure. I'm backing data off of it now (very > slowly), hope to replace it tomorrow. ...and now the (mostly) conclusion. I still haven't run final conclusive tests on the old drive, but with a new Maxtor 80 GB 7200 RPM 8 MB cache mumble disk in the box, I'm getting hdparm disk read results in the 48 - 51 MiB/sec range. *Vastly* better. I'm also no longer hearing the clicks which were coming from the box previously and which I wasn't sure were the (pretty much completely unused) Zip drive, or the hard drive. A few notes about S.M.A.R.T. (self monitoring, reporting, and analysis technology), which I think a lot of folks know _about_ but few actually know how to use. - It's on pretty much every IDE hard drive sold in the past few years, and all of 'em sold currently. SCSI III and better dittos. - Here's 98.3% of what you need to know about SMART: - There are two drive self tests (DSTs) which can be run. Short (DST) and long (eNhanced DST or NDST). Nominally 2 and 27 minutes each. Install the smartmontools package, and run them with: # smartctl -t short <device> # smartctl -t long <device> ...and access the results with: # smartctl -a <device> ...where "<device>" is /dev/hd[a-h] - The short test can rule *in* a bad drive, but is *not* definitive in determining a drive is *good*. That is, it's got a relatively high false negative rate in catching bad drives. This is good for the manufacturers (fewer spurious returns) but means you have to run additional diagnostics if the short test comes clean but you've got concerns about the disk. The short test reads at least the first 1.5 GB from the disk. Overall accuracy of this test is 60-70% - The long test is far better at correctly identifying _bad_ drives, with few false positives. Overall accuracy of this test is 95%. - The other 2.7% is: - smartmontool installs a daemon, smartd, which runs these tests regularly: DST daily and NDST weekly. There's some impact on disk performance, though these tests *can* be run on a live system. - There are additional attributes monitored, and you should go through the 'smartctl -a' output to see what your drive has to say about its current status and any logged errors. - Drive lifetime is *highly* temperature sensitive. A 5°C temperature rise from 25°C to 30°C reduces MTBF by 25%. Blow those cases! MTBF also _falls_ dramatically with drive life. Your 2 year old drive is statistically far more reliable than the one fresh out of the box. - As before: the smartmontools Sourceforge homepage has some really good information, which is where I've got most of what I'm discussing here. Read the PDFs, particular the Seagate references. http://smartmontools.sourceforge.net/ The main problem I encountered was that for my drive's failure mode, the long test wasn't running in geological time. smartmontools diagnostics indicated a "pass" on the short test, but with some marginal attributes. However: - The disk was replaced. - No data were lost. - We had relatively minimal downtime (one day) while the problem was identified, repaired, and the system rebuilt. I'd still like to drop the disk into another system and run Maxtor's test utility (which I strongly suspect is an NDST). I thought I'd also detail the system backup and restore process. Short version. - Back up critical data. - Swap drives. - Booting Knoppix, partition disk. - Booting Knoppix, run debootstrap install. - Point sources to local apt-cache proxy. - Read in package list from old system: dpkg --get-selections < file. - Reinstall packages with 'apt-get dselect-upgrade' - Restore backed-up data. - Modify /etc/fstab appropriately. - Modify /boot and /boot/grub/menu.lst appropriately. - Reboot to recovered system. - Test drive performance. Long description: - When it became clear that there _was_ a problem with the drive, even though full diagnostics were not available, I switched from "find out what's wrong" to "salvage all data" mode. Fortunately, although reads were slow, they were reliable. I got ~2 GiB transfered over the LAN in about five hours (running unattended overnight). - I backed up the following trees: - /home - /etc - /root - /boot - /usr/local - /var/www - /var/log - /var/www - /var/backups I later found that there was some data in /var/spool (my uptimed records) which I didn't have. Should probably add that. I'd also created but decided against restoring /var/lib. Most of the state-related data here are created on install anyway. - Backups were run from the root directory (to keep the full relative path) with: # tar cvf - <tree> | ssh [EMAIL PROTECTED] 'cat > <path>/<tree>.tar.gz This prompts for a password as the connection is made, then runs to completion. - I tested the integrity of the archives both on the remote host and after copying them back to the damaged system with: for f in *.tar.gz do echo -e "Testing $f ... \c" tar tzvf $f >/dev/null && echo OK || echo Wups fi - I saved the package selection status with: # dpkg --get-selections | ssh [EMAIL PROTECTED] 'gzip > <path>/packages.gz - At this point, the damaged system was powered down and the drive replaced. - The damaged system with new drive was booted with Knoppix. - The drive was partitioned (I prefer fdisk) - Reboot after running fdisk (recommended by some utilities, not sure this is required). Still running Knoppix. - Filesystems / swap partitions created. - Create /mnt/target. Mount the new intended root partition here. - Run debootstrap. I'd intended to run installations from my apt-cache proxy but had to use http://ftp.us.debian.org/debian instead. If anyone could straighten me out here, I'd prefer local fetches: # debootstrap woody . http://ftp.us.debian.org/ ...generally, that's: # debootstrap --arch <arch> <dist> <mountpoint> <archive> This takes a while. Chat on #debian at irc.debian.org.... - With the base system installed, copy in files from /etc/apt/ on the old system (effectively pointing my apt archive now to my apt-proxy cache), read in the package list, and install about 580 packages. I may have had to install apt first, not positive: # zcat packages.gz | dpkg --set-selections # apt-get -dy dselect-upgrade This was some 510 MiB of data, and took a few hours. Got to look at my LAN performance. 50 kB/s on a 100 Mbps hubbed LAN is far too slow. Note that I ran a download-only install. After this, commit with: # apt-get dselect-upgrade - While the above processes were running, I was also copying the archived filesystems back over, testing integrity of the archived data, and restoring /home and /usr/local. The /root, /boot, /etc, and /var trees are recovered later. - I also ran some preliminatry diagnostics on the new hard drive and found performance was in the expected range -- 48 - 51 MiB/sec for buffered disk reads. Knoppix's SMART utilities (using the older UCSC SMART package) indicated normal behavior, and both short and long disk tests were run and passed. - With the system's prior packages (finally) installed, I restored the /etc, /boot, /root, and /var data. For /etc, /boot, and /root, I moved the existing data to /etc.bak, /boot.bak, /root.bak, etc. If there are any discrepencies, I can go through and clean this up later. - The 'Knoppix' hostname worked its way into a few config files. # find /etc -type f -print0 | xargs -0 grep -il knoppix ...and clean those up. - Edit /etc/fstab to reflect current partitioning. - Run 'grub-install /dev/hda' - Modify kernels and re-run 'update-grub' to get a working GRUB configuration. - Reboot into new system and test services. Samba, Apache, SSH, XDMCP, etc., working well. Finished. Oh hell, dawn on the horizon. Mmmuuusssttt sssllleeeeeeppp.... Peace. -- Karsten M. Self <[EMAIL PROTECTED]> http://kmself.home.netcom.com/ What Part of "Gestalt" don't you understand? Backgrounder on the Caldera/SCO vs. IBM and Linux dispute. http://sco.iwethey.org/
signature.asc
Description: Digital signature