Resolution & recovery: Re: Slow disk - hdparm, S.M.A.R.T, badblocks, what else?

Karsten M. Self Sat, 10 Jul 2004 17:43:50 -0700

on Fri, Jul 09, 2004 at 02:17:04AM -0700, Karsten M. Self ([EMAIL PROTECTED]) wrote:
> on Thu, Jul 08, 2004 at 05:59:57PM -0400, Silvan ([EMAIL PROTECTED]) wrote:
> > On Thursday 08 July 2004 06:59 am, Karsten M. Self wrote:
> > 
> > >   - Bad drive?
> >


> That's pretty much the conclusion I'm coming to.  No more results to
> publish right now, but some more poking around with SMART tells me the
> drive's risking imminent failure.  I'm backing data off of it now (very
> slowly), hope to replace it tomorrow.

...and now the (mostly) conclusion.

I still haven't run final conclusive tests on the old drive, but with a
new Maxtor 80 GB 7200 RPM 8 MB cache mumble disk in the box, I'm getting
hdparm disk read results in the 48 - 51 MiB/sec range.  *Vastly* better.
I'm also no longer hearing the clicks which were coming from the box
previously and which I wasn't sure were the (pretty much completely
unused) Zip drive, or the hard drive.

A few notes about S.M.A.R.T. (self monitoring, reporting, and analysis
technology), which I think a lot of folks know _about_ but few actually
know how to use.

  - It's on pretty much every IDE hard drive sold in the past few years,
    and all of 'em sold currently.  SCSI III and better dittos.

  - Here's 98.3% of what you need to know about SMART:

    - There are two drive self tests (DSTs) which can be run.  Short
      (DST) and long (eNhanced DST or NDST).  Nominally 2 and 27 minutes
      each.  Install the smartmontools package, and run them with:

        # smartctl -t short <device>
        # smartctl -t long <device>

      ...and access the results with:

        # smartctl -a <device>

      ...where "<device>" is /dev/hd[a-h]

    - The short test can rule *in* a bad drive, but is *not* definitive
      in determining a drive is *good*.  That is, it's got a relatively
      high false negative rate in catching bad drives.  This is good for
      the manufacturers (fewer spurious returns) but means you have to
      run additional diagnostics if the short test comes clean but
      you've got concerns about the disk.  The short test reads at least
      the first 1.5 GB from the disk.  Overall accuracy of this test is
      60-70%

    - The long test is far better at correctly identifying _bad_ drives,
      with few false positives.  Overall accuracy of this test is 95%.

   - The other 2.7% is:

     - smartmontool installs a daemon, smartd, which runs these tests
       regularly:  DST daily and NDST weekly.  There's some impact on
       disk performance, though these tests *can* be run on a live
       system.

     - There are additional attributes monitored, and you should go
       through the 'smartctl -a' output to see what your drive has to
       say about its current status and any logged errors.

     - Drive lifetime is *highly* temperature sensitive.  A 5°C
       temperature rise from 25°C to 30°C reduces MTBF by 25%.  Blow
       those cases!  MTBF also _falls_ dramatically with drive life.
       Your 2 year old drive is statistically far more reliable than the
       one fresh out of the box.

     - As before:  the smartmontools Sourceforge homepage has some
       really good information, which is where I've got most of what I'm
       discussing here.  Read the PDFs, particular the Seagate
       references.

         http://smartmontools.sourceforge.net/

The main problem I encountered was that for my drive's failure mode, the
long test wasn't running in geological time.  smartmontools diagnostics
indicated a "pass" on the short test, but with some marginal attributes.


However:

  - The disk was replaced.

  - No data were lost.

  - We had relatively minimal downtime (one day) while the problem was
    identified, repaired, and the system rebuilt.

I'd still like to drop the disk into another system and run Maxtor's
test utility (which I strongly suspect is an NDST).



I thought I'd also detail the system backup and restore process.

Short version.

  - Back up critical data.
  - Swap drives.
  - Booting Knoppix, partition disk.
  - Booting Knoppix, run debootstrap install.
  - Point sources to local apt-cache proxy.
  - Read in package list from old system:  dpkg --get-selections < file.
  - Reinstall packages with 'apt-get dselect-upgrade'
  - Restore backed-up data.
  - Modify /etc/fstab appropriately.
  - Modify /boot and /boot/grub/menu.lst appropriately.
  - Reboot to recovered system.
  - Test drive performance.


Long description:

  - When it became clear that there _was_ a problem with the drive, even
    though full diagnostics were not available, I switched from "find
    out what's wrong" to "salvage all data" mode.  Fortunately, although
    reads were slow, they were reliable.  I got ~2 GiB transfered over
    the LAN in about five hours (running unattended overnight).

  - I backed up the following trees:

    - /home
    - /etc
    - /root
    - /boot
    - /usr/local
    - /var/www
    - /var/log
    - /var/www
    - /var/backups

    I later found that there was some data in /var/spool (my uptimed
    records) which I didn't have.  Should probably add that.

    I'd also created but decided against restoring /var/lib.  Most of
    the state-related data here are created on install anyway.

  - Backups were run from the root directory (to keep the full relative
    path) with:

      # tar cvf - <tree> | ssh [EMAIL PROTECTED] 'cat > <path>/<tree>.tar.gz

    This prompts for a password as the connection is made, then runs to
    completion.

  - I tested the integrity of the archives both on the remote host and
    after copying them back to the damaged system with:

       for f in *.tar.gz
       do 
           echo -e "Testing $f ... \c"
           tar tzvf $f >/dev/null && echo OK || echo Wups
       fi

  - I saved the package selection status with:

       # dpkg --get-selections | ssh [EMAIL PROTECTED] 'gzip > <path>/packages.gz

  - At this point, the damaged system was powered down and the drive
    replaced.

  - The damaged system with new drive was booted with Knoppix.  
  
  - The drive was partitioned (I prefer fdisk)

  - Reboot after running fdisk (recommended by some utilities, not sure
    this is required).  Still running Knoppix.
  
  - Filesystems / swap partitions created.

  - Create /mnt/target.  Mount the new intended root partition here.

  - Run debootstrap.  I'd intended to run installations from my
    apt-cache proxy but had to use http://ftp.us.debian.org/debian
    instead.  If anyone could straighten me out here, I'd prefer local
    fetches:

      # debootstrap woody . http://ftp.us.debian.org/

    ...generally, that's:

      # debootstrap --arch <arch> <dist> <mountpoint> <archive>

    This takes a while.  Chat on #debian at irc.debian.org....

  - With the base system installed, copy in files from /etc/apt/ on the
    old system (effectively pointing my apt archive now to my apt-proxy
    cache), read in the package list, and install about 580 packages.  I
    may have had to install apt first, not positive:

      # zcat packages.gz | dpkg --set-selections
      # apt-get -dy dselect-upgrade

    This was some 510 MiB of data, and took a few hours.  Got to look at
    my LAN performance.  50 kB/s on a 100 Mbps hubbed LAN is far too
    slow.

    Note that I ran a download-only install.  After this, commit with:

      # apt-get dselect-upgrade 

  - While the above processes were running, I was also copying the
    archived filesystems back over, testing integrity of the archived
    data, and restoring /home and /usr/local.  The /root, /boot, /etc,
    and /var trees are recovered later.

  - I also ran some preliminatry diagnostics on the new hard drive and
    found performance was in the expected range -- 48 - 51 MiB/sec for
    buffered disk reads.  Knoppix's SMART utilities (using the older
    UCSC SMART package) indicated normal behavior, and both short and
    long disk tests were run and passed.

  - With the system's prior packages (finally) installed, I restored the
    /etc, /boot, /root, and /var data.  For /etc, /boot, and /root, I
    moved the existing data to /etc.bak, /boot.bak, /root.bak, etc.  If
    there are any discrepencies, I can go through and clean this up
    later.

  - The 'Knoppix' hostname worked its way into a few config files.
  
      # find /etc -type f -print0 | xargs -0 grep -il knoppix

    ...and clean those up.

  - Edit /etc/fstab to reflect current partitioning.

  - Run 'grub-install /dev/hda'

  - Modify kernels and re-run 'update-grub' to get a working GRUB
    configuration.

  - Reboot into new system and test services.  Samba, Apache, SSH,
    XDMCP, etc., working well.

Finished.

Oh hell, dawn on the horizon.   Mmmuuusssttt   sssllleeeeeeppp....


Peace.

-- 
Karsten M. Self <[EMAIL PROTECTED]>        http://kmself.home.netcom.com/
 What Part of "Gestalt" don't you understand?
  Backgrounder on the Caldera/SCO vs. IBM and Linux dispute.
      http://sco.iwethey.org/

signature.asc
Description: Digital signature

Resolution & recovery: Re: Slow disk - hdparm, S.M.A.R.T, badblocks, what else?

Reply via email to