Re: Short SMART check causes disk op timeouts

Miroslav Lachman Mon, 27 Oct 2008 12:39:13 -0700

Jeremy Chadwick wrote:

On Mon, Oct 27, 2008 at 06:22:03PM +0100, Vaclav Haisman wrote:

Jeremy Chadwick wrote:

On Mon, Oct 27, 2008 at 11:16:59AM +0100, Vaclav Haisman wrote:
Second, your short offline test runs at 0300, but the errors you're
seeing are at 0454 in the morning.  A short offline test does not
take 2 hours to run -- they take between 2-10 minutes -- unless the
system is also in the middle of doing a lot of I/O, in which case the
short test will be suspended.

There are cronjobs (specifically periodic jobs) that run starting at
0301 in the morning ("periodic daily"), and many of those are I/O bound.
This could possibly extend the length of the short test until 0454.

Weekly periodic jobs run at 0415 in the morning, on Sundays.  These also
perform a lot of disk I/O, so it's possible that on Sunday specifically
the short SMART test gets pushed back quite some time.

Third, the DMA timeouts you're seeing are possibly caused by the drive
taking too long when internally suspending the SMART test.

In most cases, it's safe for SMART tests (short and long) to be run
while the machine is operational, and disk I/O requests are being
performed.  When an I/O request comes and the disk is in the middle of
performing a SMART test, the drive has to stop the SMART test (e.g.
"suspend" it), complete the I/O request, then resume the SMART test.

The FreeBSD ATA layer has a 5 second timeout on I/O requests; if it
doesn't receive an acknowledgement back from the controller (disk)
within 5 seconds, it'll report a timeout on whatever operation it was
performing.  I'm thinking the disk gets stuck in a "do the offline
test, no wait stop there's an I/O request, okay its done continue the
test, no way stop there's another I/O" loop.


Can I make the timeout higher? For the sake of elimination.



You will have to make modifications to the ata(4) driver code, and
rebuild+reinstall your kernel.

There is a patch from the FreeNAS folks which turns the command timeout
value into a sysctl for tuning, but that patch has not been brought into
FreeBSD (any version) at this time.  You can find it referenced below
(see one of the "Workarounds" sections).  You will probably have to
apply the patch "by hand" rather than blindly using patch < patchfile,
because the ATA code has changed since the patch was created.

http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting

Another possibility is that your drive really *does* have a bad block at
LBA 836986454, and that one of those cron/periodic jobs is what's
noticing it, and that upon noticing a bad block, the drive more or less
aborts the SMART test to perform internal remapping of the block.

To confirm this, you would need to boot the SeaTools utilities from DOS
or from a CD (see Seagate's site) and run a full sector scan (NOT the
"quick" test).  This takes a few hours.  Assuming it comes back clean,
then my above claim of the offline test taking too long to suspend is
probably the case.

Possibly this is a firmware bug in the drive -- you might consider
mailing Seagate about this problem, although I'm doubting their Tier 1
support will understand what the issue is.

Is the block number always the same?  Do you only see this error on
Sundays?  These are two questions which might help narrow things down.


Nope, the LBA is always different and I see it in the logs once every day.



Okay, so that greatly diminishes the possibility of it being a bad
block.  I'd still advocate running SeaTools on the disk to ensure
everything is 100% okay (re: "sake of elimination"); chances are it will
pass with flying colours.

This is on 7.1-PRERELEASE #0: Wed Oct 15 18:56:54 UTC 2008, with GENERIC
kernel.

Now, does the timeout cause loss of any data? Is there anything besides
disabling the testing that I can do about it?


Do you understand what short and long offline tests actually do and what
they're used for?  :-)  If so, you'd know that running them periodically
is more or less silly (IMHO).


I do not, not completely :) I think I have just copied the settings from
somewhere and only just tweaked it a bit whenever I have added a disk.



Let me know if you figure out who or what online resource solicited
adding daily short/long tests, as I'd like to talk to them about their
decision.  I have a feeling whoever thought it up felt that the tests
were performing entire sector scans of the entire disk, which is simply
not the case.


It seems like a little modified example from smartd.conf.sample

# First (primary) ATA/IDE hard disk.  Monitor all attributes, enable
# automatic online data collection, automatic Attribute autosave, and
# start a short self-test every day between 2-3am, and a long self test
# Saturdays between 3-4am.
#/dev/hda -a -o on -S on -s (S/../.././02|L/../../6/03)

I am using similar config without problem:

/dev/ad4 -a -o on -S on -m root -M test -M diminishing -s(S/../.././01|L/../../(3|6)/05) -t -I 194/dev/ad6 -a -o on -S on -m root -M test -M diminishing -s(S/../.././01|L/../../(3|6)/04) -t -I 194


Miroslav Lachman
_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: Short SMART check causes disk op timeouts

Reply via email to