At 11:34 PM 7/18/2010, Jeremy Chadwick wrote:
>
> yes, da0 is a RAID volume with 4 disks behind the scenes.
Okay, so can you get full SMART statistics for all 4 of those disks?
The adjusted/calculated values for SMART thresholds won't be helpful
here, one will need the actual raw SMART data. I hope the Areca CLI can
provide that.
I thought there was, but I cant seem to get the current smartctl to
work with the card.
-d TYPE, --device=TYPE
Specifies the type of the device. The valid arguments to this
option are ata, scsi, sat, marvell, 3ware,N, areca,N, usbcy-
press, usbjmicron, usbsunplus, cciss,N, hpt,L/M (or hpt,L/M/N),
and test.
# smartctl -a -d areca,0 /dev/arcmsr0
smartctl 5.39.1 2010-01-28 r3054 [FreeBSD 8.1-PRERELEASE amd64] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
/dev/arcmsr0: Unknown device type 'areca,0'
=======> VALID ARGUMENTS ARE: ata, scsi, sat[,N][+TYPE],
usbcypress[,X], usbjmicron[,x][,N], usbsunplus, 3ware,N, hpt,L/M/N,
cciss,N, atacam, test <=======
Use smartctl -h to get a usage summary
The latest CLI tool only gives this info
CLI> disk info drv=1
Drive Information
===============================================================
IDE Channel : 1
Model Name : ST31000340AS
Serial Number : 3QJ07F1N
Firmware Rev. : SD15
Disk Capacity : 1000.2GB
Device State : NORMAL
Timeout Count : 0
Media Error Count : 0
Device Temperature : 29 C
SMART Read Error Rate : 108(6)
SMART Spinup Time : 91(0)
SMART Reallocation Count : 100(36)
SMART Seek Error Rate : 81(30)
SMART Spinup Retries : 100(97)
SMART Calibration Retries : N.A.(N.A.)
===============================================================
GuiErrMsg<0x00>: Success.
CLI> disk smart drv=1
S.M.A.R.T Information For Drive[#01]
# Attribute Items Flag Value Thres State
===============================================================================
1 Raw Read Error
Rate 0x0f 108 6 OK
3 Spin Up
Time 0x03 91 0 OK
4 Start/Stop
Count 0x32 100 20 OK
5 Reallocated Sector
Count 0x33 100 36 OK
7 Seek Error
Rate 0x0f 81 30 OK
9 Power-on Hours
Count 0x32 79 0 OK
10 Spin Retry
Count 0x13 100 97 OK
12 Device Power Cycle
Count 0x32 100 20 OK
194 Temperature 0x22 29 0 OK
197 Current Pending Sector Count 0x12 100 0 OK
198 Off-line Scan Uncorrectable Sector Count 0x10 100 0 OK
199 Ultra DMA CRC Error Count 0x3e 200 0 OK
===============================================================================
GuiErrMsg<0x00>: Success.
CLI>
The obvious ones (timeout, media error etc) are all zero
Also, I'm willing to bet that the da0 "volume" and the da1 "volume"
actually share the same physical disks on the Areca controller. Is that
correct?
Yes
If so, think about what would happen if heavy I/O happened on
both da0 and da1 at the same time. I talk about this a bit more below.
No different than any other single disk being heavily worked. Again,
this particular hardware configuration has been beaten about for a
couple of years. So I am not sure why all of a sudden it would be not
possible to do
>
> Prior to someone rebooting it, it had been stuck in this state for a
> good 90min. Apart from upgrading to a later RELENG_8 to get the
> security patches, the machine had been running a few versions of
> RELENG_8 doing the same workloads every week without issue.
Then I would say you'd need to roll back kernel+world to a previous date
and try to figure out when the issue began, if that is indeed the case.
Possibly. The box only gets a heavy workout periodically when it
does an rsync to our DR site.
It would also help if you could provide timestamps of those messages;
are they all happening at once, or gradual over time? If over time, do
they all happen around the same time every day, etc.? You see where I'm
going with this.
Every couple of seconds I think. If it happens again, I will time it.
situation (since you'd then be dedicating an entire disk to just swap).
Others may have other advice. You mention in a later mail that the
ada[0-3] disks make up a ZFS pool of some sort. You might try splitting
ada0 into two slices, one for swap and the other used as a pool member.
That seems like it would just move the problem you are trying to get
me to avoid to a different set of disks. If putting swap on a raid
array is a bad thing, I am not sure how moving it to a ZFS raid array
will help.
Again: I don't think this is necessarily a bad disk problem. The only
way you'd be able to determine that would be to monitor on a per-disk
basis the I/O response time of each disk member on the Areca. If the
CLI tools provide this, awesome. Otherwise you'll probably need to
involve Areca Support.
In the past when I have had bad disks on the areca, it did catch and
flag device timeouts. There were no such alerts leading up to this situation.
---Mike
--------------------------------------------------------------------
Mike Tancsa, tel +1 519 651 3400
Sentex Communications, m...@sentex.net
Providing Internet since 1994 www.sentex.net
Cambridge, Ontario Canada www.sentex.net/mike
_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"