Re: deadlock or bad disk ? RELENG_8

Mike Tancsa Mon, 19 Jul 2010 05:39:16 -0700

At 11:34 PM 7/18/2010, Jeremy Chadwick wrote:

>
> yes, da0 is a RAID volume with 4 disks behind the scenes.


Okay, so can you get full SMART statistics for all 4 of those disks?
The adjusted/calculated values for SMART thresholds won't be helpful
here, one will need the actual raw SMART data.  I hope the Areca CLI can
provide that.

I thought there was, but I cant seem to get the current smartctl towork with the card.


-d TYPE, --device=TYPE
              Specifies  the  type of the device.  The valid arguments to this
              option are ata, scsi, sat,  marvell,  3ware,N,  areca,N,  usbcy-
              press,  usbjmicron, usbsunplus, cciss,N, hpt,L/M (or hpt,L/M/N),
              and test.

# smartctl -a -d areca,0 /dev/arcmsr0
smartctl 5.39.1 2010-01-28 r3054 [FreeBSD 8.1-PRERELEASE amd64] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

/dev/arcmsr0: Unknown device type 'areca,0'

=======> VALID ARGUMENTS ARE: ata, scsi, sat[,N][+TYPE],usbcypress[,X], usbjmicron[,x][,N], usbsunplus, 3ware,N, hpt,L/M/N,cciss,N, atacam, test <=======


Use smartctl -h to get a usage summary

The latest CLI tool only gives this info

CLI> disk info drv=1
Drive Information
===============================================================
IDE Channel                        : 1
Model Name                         : ST31000340AS
Serial Number                      : 3QJ07F1N
Firmware Rev.                      : SD15
Disk Capacity                      : 1000.2GB
Device State                       : NORMAL
Timeout Count                      : 0
Media Error Count                  : 0
Device Temperature                 : 29 C
SMART Read Error Rate              : 108(6)
SMART Spinup Time                  : 91(0)
SMART Reallocation Count           : 100(36)
SMART Seek Error Rate              : 81(30)
SMART Spinup Retries               : 100(97)
SMART Calibration Retries          : N.A.(N.A.)
===============================================================
GuiErrMsg<0x00>: Success.

CLI>  disk smart drv=1
S.M.A.R.T Information For Drive[#01]
  # Attribute Items                           Flag   Value  Thres  State
===============================================================================

1 Raw Read ErrorRate 0x0f 108 6 OK3 Spin UpTime 0x03 91 0 OK4 Start/StopCount 0x32 100 20 OK5 Reallocated SectorCount 0x33 100 36 OK7 Seek ErrorRate 0x0f 81 30 OK9 Power-on HoursCount 0x32 79 0 OK10 Spin RetryCount 0x13 100 97 OK12 Device Power CycleCount 0x32 100 20 OK

194 Temperature                               0x22      29      0  OK
197 Current Pending Sector Count              0x12     100      0  OK
198 Off-line Scan Uncorrectable Sector Count  0x10     100      0  OK
199 Ultra DMA CRC Error Count                 0x3e     200      0  OK
===============================================================================
GuiErrMsg<0x00>: Success.

CLI>

The obvious ones (timeout, media error etc) are all zero

Also, I'm willing to bet that the da0 "volume" and the da1 "volume"
actually share the same physical disks on the Areca controller.  Is that
correct?

Yes

If so, think about what would happen if heavy I/O happened on
both da0 and da1 at the same time.  I talk about this a bit more below.

No different than any other single disk being heavily worked. Again,this particular hardware configuration has been beaten about for acouple of years. So I am not sure why all of a sudden it would be notpossible to do

>
> Prior to someone rebooting it, it had been stuck in this state for a
> good 90min.  Apart from upgrading to a later RELENG_8 to get the
> security patches, the machine had been running a few versions of
> RELENG_8 doing the same workloads every week without issue.

Then I would say you'd need to roll back kernel+world to a previous date
and try to figure out when the issue began, if that is indeed the case.

Possibly. The box only gets a heavy workout periodically when itdoes an rsync to our DR site.

It would also help if you could provide timestamps of those messages;
are they all happening at once, or gradual over time?  If over time, do
they all happen around the same time every day, etc.?  You see where I'm
going with this.


Every couple of seconds I think.  If it happens again, I will time it.

situation (since you'd then be dedicating an entire disk to just swap).
Others may have other advice.  You mention in a later mail that the
ada[0-3] disks make up a ZFS pool of some sort.  You might try splitting
ada0 into two slices, one for swap and the other used as a pool member.

That seems like it would just move the problem you are trying to getme to avoid to a different set of disks. If putting swap on a raidarray is a bad thing, I am not sure how moving it to a ZFS raid arraywill help.

Again: I don't think this is necessarily a bad disk problem.  The only
way you'd be able to determine that would be to monitor on a per-disk
basis the I/O response time of each disk member on the Areca.  If the
CLI tools provide this, awesome.  Otherwise you'll probably need to
involve Areca Support.

In the past when I have had bad disks on the areca, it did catch andflag device timeouts. There were no such alerts leading up to this situation.


        ---Mike




--------------------------------------------------------------------
Mike Tancsa,                                      tel +1 519 651 3400
Sentex Communications,                            m...@sentex.net
Providing Internet since 1994                    www.sentex.net
Cambridge, Ontario Canada                         www.sentex.net/mike

_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: deadlock or bad disk ? RELENG_8

Reply via email to