Re: [zfs-discuss] zfs, raidz, spare and jbod

James C. McPherson Fri, 25 Jul 2008 05:06:13 -0700

Hi Claus,

Claus Guttesen wrote:
> Hi.
> 
> I installed solaris express developer edition (b79) on a supermicro
> quad-core harpertown E5405 with 8 GB ram and two internal sata-drives.
> I installed solaris onto one of the internal drives. I added an areca
> arc-1680 sas-controller and configured it in jbod-mode. I attached an
> external sas-cabinet with 16 sas-drives 1 TB (931 binary GB). I
> created a raidz2-pool with ten disks and one spare. I then copied some
> 400 GB of small files each approx. 1 MB. To simulate a disk-crash I
> pulled one disk out of the cabinet and zfs faulted the drive and used
> the spare and started a resilver.


I'm not convinced that this is a valid test; yanking a disk out
will have physical-layer effects apart from removing the device
from your system. I think relling or roch would have something
to say on this also.

> During the resilver-process one of the remaining disks had a
> checksum-error and was marked as degraded. The zpool is now
> unavailable. I first tried to add another spare but got I/O-error. I
> then tried to replace the degraded disk by adding a new one:
> 
> # zpool add ef1 c3t1d3p0
> cannot open '/dev/dsk/c3t1d3p0': I/O error
> 
> Partial dmesg:
> 
> Jul 25 13:14:00 malene arcmsr: [ID 419778 kern.notice] arcmsr0: scsi
> id=1 lun=3 ccb='0xffffff02e0ca0800' outstanding command timeout
> Jul 25 13:14:00 malene arcmsr: [ID 610198 kern.notice] arcmsr0: scsi
> id=1 lun=3 fatal error on target, device was gone
> Jul 25 13:14:00 malene arcmsr: [ID 658202 kern.warning] WARNING:
> arcmsr0: tran reset level=1

tran reset with level=1 is a bus reset

> Jul 25 13:14:00 malene arcmsr: [ID 658202 kern.warning] WARNING:
> arcmsr0: tran reset level=0

tran reset with level=0 is a target-specific reset, which arcmsr
doesn't support.

...

> Jul 25 13:15:00 malene arcmsr: [ID 419778 kern.notice] arcmsr0: scsi
> id=1 lun=3 ccb='0xffffff02e0ca0800' outstanding command timeout
> Jul 25 13:15:00 malene arcmsr: [ID 610198 kern.notice] arcmsr0: scsi
> id=1 lun=3 fatal error on target, device was gone

The command timed out because your system configuration was unexpectedly
changed in a manner which arcmsr doesn't support.

....

> /usr/sbin/zpool status
>   pool: ef1
>  state: DEGRADED
> status: One or more devices are faulted in response to persistent errors.
>         Sufficient replicas exist for the pool to continue functioning in a
>         degraded state.
> action: Replace the faulted device, or use 'zpool clear' to mark the device
>         repaired.
>  scrub: resilver in progress, 0.02% done, 5606h29m to go
> config:
> 
>         NAME            STATE     READ WRITE CKSUM
>         ef1             DEGRADED     0     0     0
>           raidz2        DEGRADED     0     0     0
>             spare       ONLINE       0     0     0
>               c3t0d0p0  ONLINE       0     0     0
>               c3t1d2p0  ONLINE       0     0     0
>             c3t0d1p0    ONLINE       0     0     0
>             c3t0d2p0    ONLINE       0     0     0
>             c3t0d0p0    FAULTED     35 1.61K     0  too many errors
>             c3t0d4p0    ONLINE       0     0     0
>             c3t0d5p0    DEGRADED     0     0    34  too many errors
>             c3t0d6p0    ONLINE       0     0     0
>             c3t0d7p0    ONLINE       0     0     0
>             c3t1d0p0    ONLINE       0     0     0
>             c3t1d1p0    ONLINE       0     0     0
>         spares
>           c3t1d2p0      INUSE     currently in use
> 
> errors: No known data errors

a double disk failure while resilvering - not a good state for your
pool to be in.

Can you wait for the resilver to complete? Every minute that goes
by tends to decrease the estimate on how long remains.

In addition, why are you using p0 devices rather than GPT-labelled
disks (or whole-disk s0 slices) ?

> When I try to start cli64 to access the arc-1680-card it hangs as well.
> Is this a deficiency in the arcmsr-driver?

I'll quibble - "this" can mean several things.

Yes, there seems to be an issue with arcmsr's handling of uncoordinated
device removal. I advise against doing this

I don't know how cli64 works and you haven't provided any messages output
from the system at the time when "it hangs" - is that the cli64 util,
the system, your zpool?...

For interest - which version of arcmsr are you running?



James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp       http://www.jmcp.homeunix.com/blog
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs, raidz, spare and jbod

Reply via email to