2012-05-11 14:22, Jim Klimov wrote:
What conditions can cause the reset of the resilvering
process? My lost-and-found disk can't get back into the
pool because of resilvers restarting...

FOLLOW-UP AND NEW QUESTIONS

Here is a new piece of evidence - I've finally got something
out of fmdump - series of several (5) retries ending with a
fail, dated 75 seconds before resilvers restart (more below).
Not a squeak in zpool status nor dmesg nor /dev/console.

Guess I must assume that the disk is dying indeed, losing
connection or something like that after a random time (my
resilvers restart after 15min-5hrs), and at least a run of
SMART long diags is in order, while the pool would try to
rebuild onto another disk (the hotspare) instead of trying
to update this one which was in the pool.

Anyhow, information on the ex-pool disk is likely unused
anyway - from iostat I see that it is only written to,
with few to zero reads for minutes - so I'd lose nothing
by replacing it with a blank drive (that's strange though)...

I also guess that the disk gets found after something like
an unlogged bus reset or whatever, and this event causes
the resilvering to restart from scratch.
Q: Would this be the same in OI_151a, or would it continue
resilvering from where it left off? I think I had the pool
exported once during a resilver, and it restarted from the
same percentage counter, so it is possible ;)

Best course of action would be to get those people to fully
replace the untrustworthy disk... Or at least pull and push
it a bit - maybe it's contacts just got plain dirty/oxidized
and the disk should be re-seated in the enclosure...

I'd like someone to please confirm or deny my hypotheses
and guesses :)

DETAILS

According to format, the disk in tailed fmdump reports below
is indeed the one I'm trying to resilver into:

# format | gegrep -B1 '/pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0'
      10. c1t2d0 <ATA-SEAGATE ST32500N-3AZQ-232.88GB>
          /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0

From the pool history we see the resilvering restart
timestamps... they closely match the retry-fail cycles,
following them by some 75 seconds:

# zpool history -il pond | tail; date
....
2012-05-12.10:43:35 [internal pool scrub done txg:91072311] complete=0 [user root on thumper] 2012-05-12.10:43:36 [internal pool scrub txg:91072311] func=1 mintxg=41 maxtxg=91051854 [user root on thumper] 2012-05-12.14:12:44 [internal pool scrub done txg:91072723] complete=0 [user root on thumper] 2012-05-12.14:12:45 [internal pool scrub txg:91072723] func=1 mintxg=41 maxtxg=91051854 [user root on thumper]
Sat May 12 15:45:50 MSK 2012

And last but not least - the FMDUMP messages...

# fmdump -eV | tail -150

...
May 12 2012 10:42:19.559305872 ereport.io.scsi.cmd.disk.tran
nvlist version: 0
        class = ereport.io.scsi.cmd.disk.tran
        ena = 0x928e32c9d1700401
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = dev
device-path = /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0
        (end detector)

        driver-assessment = fail
        op-code = 0x28
        cdb = 0x28 0x0 0x14 0xd5 0x53 0x0 0x0 0x0 0x80 0x0
        pkt-reason = 0x1
        pkt-state = 0x37
        pkt-stats = 0x0
        __ttl = 0x1
        __tod = 0x4fae064b 0x21565490

May 12 2012 14:11:27.754940954 ereport.io.scsi.cmd.disk.tran
nvlist version: 0
        class = ereport.io.scsi.cmd.disk.tran
        ena = 0x492896f3a8500401
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = dev
device-path = /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0
        (end detector)

        driver-assessment = retry
        op-code = 0x28
        cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0
        pkt-reason = 0x1
        pkt-state = 0x37
        pkt-stats = 0x0
        __ttl = 0x1
        __tod = 0x4fae374f 0x2cff7c1a


May 12 2012 14:11:27.754905021 ereport.io.scsi.cmd.disk.tran
nvlist version: 0
        class = ereport.io.scsi.cmd.disk.tran
        ena = 0x492896f3a8500401
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = dev
device-path = /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0
        (end detector)

        driver-assessment = retry
        op-code = 0x28
        cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0
        pkt-reason = 0x1
        pkt-state = 0x37
        pkt-stats = 0x0
        __ttl = 0x1
        __tod = 0x4fae374f 0x2cfeefbd

May 12 2012 14:11:27.754866050 ereport.io.scsi.cmd.disk.tran
nvlist version: 0
        class = ereport.io.scsi.cmd.disk.tran
        ena = 0x492896f3a8500401
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = dev
device-path = /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0
        (end detector)

        driver-assessment = retry
        op-code = 0x28
        cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0
        pkt-reason = 0x1
        pkt-state = 0x37
        pkt-stats = 0x0
        __ttl = 0x1
        __tod = 0x4fae374f 0x2cfe5782

May 12 2012 14:11:27.754793613 ereport.io.scsi.cmd.disk.tran
nvlist version: 0
        class = ereport.io.scsi.cmd.disk.tran
        ena = 0x492896f3a8500401
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = dev
device-path = /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0
        (end detector)

        driver-assessment = retry
        op-code = 0x28
        cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0
        pkt-reason = 0x1
        pkt-state = 0x37
        pkt-stats = 0x0
        __ttl = 0x1
        __tod = 0x4fae374f 0x2cfd3c8d



May 12 2012 14:11:27.754757103 ereport.io.scsi.cmd.disk.tran
nvlist version: 0
        class = ereport.io.scsi.cmd.disk.tran
        ena = 0x492896f3a8500401
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = dev
device-path = /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0
        (end detector)

        driver-assessment = retry
        op-code = 0x28
        cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0
        pkt-reason = 0x1
        pkt-state = 0x37
        pkt-stats = 0x0
        __ttl = 0x1
        __tod = 0x4fae374f 0x2cfcadef


May 12 2012 14:11:27.754721778 ereport.io.scsi.cmd.disk.tran
nvlist version: 0
        class = ereport.io.scsi.cmd.disk.tran
        ena = 0x492896f3a8500401
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = dev
device-path = /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0
        (end detector)

        driver-assessment = fail
        op-code = 0x28
        cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0
        pkt-reason = 0x1
        pkt-state = 0x37
        pkt-stats = 0x0
        __ttl = 0x1
        __tod = 0x4fae374f 0x2cfc23f2


//Jim
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to