Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

Steven Hartland Sat, 20 Apr 2019 08:54:33 -0700

Have you eliminated geli as possible source?

I've just setup an old server which has a LSI 2008 running and old FW(11.0) so was going to have a go at reproducing this.

Apart from the disconnect steps below is there anything else needed e.g.read / write workload during disconnect?

mps0: <Avago Technologies (LSI) SAS2008> port 0xe000-0xe0ff mem0xfaf3c000-0xfaf3ffff,0xfaf40000-0xfaf7ffff irq 26 at device 0.0 on pci3

mps0: Firmware: 11.00.00.00, Driver: 21.02.00.00-fbsd

mps0: IOCCapabilities:185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR>


    Regards
    Steve

On 20/04/2019 15:39, Karl Denninger wrote:

I can confirm that 20.00.07.00 does *not* stop this.
The previous write/scrub on this device was on 20.00.07.00. It was
swapped back in from the vault yesterday, resilvered without incident,
but a scrub says....

root@NewFS:/home/karl # zpool status backup
pool: backup
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat Apr
20 08:45:09 2019
config:

NAME STATE READ WRITE CKSUM
backup DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
gpt/backup61.eli ONLINE 0 0 0
gpt/backup62-1.eli ONLINE 0 0 47
13282812295755460479 OFFLINE 0 0 0 was
/dev/gpt/backup62-2.eli

errors: No known data errors

So this is firmware-invariant (at least between 19.00.00.00 and
20.00.07.00); the issue persists.

Again, in my instance these devices are never removed "unsolicited" so
there can't be (or at least shouldn't be able to) unflushed data in the
device or kernel cache. The procedure is and remains:

zpool offline .....
geli detach .....
camcontrol standby ...

Wait a few seconds for the spindle to spin down.

Remove disk.

Then of course on the other side after insertion and the kernel has
reported "finding" the device:

geli attach ...
zpool online ....

Wait...

If this is a boogered TXG that's held in the metadata for the
"offline"'d device (maybe "off by one"?) that's potentially bad in that
if there is an unknown failure in the other mirror component the
resilver will complete but data has been irrevocably destroyed.

Granted, this is a very low probability scenario (the area where the bad
checksums are has to be where the corruption hits, and it has to happen
between the resilver and access to that data.) Those are long odds but
nonetheless a window of "you're hosed" does appear to exist.


_______________________________________________
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

Reply via email to