Have you eliminated geli as possible source?
I've just setup an old server which has a LSI 2008 running and old FW
(11.0) so was going to have a go at reproducing this.
Apart from the disconnect steps below is there anything else needed e.g.
read / write workload during disconnect?
mps0: <Avago Technologies (LSI) SAS2008> port 0xe000-0xe0ff mem
0xfaf3c000-0xfaf3ffff,0xfaf40000-0xfaf7ffff irq 26 at device 0.0 on pci3
mps0: Firmware: 11.00.00.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities:
185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR>
Regards
Steve
On 20/04/2019 15:39, Karl Denninger wrote:
I can confirm that 20.00.07.00 does *not* stop this.
The previous write/scrub on this device was on 20.00.07.00. It was
swapped back in from the vault yesterday, resilvered without incident,
but a scrub says....
root@NewFS:/home/karl # zpool status backup
pool: backup
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat Apr
20 08:45:09 2019
config:
NAME STATE READ WRITE CKSUM
backup DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
gpt/backup61.eli ONLINE 0 0 0
gpt/backup62-1.eli ONLINE 0 0 47
13282812295755460479 OFFLINE 0 0 0 was
/dev/gpt/backup62-2.eli
errors: No known data errors
So this is firmware-invariant (at least between 19.00.00.00 and
20.00.07.00); the issue persists.
Again, in my instance these devices are never removed "unsolicited" so
there can't be (or at least shouldn't be able to) unflushed data in the
device or kernel cache. The procedure is and remains:
zpool offline .....
geli detach .....
camcontrol standby ...
Wait a few seconds for the spindle to spin down.
Remove disk.
Then of course on the other side after insertion and the kernel has
reported "finding" the device:
geli attach ...
zpool online ....
Wait...
If this is a boogered TXG that's held in the metadata for the
"offline"'d device (maybe "off by one"?) that's potentially bad in that
if there is an unknown failure in the other mirror component the
resilver will complete but data has been irrevocably destroyed.
Granted, this is a very low probability scenario (the area where the bad
checksums are has to be where the corruption hits, and it has to happen
between the resilver and access to that data.) Those are long odds but
nonetheless a window of "you're hosed" does appear to exist.
_______________________________________________
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"