Re: CFT for FreeBSD + ZoL
On 19/04/2019 12:46, k...@ixsystems.com wrote: FreeBSD Developers, We're pleased to make available images allowing testing of FreeBSD using ZFS on Linux. During this development cycle, the ZoL code has been made portable, and available in the ports tree as sysutils/zol and sysutils/zol-kmod, for userland/kernel bits respectively. While some have used these for testing, we felt it necessary to generate some installation images which are an easier method of getting up and started using ZoL. These images are built against FreeBSD 12-stable and 13-HEAD and will install a world / kernel with the base system ZFS disabled and the sysutils/zol ports pre-installed. Ah, this is excellent, thankyou for all the work on this. A question though - is the intnet to keep these as ports, or will the ZoL code be merged back into the base, replacing the existing ZFS implementation? cheers, -pete. [who will give this a test next week if he can] ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)
On 4/13/2019 06:00, Karl Denninger wrote: > On 4/11/2019 13:57, Karl Denninger wrote: >> On 4/11/2019 13:52, Zaphod Beeblebrox wrote: >>> On Wed, Apr 10, 2019 at 10:41 AM Karl Denninger wrote: >>> >>> In this specific case the adapter in question is... mps0: port 0xc000-0xc0ff mem 0xfbb3c000-0xfbb3,0xfbb4-0xfbb7 irq 30 at device 0.0 on pci3 mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd mps0: IOCCapabilities: 1285c Which is indeed a "dumb" HBA (in IT mode), and Zeephod says he connects his drives via dumb on-MoBo direct SATA connections. >>> Maybe I'm in good company. My current setup has 8 of the disks connected >>> to: >>> >>> mps0: port 0xb000-0xb0ff mem >>> 0xfe24-0xfe24,0xfe20-0xfe23 irq 32 at device 0.0 on pci6 >>> mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd >>> mps0: IOCCapabilities: >>> 5a85c >>> >>> ... just with a cable that breaks out each of the 2 connectors into 4 >>> SATA-style connectors, and the other 8 disks (plus boot disks and SSD >>> cache/log) connected to ports on... >>> >>> - ahci0: port >>> 0xd050-0xd057,0xd040-0xd043,0xd030-0xd037,0xd020-0xd023,0xd000-0xd01f mem >>> 0xfe90-0xfe9001ff irq 44 at device 0.0 on pci2 >>> - ahci2: port >>> 0xa050-0xa057,0xa040-0xa043,0xa030-0xa037,0xa020-0xa023,0xa000-0xa01f mem >>> 0xfe61-0xfe6107ff irq 40 at device 0.0 on pci7 >>> - ahci3: port >>> 0xf040-0xf047,0xf030-0xf033,0xf020-0xf027,0xf010-0xf013,0xf000-0xf00f mem >>> 0xfea07000-0xfea073ff irq 19 at device 17.0 on pci0 >>> >>> ... each drive connected to a single port. >>> >>> I can actually reproduce this at will. Because I have 16 drives, when one >>> fails, I need to find it. I pull the sata cable for a drive, determine if >>> it's the drive in question, if not, reconnect, "ONLINE" it and wait for >>> resilver to stop... usually only a minute or two. >>> >>> ... if I do this 4 to 6 odd times to find a drive (I can tell, in general, >>> that a drive is part of the SAS controller or the SATA controllers... so >>> I'm only looking among 8, ever) ... then I "REPLACE" the problem drive. >>> More often than not, the a scrub will find a few problems. In fact, it >>> appears that the most recent scrub is an example: >>> >>> [1:7:306]dgilbert@vr:~> zpool status >>> pool: vr1 >>> state: ONLINE >>> scan: scrub repaired 32K in 47h16m with 0 errors on Mon Apr 1 23:12:03 >>> 2019 >>> config: >>> >>> NAMESTATE READ WRITE CKSUM >>> vr1 ONLINE 0 0 0 >>> raidz2-0 ONLINE 0 0 0 >>> gpt/v1-d0 ONLINE 0 0 0 >>> gpt/v1-d1 ONLINE 0 0 0 >>> gpt/v1-d2 ONLINE 0 0 0 >>> gpt/v1-d3 ONLINE 0 0 0 >>> gpt/v1-d4 ONLINE 0 0 0 >>> gpt/v1-d5 ONLINE 0 0 0 >>> gpt/v1-d6 ONLINE 0 0 0 >>> gpt/v1-d7 ONLINE 0 0 0 >>> raidz2-2 ONLINE 0 0 0 >>> gpt/v1-e0c ONLINE 0 0 0 >>> gpt/v1-e1b ONLINE 0 0 0 >>> gpt/v1-e2b ONLINE 0 0 0 >>> gpt/v1-e3b ONLINE 0 0 0 >>> gpt/v1-e4b ONLINE 0 0 0 >>> gpt/v1-e5a ONLINE 0 0 0 >>> gpt/v1-e6a ONLINE 0 0 0 >>> gpt/v1-e7c ONLINE 0 0 0 >>> logs >>> gpt/vr1logONLINE 0 0 0 >>> cache >>> gpt/vr1cache ONLINE 0 0 0 >>> >>> errors: No known data errors >>> >>> ... it doesn't say it now, but there were 5 CKSUM errors on one of the >>> drives that I had trial-removed (and not on the one replaced). >>> ___ >> That is EXACTLY what I'm seeing; the "OFFLINE'd" drive is the one that, >> after a scrub, comes up with the checksum errors. It does *not* flag >> any errors during the resilver and the drives *not* taken offline do not >> (ever) show checksum errors either. >> >> Interestingly enough you have 19.00.00.00 firmware on your card as well >> -- which is what was on mine. >> >> I have flashed my card forward to 20.00.07.00 -- we'll see if it still >> does it when I do the next swap of the backup set. > Verry interesting. > > This drive was last written/read under 19.00.00.00. Yesterday I swapped > it back in. Note that right now I am running: > > mps0: port 0xc000-0xc0ff mem > 0xfbb3c000-0xfbb3,0xfbb4-0xfbb7 irq 30 at device 0.0 on pci3 > mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd > mps0: IOCCapabilities: > 1285c > > And, after the scrub completed overnight > > [karl@NewFS ~]$ zpool status backup > pool: backup > state: DEGRADED > status: One or more devices has experienced an unrecoverable error. An
Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)
Have you eliminated geli as possible source? I've just setup an old server which has a LSI 2008 running and old FW (11.0) so was going to have a go at reproducing this. Apart from the disconnect steps below is there anything else needed e.g. read / write workload during disconnect? mps0: port 0xe000-0xe0ff mem 0xfaf3c000-0xfaf3,0xfaf4-0xfaf7 irq 26 at device 0.0 on pci3 mps0: Firmware: 11.00.00.00, Driver: 21.02.00.00-fbsd mps0: IOCCapabilities: 185c Regards Steve On 20/04/2019 15:39, Karl Denninger wrote: I can confirm that 20.00.07.00 does *not* stop this. The previous write/scrub on this device was on 20.00.07.00. It was swapped back in from the vault yesterday, resilvered without incident, but a scrub says root@NewFS:/home/karl # zpool status backup pool: backup state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-9P scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat Apr 20 08:45:09 2019 config: NAME STATE READ WRITE CKSUM backup DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 gpt/backup61.eli ONLINE 0 0 0 gpt/backup62-1.eli ONLINE 0 0 47 13282812295755460479 OFFLINE 0 0 0 was /dev/gpt/backup62-2.eli errors: No known data errors So this is firmware-invariant (at least between 19.00.00.00 and 20.00.07.00); the issue persists. Again, in my instance these devices are never removed "unsolicited" so there can't be (or at least shouldn't be able to) unflushed data in the device or kernel cache. The procedure is and remains: zpool offline . geli detach . camcontrol standby ... Wait a few seconds for the spindle to spin down. Remove disk. Then of course on the other side after insertion and the kernel has reported "finding" the device: geli attach ... zpool online Wait... If this is a boogered TXG that's held in the metadata for the "offline"'d device (maybe "off by one"?) that's potentially bad in that if there is an unknown failure in the other mirror component the resilver will complete but data has been irrevocably destroyed. Granted, this is a very low probability scenario (the area where the bad checksums are has to be where the corruption hits, and it has to happen between the resilver and access to that data.) Those are long odds but nonetheless a window of "you're hosed" does appear to exist. ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)
On 4/20/2019 10:50, Steven Hartland wrote: > Have you eliminated geli as possible source? No; I could conceivably do so by re-creating another backup volume set without geli-encrypting the drives, but I do not have an extra set of drives of the capacity required laying around to do that. I would have to do it with lower-capacity disks, which I can attempt if you think it would help. I *do* have open slots in the drive backplane to set up a second "test" unit of this sort. For reasons below it will take at least a couple of weeks to get good data on whether the problem exists without geli, however. > > I've just setup an old server which has a LSI 2008 running and old FW > (11.0) so was going to have a go at reproducing this. > > Apart from the disconnect steps below is there anything else needed > e.g. read / write workload during disconnect? Yes. An attempt to recreate this on my sandbox machine using smaller disks (WD RE-320s) and a decent amount of read/write activity (tens to ~100 gigabytes) on a root mirror of three disks with one taken offline did not succeed. It *reliably* appears, however, on my backup volumes with every drive swap. The sandbox machine is physically identical other than the physical disks; both are Xeons with ECC RAM in them. The only operational difference is that the backup volume sets have a *lot* of data written to them via zfs send|zfs recv over the intervening period where with "ordinary" activity from I/O (which was the case on my sandbox) the I/O pattern is materially different. The root pool on the sandbox where I tried to reproduce it synthetically *is* using geli (in fact it boots native-encrypted.) The "ordinary" resilver on a disk swap typically covers ~2-3Tb and is a ~6-8 hour process. The usual process for the backup pool looks like this: Have 2 of the 3 physical disks mounted; the third is in the bank vault. Over the space of a week, the backup script is run daily. It first imports the pool and then for each zfs filesystem it is backing up (which is not all of them; I have a few volatile ones that I don't care if I lose, such as object directories for builds and such, plus some that are R/O data sets that are backed up separately) it does: If there is no "...@zfs-base": zfs snapshot -r ...@zfs-base; zfs send -R ...@zfs-base | zfs receive -Fuvd $BACKUP else zfs rename -r ...@zfs-base ...@zfs-old zfs snapshot -r ...@zfs-base zfs send -RI ...@zfs-old ...@zfs-base |zfs recv -Fudv $BACKUP if ok then zfs destroy -vr ...@zfs-old otherwise print a complaint and stop. When all are complete it then does a "zpool export backup" to detach the pool in order to reduce the risk of "stupid root user" (me) accidents. In short I send an incremental of the changes since the last backup, which in many cases includes a bunch of automatic snapshots that are taken on frequent basis out of the cron. Typically there are a week's worth of these that accumulate between swaps of the disk to the vault, and the offline'd disk remains that way for a week. I also wait for the zpool destroy on each of the targets to drain before continuing, as not doing so back in the 9 and 10.x days was a good way to stimulate an instant panic on re-import the next day due to kernel stack page exhaustion if the previous operation destroyed hundreds of gigabytes of snapshots (which does routinely happen as part of the backed up data is Macrium images from PCs, so when a new month comes around the PC's backup routine removes a huge amount of old data from the filesystem.) Trying to simulate the checksum errors in a few hours' time thus far has failed. But every time I swap the disks on a weekly basis I get a handful of checksum errors on the scrub. If I export and re-import the backup mirror after that the counters are zeroed -- the checksum error count does *not* remain across an export/import cycle although the "scrub repaired" line remains. For example after the scrub completed this morning I exported the pool (the script expects the pool exported before it begins) and ran the backup. When it was complete: root@NewFS:~/backup-zfs # zpool status backup pool: backup state: DEGRADED status: One or more devices has been taken offline by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat Apr 20 08:45:09 2019 config: NAME STATE READ WRITE CKSUM backup DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 gpt/backup61.eli ONLINE 0 0 0 gpt/backup62-1.eli ONLINE 0 0 0 13282812295755460479 OFFLINE 0 0 0 was /dev/gpt/backup62-2.eli errors: No known data errors It knows it fix
Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)
Thanks for extra info, the next question would be have you eliminated that corruption exists before the disk is removed? Would be interesting to add a zpool scrub to confirm this isn't the case before the disk removal is attempted. Regards Steve On 20/04/2019 18:35, Karl Denninger wrote: On 4/20/2019 10:50, Steven Hartland wrote: Have you eliminated geli as possible source? No; I could conceivably do so by re-creating another backup volume set without geli-encrypting the drives, but I do not have an extra set of drives of the capacity required laying around to do that. I would have to do it with lower-capacity disks, which I can attempt if you think it would help. I *do* have open slots in the drive backplane to set up a second "test" unit of this sort. For reasons below it will take at least a couple of weeks to get good data on whether the problem exists without geli, however. I've just setup an old server which has a LSI 2008 running and old FW (11.0) so was going to have a go at reproducing this. Apart from the disconnect steps below is there anything else needed e.g. read / write workload during disconnect? Yes. An attempt to recreate this on my sandbox machine using smaller disks (WD RE-320s) and a decent amount of read/write activity (tens to ~100 gigabytes) on a root mirror of three disks with one taken offline did not succeed. It *reliably* appears, however, on my backup volumes with every drive swap. The sandbox machine is physically identical other than the physical disks; both are Xeons with ECC RAM in them. The only operational difference is that the backup volume sets have a *lot* of data written to them via zfs send|zfs recv over the intervening period where with "ordinary" activity from I/O (which was the case on my sandbox) the I/O pattern is materially different. The root pool on the sandbox where I tried to reproduce it synthetically *is* using geli (in fact it boots native-encrypted.) The "ordinary" resilver on a disk swap typically covers ~2-3Tb and is a ~6-8 hour process. The usual process for the backup pool looks like this: Have 2 of the 3 physical disks mounted; the third is in the bank vault. Over the space of a week, the backup script is run daily. It first imports the pool and then for each zfs filesystem it is backing up (which is not all of them; I have a few volatile ones that I don't care if I lose, such as object directories for builds and such, plus some that are R/O data sets that are backed up separately) it does: If there is no "...@zfs-base": zfs snapshot -r ...@zfs-base; zfs send -R ...@zfs-base | zfs receive -Fuvd $BACKUP else zfs rename -r ...@zfs-base ...@zfs-old zfs snapshot -r ...@zfs-base zfs send -RI ...@zfs-old ...@zfs-base |zfs recv -Fudv $BACKUP if ok then zfs destroy -vr ...@zfs-old otherwise print a complaint and stop. When all are complete it then does a "zpool export backup" to detach the pool in order to reduce the risk of "stupid root user" (me) accidents. In short I send an incremental of the changes since the last backup, which in many cases includes a bunch of automatic snapshots that are taken on frequent basis out of the cron. Typically there are a week's worth of these that accumulate between swaps of the disk to the vault, and the offline'd disk remains that way for a week. I also wait for the zpool destroy on each of the targets to drain before continuing, as not doing so back in the 9 and 10.x days was a good way to stimulate an instant panic on re-import the next day due to kernel stack page exhaustion if the previous operation destroyed hundreds of gigabytes of snapshots (which does routinely happen as part of the backed up data is Macrium images from PCs, so when a new month comes around the PC's backup routine removes a huge amount of old data from the filesystem.) Trying to simulate the checksum errors in a few hours' time thus far has failed. But every time I swap the disks on a weekly basis I get a handful of checksum errors on the scrub. If I export and re-import the backup mirror after that the counters are zeroed -- the checksum error count does *not* remain across an export/import cycle although the "scrub repaired" line remains. For example after the scrub completed this morning I exported the pool (the script expects the pool exported before it begins) and ran the backup. When it was complete: root@NewFS:~/backup-zfs # zpool status backup pool: backup state: DEGRADED status: One or more devices has been taken offline by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat Apr 20 08:45:09 2019 config: NAME STATE READ WRITE CKSUM backup
Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)
No; I can, but of course that's another ~8 hour (overnight) delay between swaps. That's not a bad idea however On 4/20/2019 15:56, Steven Hartland wrote: > Thanks for extra info, the next question would be have you eliminated > that corruption exists before the disk is removed? > > Would be interesting to add a zpool scrub to confirm this isn't the > case before the disk removal is attempted. > > Regards > Steve > > On 20/04/2019 18:35, Karl Denninger wrote: >> >> On 4/20/2019 10:50, Steven Hartland wrote: >>> Have you eliminated geli as possible source? >> No; I could conceivably do so by re-creating another backup volume >> set without geli-encrypting the drives, but I do not have an extra >> set of drives of the capacity required laying around to do that. I >> would have to do it with lower-capacity disks, which I can attempt if >> you think it would help. I *do* have open slots in the drive >> backplane to set up a second "test" unit of this sort. For reasons >> below it will take at least a couple of weeks to get good data on >> whether the problem exists without geli, however. >>> >>> I've just setup an old server which has a LSI 2008 running and old >>> FW (11.0) so was going to have a go at reproducing this. >>> >>> Apart from the disconnect steps below is there anything else needed >>> e.g. read / write workload during disconnect? >> >> Yes. An attempt to recreate this on my sandbox machine using smaller >> disks (WD RE-320s) and a decent amount of read/write activity (tens >> to ~100 gigabytes) on a root mirror of three disks with one taken >> offline did not succeed. It *reliably* appears, however, on my >> backup volumes with every drive swap. The sandbox machine is >> physically identical other than the physical disks; both are Xeons >> with ECC RAM in them. >> >> The only operational difference is that the backup volume sets have a >> *lot* of data written to them via zfs send|zfs recv over the >> intervening period where with "ordinary" activity from I/O (which was >> the case on my sandbox) the I/O pattern is materially different. The >> root pool on the sandbox where I tried to reproduce it synthetically >> *is* using geli (in fact it boots native-encrypted.) >> >> The "ordinary" resilver on a disk swap typically covers ~2-3Tb and is >> a ~6-8 hour process. >> >> The usual process for the backup pool looks like this: >> >> Have 2 of the 3 physical disks mounted; the third is in the bank vault. >> >> Over the space of a week, the backup script is run daily. It first >> imports the pool and then for each zfs filesystem it is backing up >> (which is not all of them; I have a few volatile ones that I don't >> care if I lose, such as object directories for builds and such, plus >> some that are R/O data sets that are backed up separately) it does: >> >> If there is no "...@zfs-base": zfs snapshot -r ...@zfs-base; zfs send >> -R ...@zfs-base | zfs receive -Fuvd $BACKUP >> >> else >> >> zfs rename -r ...@zfs-base ...@zfs-old >> zfs snapshot -r ...@zfs-base >> >> zfs send -RI ...@zfs-old ...@zfs-base |zfs recv -Fudv $BACKUP >> >> if ok then zfs destroy -vr ...@zfs-old otherwise print a >> complaint and stop. >> >> When all are complete it then does a "zpool export backup" to detach >> the pool in order to reduce the risk of "stupid root user" (me) >> accidents. >> >> In short I send an incremental of the changes since the last backup, >> which in many cases includes a bunch of automatic snapshots that are >> taken on frequent basis out of the cron. Typically there are a week's >> worth of these that accumulate between swaps of the disk to the >> vault, and the offline'd disk remains that way for a week. I also >> wait for the zpool destroy on each of the targets to drain before >> continuing, as not doing so back in the 9 and 10.x days was a good >> way to stimulate an instant panic on re-import the next day due to >> kernel stack page exhaustion if the previous operation destroyed >> hundreds of gigabytes of snapshots (which does routinely happen as >> part of the backed up data is Macrium images from PCs, so when a new >> month comes around the PC's backup routine removes a huge amount of >> old data from the filesystem.) >> >> Trying to simulate the checksum errors in a few hours' time thus far >> has failed. But every time I swap the disks on a weekly basis I get >> a handful of checksum errors on the scrub. If I export and re-import >> the backup mirror after that the counters are zeroed -- the checksum >> error count does *not* remain across an export/import cycle although >> the "scrub repaired" line remains. >> >> For example after the scrub completed this morning I exported the >> pool (the script expects the pool exported before it begins) and ran >> the backup. When it was complete: >> >> root@NewFS:~/backup-zfs # zpool status backup >> pool: backup >> state: DEGRADED >> status: One or more devices has been taken offline by the administrator. >> Su