Re: CFT for FreeBSD + ZoL

2019-04-20 Thread Pete French



On 19/04/2019 12:46, k...@ixsystems.com wrote:

FreeBSD Developers,

  


We're pleased to make available images allowing testing of FreeBSD using ZFS
on Linux.  During this development cycle, the ZoL code has been made
portable, and available in the ports tree as sysutils/zol and
sysutils/zol-kmod, for userland/kernel bits respectively. While some have
used these for testing, we felt it necessary to generate some installation
images which are an easier method of getting up and started using ZoL. These
images are built against FreeBSD 12-stable and 13-HEAD and will install a
world / kernel with the base system ZFS disabled and the sysutils/zol ports
pre-installed.


Ah, this is excellent, thankyou for all the work on this. A question 
though - is the intnet to keep these as ports, or will the ZoL code

be merged back into the base, replacing the existing ZFS implementation?

cheers,

-pete. [who will give this a test next week if he can]

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

2019-04-20 Thread Karl Denninger

On 4/13/2019 06:00, Karl Denninger wrote:
> On 4/11/2019 13:57, Karl Denninger wrote:
>> On 4/11/2019 13:52, Zaphod Beeblebrox wrote:
>>> On Wed, Apr 10, 2019 at 10:41 AM Karl Denninger  wrote:
>>>
>>>
 In this specific case the adapter in question is...

 mps0:  port 0xc000-0xc0ff mem
 0xfbb3c000-0xfbb3,0xfbb4-0xfbb7 irq 30 at device 0.0 on pci3
 mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
 mps0: IOCCapabilities:
 1285c

 Which is indeed a "dumb" HBA (in IT mode), and Zeephod says he connects
 his drives via dumb on-MoBo direct SATA connections.

>>> Maybe I'm in good company.  My current setup has 8 of the disks connected
>>> to:
>>>
>>> mps0:  port 0xb000-0xb0ff mem
>>> 0xfe24-0xfe24,0xfe20-0xfe23 irq 32 at device 0.0 on pci6
>>> mps0: Firmware: 19.00.00.00, Driver: 21.02.00.00-fbsd
>>> mps0: IOCCapabilities:
>>> 5a85c
>>>
>>> ... just with a cable that breaks out each of the 2 connectors into 4
>>> SATA-style connectors, and the other 8 disks (plus boot disks and SSD
>>> cache/log) connected to ports on...
>>>
>>> - ahci0:  port
>>> 0xd050-0xd057,0xd040-0xd043,0xd030-0xd037,0xd020-0xd023,0xd000-0xd01f mem
>>> 0xfe90-0xfe9001ff irq 44 at device 0.0 on pci2
>>> - ahci2:  port
>>> 0xa050-0xa057,0xa040-0xa043,0xa030-0xa037,0xa020-0xa023,0xa000-0xa01f mem
>>> 0xfe61-0xfe6107ff irq 40 at device 0.0 on pci7
>>> - ahci3:  port
>>> 0xf040-0xf047,0xf030-0xf033,0xf020-0xf027,0xf010-0xf013,0xf000-0xf00f mem
>>> 0xfea07000-0xfea073ff irq 19 at device 17.0 on pci0
>>>
>>> ... each drive connected to a single port.
>>>
>>> I can actually reproduce this at will.  Because I have 16 drives, when one
>>> fails, I need to find it.  I pull the sata cable for a drive, determine if
>>> it's the drive in question, if not, reconnect, "ONLINE" it and wait for
>>> resilver to stop... usually only a minute or two.
>>>
>>> ... if I do this 4 to 6 odd times to find a drive (I can tell, in general,
>>> that a drive is part of the SAS controller or the SATA controllers... so
>>> I'm only looking among 8, ever) ... then I "REPLACE" the problem drive.
>>> More often than not, the a scrub will find a few problems.  In fact, it
>>> appears that the most recent scrub is an example:
>>>
>>> [1:7:306]dgilbert@vr:~> zpool status
>>>   pool: vr1
>>>  state: ONLINE
>>>   scan: scrub repaired 32K in 47h16m with 0 errors on Mon Apr  1 23:12:03
>>> 2019
>>> config:
>>>
>>> NAMESTATE READ WRITE CKSUM
>>> vr1 ONLINE   0 0 0
>>>   raidz2-0  ONLINE   0 0 0
>>> gpt/v1-d0   ONLINE   0 0 0
>>> gpt/v1-d1   ONLINE   0 0 0
>>> gpt/v1-d2   ONLINE   0 0 0
>>> gpt/v1-d3   ONLINE   0 0 0
>>> gpt/v1-d4   ONLINE   0 0 0
>>> gpt/v1-d5   ONLINE   0 0 0
>>> gpt/v1-d6   ONLINE   0 0 0
>>> gpt/v1-d7   ONLINE   0 0 0
>>>   raidz2-2  ONLINE   0 0 0
>>> gpt/v1-e0c  ONLINE   0 0 0
>>> gpt/v1-e1b  ONLINE   0 0 0
>>> gpt/v1-e2b  ONLINE   0 0 0
>>> gpt/v1-e3b  ONLINE   0 0 0
>>> gpt/v1-e4b  ONLINE   0 0 0
>>> gpt/v1-e5a  ONLINE   0 0 0
>>> gpt/v1-e6a  ONLINE   0 0 0
>>> gpt/v1-e7c  ONLINE   0 0 0
>>> logs
>>>   gpt/vr1logONLINE   0 0 0
>>> cache
>>>   gpt/vr1cache  ONLINE   0 0 0
>>>
>>> errors: No known data errors
>>>
>>> ... it doesn't say it now, but there were 5 CKSUM errors on one of the
>>> drives that I had trial-removed (and not on the one replaced).
>>> ___
>> That is EXACTLY what I'm seeing; the "OFFLINE'd" drive is the one that,
>> after a scrub, comes up with the checksum errors.  It does *not* flag
>> any errors during the resilver and the drives *not* taken offline do not
>> (ever) show checksum errors either.
>>
>> Interestingly enough you have 19.00.00.00 firmware on your card as well
>> -- which is what was on mine.
>>
>> I have flashed my card forward to 20.00.07.00 -- we'll see if it still
>> does it when I do the next swap of the backup set.
> Verry interesting.
>
> This drive was last written/read under 19.00.00.00.  Yesterday I swapped
> it back in.  Note that right now I am running:
>
> mps0:  port 0xc000-0xc0ff mem
> 0xfbb3c000-0xfbb3,0xfbb4-0xfbb7 irq 30 at device 0.0 on pci3
> mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
> mps0: IOCCapabilities:
> 1285c
>
> And, after the scrub completed overnight
>
> [karl@NewFS ~]$ zpool status backup
>   pool: backup
>  state: DEGRADED
> status: One or more devices has experienced an unrecoverable error.  An

Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

2019-04-20 Thread Steven Hartland

Have you eliminated geli as possible source?

I've just setup an old server which has a LSI 2008 running and old FW 
(11.0) so was going to have a go at reproducing this.


Apart from the disconnect steps below is there anything else needed e.g. 
read / write workload during disconnect?


mps0:  port 0xe000-0xe0ff mem 
0xfaf3c000-0xfaf3,0xfaf4-0xfaf7 irq 26 at device 0.0 on pci3

mps0: Firmware: 11.00.00.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities: 
185c


    Regards
    Steve

On 20/04/2019 15:39, Karl Denninger wrote:

I can confirm that 20.00.07.00 does *not* stop this.
The previous write/scrub on this device was on 20.00.07.00.  It was
swapped back in from the vault yesterday, resilvered without incident,
but a scrub says

root@NewFS:/home/karl # zpool status backup
   pool: backup
  state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
     attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
     using 'zpool clear' or replace the device with 'zpool replace'.
    see: http://illumos.org/msg/ZFS-8000-9P
   scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat Apr
20 08:45:09 2019
config:

     NAME  STATE READ WRITE CKSUM
     backup    DEGRADED 0 0 0
   mirror-0    DEGRADED 0 0 0
     gpt/backup61.eli  ONLINE   0 0 0
     gpt/backup62-1.eli    ONLINE   0 0    47
     13282812295755460479  OFFLINE  0 0 0  was
/dev/gpt/backup62-2.eli

errors: No known data errors

So this is firmware-invariant (at least between 19.00.00.00 and
20.00.07.00); the issue persists.

Again, in my instance these devices are never removed "unsolicited" so
there can't be (or at least shouldn't be able to) unflushed data in the
device or kernel cache.  The procedure is and remains:

zpool offline .
geli detach .
camcontrol standby ...

Wait a few seconds for the spindle to spin down.

Remove disk.

Then of course on the other side after insertion and the kernel has
reported "finding" the device:

geli attach ...
zpool online 

Wait...

If this is a boogered TXG that's held in the metadata for the
"offline"'d device (maybe "off by one"?) that's potentially bad in that
if there is an unknown failure in the other mirror component the
resilver will complete but data has been irrevocably destroyed.

Granted, this is a very low probability scenario (the area where the bad
checksums are has to be where the corruption hits, and it has to happen
between the resilver and access to that data.)  Those are long odds but
nonetheless a window of "you're hosed" does appear to exist.



___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

2019-04-20 Thread Karl Denninger

On 4/20/2019 10:50, Steven Hartland wrote:
> Have you eliminated geli as possible source?
No; I could conceivably do so by re-creating another backup volume set
without geli-encrypting the drives, but I do not have an extra set of
drives of the capacity required laying around to do that.  I would have
to do it with lower-capacity disks, which I can attempt if you think it
would help.  I *do* have open slots in the drive backplane to set up a
second "test" unit of this sort.  For reasons below it will take at
least a couple of weeks to get good data on whether the problem exists
without geli, however.
>
> I've just setup an old server which has a LSI 2008 running and old FW
> (11.0) so was going to have a go at reproducing this.
>
> Apart from the disconnect steps below is there anything else needed
> e.g. read / write workload during disconnect?

Yes.  An attempt to recreate this on my sandbox machine using smaller
disks (WD RE-320s) and a decent amount of read/write activity (tens to
~100 gigabytes) on a root mirror of three disks with one taken offline
did not succeed.  It *reliably* appears, however, on my backup volumes
with every drive swap.  The sandbox machine is physically identical
other than the physical disks; both are Xeons with ECC RAM in them.

The only operational difference is that the backup volume sets have a
*lot* of data written to them via zfs send|zfs recv over the intervening
period where with "ordinary" activity from I/O (which was the case on my
sandbox) the I/O pattern is materially different.  The root pool on the
sandbox where I tried to reproduce it synthetically *is* using geli (in
fact it boots native-encrypted.)

The "ordinary" resilver on a disk swap typically covers ~2-3Tb and is a
~6-8 hour process.

The usual process for the backup pool looks like this:

Have 2 of the 3 physical disks mounted; the third is in the bank vault.

Over the space of a week, the backup script is run daily.  It first
imports the pool and then for each zfs filesystem it is backing up
(which is not all of them; I have a few volatile ones that I don't care
if I lose, such as object directories for builds and such, plus some
that are R/O data sets that are backed up separately) it does:

If there is no "...@zfs-base": zfs snapshot -r ...@zfs-base; zfs send -R
...@zfs-base | zfs receive -Fuvd $BACKUP

else

zfs rename -r ...@zfs-base ...@zfs-old
zfs snapshot -r ...@zfs-base

zfs send -RI ...@zfs-old ...@zfs-base |zfs recv -Fudv $BACKUP

 if ok then zfs destroy -vr ...@zfs-old otherwise print a complaint
and stop.

When all are complete it then does a "zpool export backup" to detach the
pool in order to reduce the risk of "stupid root user" (me) accidents.

In short I send an incremental of the changes since the last backup,
which in many cases includes a bunch of automatic snapshots that are
taken on frequent basis out of the cron.  Typically there are a week's
worth of these that accumulate between swaps of the disk to the vault,
and the offline'd disk remains that way for a week.  I also wait for the
zpool destroy on each of the targets to drain before continuing, as not
doing so back in the 9 and 10.x days was a good way to stimulate an
instant panic on re-import the next day due to kernel stack page
exhaustion if the previous operation destroyed hundreds of gigabytes of
snapshots (which does routinely happen as part of the backed up data is
Macrium images from PCs, so when a new month comes around the PC's
backup routine removes a huge amount of old data from the filesystem.)

Trying to simulate the checksum errors in a few hours' time thus far has
failed.  But every time I swap the disks on a weekly basis I get a
handful of checksum errors on the scrub.  If I export and re-import the
backup mirror after that the counters are zeroed -- the checksum error
count does *not* remain across an export/import cycle although the
"scrub repaired" line remains.

For example after the scrub completed this morning I exported the pool
(the script expects the pool exported before it begins) and ran the
backup.  When it was complete:

root@NewFS:~/backup-zfs # zpool status backup
  pool: backup
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Online the device using 'zpool online' or replace the device with
    'zpool replace'.
  scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat Apr
20 08:45:09 2019
config:

    NAME  STATE READ WRITE CKSUM
    backup    DEGRADED 0 0 0
  mirror-0    DEGRADED 0 0 0
    gpt/backup61.eli  ONLINE   0 0 0
    gpt/backup62-1.eli    ONLINE   0 0 0
    13282812295755460479  OFFLINE  0 0 0  was
/dev/gpt/backup62-2.eli

errors: No known data errors

It knows it fix

Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

2019-04-20 Thread Steven Hartland
Thanks for extra info, the next question would be have you eliminated 
that corruption exists before the disk is removed?


Would be interesting to add a zpool scrub to confirm this isn't the case 
before the disk removal is attempted.


    Regards
    Steve

On 20/04/2019 18:35, Karl Denninger wrote:


On 4/20/2019 10:50, Steven Hartland wrote:

Have you eliminated geli as possible source?
No; I could conceivably do so by re-creating another backup volume set 
without geli-encrypting the drives, but I do not have an extra set of 
drives of the capacity required laying around to do that. I would have 
to do it with lower-capacity disks, which I can attempt if you think 
it would help.  I *do* have open slots in the drive backplane to set 
up a second "test" unit of this sort.  For reasons below it will take 
at least a couple of weeks to get good data on whether the problem 
exists without geli, however.


I've just setup an old server which has a LSI 2008 running and old FW 
(11.0) so was going to have a go at reproducing this.


Apart from the disconnect steps below is there anything else needed 
e.g. read / write workload during disconnect?


Yes.  An attempt to recreate this on my sandbox machine using smaller 
disks (WD RE-320s) and a decent amount of read/write activity (tens to 
~100 gigabytes) on a root mirror of three disks with one taken offline 
did not succeed.  It *reliably* appears, however, on my backup volumes 
with every drive swap. The sandbox machine is physically identical 
other than the physical disks; both are Xeons with ECC RAM in them.


The only operational difference is that the backup volume sets have a 
*lot* of data written to them via zfs send|zfs recv over the 
intervening period where with "ordinary" activity from I/O (which was 
the case on my sandbox) the I/O pattern is materially different.  The 
root pool on the sandbox where I tried to reproduce it synthetically 
*is* using geli (in fact it boots native-encrypted.)


The "ordinary" resilver on a disk swap typically covers ~2-3Tb and is 
a ~6-8 hour process.


The usual process for the backup pool looks like this:

Have 2 of the 3 physical disks mounted; the third is in the bank vault.

Over the space of a week, the backup script is run daily.  It first 
imports the pool and then for each zfs filesystem it is backing up 
(which is not all of them; I have a few volatile ones that I don't 
care if I lose, such as object directories for builds and such, plus 
some that are R/O data sets that are backed up separately) it does:


If there is no "...@zfs-base": zfs snapshot -r ...@zfs-base; zfs send 
-R ...@zfs-base | zfs receive -Fuvd $BACKUP


else

zfs rename -r ...@zfs-base ...@zfs-old
zfs snapshot -r ...@zfs-base

zfs send -RI ...@zfs-old ...@zfs-base |zfs recv -Fudv $BACKUP

 if ok then zfs destroy -vr ...@zfs-old otherwise print a 
complaint and stop.


When all are complete it then does a "zpool export backup" to detach 
the pool in order to reduce the risk of "stupid root user" (me) accidents.


In short I send an incremental of the changes since the last backup, 
which in many cases includes a bunch of automatic snapshots that are 
taken on frequent basis out of the cron. Typically there are a week's 
worth of these that accumulate between swaps of the disk to the vault, 
and the offline'd disk remains that way for a week.  I also wait for 
the zpool destroy on each of the targets to drain before continuing, 
as not doing so back in the 9 and 10.x days was a good way to 
stimulate an instant panic on re-import the next day due to kernel 
stack page exhaustion if the previous operation destroyed hundreds of 
gigabytes of snapshots (which does routinely happen as part of the 
backed up data is Macrium images from PCs, so when a new month comes 
around the PC's backup routine removes a huge amount of old data from 
the filesystem.)


Trying to simulate the checksum errors in a few hours' time thus far 
has failed.  But every time I swap the disks on a weekly basis I get a 
handful of checksum errors on the scrub. If I export and re-import the 
backup mirror after that the counters are zeroed -- the checksum error 
count does *not* remain across an export/import cycle although the 
"scrub repaired" line remains.


For example after the scrub completed this morning I exported the pool 
(the script expects the pool exported before it begins) and ran the 
backup.  When it was complete:


root@NewFS:~/backup-zfs # zpool status backup
  pool: backup
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
    Sufficient replicas exist for the pool to continue functioning 
in a

    degraded state.
action: Online the device using 'zpool online' or replace the device with
    'zpool replace'.
  scan: scrub repaired 188K in 0 days 09:40:18 with 0 errors on Sat 
Apr 20 08:45:09 2019

config:

    NAME  STATE READ WRITE CKSUM
    backup 

Re: Concern: ZFS Mirror issues (12.STABLE and firmware 19 .v. 20)

2019-04-20 Thread Karl Denninger
No; I can, but of course that's another ~8 hour (overnight) delay
between swaps.

That's not a bad idea however

On 4/20/2019 15:56, Steven Hartland wrote:
> Thanks for extra info, the next question would be have you eliminated
> that corruption exists before the disk is removed?
>
> Would be interesting to add a zpool scrub to confirm this isn't the
> case before the disk removal is attempted.
>
>     Regards
>     Steve
>
> On 20/04/2019 18:35, Karl Denninger wrote:
>>
>> On 4/20/2019 10:50, Steven Hartland wrote:
>>> Have you eliminated geli as possible source?
>> No; I could conceivably do so by re-creating another backup volume
>> set without geli-encrypting the drives, but I do not have an extra
>> set of drives of the capacity required laying around to do that. I
>> would have to do it with lower-capacity disks, which I can attempt if
>> you think it would help.  I *do* have open slots in the drive
>> backplane to set up a second "test" unit of this sort.  For reasons
>> below it will take at least a couple of weeks to get good data on
>> whether the problem exists without geli, however.
>>>
>>> I've just setup an old server which has a LSI 2008 running and old
>>> FW (11.0) so was going to have a go at reproducing this.
>>>
>>> Apart from the disconnect steps below is there anything else needed
>>> e.g. read / write workload during disconnect?
>>
>> Yes.  An attempt to recreate this on my sandbox machine using smaller
>> disks (WD RE-320s) and a decent amount of read/write activity (tens
>> to ~100 gigabytes) on a root mirror of three disks with one taken
>> offline did not succeed.  It *reliably* appears, however, on my
>> backup volumes with every drive swap. The sandbox machine is
>> physically identical other than the physical disks; both are Xeons
>> with ECC RAM in them.
>>
>> The only operational difference is that the backup volume sets have a
>> *lot* of data written to them via zfs send|zfs recv over the
>> intervening period where with "ordinary" activity from I/O (which was
>> the case on my sandbox) the I/O pattern is materially different.  The
>> root pool on the sandbox where I tried to reproduce it synthetically
>> *is* using geli (in fact it boots native-encrypted.)
>>
>> The "ordinary" resilver on a disk swap typically covers ~2-3Tb and is
>> a ~6-8 hour process.
>>
>> The usual process for the backup pool looks like this:
>>
>> Have 2 of the 3 physical disks mounted; the third is in the bank vault.
>>
>> Over the space of a week, the backup script is run daily.  It first
>> imports the pool and then for each zfs filesystem it is backing up
>> (which is not all of them; I have a few volatile ones that I don't
>> care if I lose, such as object directories for builds and such, plus
>> some that are R/O data sets that are backed up separately) it does:
>>
>> If there is no "...@zfs-base": zfs snapshot -r ...@zfs-base; zfs send
>> -R ...@zfs-base | zfs receive -Fuvd $BACKUP
>>
>> else
>>
>> zfs rename -r ...@zfs-base ...@zfs-old
>> zfs snapshot -r ...@zfs-base
>>
>> zfs send -RI ...@zfs-old ...@zfs-base |zfs recv -Fudv $BACKUP
>>
>>  if ok then zfs destroy -vr ...@zfs-old otherwise print a
>> complaint and stop.
>>
>> When all are complete it then does a "zpool export backup" to detach
>> the pool in order to reduce the risk of "stupid root user" (me)
>> accidents.
>>
>> In short I send an incremental of the changes since the last backup,
>> which in many cases includes a bunch of automatic snapshots that are
>> taken on frequent basis out of the cron. Typically there are a week's
>> worth of these that accumulate between swaps of the disk to the
>> vault, and the offline'd disk remains that way for a week.  I also
>> wait for the zpool destroy on each of the targets to drain before
>> continuing, as not doing so back in the 9 and 10.x days was a good
>> way to stimulate an instant panic on re-import the next day due to
>> kernel stack page exhaustion if the previous operation destroyed
>> hundreds of gigabytes of snapshots (which does routinely happen as
>> part of the backed up data is Macrium images from PCs, so when a new
>> month comes around the PC's backup routine removes a huge amount of
>> old data from the filesystem.)
>>
>> Trying to simulate the checksum errors in a few hours' time thus far
>> has failed.  But every time I swap the disks on a weekly basis I get
>> a handful of checksum errors on the scrub. If I export and re-import
>> the backup mirror after that the counters are zeroed -- the checksum
>> error count does *not* remain across an export/import cycle although
>> the "scrub repaired" line remains.
>>
>> For example after the scrub completed this morning I exported the
>> pool (the script expects the pool exported before it begins) and ran
>> the backup.  When it was complete:
>>
>> root@NewFS:~/backup-zfs # zpool status backup
>>   pool: backup
>>  state: DEGRADED
>> status: One or more devices has been taken offline by the administrator.
>>     Su