On 7/18/2019 15:35, Karl Denninger wrote: > On 7/18/2019 15:19, Eugene Grosbein wrote: >> 19.07.2019 3:13, Karl Denninger wrote: >> >>> FreeBSD 12.0-STABLE #2 r349024M: Thu Jun 13 18:01:16 CDT 2019 >>> k...@newfs.denninger.net:/usr/obj/usr/src/amd64.amd64/sys/KSD-SMP >>> >>> Note -- no patches of any sort in the ZFS code; I am NOT running any of >>> my former patch set. >>> >>> NewFS.denninger.net dumped core - see /var/crash/vmcore.8 >>> >>> Thu Jul 18 15:02:54 CDT 2019 >>> >>> FreeBSD NewFS.denninger.net 12.0-STABLE FreeBSD 12.0-STABLE #2 r349024M: >>> Thu Jun 13 18:01:16 CDT 2019 >>> k...@newfs.denninger.net:/usr/obj/usr/src/amd64.amd64/sys/KSD-SMP amd64 >>> >>> panic: double fault >> [skip] >> >>> #283 0xffffffff82748d91 in zio_vdev_io_done (zio=0xfffff8000b8b8000) >>> at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:3376 >>> #284 0xffffffff82744eac in zio_execute (zio=0xfffff8000b8b8000) >>> at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1786 >>> #285 0xffffffff80c3b7f4 in taskqueue_run_locked (queue=0xfffff801a8b35100) >>> at /usr/src/sys/kern/subr_taskqueue.c:467 >>> #286 0xffffffff80c3cb28 in taskqueue_thread_loop (arg=<value optimized out>) >>> at /usr/src/sys/kern/subr_taskqueue.c:773 >>> #287 0xffffffff80b9ab23 in fork_exit ( >>> callout=0xffffffff80c3ca90 <taskqueue_thread_loop>, >>> arg=0xfffff801a0577520, frame=0xfffffe009d4edc00) >>> at /usr/src/sys/kern/kern_fork.c:1063 >>> #288 0xffffffff810b367e in fork_trampoline () >>> at /usr/src/sys/amd64/amd64/exception.S:996 >>> #289 0x0000000000000000 in ?? () >>> Current language: auto; currently minimal >>> (kgdb) >> You have "double fault" and completely insane number of stack frames in the >> trace. >> This is obviously infinite recursion resulting in kernel stack overflow and >> panic. > Yes, but.... why and how? > > What's executing at the time is this command: > > zfs send -RI $i@zfs-old $i@zfs-base | zfs receive -Fudv $BACKUP > > Which in turn results in the old snapshots on the target not on the > source being deleted, then the new ones being sent. It never gets to > the sending part; it blows up during the delete of the OLD snapshots. > > The one(s) it deletes, however, it DOES delete. When the box is > rebooted those two snapshots on the target are indeed gone. > > That is, it is NOT getting "stuck" on one (which would imply there's an > un-detected fault in the filesystem on the target in the metadata for > that snapshot, resulting in a recursive call that blows up the stack) > and it never gets to send the new snapshot, so whatever is going on is > NOT on the source filesystem. Neither source or destination shows any > errors on the filesystem; both pools are healthy with zero error counts. > > Therefore the question -- is the system queueing enough work to blow the > stack *BUT* the work it queues is all legitimate? If so there's a > serious problem in the way the code now functions in that an "ordinary" > operation can result in what amounts to kernel stack exhaustion. > > One note -- I haven't run this backup for the last five days, as I do it > manually and I've been out of town. Previous running it on a daily > basis completed without trouble. This smells like a backlog of "things > to do" when the send runs that results in the allegedly-infinite > recursion (that isn't really infinite) that runs the stack out of space > -- and THAT implies that the system is trying to queue a crazy amount of > work on a recursive basis for what is a perfectly-legitimate operation > -- which it should *NOT* do.
Update: This looks like an OLD bug that came back. Previously the system would go absolutely insane on the first few accesses to spinning rust during a snapshot delete and ATTEMPT to send thousands of TRIM requests -- which spinning rust does not support. On a system with mixed vdevs, where some pools are rust and some are SSD, this was a problem since you can't turn TRIM off because you REALLY want it on those disks. The FIX for this was to do this on the import of said pool comprised of spinning rust: # # Now try to trigger TRIM so that we don't have a storm of them # # echo "Attempting to disable TRIM on spinning rust...." mount -t zfs $BACKUP/no-trim /mnt dd if=/dev/random of=/mnt/kill-trim bs=128k count=2 echo "Performed 2 writes" sleep 2 rm /mnt/kill-trim echo "Performed delete of written file; wait" sleep 35 umount /mnt echo "Unmounted temporary filesystem" sleep 2 echo "TRIM disable theoretically done...." This would cause ZFS to "figure out" that TRIM doesn't work on the VDEVs on that pool and not send them. You were then ok. When this went away I commented those out. I just un-commented them -- and the backup is now running, and appears to be ok -- it's well past where it blew up before, and no panic. So it APPEARS that ZFS (still) doesn't figure out that TRIM doesn't work on those disks on a pool import quickly enough and this leads to a CRAZY number of queued requests which blow up the stack. But, if you perform an operation that *should* result in a small number of TRIMs being issued and then wait a few seconds the ZFS code does correctly realize that these are spinning drives that don't do TRIM, doesn't stack them up, and it doesn't panic. This wouldn't bite you on an import and then MOST ordinary I/O sequences, because the first few blocks you delete or overwrite will generate a TRIM attempt, it will fail, and the vdev's get properly marked. But if the first thing you do after the pool import is a "zfs recv" *and* that is an incremental receive, that is, it first deletes the non-existent snapshots on the source, you get screwed. I don't know why this was "fixed" and now once again is not or if it was just luck that it didn't bite me earlier after moving to 12.0 -- but going through the backtrace I found a crap-ton of TRIMs in those frames which just flat-out don't work, of course, on spinning drives. IMHO ZFS should figure this out when the pool is imported and set the flags appropriately, but it appears that it does not until you actually attempt to TRIM, and the "attempt" in the case of a send/recv pair results in a very large number of TRIM requests all being queued up at once -- enough to cause a panic -- before the code figures out that it can't do them at all. -- Karl Denninger k...@denninger.net <mailto:k...@denninger.net> /The Market Ticker/ /[S/MIME encrypted email preferred]/
smime.p7s
Description: S/MIME Cryptographic Signature