2012-02-04 18:27, Jim Klimov wrote:
panicstr = BAD TRAP: type=e (#pf Page fault) rp=ffffff0010a5e920
addr=30 occurred in module "zfs" due to a NULL pointer dereference
panicstack = unix:die+dd () | unix:trap+1799 () | unix:cmntrap+e6
() | zfs:ddt_phys_decref+c () | zfs:zio_ddt_free+5c () |
zfs:zio_execute+8d () | genunix:taskq_thread+285 () |
unix:thread_start+8 () | crashtime = 1328314064
panic-time = February 4, 2012 04:07:44 AM MSK MSK
(end fault-list[0])

I've finally looked into the code, and the problem seems
weird indeed. With "properly" wrong on-disk data the code
can take such a path that it tries to dereference a NULL
pointer. This is a bug, methinks:

http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/zio.c#zio_ddt_free

   2100 static int
   2101 zio_ddt_free(zio_t *zio)
   2102 {
   2103         spa_t *spa = zio->io_spa;
   2104         blkptr_t *bp = zio->io_bp;
   2105         ddt_t *ddt = ddt_select(spa, bp);
   2106         ddt_entry_t *dde;
   2107         ddt_phys_t *ddp;
   2108
   2109         ASSERT(BP_GET_DEDUP(bp));
   2110         ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
   2111
   2112         ddt_enter(ddt);
   2113         freedde = dde = ddt_lookup(ddt, bp, B_TRUE);
   2114         ddp = ddt_phys_select(dde, bp);
   2115         ddt_phys_decref(ddp);
   2116         ddt_exit(ddt);
   2117
   2118         return (ZIO_PIPELINE_CONTINUE);
   2119 }

The way I see it, this function:
2109) asserts that the dedup bit is set on the blockpointer,

2113) selects a DDT entry matching the BP with ddt_lookup()
      Strangely for me, it requests creation of a new DDT
      entry if a matching one doesn't exist (B_TRUE) -
      even though this is a DDT-free routine.
      I doubt this autocreated entry gets a DVA assigned,
      but I did not trace the code that far to be certain.

2114) then it tries to find the "ddp" pointer to the DDT
      entry with ddt_phys_select(), which can validly return
      NULL if no DDT entries, with same DVA[0] and same
      phys_birth as the bp, exist. I speculated above that
      an autocreated DDT entry might have no DVA yet, maybe.
      I also wonder about cases where phys_birth TXGs would
      differ for some reason, even if an otherwise matching
      DDT entry exists (BP rewrite in future, some lag in
      committing-to-disk today)?
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/ddt.c#ddt_phys_select

2115) ddt_phys_decref() just tries to use ddp as a pointer
      to structure without checking if it is NULL itself:
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/ddt.c#ddt_phys_decref

    309 void
    310 ddt_phys_decref(ddt_phys_t *ddp)
    311 {
    312         ASSERT((int64_t)ddp->ddp_refcnt > 0);
    313         ddp->ddp_refcnt--;
    314 }
    315

According to my stacktrace in the kernel panic message,
this should be where the system panics and fails.

Seemingly this should work if on-disk data is consistent.

Since I've had some "strangeness" regarding corrupt data
blocks pointed to by DDT, with some hickups and reboots
during or soon-after repair attempts, I can't vouch that
my on-disk data is consistent, or that the whole block
tree is readable and accessible to the OS now.

But the code allows for failure is something goes wrong.
I'd add a lot more NULL checks for values (especially
pointers) returned by functions called from zio_ddt_free()
and/or sanity checks for values fed into such functions.

Actually, I'd rewrite this part of zio_ddt_free() like
this:

-       freedde = dde = ddt_lookup(ddt, bp, B_TRUE);
+       freedde = dde = ddt_lookup(ddt, bp, B_FALSE);
+       if (dde) {
            ddp = ddt_phys_select(dde, bp);
+           if (ddp) {
                ddt_phys_decref(ddp);
+           }
+       }
        ddt_exit(ddt);

If ddt_lookup() finds no matching DDE, fine - we're freed.
If ddt_phys_select() finds no DDP, not so fine, but that's
not a good reason to kernel-panic. Perhaps log a message
and maybe leave a leaked block? Or relax the phys_birth check?
Am I mistaken anywhere? :)

Thanks,
//Jim

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to