-----Original message----- To: ZFS Discussions <zfs-discuss@opensolaris.org>; From: Paul Kraus <p...@kraus-haus.org> Sent: Tue 27-03-2012 15:05 Subject: Re: [zfs-discuss] kernel panic during zfs import > On Tue, Mar 27, 2012 at 3:14 AM, Carsten John <cj...@mpi-bremen.de> wrote: > > Hallo everybody, > > > > I have a Solaris 11 box here (Sun X4270) that crashes with a kernel panic > during the import of a zpool (some 30TB) containing ~500 zfs filesystems > after > reboot. This causes a reboot loop, until booted single user and removed > /etc/zfs/zpool.cache. > > > > > > From /var/adm/messages: > > > > savecore: [ID 570001 auth.error] reboot after panic: BAD TRAP: type=e (#pf > Page fault) rp=ffffff002f9cec50 addr=20 occurred in module "zfs" due to a > NULL > pointer dereference > > savecore: [ID 882351 auth.error] Saving compressed system crash dump in > /var/crash/vmdump.2 > > > > I ran into a very similar problem with Solaris 10U9 and the > replica (zfs send | zfs recv destination) of a zpool of about 25 TB of > data. The problem was an incomplete snapshot (the zfs send | zfs recv > had been interrupted). On boot the system was trying to import the > zpool and as part of that it was trying to destroy the offending > (incomplete) snapshot. This was zpool version 22 and destruction of > snapshots is handled as a single TXG. The problem was that the > operation was running the system out of RAM (32 GB worth). There is a > fix for this and it is in zpool 26 (or newer), but any snapshots > created while the zpool is at a version prior to 26 will have the > problem on-disk. We have support with Oracle and were able to get a > loaner system with 128 GB RAM to clean up the zpool (it took about 75 > GB RAM to do so). > > If you are at zpool 26 or later this is not your problem. If you > are at zpool < 26, then test for an incomplete snapshot by importing > the pool read only, then `zdb -d <zpool> | grep '%'` as the incomplete > snapshot will have a '%' instead of a '@' as the dataset / snapshot > separator. You can also run the zdb against the _un_imported_ zpool > using the -e option to zdb. > > See the following Oracle Bugs for more information. > > CR# 6876953 > CR# 6910767 > CR# 7082249 > > CR#7082249 has been marked as a duplicate of CR# 6948890 > > P.S. I have a suspect that the incomplete snapshot was also corrupt in > some strange way, but could never make a solid determination of that. > We think what caused the zfs send | zfs recv to be interrupted was > hitting an e1000g Ethernet device driver bug. > > -- > {--------1---------2---------3---------4---------5---------6---------7---------} > Paul Kraus > -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) > -> Sound Coordinator, Schenectady Light Opera Company ( > http://www.sloctheater.org/ ) > -> Technical Advisor, Troy Civic Theatre Company > -> Technical Advisor, RPI Players > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Hi, this scenario seems to fit. The machine that was sending the snapshot is on OpenSolaris Build 111b (which is running zpool version 14). I rebooted the receiving machine due to a hanging "zfs receive" that couldn't be killed. zdb -d -e <pool> does not give any useful information: zdb -d -e san_pool Dataset san_pool [ZPL], ID 18, cr_txg 1, 36.0K, 11 objects When importing the pool readonly, I get an error about two datasets: zpool import -o readonly=on san_pool cannot set property for 'san_pool/home/someuser': dataset is read-only cannot set property for 'san_pool/home/someotheruser': dataset is read-only As this is a mirror machine, I still have the option to destroy the pool and copy over the stuff via send/receive from the primary. But nobody knows how long this will work until I'm hit again.... If an interrupted send/receive can screw up a 30TB target pool, then send/receive isn't an option for replication data at all, furthermore it should be flagged as "don't use it if your target pool might contain any valuable data" I wil reproduce the crash once more and try to file a bug report for S11 as recommended by Deepak (not so easy these days...). thanks Carsten _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss