On Tue, Mar 27, 2012 at 3:14 AM, Carsten John <cj...@mpi-bremen.de> wrote: > Hallo everybody, > > I have a Solaris 11 box here (Sun X4270) that crashes with a kernel panic > during the import of a zpool (some 30TB) containing ~500 zfs filesystems > after reboot. This causes a reboot loop, until booted single user and removed > /etc/zfs/zpool.cache. > > > From /var/adm/messages: > > savecore: [ID 570001 auth.error] reboot after panic: BAD TRAP: type=e (#pf > Page fault) rp=ffffff002f9cec50 addr=20 occurred in module "zfs" due to a > NULL pointer dereference > savecore: [ID 882351 auth.error] Saving compressed system crash dump in > /var/crash/vmdump.2 >
I ran into a very similar problem with Solaris 10U9 and the replica (zfs send | zfs recv destination) of a zpool of about 25 TB of data. The problem was an incomplete snapshot (the zfs send | zfs recv had been interrupted). On boot the system was trying to import the zpool and as part of that it was trying to destroy the offending (incomplete) snapshot. This was zpool version 22 and destruction of snapshots is handled as a single TXG. The problem was that the operation was running the system out of RAM (32 GB worth). There is a fix for this and it is in zpool 26 (or newer), but any snapshots created while the zpool is at a version prior to 26 will have the problem on-disk. We have support with Oracle and were able to get a loaner system with 128 GB RAM to clean up the zpool (it took about 75 GB RAM to do so). If you are at zpool 26 or later this is not your problem. If you are at zpool < 26, then test for an incomplete snapshot by importing the pool read only, then `zdb -d <zpool> | grep '%'` as the incomplete snapshot will have a '%' instead of a '@' as the dataset / snapshot separator. You can also run the zdb against the _un_imported_ zpool using the -e option to zdb. See the following Oracle Bugs for more information. CR# 6876953 CR# 6910767 CR# 7082249 CR#7082249 has been marked as a duplicate of CR# 6948890 P.S. I have a suspect that the incomplete snapshot was also corrupt in some strange way, but could never make a solid determination of that. We think what caused the zfs send | zfs recv to be interrupted was hitting an e1000g Ethernet device driver bug. -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, Troy Civic Theatre Company -> Technical Advisor, RPI Players _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss