An additional data point, when i try to do a zdb -e -d and find the incomplete zfs recv snapshot I get an error as follows:
# sudo zdb -e -d xxx-yy-01 | grep "%" Could not open xxx-yy-01/aaa-bb-01/aaa-bb-01-01/%1309906801, error 16 # Anyone know what error 16 means from zdb and how this might impact importing this zpool ? On Wed, Aug 3, 2011 at 9:19 AM, Paul Kraus <p...@kraus-haus.org> wrote: > I am having a very odd problem, and so far the folks at Oracle > Support have not provided a working solution, so I am asking the crowd > here while still pursuing it via Oracle Support. > > The system is a T2000 running 10U9 with CPU-2010-01and two J4400 > loaded with 1 TB SATA drives. There is one zpool on the J4400 (3 x 15 > disk vdev + 3 hot spare). This system is the target for zfs send / > recv replication from our production server.The OS is UFS on local > disk. > > While I was on vacation this T2000 hung with "out of resource" > errors. Other staff tried rebooting, which hung the box. Then they > rebooted off of an old BE (10U9 without CPU-2010-01). Oracle Support > had them apply a couple patches and an IDR to address zfs "stability > and reliability problems" as well as set the following in /etc/system > > set zfs:zfs_arc_max = 0x700000000 (which is 28 GB) > set zfs:arc_meta_limit = 0x700000000 (which is 28 GB) > > The system has 32 GB RAM and 32 (virtual) CPUs. They then tried > importing the zpool and the system hung (after many hours) with the > same "out of resource" error. At this point they left the problem for > me :-( > > I removed the zfs.cache from the 10U9 + CPU 2010-10 BE and booted > from that. I then applied the IDR (IDR146118-12 )and the zfs patch it > depended on (145788-03). I did not include the zfs arc and zfs arc > meta limits as I did not think they relevant. A zpool import shows the > pool is OK and a sampling with zdb -l of the drives shows good labels. > I started importing the zpool and after many hours it hung the system > with "out of resource" errors. I had a number of tools running to see > what was going on. The only thing this system is doing is importing > the zpool. > > ARC had climbed to about 8 GB and then declined to 3 GB by the time > the system hung. This tells me that there is something else consuming > RAM and the ARC is releasing it. > > The hung TOP screen showed the largest user process only had 148 MB > allocated (and much less resident). > > VMSTAT showed a scan rate of over 900,000 (NOT a typo) and almost 8 GB > of free swap (so whatever is using memory cannot be paged out). > > So my guess is that there is a kernel module that is consuming all > (and more) of the RAM in the box. I am looking for a way to query how > much RAM each kernel module is using and script that in a loop (which > will hang when the box runs out of RAM next). I am very open to > suggestions here. > > Since this is the recv end of replication, I assume there was a zfs > recv going on at the time the system initially hung. I know there was > a 3+ TB snapshot replicating (via a 100 Mbps WAN link) when I left for > vacation, that may have still been running. I also assume that any > partial snapshots (% instead of @) are being removed when the pool is > imported. But what could be causing a partial snapshot removal, even > of a very large snapshot, to run the system out of RAM ? What caused > the initial hang of the system (I assume due to out of RAM) ? I did > not think there was a limit to the size of either a snapshot or a zfs > recv. > > Hung TOP screen: > > load averages: 91.43, 33.48, 18.989 xxx-xxx1 > 18:45:34 > 84 processes: 69 sleeping, 12 running, 1 zombie, 2 on cpu > CPU states: 95.2% idle, 0.5% user, 4.4% kernel, 0.0% iowait, 0.0% swap > Memory: 31.9G real, 199M free, 267M swap in use, 7.7G swap free > > PID USERNAME THR PR NCE SIZE RES STATE TIME FLTS CPU COMMAND > 533 root 51 59 0 148M 30.6M run 520:21 0 9.77% java > 1210 yyyyyy 1 0 0 5248K 1048K cpu25 2:08 0 2.23% xload > 14720 yyyyyy 1 59 0 3248K 1256K cpu24 1:56 0 0.03% top > 154 root 1 59 0 4024K 1328K sleep 1:17 0 0.02% vmstat > 1268 yyyyyy 1 59 0 4248K 1568K sleep 1:26 0 0.01% iostat > ... > > VMSTAT: > > kthr memory page disk faults cpu > r b w swap free re mf pi po fr de sr m0 m1 m2 m3 in sy cs us sy id > 0 0 112 8117096 211888 55 46 0 0 425 0 912684 0 0 0 0 976 166 836 0 2 98 > 0 0 112 8117096 211936 53 51 6 0 394 0 926702 0 0 0 0 976 167 833 0 2 98 > > ARC size (B): 4065882656 > > -- > {--------1---------2---------3---------4---------5---------6---------7---------} > Paul Kraus > -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) > -> Sound Designer: Frankenstein, A New Musical > (http://www.facebook.com/event.php?eid=123170297765140) > -> Sound Coordinator, Schenectady Light Opera Company ( > http://www.sloctheater.org/ ) > -> Technical Advisor, RPI Players > -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Designer: Frankenstein, A New Musical (http://www.facebook.com/event.php?eid=123170297765140) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss