Greetings, all! I've recently jumped into OpenSolaris after years of using Gentoo for my primary server OS, and I've run into a bit of trouble on my main storage zpool. Reading through the archives, it seems like the symptoms I'm seeing are fairly common though the causes seem to vary a bit. Also, it seems like many of the issues were fixed at or around snv99 while I'm running dev-134.
The trouble started when I created and subsequently tried to destroy a 2TB zvol. The zpool hosting it has compression & dedup enabled on the root. I've run into issues previously with zvol destruction taking a long time, so I was expecting this to take a while, but alas it managed to hang and lock the system up tight before it completed. Immediately after starting the `zfs destroy` process, all I/O to the zpool stopped dead. No NFS, no Xen zvol access, no local POSIX access. Any attempts to run `zpool <anything>` hung indefinitely and couldn't be Ctrl-C'd or kill -9'd. For about two hours, there was disk activity on the pool (blinken lights), but then everything stopped. No more lights, no response on network, and the console's monitor was stuck in powersave with no keyboard or mouse activity able to wake it up. I let it sit that way for another hour or so before giving up and hitting the BRS. Upon reboot, the OS hung at "Reading ZFS Config: -", again with blinken lights. That ran for about an hour, then locked up as above. In order to get back into the system, I pulled the four Samsung HD203WI drives as well as the two OCZ Vertex SSD's (split between ZIL, L2ARC, and swap for the Linux xVM) and was able to boot up with just the rpool. Knowing that my system is RAM constrained (4GB, but 1.5 of that dedicated to a Linux in xVM), I thought that perhaps the ZFS ARC was exhausting system memory and ultimately killing the system. Turning to the Evil Tuning Guide, I tried adding "set zfs:zfs_arc_max = 0x20000000" to /etc/system, followed by update-archive and reboot. Once the system rebooted (still without) the six devices for the pool, I hot-added the six devices, did `cfgadm -al` followed by `cfgadm -cconfigure <the devices>` and had everything connected and spinning again. `zpool status` of course was livid about the state of disrepair the pool was in: pool: tank state: UNAVAIL status: One or more devices could not be opened. There are insufficient replicas for the pool to continue functioning. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-3C scrub: none requested config: NAME STATE READ WRITE CKSUM tank UNAVAIL 0 0 0 insufficient replicas raidz1-0 UNAVAIL 0 0 0 insufficient replicas c4t1d0 UNAVAIL 0 0 0 cannot open c4t0d0 UNAVAIL 0 0 0 cannot open c5t0d0 UNAVAIL 0 0 0 cannot open c5t1d0 UNAVAIL 0 0 0 cannot open logs mirror-1 UNAVAIL 0 0 0 insufficient replicas c7t4d0s0 UNAVAIL 0 0 0 cannot open c7t5d0s0 UNAVAIL 0 0 0 cannot open So I crossed my fingers, did a `zpool clear tank` and..... Nothing.... The command hung indefinitely, but I had blinken lights again, so it was definitely doing *something*. I headed off for $DAY_JOB, fingers still crossed (which makes driving difficult, believe me...) Through the magic of open WiFi on an otherwise closed corporate network, I was able to VPN back home and keep an eye on things. At this point, `zpool <anything>` still hung immortally (tried everything short of a stake through the heart, but it wouldn't be killed). Any existing SSH sessions continued to work (accessed via VNC to my desktop where they'd been left open), but any new SSH attempts hung before getting a shell: fawn:~ pendor$ ssh pin Last login: Fri Aug 6 00:08:30 2010 from fawn.thebedells (the end -- nothing beyond this) Peeking at `ps` showed that each new SSH attempt tried to run /usr/sbin/quota and seemingly died there. I'm assuming that I could probably somehow hack my SSHd or PAM config to skip quota, but that's more or less moot. Curiously, console interactive login to X as well as remote VNC sessions to the server both worked fine and gave me a functional Gnome environment. While this process was on-going, `zpool iostat` was off the menu, but running plain-old `iostat -dex 10` gave: extended device statistics ---- errors --- device r/s w/s kr/s kw/s wait actv svc_t %w %b s/w h/w trn tot sd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 sd1 0.1 5.6 6.4 30.8 0.0 0.0 0.9 0 0 0 0 0 0 sd2 0.0 5.6 0.0 30.8 0.0 0.0 0.6 0 0 0 0 0 0 sd7 69.3 0.0 71.3 0.0 0.0 0.6 8.3 0 55 0 0 0 0 sd8 69.0 0.0 71.8 0.0 0.0 0.5 7.5 0 49 0 0 0 0 sd9 67.5 0.0 73.5 0.0 0.0 0.6 8.3 0 52 0 0 0 0 sd10 68.3 0.0 68.5 0.0 0.0 0.6 8.6 0 55 0 0 0 0 sd11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 sd12 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 sd15 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 sd16 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 sd17 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 sd18 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 There's a *little* bit of throughput coming from the drives and intermittent write going to the SSD's. The read rate is pretty much constant (between 70 & 140 per 10 seconds), but the SSD write only shows up sporadically, maybe once or twice a minute. Alas, about four hours after starting the process this time, everything went dark, and the system stopped responding. Upon returning home, I found the blinken lights had ceased, and any attempt to access the console failed to wake the monitor. I pulled the drives & hit the BRS yet again. At this point, I'm still operating on the assumption that "this too shall pass," and that given enough RAM and patience, the pool might eventually import. Given that a goodly chunk of the server's measly 4GB was consumed by an inaccessible xVM, I hacked my grub menu.lst to allow me to boot to a non-PV kernel and booted up with the full 4GB at Solaris' disposal. Repeating the above dance with cfgadm and zpool clear has started the import process again, and I anxiously await its outcome. That said, I'm hoping someone with a bit more knowledge of ZFS than I (which would be roughly the entire population of the planet) might have some suggestion for a short cut to get this pool running again or at the very least some suggestion that might allow me to delete zvol's in the future without having to perform major surgery. So far I've been running about 90 minutes with the full 4GB available, and the blinken lights persist. I'll give what vitals I can on the system in the hopes they might help. I'm not sure how much detail I'll be able to give as I'm not yet even qualified to be a complete Solaris n00b (some parts are missing). That said, I'm comfortable as can be on Unix-like systems (Linux & Darwin by day), and would love to help diagnose this issue in detail. SSH tunneling, etc. is an option if anyone's up for it. For all my willingness to help, though, the data on this pool is rather on the important side, so any debugging likely to lead to dataloss I'd rather avoid if it all possible. The system is OpenSolaris dev-b134 running on x64. Supermicro MBD-X7SBE motherboard with an Intel E6400 Wolfdale 2.8GHz dual core CPU and 4GB of DDR2-667 RAM. The disks are connected to two Supermicro AOC-SAT2-MV8 SATA controllers (staggered across both of them) with the ZIL & L2ARC SSD's on two of the motherboard's built-in SATA ports. Two WD Caviar Blue drives also on the motherboard SATA are mirrored to make the rpool. The whole mess is stored in a NORCO RPC-4220 case with 20 hot-swap (allegedly) SATA bays. The top shelf of four bays go to the MB SATA where the remaining 16 are staggered across the two MV8's via SFF-8087->4x SATA splitter cables. My zpool's are at version 22. The system usually runs as a dom0, but for the time being I've switched to a vanilla kernel so as to give maximal resources to ZFS. As mentioned, above the pool in question has dedup and compression active. Given my reading the last several days, I've come to conclude enabling those was probably a BadIdea(tm), but c'est la vie, at least for the time being. Best case, if anyone can tell me how to escape import heck and get back to a running system, I'll be your friend forever... Failing that, if there were some way to get a PROGRESS report of how the thing is doing, that would at least satisfy my OCD need to know what's going on. If any additional info would help diagnose the issue, just name it. Thanks in advance for any assistance! Best regards, Zac Bedell -- This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss