Re: [zfs-discuss] ZFS hangs/freezes after disk failure, resumes when disk is replaced

Tim Sat, 23 Aug 2008 21:14:07 -0700

On Sat, Aug 23, 2008 at 11:06 PM, Todd H. Poole <[EMAIL PROTECTED]>wrote:


> Howdy yall,
>
> Earlier this month I downloaded and installed the latest copy of
> OpenSolaris (2008.05) so that I could test out some of the newer features
> I've heard so much about, primarily ZFS.
>
> My goal was to replace our aging linux-based (SuSE 10.1) file and media
> server with a new machine running Sun's OpenSolaris and ZFS. Our old server
> ran your typical RAID5 setup with 4 500GB disks (3 data, 1 parity), used
> lvm, mdadm, and xfs to help keep things in order, and relied on NFS to
> export users' shares. It was solid, stable, and worked wonderfully well.
>
> I would like to replicate this experience using the tools OpenSolaris has
> to offer, taking advantages of ZFS. However, there are enough differences
> between the two OSes - especially with respect to the filesystems and (for
> lack of a better phrase) "RAID managers" - to cause me to consult (on
> numerous occasions) the likes of Google, these forums, and other places for
> help.
>
> I've been successful in troubleshooting all problems up until now.
>
> On our old media server (the SuSE 10.1 one), when a disk failed, the
> machine would send out an e-mail detailing the type of failure, and
> gracefully fall into a degraded state, but would otherwise continue to
> operate using the remaining 3 disks in the system. After the faulty disk was
> replaced, all of the data from the old disk would be replicated onto the new
> one (I think the term is "resilvered" around here?), and after a few hours,
> the RAID5 array would be seamlessly promoted from "degraded" back up to a
> healthy "clean" (or "online") state.
>
> Throughout the entire process, there would be no interruptions to the end
> user: all NFS shares still remained mounted, there were no noticeable drops
> in I/O, files, directories, and any other user-created data still remained
> available, and if everything went smoothly, no one would notice a failure
> had even occurred.
>
> I've tried my best to recreate something similar in OpenSolaris, but I'm
> stuck on making it all happen seamlessly.
>
> For example, I have a standard beige box machine running OS 2008.05 with a
> zpool that contains 4 disks, similar to what the old SuSE 10.1 server had.
> However, whenever I unplug the SATA cable from one of the drives (to
> simulate a catastrophic drive failure) while doing moderate reading from the
> zpool (such as streaming HD video), not only does the video hang on the
> remote machine (which is accessing the zpool via NFS), but the server
> running OpenSolaris seems to either hang, or become incredibly unresponsive.
>
> And when I write unresponsive, I mean that when I type the command "zpool
> status" to see what's going on, the command hangs, followed by a frozen
> Terminal a few seconds later. After just a few more seconds, the entire GUI
> - mouse included - locks up or freezes, and all NFS shares become
> unavailable from the perspective of the remote machines. The whole machine
> locks up hard.
>
> The machine then stays in this frozen state until I plug the hard disk back
> in, at which point everything, quite literally, pops back into existence all
> at once: the output of the "zpool status" command flies by (with all disks
> listed as "ONLINE" and all "READ," "WRITE," and "CKSUM," fields listed as
> "0"), the mouse jumps to a different part of the screen, the NFS share
> becomes available again, and the movie resumes right where it had left off.
>
> While such a quick resume is encouraging, I'd like to avoid the freeze in
> the first place.
>
> How can I keep any hardware failures like the above transparent to my
> users?
>
> -Todd
>
> PS: I've done some researching, and while my problem is similar to the
> following:
>
> http://opensolaris.org/jive/thread.jspa?messageID=151719&#151719
> http://opensolaris.org/jive/thread.jspa?messageID=240481&#240481
>
> most of these posts are quite old, and do not offer any solutions.
>
> PSS: I know I haven't provided any details on hardware, but I feel like
> this is more likely a higher-level issue (like some sort of configuration
> file or setting is needed) rather than a lower-level one (like faulty
> hardware). However, if someone were to give me a command to run, I'd gladly
> do it... I'm just not sure which ones would be helpful, or if I even know
> which ones to run. It took me half an hour of searching just to find out how
> to list the disks installed in this system (it's "format") so that I could
> build my zpool in the first place. It's not quite as simple as writing out
> /dev/hda, /dev/hdb, /dev/hdc, /dev/hdd. ;)
>
>
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>



It's a lower level one.  What hardware are you running?

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS hangs/freezes after disk failure, resumes when disk is replaced

Reply via email to