On Sat, Aug 23, 2008 at 11:06 PM, Todd H. Poole <[EMAIL PROTECTED]>wrote:
> Howdy yall, > > Earlier this month I downloaded and installed the latest copy of > OpenSolaris (2008.05) so that I could test out some of the newer features > I've heard so much about, primarily ZFS. > > My goal was to replace our aging linux-based (SuSE 10.1) file and media > server with a new machine running Sun's OpenSolaris and ZFS. Our old server > ran your typical RAID5 setup with 4 500GB disks (3 data, 1 parity), used > lvm, mdadm, and xfs to help keep things in order, and relied on NFS to > export users' shares. It was solid, stable, and worked wonderfully well. > > I would like to replicate this experience using the tools OpenSolaris has > to offer, taking advantages of ZFS. However, there are enough differences > between the two OSes - especially with respect to the filesystems and (for > lack of a better phrase) "RAID managers" - to cause me to consult (on > numerous occasions) the likes of Google, these forums, and other places for > help. > > I've been successful in troubleshooting all problems up until now. > > On our old media server (the SuSE 10.1 one), when a disk failed, the > machine would send out an e-mail detailing the type of failure, and > gracefully fall into a degraded state, but would otherwise continue to > operate using the remaining 3 disks in the system. After the faulty disk was > replaced, all of the data from the old disk would be replicated onto the new > one (I think the term is "resilvered" around here?), and after a few hours, > the RAID5 array would be seamlessly promoted from "degraded" back up to a > healthy "clean" (or "online") state. > > Throughout the entire process, there would be no interruptions to the end > user: all NFS shares still remained mounted, there were no noticeable drops > in I/O, files, directories, and any other user-created data still remained > available, and if everything went smoothly, no one would notice a failure > had even occurred. > > I've tried my best to recreate something similar in OpenSolaris, but I'm > stuck on making it all happen seamlessly. > > For example, I have a standard beige box machine running OS 2008.05 with a > zpool that contains 4 disks, similar to what the old SuSE 10.1 server had. > However, whenever I unplug the SATA cable from one of the drives (to > simulate a catastrophic drive failure) while doing moderate reading from the > zpool (such as streaming HD video), not only does the video hang on the > remote machine (which is accessing the zpool via NFS), but the server > running OpenSolaris seems to either hang, or become incredibly unresponsive. > > And when I write unresponsive, I mean that when I type the command "zpool > status" to see what's going on, the command hangs, followed by a frozen > Terminal a few seconds later. After just a few more seconds, the entire GUI > - mouse included - locks up or freezes, and all NFS shares become > unavailable from the perspective of the remote machines. The whole machine > locks up hard. > > The machine then stays in this frozen state until I plug the hard disk back > in, at which point everything, quite literally, pops back into existence all > at once: the output of the "zpool status" command flies by (with all disks > listed as "ONLINE" and all "READ," "WRITE," and "CKSUM," fields listed as > "0"), the mouse jumps to a different part of the screen, the NFS share > becomes available again, and the movie resumes right where it had left off. > > While such a quick resume is encouraging, I'd like to avoid the freeze in > the first place. > > How can I keep any hardware failures like the above transparent to my > users? > > -Todd > > PS: I've done some researching, and while my problem is similar to the > following: > > http://opensolaris.org/jive/thread.jspa?messageID=151719𥂧 > http://opensolaris.org/jive/thread.jspa?messageID=240481𺭡 > > most of these posts are quite old, and do not offer any solutions. > > PSS: I know I haven't provided any details on hardware, but I feel like > this is more likely a higher-level issue (like some sort of configuration > file or setting is needed) rather than a lower-level one (like faulty > hardware). However, if someone were to give me a command to run, I'd gladly > do it... I'm just not sure which ones would be helpful, or if I even know > which ones to run. It took me half an hour of searching just to find out how > to list the disks installed in this system (it's "format") so that I could > build my zpool in the first place. It's not quite as simple as writing out > /dev/hda, /dev/hdb, /dev/hdc, /dev/hdd. ;) > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > It's a lower level one. What hardware are you running?
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss