Todd H. Poole wrote:
> Hmm... I'm leaning away a bit from the hardware, but just in case you've
> got an idea, the machine is as follows:
> 
> CPU: AMD Athlon X2 4850e 2.5GHz Socket AM2 45W Dual-Core Processor Model
> ADH4850DOBOX
> (http://www.newegg.com/Product/Product.aspx?Item=N82E16819103255)
> 
> Motherboard: GIGABYTE GA-MA770-DS3 AM2+/AM2 AMD 770 ATX All Solid
> Capacitor AMD Motherboard
> (http://www.newegg.com/Product/Product.aspx?Item=N82E16813128081)


..
> The reason why I don't think there's a hardware issue is because before I
> got OpenSolaris up and running, I had a fully functional install of
> openSuSE 11.0 running (with everything similar to the original server) to
> make sure that none of the components were damaged during shipping from
> Newegg. Everything worked as expected.

Yes, but you're running a new operating system, new filesystem...
that's a mountain of difference right in front of you.


A few commands that you could provide the output from include:


(these two show any FMA-related telemetry)
fmadm faulty
fmdump -v

(this shows your storage controllers and what's connected to them)
cfgadm -lav

You'll also find messages in /var/adm/messages which might prove
useful to review.


Apart from that, your description of what you're doing to simulate
failure is

"however, whenever I unplug the SATA cable from one of the drives (to 
simulate a catastrophic drive failure) while doing moderate reading from the 
zpool (such as streaming HD video), not only does the video hang on the 
remote machine (which is accessing the zpool via NFS), but the server 
running OpenSolaris seems to either hang, or become incredibly unresponsive."


First and foremost, for me, this is a stupid thing to do. You've
got common-or-garden PC hardware which almost *definitely* does not
support hot plug of devices. Which is what you're telling us that
you're doing. Would try this with your pci/pci-e cards in this
system? I think not.


If you absolutely must do something like this, then please use
what's known as "coordinated hotswap" using the cfgadm(1m) command.


Viz:

(detect fault in disk c2t3d0, in some way)

# cfgadm -c unconfigure c2::dsk/c2t3d0
# cfgadm -c disconnect c2::dsk/c2t3d0

(go and swap the drive, plugin new drive with same cable)

# zpool replace -f poolname c2t3d0


What this will do is tell the kernel to do things in the
right order, and - for zpool - tell it to do an in-place
replacement of device c2t3d0 in your pool.


There are manpages and admin guides you could have a look
through, too:

http://docs.sun.com/app/docs/coll/40.17 (manpages)
http://docs.sun.com/app/docs/coll/47.23 (system admin collection)
http://docs.sun.com/app/docs/doc/817-2271 ZFS admin guide
http://docs.sun.com/app/docs/doc/819-2723 devices + filesystems guide



James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp       http://www.jmcp.homeunix.com/blog
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to