On Mon, Mar 8, 2010 at 2:00 PM, Chris Dunbar <cdun...@earthside.net> wrote:
> Hello, > > I just found this list and am very excited that you all are here! I have a > homemade ZFS server that serves as our poor man's Thumper (we named it > thumpthis) and provides primarily NFS shares for our VMware environment. As > is often the case, the server has developed a hardware problem mere days > before I am ready to go live with a new replacement server (thumpthat). At > first the problem appeared to be a bad drive, but now I am not so sure. I > would like to sanity check my thought process with this list and see if > anybody has some different ideas. Here is a quick timeline of the trouble: > > 1. I noticed the following when running a routine zpool status: > > <snip> > mirror DEGRADED 0 0 0 > c3t2d0 ONLINE 0 0 0 > c3t3d0 REMOVED 0 368K 0 > </snip> > > 2. I determined which drive appeared to be offline by watching drive lights > and then rebooted the server. > > 3. Initially the drive appeared to be fine and ZFS picked it backup and > resilvered the mirror. About 30 minutes later I noticed that the same drive > was again marked REMOVED. > > 4. I shut the server down and replaced the drives with a new, larger disk. > > 5. I ran zpool replace tank c3t3d0 and it happily went to work on the > replacement drive. A few hours later the resilver was complete and all > seemed well. > > 6. The next day, about 12 hours after installing the new drive I found the > same error message (here's the whole pool): > > config: > > NAME STATE READ WRITE CKSUM > tank DEGRADED 0 0 0 > mirror ONLINE 0 0 0 > c3t0d0 ONLINE 0 0 0 > c3t1d0 ONLINE 0 0 0 > mirror DEGRADED 0 0 0 > c3t2d0 ONLINE 0 0 0 > c3t3d0 REMOVED 0 370K 0 > mirror ONLINE 0 0 0 > c4t0d0 ONLINE 0 0 0 > c4t1d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c4t2d0 ONLINE 0 0 0 > c4t3d0 ONLINE 0 0 0 > > errors: No known data errors > > This is where I am now. Either my new hard drive is bad (not impossible) or > I am looking at some other hardware failure, possibly the AOC-SAT2-MV8 > controller card. I have a spare controller card (same make and model > purchased at the same time we built the server) and plan to replace that > tonight. Does that seem like the correct course of action? Are there any > steps I can take beforehand to zero in on the problem? Any words of > encouragement or wisdom? > What does `iostat -En` say ? My suggestion is to replace the cable that's connecting the c3t3d0 disk. IMHO, the cable is much more likely to be faulty than a single port on the disk controller. -- Giovanni Tirloni sysdroid.com
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss