Re: [zfs-discuss] ZFS disk failure question

Jason Frank Thu, 15 Oct 2009 02:51:56 -0700

Thank you, that did the trick.  That's not terribly obvious from the
man page though.  The man page says it detaches the devices from a
mirror, and I had a raidz2.  Since I'm messing with production data, I
decided I wasn't going to chance it when I was reading the man page.
You might consider changing the man page, and explaining a little more
what it means, maybe even what the circumstances look like where you
might use it.


Actually, an official and easily searchable "What to do when you have
a zfs disk failure" with lots of examples would be great.  There are a
lot of attempts out there, but nothing I've found is comprehensive.

Jason

On Wed, Oct 14, 2009 at 4:23 PM, Eric Schrock <eric.schr...@sun.com> wrote:
> On 10/14/09 14:17, Cindy Swearingen wrote:
>>
>> Hi Jason,
>>
>> I think you are asking how do you tell ZFS that you want to replace the
>> failed disk c8t7d0 with the spare, c8t11d0?
>>
>> I just tried do this on my Nevada build 124 lab system, simulating a
>> disk failure and using zpool replace to replace the failed disk with
>> the spare. The spare is now busy and it fails. This has to be a bug.
>
> You need to 'zpool detach' the original (c8t7d0).
>
> - Eric
>
>>
>> Another way to recover is if you have a replacement disk for c8t7d0,
>> like this:
>>
>> 1. Physically replace c8t7d0.
>>
>> You might have to unconfigure the disk first. It depends
>> on the hardware.
>>
>> 2. Tell ZFS that you replaced it.
>>
>> # zpool replace tank c8t7d0
>>
>> 3. Detach the spare.
>>
>> # zpool detach tank c8t11d0
>>
>> 4. Clear the pool or the device specifically.
>>
>> # zpool clear tank c8t7d0
>>
>> Cindy
>>
>> On 10/14/09 14:44, Jason Frank wrote:
>>>
>>> So, my Areca controller has been complaining via email of read errors for
>>> a couple days on SATA channel 8.  The disk finally gave up last night at
>>> 17:40.  I got to say I really appreciate the Areca controller taking such
>>> good care of me.
>>>
>>> For some reason, I wasn't able to log into the server last night or in
>>> the morning, probably because my home dir was on the zpool with the failed
>>> disk (although it's a raidz2, so I don't know why that was a problem.)  So,
>>> I went ahead and rebooted it the hard way this morning.
>>>
>>> The reboot went OK, and I was able to get access to my home directory by
>>> waiting about 5 minutes after authenticating.  I checked my zpool, and it
>>> was resilvering.  But, it had only been running for a few minutes.
>>>  Evidently, it didn't start resilvering until I rebooted it.  I would have
>>> expected it to do that when the disk failed last night (I had set up a hot
>>> spare disk already).
>>>
>>> All of the zpool commands were taking minutes to complete while c8t7d0
>>> was UNAVAIL, so I offline'd it.  When I say all, that includes iostat,
>>> status, upgrade, just about anything non-destructive that I could try.  That
>>> was a little odd.  Once I offlined the drive, my resilver restarted, which
>>> surprised me.  After all, I simply changed an UNAVAIL drive to OFFLINE, in
>>> either case, you can't use it for operations.  But no big deal there.  That
>>> fixed the login slowness and the zpool command slowness.
>>>
>>> The resilver completed, and now I'm left with the following zpool config.
>>>  I'm not sure how to get things back to normal though, and I hate to do
>>> something stupid...
>>>
>>> r...@datasrv1:~# zpool status tank
>>>  pool: tank
>>>  state: DEGRADED
>>>  scrub: scrub stopped after 0h10m with 0 errors on Wed Oct 14 15:23:06
>>> 2009
>>> config:
>>>
>>>        NAME           STATE     READ WRITE CKSUM
>>>        tank           DEGRADED     0     0     0
>>>          raidz2       DEGRADED     0     0     0
>>>            c8t0d0     ONLINE       0     0     0
>>>            c8t1d0     ONLINE       0     0     0
>>>            c8t2d0     ONLINE       0     0     0
>>>            c8t3d0     ONLINE       0     0     0
>>>            c8t4d0     ONLINE       0     0     0
>>>            c8t5d0     ONLINE       0     0     0
>>>            c8t6d0     ONLINE       0     0     0
>>>            spare      DEGRADED     0     0     0
>>>              c8t7d0   REMOVED      0     0     0
>>>              c8t11d0  ONLINE       0     0     0
>>>            c8t8d0     ONLINE       0     0     0
>>>            c8t9d0     ONLINE       0     0     0
>>>            c8t10d0    ONLINE       0     0     0
>>>        spares
>>>          c8t11d0      INUSE     currently in use
>>>
>>> Since it's not obvious, the spare line had both t7 and t11 indented under
>>> it.
>>> When the resilver completed, I yanked the hard drive on target 7.
>>>
>>> I'm assuming that t11 has the same content as t7, but that's not
>>> necessarily clear from the output above.
>>>
>>> So, now I'm left with the following config.  I can't zfs remove t7,
>>> because it's not a hot spare or a cache disk.  I can't zfs replace t7 with
>>> t11, I'm told that t11 is busy.  And I didn't see any other zpool
>>> subcommands that look likely to fix the problem.
>>>
>>> Here are my system details:
>>> SunOS datasrv1 5.11 snv_118 i86pc i386 i86xpv Solaris
>>>
>>> This system is currently running ZFS pool version 16.
>>>
>>> Pool 'tank' is already formatted using the current version.
>>>
>>> How do I tell the system that t11 is the replacement for t7, and how to I
>>> then add t7 as the hot spare (after I replace the disk)?
>>>
>>> Thanks
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>
> --
> Eric Schrock, Fishworks                    http://blogs.sun.com/eschrock
>
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS disk failure question

Reply via email to