Re: [zfs-discuss] Impact of L2ARC device failure and SSD recommendations

Pasi Kärkkäinen Sun, 12 Jun 2011 04:08:17 -0700

On Sat, Jun 11, 2011 at 08:26:34PM +0400, Jim Klimov wrote:
> 2011-06-11 19:15, Pasi Kärkkäinen ??????????:
>> On Sat, Jun 11, 2011 at 08:35:19AM -0500, Edmund White wrote:
>>>     I've had two incidents where performance tanked suddenly, leaving the VM
>>>     guests and Nexenta SSH/Web consoles inaccessible and requiring a full
>>>     reboot of the array to restore functionality. In both cases, it was the
>>>     Intel X-25M L2ARC SSD that failed or was "offlined". NexentaStor failed 
>>> to
>>>     alert me on the cache failure, however the general ZFS FMA alert was
>>>     visible on the (unresponsive) console screen.
>>>
>>>     The "zpool status" output showed:
>>>
>>>   cache
>>>   c6t5001517959467B45d0     FAULTED      2   542     0  too many errors
>>>
>>>     This did not trigger any alerts from within Nexenta.
>>>
>>>     I was under the impression that an L2ARC failure would not impact the
>>>     system. But in this case, it was the culprit. I've never seen any
>>>     recommendations to RAID L2ARC for resiliency. Removing the bad SSD
>>>     entirely from the server got me back running, but I'm concerned about 
>>> the
>>>     impact of the device failure and the lack of notification from
>>>     NexentaStor.
>> IIRC recently there was discussion on this list about firmware bug
>> on the Intel X25 SSDs causing them to fail under high disk IO with "reset 
>> storms".
> Even if so, this does not forgive ZFS hanging - especially
> if it detected the drive failure, and especially if this drive
> is not required for redundant operation.
>
> I've seen similar bad behaviour on my oi_148a box when
> I tested USB flash devices as L2ARC caches and
> occasionally they died by slightly moving out of the
> USB socket due to vibration or whatever reason ;)
>
> Similarly, this oi_148a box hung upon loss of SATA
> connection to a drive in the raidz2 disk set due to
> unreliable cable connectors, while it should have
> stalled IOs to that pool but otherwise the system
> should have remained remain responsive (tested
> failmode=continue and failmode=wait on different
> occasions).
>
> So I can relate - these things happen, they do annoy,
> and I hope they will be fixed sometime soon so that
> ZFS matches its docs and promises ;)
>


True, definitely sounds like a bug in ZFS aswell..

-- Pasi

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Impact of L2ARC device failure and SSD recommendations

Reply via email to