Re: [zfs-discuss] Impact of L2ARC device failure and SSD recommendations

Richard Elling Sun, 12 Jun 2011 12:54:44 -0700

On Jun 11, 2011, at 6:35 AM, Edmund White wrote:

> Posted in greater detail at Server Fault - 
> http://serverfault.com/q/277966/13325
> 
Replied in greater detail at same.


> I have an HP ProLiant DL380 G7 system running NexentaStor. The server has 
> 36GB RAM, 2 LSI 9211-8i SAS controllers (no SAS expanders), 2 SAS system 
> drives, 12 SAS data drives, a hot-spare disk, an Intel X25-M L2ARC cache and 
> a DDRdrive PCI ZIL accelerator. This system serves NFS to multiple VMWare 
> hosts. I also have about 90-100GB of deduplicated data on the array.
> 
> I've had two incidents where performance tanked suddenly, leaving the VM 
> guests and Nexenta SSH/Web consoles inaccessible and requiring a full reboot 
> of the array to restore functionality.
> 
The reboot is your decision, the software will, eventually, recover.

> In both cases, it was the Intel X-25M L2ARC SSD that failed or was 
> "offlined". NexentaStor failed to alert me on the cache failure, however the 
> general ZFS FMA alert was visible on the (unresponsive) console screen.
> 
> 

NexentaStor fault triggers run in addition to the existing FMA and syslog 
services.

> The "zpool status" output showed:
> 
> 
> cache
> c6t5001517959467B45d0     FAULTED      2   542     0  too many errors
> 
> This did not trigger any alerts from within Nexenta.
> 
> 

The NexentaStor volume-check runner looks for zpool status error messages. 
Check your configuration
for the runner schedule, by default it is hourly.


> I was under the impression that an L2ARC failure would not impact the system.
> 
With all due respect, that is a naive assumption. Any system failure can impact 
the system. The
worst kinds of failures are those that impact performance. In this case, the 
broken SSD firmware
causes very slow response to I/O requests. It does not return an error code 
that says "I'm broken" 
it just responds very slowly, perhaps after other parts of the system ask it to 
reset and retry a few
times.

> But in this case, it was the culprit. I've never seen any recommendations to 
> RAID L2ARC for resiliency. Removing the bad SSD entirely from the server got 
> me back running, but I'm concerned about the impact of the device failure and 
> the lack of notification from NexentaStor.
> 
> 

We have made some improvements in notification for this type of failure in the 
3.1 release. Why?
Because we have seen a large number of these errors from various disk and SSD 
manufacturers
recently. You will notice that Nexenta does not support these SSDs behind SAS 
expanders for this
very reason. At the end of the day, resolution is to get the device fixed or 
replaced. Contact your hardware
provider for details.

> What's the current best-choice SSD for L2ARC cache applications these days? 
> It seems as though the Intel units are no longer well-regarded. 
> 
> 

No device is perfect. Some have better firmware, components, or design than 
others. YMMV.
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Impact of L2ARC device failure and SSD recommendations

Reply via email to