On 6/15/2010 4:42 AM, Arve Paalsrud wrote:
Hi,

We are currently building a storage box based on OpenSolaris/Nexenta using ZFS.
Our hardware specifications are as follows:

Quad AMD G34 12-core 2.3 GHz (~110 GHz)
10 Crucial RealSSD (6Gb/s)
42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders
LSI2008SAS (two 4x ports)
Mellanox InfiniBand 40 Gbit NICs
128 GB RAM

This setup gives us about 40TB storage after mirror (two disks in spare), 2.5TB 
L2ARC and 64GB Zil, all fit into a single 5U box.

Both L2ARC and Zil shares the same disks (striped) due to bandwidth 
requirements. Each SSD has a theoretical performance of 40-50k IOPS on 4k 
read/write scenario with 70/30 distribution. Now, I know that you should have 
mirrored Zil for safety, but the entire box are synchronized with an active 
standby on a different site location (18km distance - round trip of 0.16ms + 
equipment latency). So in case the Zil in Site A takes a fall, or the 
motherboard/disk group/motherboard dies - we still have safety.

DDT requirements for dedupe on 16k blocks should be about 640GB when main pool 
are full (capacity).

Without going into details about chipsets and such, do any of you on this list 
have any experience with a similar setup and can share with us your thoughts, 
do's and dont's, and any other information that could be of help while building 
and configuring this?

What I want to achieve is 2 GB/s+ NFS traffic against our ESX clusters (also 
InfiniBand-based), with both dedupe and compression enabled in ZFS.

Let's talk moon landings.

Regards,
Arve


Given that for ZIL, random write IOPS is paramount, the RealSSD isn't a good choice. SLC SSDs still spank any MLC device, and random IOPS for something like an Intel X25-E or OCZ Vertex EX are over twice that of the RealSSD. I don't know where they manage to get 40k+ IOPS number for the RealSSD (I know it's in the specs, but how did they get that?), but that's not what others are reporting:

http://benchmarkreviews.com/index.php?option=com_content&task=view&id=454&Itemid=60&limit=1&limitstart=7

Sadly, none of the current crop of SSDs support a capacitor or battery to back up their local (on-SSD) cache, so they're all subject to data loss on a power interruption.

Likewise, random Read dominates L2ARC usage. Here, the most cost-effective solutions tend to be MLC-based SSDs with more moderate IOPS performance - the Intel X25-M and OCZ Vertex series are likely much more cost-effective than a RealSSD, especially considering price/performance.


Also, given the limitations of a x4 port connection to the rest of the system, I'd consider using a couple more SAS controllers, and fewer Expanders. The SSDs together are likely to be able to overwhelm a x4 PCI-E connection, so I'd want at least one dedicated x4 SAS HBA just for them. For the 42 disks, it depends more on what your workload looks like. If it is mostly small or random I/O to the disks, you can get away with fewer HBAs. Large, sequential I/O to the disks is going to require more HBAs. Remember, a modern 7200RPM SATA drive can pump out well over 100MB/s sequential, but well under 10MB/s random. Do the math to see how fast it will overwhelm the x4 PCI-E 2.0 connection which maxes out at about 2GB/s.


I'd go with 2 Intel X25-E 32GB models for ZIL. Mirror them - striping isn't really going to buy you much here (so far as I can tell). 6Gbit/s SAS is wasted on HDs, so don't bother paying for it if you can avoid doing so. Really, I'd suspect that paying for 6Gb/s SAS isn't worth it at all, as really only the read performance of the L2ARC SSDs might possibly exceed 3Gb/s SAS.


I'm going to say something sacrilegious here: 128GB of RAM may be overkill. You have the SSDs for L2ARC - much of which will be the DDT, but, if I'm reading this correctly, even if you switch to the 160GB Intel X25-M, that give you 8 x 160GB = 1280GB of L2ARC, of which only half is in-use by the DDT. The rest is file cache. You'll need lots of RAM if you plan on storing lots of small files in the L2ARC (that is, if your workload is lots of small files). 200bytes/record needed in RAM for an L2ARC entry.

I.e.

if you have 1k average record size, for 600GB of L2ARC, you'll need 600GB / 1kb * 200B = 120GB RAM.

if you have a more manageable 8k record size, then, 600GB / 8kB * 200B = 15GB


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to