On 6/15/2010 4:42 AM, Arve Paalsrud wrote:
Hi,
We are currently building a storage box based on OpenSolaris/Nexenta using ZFS.
Our hardware specifications are as follows:
Quad AMD G34 12-core 2.3 GHz (~110 GHz)
10 Crucial RealSSD (6Gb/s)
42 WD RAID Ed. 4 2TB disks + 6Gb/s SAS expanders
LSI2008SAS (two 4x ports)
Mellanox InfiniBand 40 Gbit NICs
128 GB RAM
This setup gives us about 40TB storage after mirror (two disks in spare), 2.5TB
L2ARC and 64GB Zil, all fit into a single 5U box.
Both L2ARC and Zil shares the same disks (striped) due to bandwidth
requirements. Each SSD has a theoretical performance of 40-50k IOPS on 4k
read/write scenario with 70/30 distribution. Now, I know that you should have
mirrored Zil for safety, but the entire box are synchronized with an active
standby on a different site location (18km distance - round trip of 0.16ms +
equipment latency). So in case the Zil in Site A takes a fall, or the
motherboard/disk group/motherboard dies - we still have safety.
DDT requirements for dedupe on 16k blocks should be about 640GB when main pool
are full (capacity).
Without going into details about chipsets and such, do any of you on this list
have any experience with a similar setup and can share with us your thoughts,
do's and dont's, and any other information that could be of help while building
and configuring this?
What I want to achieve is 2 GB/s+ NFS traffic against our ESX clusters (also
InfiniBand-based), with both dedupe and compression enabled in ZFS.
Let's talk moon landings.
Regards,
Arve
Given that for ZIL, random write IOPS is paramount, the RealSSD isn't a
good choice. SLC SSDs still spank any MLC device, and random IOPS for
something like an Intel X25-E or OCZ Vertex EX are over twice that of
the RealSSD. I don't know where they manage to get 40k+ IOPS number for
the RealSSD (I know it's in the specs, but how did they get that?), but
that's not what others are reporting:
http://benchmarkreviews.com/index.php?option=com_content&task=view&id=454&Itemid=60&limit=1&limitstart=7
Sadly, none of the current crop of SSDs support a capacitor or battery
to back up their local (on-SSD) cache, so they're all subject to data
loss on a power interruption.
Likewise, random Read dominates L2ARC usage. Here, the most
cost-effective solutions tend to be MLC-based SSDs with more moderate
IOPS performance - the Intel X25-M and OCZ Vertex series are likely much
more cost-effective than a RealSSD, especially considering
price/performance.
Also, given the limitations of a x4 port connection to the rest of the
system, I'd consider using a couple more SAS controllers, and fewer
Expanders. The SSDs together are likely to be able to overwhelm a x4
PCI-E connection, so I'd want at least one dedicated x4 SAS HBA just for
them. For the 42 disks, it depends more on what your workload looks
like. If it is mostly small or random I/O to the disks, you can get away
with fewer HBAs. Large, sequential I/O to the disks is going to require
more HBAs. Remember, a modern 7200RPM SATA drive can pump out well over
100MB/s sequential, but well under 10MB/s random. Do the math to see
how fast it will overwhelm the x4 PCI-E 2.0 connection which maxes out
at about 2GB/s.
I'd go with 2 Intel X25-E 32GB models for ZIL. Mirror them - striping
isn't really going to buy you much here (so far as I can tell). 6Gbit/s
SAS is wasted on HDs, so don't bother paying for it if you can avoid
doing so. Really, I'd suspect that paying for 6Gb/s SAS isn't worth it
at all, as really only the read performance of the L2ARC SSDs might
possibly exceed 3Gb/s SAS.
I'm going to say something sacrilegious here: 128GB of RAM may be
overkill. You have the SSDs for L2ARC - much of which will be the DDT,
but, if I'm reading this correctly, even if you switch to the 160GB
Intel X25-M, that give you 8 x 160GB = 1280GB of L2ARC, of which only
half is in-use by the DDT. The rest is file cache. You'll need lots of
RAM if you plan on storing lots of small files in the L2ARC (that is, if
your workload is lots of small files). 200bytes/record needed in RAM
for an L2ARC entry.
I.e.
if you have 1k average record size, for 600GB of L2ARC, you'll need
600GB / 1kb * 200B = 120GB RAM.
if you have a more manageable 8k record size, then, 600GB / 8kB * 200B =
15GB
--
Erik Trimble
Java System Support
Mailstop: usca22-123
Phone: x17195
Santa Clara, CA
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss