I figured the following ZFS 'success story' may interest some readers here.
I was interested to see how much sequential read/write performance it would be possible to obtain from ZFS running on commodity hardware with modern features such as PCI-E busses, SATA disks, well-designed SATA controllers (AHCI, SiI3132/SiI3124). So I made this experiment of building a fileserver by picking each component to be as cheap as possible while not sacrificing performance too much. I ended up spending $270 on the server itself and $1050 on seven 750GB SATA disks. After installing snv_82, a 7-disk raidz pool on this $1320 box is capable of: - 220-250 MByte/s sequential write throughput (dd if=/dev/zero of=file bs=1024k) - 430-440 MByte/s sequential read throughput (dd if=file of=/dev/null bs=1024k) I did a quick test with a 7-disk striped pool too: - 330-390 MByte/s seq. writes - 560-570 MByte/s seq. reads (what's really interesting here is that the bottleneck is the platter speed of one of the disks at 81 MB/s: 81*7=567, ZFS truly "runs at platter speed", as advertised, wow) I used disks with 250GB-platter (Samsung HD753LJ; they have even higher density 640GB and 1TB models with 334GB/platter but they are respectively impossible to find or too expensive). I put 4 disks on the motherboard's integrated AHCI controller (SB600 chipset), 2 disks on a 2-port $20 PCI-E 1x SiI3132 controller, and the 7th disk on a $65 4-port PCI-X SiI3124 controller that I scavenged from another server (it's in a PCI slot, what a waste for a PCI-X card...). The rest is also dirty cheap: $65 Asus M2A-VM motherboard, $60 dual-core Athlon 64 X2 4000+, with 1GB of DDR2 800, and a 400W PSU. When testing the read throughput of individual disks with dd (roughly 81 to 97 MB/s at the beginning of the platter -- I don't know why it varies so much between different units of the same model, additional seeks caused by reallocated sectors perhaps) I found out that an important factor influencing the max bandwidth of a PCI Express device such as the SiI3132 is the Max_Payload_Size setting, which can be set from 128 to 4096 bytes by writing to bits 7:5 of the Device Control Register (offset 08h) in the PCI Express Capability Structure (starting at offset 70h on the SiI3132): $ /usr/X11/bin/pcitweak -r 2:0 -h 0x78 # read the register 0x2007 Bits 7:5 of 0x2007 are 000, which indicates a 128 bytes max payload size (000=128B, 001=256B, ..., 101=4096B, 110=reserved, 111=reserved). All OSes and drivers seem to keep it to this default value of 128 bytes. However in my tests, this payload size only allowed a practical unidirectional bandwidth of about 147 MB/s (59% of the 250 MB/s peak theoretical of PCI-E 1x). I changed it to 256 bytes: $ /usr/X11/bin/pcitweak -w 2:0 -h 0x78 0x2027 This increased the bandwidth to 175 MB/s. Better. At 512 bytes or above something strange happens: the bandwidth ridiculously drops below 5 or 50 MB/s depending on the PCI-E slot I use... Weird, I have no idea why. Anyway 175 MB/s or even 145 MB/s is good enough for this 2-port SATA controller because the I/O bandwidth consumed by ZFS in my case is never above 62-63 MB/s per disk. I wanted to share this Max_Payload_Size tidbit here because I didn't find any mention of anybody manually tuning this parameter on the Net. So in case some of you wonder why PCI-E devices seem limited to 60% of their peak theoretical bandwidth, now you know why. Speaking of another bottleneck, my SiI3124 has a bottleneck of 87 MB/s per SATA port. Back on the main topic, here are some system stats during 430-440 MB/s sequential reads from the ZFS raidz pool with dd (c0 is the AHCI controller, c1 = SiI3124, c2 = SiI3132). "zpool iostat -v 2" capacity operations bandwidth pool used avail read write read write ------------ ----- ----- ----- ----- ----- ----- tank 2.54T 2.17T 3.38K 0 433M 0 raidz1 2.54T 2.17T 3.38K 0 433M 0 c0t0d0s7 - - 1.02K 0 61.9M 0 c0t1d0s7 - - 1.02K 0 61.9M 0 c0t2d0s7 - - 1.02K 0 62.0M 0 c0t3d0s7 - - 1.02K 0 62.0M 0 c1t0d0s7 - - 1.01K 0 61.9M 0 c2t0d0s7 - - 1.02K 0 62.0M 0 c2t1d0s7 - - 1.02K 0 61.9M 0 ------------ ----- ----- ----- ----- ----- ----- "iostat -Mnx 2" extended device statistics r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 1044.8 0.5 61.7 0.0 0.1 14.3 0.1 13.6 4 81 c0t0d0 1043.3 0.0 61.7 0.0 0.1 15.4 0.1 14.7 5 84 c0t1d0 1043.3 0.0 61.7 0.0 0.1 14.7 0.1 14.1 5 82 c0t2d0 1044.8 0.0 61.8 0.0 0.1 13.0 0.1 12.5 4 76 c0t3d0 1042.3 0.0 61.7 0.0 13.9 0.8 13.3 0.8 83 83 c1t0d0 1041.8 0.0 61.7 0.0 11.5 0.7 11.1 0.7 73 73 c2t0d0 1041.8 0.0 61.7 0.0 12.8 0.8 12.3 0.8 79 79 c2t1d0 (actv is less than 1 on c1 & c2 due to si3124 driver not supporting NCQ. I wonder what's preventing ZFS from keeping the disks busy more than ~80% of the time. Maybe the CPU, see below.) "mpstat 2" CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 7275 7075 691 99 146 354 0 321 0 89 0 10 1 0 0 0 221 2 844 43 150 244 0 241 0 88 0 11 (The dual-core CPU is almost(?) a bottleneck at 90% utilization on both cores. For this reason I doubt such a read throughput could have been reached pre-snv_79 because checksum verification/calculation only became multithreaded in snv_79 and after.) -marc _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss