I figured the following ZFS 'success story' may interest some readers here.

I was interested to see how much sequential read/write performance it would be 
possible to obtain from ZFS running on commodity hardware with modern features 
such as PCI-E busses, SATA disks, well-designed SATA controllers (AHCI, 
SiI3132/SiI3124). So I made this experiment of building a fileserver by 
picking each component to be as cheap as possible while not sacrificing 
performance too much.

I ended up spending $270 on the server itself and $1050 on seven 750GB SATA 
disks. After installing snv_82, a 7-disk raidz pool on this $1320 box is 
capable of:

- 220-250 MByte/s sequential write throughput (dd if=/dev/zero of=file 
bs=1024k)
- 430-440 MByte/s sequential read throughput (dd if=file of=/dev/null 
bs=1024k)

I did a quick test with a 7-disk striped pool too:

- 330-390 MByte/s seq. writes
- 560-570 MByte/s seq. reads (what's really interesting here is that the 
bottleneck is the platter speed of one of the disks at 81 MB/s: 81*7=567, ZFS 
truly "runs at platter speed", as advertised, wow)

I used disks with 250GB-platter (Samsung HD753LJ; they have even higher 
density 640GB and 1TB models with 334GB/platter but they are respectively 
impossible to find or too expensive). I put 4 disks on the motherboard's 
integrated AHCI controller (SB600 chipset), 2 disks on a 2-port $20 PCI-E 1x 
SiI3132 controller, and the 7th disk on a $65 4-port PCI-X SiI3124 controller 
that I scavenged from another server (it's in a PCI slot, what a waste for a 
PCI-X card...). The rest is also dirty cheap: $65 Asus M2A-VM motherboard, $60 
dual-core Athlon 64 X2 4000+, with 1GB of DDR2 800, and a 400W PSU.

When testing the read throughput of individual disks with dd (roughly 81 to 97 
MB/s at the beginning of the platter -- I don't know why it varies so much 
between different units of the same model, additional seeks caused by 
reallocated sectors perhaps) I found out that an important factor influencing 
the max bandwidth of a PCI Express device such as the SiI3132 is the 
Max_Payload_Size setting, which can be set from 128 to 4096 bytes by writing 
to bits 7:5 of the Device Control Register (offset 08h) in the PCI Express 
Capability Structure (starting at offset 70h on the SiI3132):

  $ /usr/X11/bin/pcitweak -r 2:0 -h 0x78         # read the register
  0x2007

Bits 7:5 of 0x2007 are 000, which indicates a 128 bytes max payload size 
(000=128B, 001=256B, ..., 101=4096B, 110=reserved, 111=reserved). All OSes and 
drivers seem to keep it to this default value of 128 bytes. However in my 
tests, this payload size only allowed a practical unidirectional bandwidth of 
about 147 MB/s (59% of the 250 MB/s peak theoretical of PCI-E 1x). I changed 
it to 256 bytes:

  $ /usr/X11/bin/pcitweak -w 2:0 -h 0x78 0x2027

This increased the bandwidth to 175 MB/s. Better. At 512 bytes or above 
something strange happens: the bandwidth ridiculously drops below 5 or 50 MB/s 
depending on the PCI-E slot I use... Weird, I have no idea why. Anyway 175 
MB/s or even 145 MB/s is good enough for this 2-port SATA controller because 
the I/O bandwidth consumed by ZFS in my case is never above 62-63 MB/s per 
disk.

I wanted to share this Max_Payload_Size tidbit here because I didn't find any 
mention of anybody manually tuning this parameter on the Net. So in case some 
of you wonder why PCI-E devices seem limited to 60% of their peak theoretical 
bandwidth, now you know why.

Speaking of another bottleneck, my SiI3124 has a bottleneck of 87 MB/s per 
SATA port.

Back on the main topic, here are some system stats during 430-440 MB/s 
sequential reads from the ZFS raidz pool with dd (c0 is the AHCI controller, 
c1 = SiI3124, c2 = SiI3132).

"zpool iostat -v 2"
                 capacity     operations    bandwidth
pool           used  avail   read  write   read  write
------------  -----  -----  -----  -----  -----  -----
tank          2.54T  2.17T  3.38K      0   433M      0
  raidz1      2.54T  2.17T  3.38K      0   433M      0
    c0t0d0s7      -      -  1.02K      0  61.9M      0
    c0t1d0s7      -      -  1.02K      0  61.9M      0
    c0t2d0s7      -      -  1.02K      0  62.0M      0
    c0t3d0s7      -      -  1.02K      0  62.0M      0
    c1t0d0s7      -      -  1.01K      0  61.9M      0
    c2t0d0s7      -      -  1.02K      0  62.0M      0
    c2t1d0s7      -      -  1.02K      0  61.9M      0
------------  -----  -----  -----  -----  -----  -----

"iostat -Mnx 2"
                    extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
 1044.8    0.5   61.7    0.0  0.1 14.3    0.1   13.6   4  81 c0t0d0
 1043.3    0.0   61.7    0.0  0.1 15.4    0.1   14.7   5  84 c0t1d0
 1043.3    0.0   61.7    0.0  0.1 14.7    0.1   14.1   5  82 c0t2d0
 1044.8    0.0   61.8    0.0  0.1 13.0    0.1   12.5   4  76 c0t3d0
 1042.3    0.0   61.7    0.0 13.9  0.8   13.3    0.8  83  83 c1t0d0
 1041.8    0.0   61.7    0.0 11.5  0.7   11.1    0.7  73  73 c2t0d0
 1041.8    0.0   61.7    0.0 12.8  0.8   12.3    0.8  79  79 c2t1d0
(actv is less than 1 on c1 & c2 due to si3124 driver not supporting NCQ. I 
wonder what's preventing ZFS from keeping the disks busy more than ~80% of the 
time. Maybe the CPU, see below.)

"mpstat 2"
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0  7275 7075  691   99  146  354    0   321    0  89   0  10
  1    0   0    0   221    2  844   43  150  244    0   241    0  88   0  11
(The dual-core CPU is almost(?) a bottleneck at 90% utilization on both cores. 
For this reason I doubt such a read throughput could have been reached 
pre-snv_79 because checksum verification/calculation only became multithreaded 
in snv_79 and after.)

-marc

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to