comments below... On Mar 21, 2012, at 10:40 AM, Marion Hakanson wrote:
> p...@kraus-haus.org said: >> Without knowing the I/O pattern, saying 500 MB/sec. is meaningless. >> Achieving 500MB/sec. with 8KB files and lots of random accesses is really >> hard, even with 20 HDDs. Achieving 500MB/sec. of sequential streaming of >> 100MB+ files is much easier. >> . . . >> For ZFS, performance is proportional to the number of vdevs NOT the >> number of drives or the number of drives per vdev. See https:// >> docs.google.com/spreadsheet/ccc?key=0AtReWsGW-SB1dFB1cmw0QWNNd0RkR1ZnN0JEb2RsL >> Xc for some testing I did a while back. I did not test sequential read as >> that is not part of our workload. Actually, few people have sequential workloads. Many think they do, but I say prove it with iopattern. >> . . . >> I understand why the read performance scales with the number of vdevs, >> but I have never really understood _why_ it does not also scale with the >> number of drives in each vdev. When I did my testing with 40 dribves, I >> expected similar READ performance regardless of the layout, but that was NOT >> the case. > > In your first paragraph you make the important point that "performance" > is too ambiguous in this discussion. Yet in the 2nd & 3rd paragraphs above, > you go back to using "performance" in its ambiguous form. I assume that > by "performance" you are mostly focussing on random-read performance.... > > My experience is that sequential read performance _does_ scale with the number > of drives in each vdev. Both sequential and random write performance also > scales in this manner (note that ZFS tends to save up small, random writes > and flush them out in a sequential batch). Yes. I wrote a small, random read performance model that considers the various caches. It is described here: http://info.nexenta.com/rs/nexenta/images/tech_brief_nexenta_performance.pdf The spreadsheet shown in figure 3 is available for the asking (and it works on your iphone or ipad :-) > Small, random read performance does not scale with the number of drives in > each > raidz[123] vdev because of the dynamic striping. In order to read a single > logical block, ZFS has to read all the segments of that logical block, which > have been spread out across multiple drives, in order to validate the checksum > before returning that logical block to the application. This is why a single > vdev's random-read performance is equivalent to the random-read performance of > a single drive. It is not as bad as that. The actual worst case number for a HDD with zfs_vdev_max_pending of one is: average IOPS * ((D+P) / D) where, D = number of data vdevs P = numebr of parity vdevs (1 for raidz, 2 for raidz2, 3 for raidz3) total disks per set = D + P We did many studies that verified this. More recent studies show zfs_vdev_max_pending has a huge impact on average latency of HDDs, which I also described in my talk at OpenStorage Summit last fall. > p...@kraus-haus.org said: >> The recommendation is to not go over 8 or so drives per vdev, but that is >> a performance issue NOT a reliability one. I have also not been able to >> duplicate others observations that 2^N drives per vdev is a magic number (4, >> 8, 16, etc). As you can see from the above, even a 40 drive vdev works and is >> reliable, just (relatively) slow :-) Paul, I have a considerable amount of data that refutes your findings. Can we agree that YMMV and varies dramatically, depending on your workload? > > Again, the "performance issue" you describe above is for the random-read > case, not sequential. If you rarely experience small-random-read workloads, > then raidz* will perform just fine. We often see 2000 MBytes/sec sequential > read (and write) performance on a raidz3 pool consisting of 3, 12-disk vdev's > (using 2TB drives). Yes, this is relatively easy to see. I've seen 6GByes/sec for large configs, but that begins to push the system boundaries in many ways. > > However, when a disk fails and must be resilvered, that's when you will > run into the slow performance of the small, random read workload. This > is why I use raidz2 or raidz3 on vdevs consisting of more than 6-7 drives, > especially of the 1TB+ size. That way if it takes 200 hours to resilver, > you've still got a lot of redundancy in place. > > Regards, > > Marion > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- DTrace Conference, April 3, 2012, http://wiki.smartos.org/display/DOC/dtrace.conf ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss