The imminent demise of rotating media storage has been predicted for many years now, and here is a device that aims to help turn that fantasy into fact:
http://www.violin-memory.com/products/violin1010.html Well. At least if you have the money, and are not bothered by the fact that that big box of ram in the picture (2U) only stores the same amount of data as a typical $100 sata disk. But if you are the proud owner of a disk-bound application that cannot be scaled any other way, a device like this is going to be interesting to you right now. Maybe in ten years time everybody will have something like this in their laptop. The characteristics of a solid state disk such as this one are very interesting, and needless to say, somewhat different from rotating media. Here is the big one: Seek time = 0 And this one is not so bad: Transfer latency = a handful of microseconds And to round it out: Write bandwidth = about 1,000 MB/sec Read bandwidth = about 1,500 MB/sec Violin gets these numbers because their device is essentially a huge ramdisk connected to a host server by a PCI-e 8x external bus. Well, this is not just a box of DRAM, it is a box of raided (raid 3!) hot-swappable DRAM, and it runs Linux as a high level supervisor. So to put it mildly, this is an interesting piece of equipment. We had the good fortune to be able to run some preliminary benchmarks on one of the first of these machines off the production line. As the Zumastor team (http://zumastor.org), naturally we were interested in the performance effect on our NAS application, which consists of knfsd running on top of a ddsnap virtual block device. Using a run of the mill rackmount server connected to the Violin box, we found Ext3 capable of roughly 500 MB/Sec write speed and 650 MB/Sec read, for large linear transfers. So some bandwidth got lost somewhere, but unfortunately we did not have time to go hunting and see where it went. Still, hundreds of megabytes of read and write bandwidth are hardly something to sneeze at. We went on to some higher level tests. The thing that interested us most was, what would be the bottom line effect on NFS serving performance, with and without volume snapshotting. We planned to compare this to the same machine serving data off a single sata disk. It would have been nice to compare to, say, a 5 disk array as well, but unfortunately such a configuration could not be set up in the time available. Later. Ddsnap is a seek intensive storage application because it has to maintain a fairly complex data structure on disk, which may have to be updated on every write. Those updates have to be durable, so add in the cost of a journal or moral equivalent. This creates a scary amount of seek activity under heavy write loads. So what happens on a solid state disk? Obviously, things improve a lot. Now a word about how one measures NFS performance. Total throughput with lots of simulated clients? One would think. But actually, the fashion is to measure transaction latency for some given number of transactions per second. Which is logical when you think about it: total throughput may well increase when you throw more traffic at a server, but what use is that if latency goes through the roof? To know more about the esoterica of NFS benchmarking, see here: http://www.spec.org/sfs97r1/docs/sfs-3.0v11.html The fstress test we use is an open source effort kindly contributed by Jeff Chase and Darrell Anderson of Duke University. Thankyou very much, Jeff and Darrell. http://www.cs.duke.edu/ari/fstress/ I cannot attest to any particular relationsip between Spec SFS and fstress results. So please do not compare our fstress numbers to commercially published Spec SFS results. Though they attempt to measure much the same thing, the algorithms are not precisely the same, and that might cause results of fstress to be quite different from Spec SFS on the same hardware. So, did I say, please do not compare these results to commercially published Spec SFS results? Thankyou :-) We ran three fstress tests: 1. NFS served directly from an Ext3 volume 2. NFS served from a ddsnap virtual volume with no snapshots 3. NFS served from a ddsnap virtual volume holding one snapshot Server Hardware: HP DL-385 with 8GB of RAM 2 x dual-core opterons 2220 Single SAS disk Violin 1010 connected via PCI-e 8x Chelsio 10GigE directly connected to client Client hardware: Dell precision 380 with 10GigE Test results using the Violin SSD device: http://zumastor.org/graphs/fstress.violin.jpg We see that at 20,000 NFS operations per second, latency is only 6 milliseconds per NFS operation on raw Ext3 and 9 milliseconds on a snapshotted virtual volume. Unfortunately, we were unable to test higher transaction rates this time because instability with the 10 Gige network connection that could not be tracked down in the time that we had. Until we have a chance to perform further tests, we can only guess how high the performance scales before latency goes vertical. For comparison, we ran the same tests using a single sata disk in place of the Violin SSD: http://zumastor.org/graphs/fstress.sata.jpg Here, network interface instability cut short our fstress runs, so we only got a few data points for snapshotted volumes. However, as far as it goes, the relationship between raw, virtual and snapshotted latency looks similar to the results on the Violin SSD. The raw results were obtained up to 2,000 operations per second, and there we already see 160 ms latency. That is 20 times more latency at one tenth the operations per second. So roughly speaking, the Violin SSD versus the sata disk speeds the whole system up by a factor of 200. Pretty cool, hmm? We learned a lot from these tests. The first and most important news from our point of view is that snapshotting via ddsnap does not have a particularly horrible effect on NFS serving performance, either at the low end or the high end of the hardware performance spectrum. This was a big relief for us, because we always worried that the copy-on-write strategy that was adopted for ddsnap can exhibit rather large write performance artifacts, over 10 times worse performance in some cases than a raw disk. In practice, the awful write performance of NFS in general hides a lot of that write slowness, disk cache hides some more, and the relatively low balance of writes in the fstress algorithm hides yet more. The gap between snapshotted and unsnapshotted performance seems to get wider on the SSD if anything, a counterintuitive result that is possibly explained by data bandwidth considerations as discussed below. The next thing we learned is that a solid state disk makes NFS serving go an awful lot faster, other things being equal. The news here is that all of the following turned out to scale very well: Ext3, the block layer, knfsd, networking, device mapper and ddsnap. Personally, I was quite surprised at how well knfsd scales. It seems to me that the popular wisdom has always been that our knfsd is something of a turtle, but when I went to read the code to find out why, I just could not find any glaring reason why that should be so. And lo, we now see that it just ain't so. Big kudos to all those who have worked on the code over the years to turn it into a great performer. Very thrifty with CPU too. A third thing we learned, is that we can run on at dizzying speeds under stupifying load for as long as we had time to test, without deadlocking. This was only possible due to our fixing instabilities in core kernel, which ties into another thread we have seen here on lkml recently. Incidentally, we ran our tests with 128 knfsd threads. The default of 8 threads produces miserable performance on the SSD, which gave us a good scare on our initial test run. It would be very nice to implement an algorithm to scale the knfsd thread pool automatically, in order to eliminate this class of thing that can go wrong. If somebody became inspired to take on that little project that would be great, otherwise it is in our pipeline for, hmm, Christmas delivery. (Exactly which Christmas is left unspecified.) Now, it is really nice that just throwing an SSD at our software makes it run really fast, but naturally we want our software to go even faster. Up till now, filesystem (and fancy virtual block device) optimization has been mainly about reducing seeks, because seeking is the big performance killer. This is completely irrelevant with an SSD, because there is no seek time. That brings the second biggest eater of performance to the top of the list: total data transfered per operation. So to get even more performance out of the SSD, we must cut down the total data transferred. In the case of ddsnap, that is not very hard, mainly because my initial design was pretty lazy about writing out lots of metadata blocks on each snapshot update. I knew those writes would mostly go to nearby places, thus not adding a lot of extra seeks. Now, with an eye to making SSD work better, we intend to amortize some of that traffic using a logical journaling strategy. This and other ideas waiting in the wings should cut metadata traffic by half or more. So what does the arrival of SSD mean for filesystem development in general? Fortunately, reducing total data transfers is also a good thing for rotating media, so long as the reduction is not obtained by adding more seeks. The flip side is, reducing total data transfers suddenly becomes a lot more important. Hence, Zumastor team will concentrate on optimizing ddsnap for both SSD and rotating media. For pragmatic reasons, most optimization work will continue to be directed at the latter, but experience with this hardware has certainly changed our thinking about where we are headed in the long run. Warm thanks to all who read this far and thanks to Violin Memory for providing us access to this very interesting hardware. Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/