> I would love to understand how people got to this conclusion however and try > to find out why we seem to see differences!
I won't make any claims with Cassandra because I have never bothered benchmarking the different in CPU usage since all my use-cases have been more focused on I/O efficiency, but I will say, without having benchmarked that either, the *generally*, if you're doing small reads of data that is in page cache using mmap() - something would have to be seriously wrong for that not to be significantly faster than regular I/O. There's just *no way* there is no performance penalty involved in making the context switch to kernel space, validating syscall parameters etc (not to mention the indirect effects on e.g. process scheduling etc) - compared to simply *touching some virtual memory*. It's easy to benchmark the maximum number of syscalls you can do per second, and I'll eat my left foot if you're able to do more of that than touching a piece of memory ;) Obviously this does *not* mean that mmap():ed I/O will actually be faster in some particular application. But I do want to make the point that the idea that mmap():ed I/O is good for performance (in terms of CPU) is definitely not arbitrary and unfounded. Now, and HERE is the kicker: With all the hoopla over mmap():ed I/O and benchmarks you see, as usual there are lies, damned lies and benchmarks. It's pretty easy to come up with I/O patterns where mmap() will be significantly slower (certainly on platters, I'm guessing even with modern SSD:s) than regular I/O because the method used to communicate with the operating system (touching a page of memory) is vastly different. In the most obvious and simple case, consider an application that needs to read 50 MB of data exactly, and knows it. Suppose the data is not in page cache. Submitting a read() of exactly those 50 MB clearly has at least the potential to be significantly more efficient (assuming nothing is outright wrong) than toughing pages in a sequential fashion and (1) taking multiple, potentially quite a few, page faults in the kernel, and (2) being reliant on read-ahead/pre-fetching which will never have enough knowledge to predict your 50 MB read so you'll invariable take more seeks (at least potentially with concurrent I/O) and probably read more than necessary (since pre-fetching algorithms won't know when you'll be "done") than if you simply state to the kernel your exact intent of reading exactly 50*1024*1024 bytes in a particular position in a file/device. To some extent issues like these may affect Cassandra, but it's difficult to measure. For example, if you're I/O bound and doing a lot of range slices that are bigger than a single page - perhaps the default 64kb read size with standard I/O is eliminating unnecessary seeks for you that you're otherwise taking when doing I/O by paging? It's an hypothesis that is certainly plausible under some circumstances, but difficult to validate or falsify. One can probably construct a benchmark where there's no difference, yet see a significant difference in a real-world scenario when your benchmarked I/O is intermixed with other I/O. Not to mention subtle differences in behaviors of kernels, RAID controllers, disk drive controllers, etc... -- / Peter Schuller (@scode on twitter)