> I would love to understand how people got to this conclusion however and try 
> to find out why we seem to see differences!

I won't make any claims with Cassandra because I have never bothered
benchmarking the different in CPU usage since all my use-cases have
been more focused on I/O efficiency, but I will say, without having
benchmarked that either, the *generally*, if you're doing small reads
of data that is in page cache using mmap() - something would have to
be seriously wrong for that not to be significantly faster than
regular I/O.

There's just *no way* there is no performance penalty involved in
making the context switch to kernel space, validating syscall
parameters etc (not to mention the indirect effects on e.g. process
scheduling etc) - compared to simply *touching some virtual memory*.
It's easy to benchmark the maximum number of syscalls you can do per
second, and I'll eat my left foot if you're able to do more of that
than touching a piece of memory ;)

Obviously this does *not* mean that mmap():ed I/O will actually be
faster in some particular application. But I do want to make the point
that the idea that mmap():ed I/O is good for performance (in terms of
CPU) is definitely not arbitrary and unfounded.

Now, and HERE is the kicker: With all the hoopla over mmap():ed I/O
and benchmarks you see, as usual there are lies, damned lies and
benchmarks. It's pretty easy to come up with I/O patterns where mmap()
will be significantly slower (certainly on platters, I'm guessing even
with modern SSD:s) than regular I/O because the method used to
communicate with the operating system (touching a page of memory) is
vastly different.

In the most obvious and simple case, consider an application that
needs to read 50 MB of data exactly, and knows it. Suppose the data is
not in page cache. Submitting a read() of exactly those 50 MB clearly
has at least the potential to be significantly more efficient
(assuming nothing is outright wrong) than toughing pages in a
sequential fashion and (1) taking multiple, potentially quite a few,
page faults in the kernel, and (2) being reliant on
read-ahead/pre-fetching which will never have enough knowledge to
predict your 50 MB read so you'll invariable take more seeks (at least
potentially with concurrent I/O) and probably read more than necessary
(since pre-fetching algorithms won't know when you'll be "done") than
if you simply state to the kernel your exact intent of reading exactly
50*1024*1024 bytes in a particular position in a file/device.

To some extent issues like these may affect Cassandra, but it's
difficult to measure. For example, if you're I/O bound and doing a lot
of range slices that are bigger than a single page - perhaps the
default 64kb read size with standard I/O is eliminating unnecessary
seeks for you that you're otherwise taking when doing I/O by paging?
It's an hypothesis that is certainly plausible under some
circumstances, but difficult to validate or falsify. One can probably
construct a benchmark where there's no difference, yet see a
significant difference in a real-world scenario when your benchmarked
I/O is intermixed with other I/O. Not to mention subtle differences in
behaviors of kernels, RAID controllers, disk drive controllers, etc...

-- 
/ Peter Schuller (@scode on twitter)

Reply via email to