On Tue, Jul 13, 2010 at 10:26 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
> > I'm totally fine with saying "Here's a JNI library for Linux [or even > Linux version >= 2.6.X]" since that makes up 99% of our production > deployments, and leaving the remaining 1% with the status quo. > You really need to say Linux > 2.6 and filesystem xyz . That probably reduces the percentage a bit, but probably not critically. It is quite a while since I have written code for directio (I really try to avoid using it anymore), but from memory, as long as there is a framework which is somewhat extendable and can be used as a basis for new platforms, it should be reasonably trivial for a somewhat experienced person to add a new unix like platform in a couple of days. No idea for windows. I have never written code for this there. > > O_DIRECT also bypasses the cache completely > > Right, that's the idea. :) > Hm... I would have thought it was clear that my idea is that you do want to interact with the cache if you can! :) Under high load, you might reduce performance 10-30% by throwing out the scheduling benefits you get from the OS (yes, that is based on real life experience). Of course... that is given that you can somehow can avoid the worst case scenarios without direct I/O. As always, things will differ from use case to use case. A well performing HW raid card with sufficient writeback cache might also help reduce the negative impact of directio. Funny enough, it is often the systems with light read load that is hardest hit. Systems with heavy read load have more pressure on the cache on the read side and the write will not push content out of the cache (or applications out of physical memory) as easily. To make things more annoying, OSes (not just linux) has a tendency of behaving different from release to release. What is a problem on one linux release is not necessarily a problem on another. I have not seen huge problems when compacting on cassandra in terms of I/O myself, but I am currently working on HW with loads of memory, so I might not see the problems others see. I am more concerned with other performance issues at the moment. One nifty effect which may, or may not, be worth looking into, is what happens when you flip over to the new compacted SSTable, the last thing you write to the new compacted table will be there ready in cache to be read once you start using it. It can as such be worth ordering the compaction so that the most performance critical parts are written last and they are written without direct I/O or similar settings so they will be ready in cache when needed. I am not sure to what extent parts of the SSTables have structures of importance like this for Cassandra. Haven't really thought about it until now. Might also be worth looking at IO scheduler settings in the linux kernel. Some of the io schedulers also supports ionice/io priorities. I have never used it on single threads, but I have read that ioprio_set() accepts thread id's (not just process ids like the man page indicate). While not super efficient, in my experience, on preventing cache flushing of mostly idle data, if the compaction I/O occurs in isolated threads so ionice can be applied to that thread, it should help. > Exactly: it the fadvise mode that would actually be useful to us, is a > no-op and not likely to change soon. A bit of history: > > Interesting, I had not seen that before. Thanks! Terje