Hi, I'm curious to know if anyone has attempted to improve read IOPS performance during sequential I/O operations (largely compactions) while still maintaining read performance for small row, random-access client reads?
Our use case is very high write load to read load ratio with rows that are small (< 1-2kb). We've taken many of the steps to ensure that client reads to random-access rows are optimal by reducing read_ahead and even using a smaller default LZ4 chunk size. So far performance has been great with p95 read times that are < 10ms. However, we have noticed that our total read IOPS to the Cassandra data drive is extremely high compared to our write IOPS, almost 15x the write IOPS to the same drive. We even setup a ring that took the same write load with zero client reads and observed that the high read IOPS were driven by compaction operations. During large (>50GB) compactions, write and read volume (bytes) were nearly identical which matched our assumptions, while read iops were 15x write iops. When we plotted the average read and write op size we saw an average read ops size of just under 5KB and average write op of 120KB. Given we are using the default disk access mode of mmap, this aligns with our assumption that we are paging in a single 4KB page at a time while the write size is coalescing write flushes. We wanted to test this, so we switched a single node to `disk_access_mode:standard`, which should do reads based on the chunksizes, and found that read op size increased to ~7.5KB: https://imgur.com/okbfFby We don't want to sacrifice our read performance, but we also must scale/size our disk performance based on peak iops. If we could cut the read iops by a quarter or even half during compaction operations, that would mean a large cost savings. We are also limited on drive throughput, so there's a theoretical maximum op size we'd want to use to stay under that throughput limit. Alternatively, we could also tune compaction throughput to maintain that limit too. Has any work been done to optimize sequential I/O operations in Cassandra? Naively it seems that sequential I/O operations could use a standard disk access mode reader with configurable block size while normal read operations stuck to the mmap'd segments. Being unfamiliar with the code, are compaction/sequential sstable reads done through any single interface or does it use the same as normal read ops? Thoughts? -Mike -- Mike Heffner <m...@librato.com> Librato, Inc.