> I've tested again with recording LiveSSTableCount and MemtableDataSize > via jmx. I guess this result supports my suspect on memtable > performance because I cannot find Full GC this time. > This is a result in smaller data size (160million records on > cassandra) on different disk configuration from my previous post. But > the general picture doesn't change. > > The attached files: > - graph-read-throughput-diskT.png: read throughput on my client program. > - graph-diskT-stat-with-jmx.png: graph of cpu load, LiveSSTableCount > and logarithm of MemtableDataSize. > - log-gc.20101122-12:41.160M.log.gz: GC log with -XX:+PrintGC > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps > > As you can see from the second graph, logarithm of MemtableDataSize > and cpu load has a clear correlation. When a memtable is flushed and a > new SSTable is created (LiveSSTableCount is incremented), read > performance will be recovered. But it degrades soon. > I couldn't find Full GC in GC log in this test. So, I guess that this > performance is not a result of GC activity.
Hmmm. As Edward correctly points out, memtable performance *is* expected to decrease with size just from the fact that such is the nature of data structures. However: (1) It really doesn't make sense to me that this would be so significant that the CPU-bound access is slower than going to disk, as your original graphs would seem to indicate (as Terje points out). (2) Assuming your average record size of roughly 1 KB is the average size of each column, you're really not writing a huge amount of tiny pieces of data. 1 KB per column is larger than than most use-cases I would presume. So your use case should not be triggering any unusual cases with respect to the data structures degenerating with large numbers of entries. In addition, you say in your original post that you're doing random (in terms of row key I presume, and I presume in terms of columns (or else static column sets for rows)? That should hopefully mean that you're not accidentally triggering some degenerate case in the memtable data structure themselves, such that the skip list becomes unbalanced or some such. So that really leaves me wondering what's going on. You are doing range slices. Can you try to confirm whether you see the same drop in read performance if you attempt to perform individual column reads rather than slicing over a range? I'm not necessarily saying individual RPC calls for each columns, but say by providing a list of columns rather than a range? Assuming your data is such that you can do this. Also, could you perhaps try attaching to one of the nodes with e.g. VisualVM and to some sample based profiling (not the non-sample based one) and see if you see a consistent difference between the periods after an sstable just got flushed, and the periods just before flushing? If we're lucky we might see something obvious there, in terms of where time is being spent. Also another question (I'm not sure whether it would actually be relevant, but at least if the answer is 'yes' we can drop it): You're data access is random across rows, right - you're not reading/writing random column names for the same large row? (Sorry if this was already stated, I did check real quick but didn't see it.) And finally, how independent of your (presumably non-public) code/data is this test? Would it be possible to publish the test so that others can reproduce and experiment? -- / Peter Schuller