Thanks, Peter. I missed an important point. Each time the performance resets, there is a memtable flush. The memtable is flushed every hour. It is about 128MB. Each time the memtable flushed, there is a GC spike.
After a memtable flush, you see minimum cpu and maximum read throughput both in term of disk and cassandra records read. As memtable increase in size, cpu goes up and read drops. If this is because of memtable or GC performance issue, this is the big question. As each memtable is just 128MB when flushed, I don't really expect GC problem or caching issues. Regards, Shotaro On Fri, Nov 12, 2010 at 9:24 AM, Peter Schuller <peter.schul...@infidyne.com> wrote: >> - About 300 million records are stored in single cassandra node. (16 >> cores, 32GB memory, SSD disk) >> - One record is avg. 1KB. > > So about 300 GB of disk space in total? > >> - Multi-thread client on different server does read and write: >> - 120 reader threads continuously read random records from cassandra >> and 1/10 of read data are written back to cassandra with 20 writer >> threads. > > They are definitely (pseudo-)random, uniformly across the data? I.e., > there is no particular hot active set? > >> - Use JNA library to suppress GC for cassandra program. > > I assume you mean dropping JNA in there to enable cassandra to do > mlockall(). Does it succeed? Have you confirmed there is no swapping > going on? > > (Just to be sure; the long GC times sound a bit suspicious) > >> It didn't help. More frequent fluctuation occurs and zero disk read >> happens twice in one hour. The disk read throughput is lower than >> previous one. > > Hrmpf. How about compaction; is compaction correlated somehow with the > trends you see in graphs? > > For the long pauses, do you have GCInspector log entries indicating > the before/after heap sizes for the CMS GC passes? > > Regarding the graph-ssd-stat.jna.operations1M.png: > > The behavior seems very consistent with having low cache locality at > the start, being disk-bound. As disk activity goes down because of hot > reads, CPU usage instead goes up. However I would expect this to be > correlated with a higher operations/second rate. Are the different > graphs over the exact same time period, such that they show that the > overall 'reads per second' rate actually goes *down* as CPU usage goes > up and disk usage goes down? > > If so, that indicates there is in fact excessive activity going on > wasting CPU. Since it is periodic, I'm thinking maybe the heap simply > grows too full, causing GC pressure to decrease overall efficiency > (without OOM:ing) until the time based criteria triggers memtable > flush. Once that happens and once the memtables gets GC:ed, suddenly > the application (cassandra) can make progress again. > >> The last image Screenshot.png is GC log graph created by gcview. The >> GC spikes match to the performance drop timing (blue line means used >> heap size and green line height shows GC time). >> We believe that some strange thing is going on after memtable flush. >> Does anyone has a fix or advice? > > Hmmm. I'm not sure how to correlate the GC graphs, especially because > of the resolution. Can you find some GCInspector log entries and > report back what the heap size is before/after a CMS sweep (1) during > the "good" period of high throughput and (2) during the "bad" period > of low throughput. I.e., find ConcurrentMarkSweep GCInspector lines > from those periods; each line should contained heap size and the > amount freed. > > In general, I think the GC times are pretty surprising unless it is > because CMS is failing to start early enough such that the excessive > times are due to fallback to full GC. > > You may want to modify the VM args to run with -Xloggc:path/to/gclog > -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimestamps. That should > make any concurrent mode failures and full GC:s visible and in general > give some more information about what is going on at each GC. One can > then look at the log surrounding the instances of prolonged GC pauses. > If you could run it with those options and trigger the behavior, and > perhaps post the GC log? > > I think the most important thing to figure out is what CPU time is > spent on. If it is GC it should almost certainly be fixable as long as > one knows *why* (it should be due to heap usage, although I don't see > anything in your numbers that would indicate to me why the heap would > have so much live data as to cause problems given your 16 gig heap > size). > > -- > / Peter Schuller > -- Shotaro Kamio