Hi, I have a strange behavior I am not able to understand.
I have 6 nodes with cassandra-1.0.12. Each nodes have 8G of RAM. I have a replication factor of 3. --------------- my story is maybe too long, trying shorter here, while saving what I wrote in case someone has patience to read my bad english ;) I got under a situation where my cluster was generating a lot of timeouts on our frontend, whereas I could not see any major trouble on the internal stats. Actually cpu, read & write counts on the column families were quite low. A mess until I switched from java7 to java6 and forced the used of jamm. After the switch, cpu, read & write counts, were going up again, timeouts gone. I have seen this behavior while reducing the xmx too. What could be blocking cassandra from utilizing the while resources of the machine ? Is there is metrics I didn't saw which could explain this ? --------------- Here is the long story. When I first set my cluster up, I gave blindly 6G of heap to the cassandra nodes, thinking that more a java process has, the smoother it runs, while keeping some RAM to the disk cache. We got some new feature deployed, and things were going into hell, some machine up to 60% of wa. I give credit to cassandra because there was not that much timeout received on the web frontend, it was kind of slow but is was kind of working. With some optimizations, we reduced the pressure of the new feature, but it was still at 40%wa. At that time I didn't have much monitoring, just heap and cpu. I read some article how to tune, and I learned that the disk cache is quite important because cassandra relies on it to be the read cache. So I have tried many xmx, and 3G seems of kind the lowest possible. So on 2 among 6 nodes, I have set 3,3G to xmx. Amazingly, I saw the wa down to 10%. Quite happy with that, I changed the xmx 3,3G on each node. But then things really went to hell, a lot of timeouts on the frontend. It was not working at all. So I rolled back. After some time, probably because of the growing data of the new feature to a nominal size, things went again to very high %wa, and cassandra was not able to keep it up. So we kind of reverted the feature, the column family is still used but only by one thread on the frontend. The wa was reduced to 20%, but things continued to not properly working, from time to time, a bunch of timeout are raised on our frontend. In the mean time, I took time to do some proper monitoring of cassandra: column family read & write counts, latency, memtable size, but also the dropped messages, the pending tasks, the timeouts between nodes. It's just a start but it haves me a first nice view of what is actually going on. I tried again reducing the xmx on one node. Cassandra is not complaining of having not enough heap, memtables are not flushed insanely every second, the number of read and write is reduced compared to the other node, the cpu is lower too, there is not much pending tasks, no message dropped more than 1 or 2 from time to time. Everything indicates that there is probably more room to more work, but the node doesn't take it. Even its read and write latencies are lower than on the other nodes. But if I keep this long enough with this xmx, timeouts start to raise on the frontends. After some individual node experiment, the cluster was starting be be quite "sick". Even with 6G, the %wa were reducing, read and write counts too, on kind of every node. And more and more timeout raised on the frontend. The only thing that I could see worrying, is the heap climbing slowly above the 75% threshold and from time to time suddenly dropping from 95% to 70%. I looked at the full gc counter, not much pressure. And another thing was some "Timed out replaying hints to /10.0.0.56; aborting further deliveries" in the log. But logged as info, so I guess not much important. After some long useless staring at the monitoring graphs, I gave a try to using the openjdk 6b24 rather than openjdk 7u9, and force cassandra to load jamm, since in 1.0 the init script blacklist the openjdk. Node after node, I saw that the heap was behaving more like I use to see on jam based apps, some nice up and down rather than a long and slow climb. But read and write counts were still low on every node, and timeout were still bursting on our frontend. A continuing mess until I restarted the "first" node of the cluster. There was still one to switch to java6 + jamm, but as soon as I restarted my "first" node, every node started working more, %wa climbing, read & write count climbing, no more timeout on the frontend, the frontend being then fast has hell. I understand that my cluster is probably under-capacity. But I don't understand how since there is something within cassandra which might block the full use of the machine resources. It seems kind of related to the heap, but I don't know how. Any idea ? I intend to start monitoring more metrics, but do you have any hint on which could explain that behavior ? Nicolas