> There's some words on the 'Net that - the recent pages on > Riptano's site in fact - that strongly encourage scaling left > and right, rather than beefing up the boxes - and certainly > we're seeing far less bother from GC using a much smaller > heap - previously we'd been going up to 16GB, or even > higher. This is based on my previous positive experiences > of getting better performance from memory hog apps (eg. > Java) by giving them more memory. In any case, it seems > that using large amounts of memory on EC2 is just asking > for trouble.
Keep in mind that while GC tends to be more efficient with larger heap sizes, that does not always translate into better overall performance when other things have to be considered. In particular, in the case of Cassandra, if you "waste" 10-15 gigs of RAM on the JVM heap for a Cassandra instances which could live with e.g. 1 GB, you're actively taking away those 10-15 gigs of RAM from the operating system to use for the buffer cache. Particularly if you're I/O bound on reads then, this could have very detrimental effects (assuming the data set is sufficiently small and locality is such that 15 GB of extra buffer cache makes a difference; usually, but not always, this is the case). So with Cassandra, in the general case, you definitely want to keep hour heap size reasonable in relation to the actual live set (amount of actually reachable data), rather than just cranking it up as much as possible. (The main issue here is also keeping it high enough to not OOM, given that exact memory demands are hard to predict; it would be absolutely great if the JVM was better at maintaining a reasonable heap size to live set size ratio so that much less tweaking of heap sizes was necessary, but this is not the case.) -- / Peter Schuller