> But to be honest I'm pretty disappointed that Cassandra doesn't really > scale linearly (or "semi-linearly" :)) when adding new machines. I
It really should scale linearly for this workload unless I have missed something important (in which case I hope someone will chime in). But note that you added more nodes and increased replication factor at the same time so the discrepancy you're seeing is lower than it might first appear. I.e., you got 200/sec with one machine @ rf =1 and 450/sec with 8 machines @ rf = 2. Given an 8x increase in machine count and a 2x rf increase, the expectation would be 4x the read rate. Why you're seeing 450 rather than something like 800 I'm not sure though (with disk access and caching though, beware of the difficulty of normalizing the environment when benchmarking). But whatever is going on I don't believe you can draw the conclusion that this is due to cassandra scaling that poorly for simple randomly distributed small reads. For a read, the nodes involved in servicing your requests are going to be the limited to node you're talking to for RPC + RF number of nodes (assuming read-repair is turned on, and that the RPC node did not happen to be one of the nodes having the data). This really should imply linear scaling (with respect to disk I/O, in the absence of other bottlenecks). Also, you can turn read-repair off (in 0.6) or partially off (in 0.7, by percentage) if you are concerned with scaling with higher RF:s and small number of nodes. > expected that 8-machines cluster will easily beat single MySQL when > there is much more data than RAM. The relative performance characteristics in this case will be significantly dependent on the type of data; it is not just about the total amount. In particular the average row size is likely to be very relevant. Access pattern also matters; for example, "random access" within rows to different columns or column ranges have the potential to be very much efficient, while random access between rows doing only a single read for each row is probably the least flattering case for Cassandra when disk bound. Without knowing more details it's probably difficult to offer specific explanations for this particular case. -- / Peter Schuller