> - performance (it should be not as much less than shard of MySQL and > scale linearly, we want to have not more that 10K inserts per second > of writes, and probably not more than 1K/s reads which will be mostly > random) > - ability to store big amounts of data (now it looks that we will > have about 50GB of uncompressed data per day)
I'm curious what the writes look like, if you expect to do 10k inserts per second with MySQL on large data sets that do not fit in RAM. Are they of such a nature that the writes turn mostly sequential even in MySQL? In particular consider that for indexed columns in the MySQL case, if you're inserting values randomly with respect to the index order of indexed columns, you can expect to take seeks for your writes even if your database is logically "append only" in terms of not having deletes. Unless the write activity does happen to be very unusually structured such that you don't take any seeking in a MySQL/Tokyo Cabinet environment, it seems likely that write performance will be the most significant issue for you (and cassandra should shine here), given the size of the data set and given that you say you expect 10x the amount of writes as you do reads, and that the reads are expected to be random. With regards to reads, I re-read your original post. I'm trying to figure out what the most likely cause of the discrepancy is. Even with zero cache locality, my reading of the index read path is that, in cassandra, you would expect exactly one disk seek into the index file, followed by streaming (though this may be turned into multiple seeks depending on OS readahead behavior). Then, you would expect a single seek in the data file followed by streaming. With low cache hit ratios, I would not expect a btree index lookup + data file seek in mysql/cabinet/postgresql to be significantly faster in terms of disk seeks, assuming (1) read-ahead is in effect and avoids the streaming turning into multiple seeks, and (2) that the data size is large enough that btree index leaves are not cached most of the time. Some random thoughts: * Is MySQL configured such that most or all of the index will be in MySQL's internal key cache? If this is the case, it is not competing with the data file and may account for MySQL effectively not needing to go down on disk for index lookups, where Cassandra may have to. However, if my understanding is correct, this should effectively no more than double the amount of seeks required in Cassandra relative to MySQL. * For rows that span multiple sstables, you may see a magnifying effects with respect to seeks across multiple sstables. My reading of py_stress is that this should not be what is hitting you, assuming you populated the database once and then did the reads, since each row key seems to be populated once for an insertion run. To ensure this is not an issue, you can try confirming that read performance does not significantly improve after having triggered a major compaction (and waiting for it to complete). As to why you just saw 200 -> 450 instead of 200 -> 800 when moving to 8 nodes and doubling RF, I don't know, given that you say you were still fully disk bound. Going from 450 down to 300 when going from RF 2 to 3 seems very much expected though (almost exactly), given that you're increasing the reads by 50% (for an underlying read capacity of n seeks per second and replication factor f, the ideal case would be to expect that the read rate r = n/f; so r = 450, f = 2 => n = 900. So given n = 900 and f = 3, n = 900/3=300). Anyways... on to the numbers comparison between cassandra and MySQL: With MySQL you got 340 reads/second @ 21 gb on a single machine. The poorest numbers from cassandra (discounting the expected decrease of going from rf 2->3), is 450 / second on 8 nodes with rf2. This is roughly equivalent of 450 / second on 4 nodes with rf 1, or 112 reads/second for a single node. In other words, in the end the poorest number you got from cassandra, compared to the best number you reported for MySQL, shows a factor of 3 difference. What is this caused by? I'm not sure. But if the mysql key index hypothesis is correct, that might account for a factor of 2 difference. If so, the remainder would be 50% faster in the MySQL case. Add to this that the 2.5 GB heap given to Cassandra may be large relative to MySQL, leaving less room for OS caching. I'm not trying to make excuses for cassandra, and perhaps the MySQL key cache is not even at play here. But it's difficult to offer an accurate explanation for what exactly is happening based on the information provided, and whether the cause of the performance difference would be likely to hold for the real production use case. This is not to complain that you are not providing enough information, but to make the point that the performance of these things can be subtly affected by many things and it is difficult to predict expected performance to a high degree of precision - particularly when multiple different forms of caching involved in the comparison. -- / Peter Schuller