>  - performance (it should be not as much less than shard of MySQL and
> scale linearly, we want to have not more that 10K inserts per second
> of writes, and probably not more than 1K/s reads which will be mostly
> random)
>  - ability to store big amounts of data (now it looks that we will
> have about 50GB of uncompressed data per day)

I'm curious what the writes look like, if you expect to do 10k inserts
per second with MySQL on large data sets that do not fit in RAM. Are
they of such a nature that the writes turn mostly sequential even in
MySQL?

In particular consider that for indexed columns in the MySQL case, if
you're inserting values randomly with respect to the index order of
indexed columns, you can expect to take seeks for your writes even if
your database is logically "append only" in terms of not having
deletes.

Unless the write activity does happen to be very unusually structured
such that you don't take any seeking in a MySQL/Tokyo Cabinet
environment, it seems likely that write performance will be the most
significant issue for you (and cassandra should shine here), given the
size of the data set and given that you say you expect 10x the amount
of writes as you do reads, and that the reads are expected to be
random.

With regards to reads, I re-read your original post. I'm trying to
figure out what the most likely cause of the discrepancy is. Even with
zero cache locality, my reading of the index read path is that, in
cassandra, you would expect exactly one disk seek into the index file,
followed by streaming (though this may be turned into multiple seeks
depending on OS readahead behavior). Then, you would expect a single
seek in the data file followed by streaming.

With low cache hit ratios, I would not expect a btree index lookup +
data file seek in mysql/cabinet/postgresql to be significantly faster
in terms of disk seeks, assuming (1) read-ahead is in effect and
avoids the streaming turning into multiple seeks, and (2) that the
data size is large enough that btree index leaves are not cached most
of the time.

Some random thoughts:

* Is MySQL configured such that most or all of the index will be in
MySQL's internal key cache? If this is the case, it is not competing
with the data file and may account for MySQL effectively not needing
to go down on disk for index lookups, where Cassandra may have to.
However, if my understanding is correct, this should effectively no
more than double the amount of seeks required in Cassandra relative to
MySQL.

* For rows that span multiple sstables, you may see a magnifying
effects with respect to seeks across multiple sstables. My reading of
py_stress is that this should not be what is hitting you, assuming you
populated the database once and then did the reads, since each row key
seems to be populated once for an insertion run. To ensure this is not
an issue, you can try confirming that read performance does not
significantly improve after having triggered a major compaction (and
waiting for it to complete).

As to why you just saw 200 -> 450 instead of 200 -> 800 when moving to
8 nodes and doubling RF, I don't know, given that you say you were
still fully disk bound.

Going from 450 down to 300 when going from RF 2 to 3 seems very much
expected though (almost exactly), given that you're increasing the
reads by 50% (for an underlying read capacity of n seeks per second
and replication factor f, the ideal case would be to expect that
the read rate r = n/f; so r = 450, f = 2 => n = 900. So given n = 900
and f = 3, n = 900/3=300).

Anyways... on to the numbers comparison between cassandra and MySQL:

With MySQL you got 340 reads/second @ 21 gb on a single machine. The
poorest numbers from cassandra (discounting the expected decrease of
going from rf 2->3), is 450 / second on 8 nodes with rf2. This is
roughly equivalent of 450 / second on 4 nodes with rf 1, or 112
reads/second for a single node.

In other words, in the end the poorest number you got from cassandra,
compared to the best number you reported for MySQL, shows a factor of
3 difference. What is this caused by? I'm not sure. But if the mysql
key index hypothesis is correct, that might account for  a factor of 2
difference. If so, the remainder would be 50% faster in the MySQL
case. Add to this that the 2.5 GB heap given to Cassandra may be large
relative to MySQL, leaving less room for OS caching.

I'm not trying to make excuses for cassandra, and perhaps the MySQL
key cache is not even at play here. But it's difficult to offer an
accurate explanation for what exactly is happening based on the
information provided, and whether the cause of the performance
difference would be likely to hold for the real production use case.
This is not to complain that you are not providing enough information,
but to make the point that the performance of these things can be
subtly affected by many things and it is difficult to predict expected
performance to a high degree of precision - particularly when multiple
different forms of caching involved in the comparison.

-- 
/ Peter Schuller

Reply via email to