cassandra read performance on large dataset

Radim Kolar Mon, 28 Nov 2011 05:45:12 -0800

I understand that my computer may be not as powerful as those used inthe other benchmarks,
but it shouldn't be that far off (1:30), right?

cassandra has very fast writes. you can have read:write ratios like 1:1000


pure read workload on 1 billion rows without key/row cache on 2 node cluster
Running workload in 10 threads 1000 ops each.
Workload took 88.59 seconds, thruput 112.88 ops/sec

each node can do about 240 IOPS. Which means average 4 iops per read incassandra on cold system.After OS cache warms enough to cache indirect seek blocks it gets fasterto almost ideal:

Workload took 79.76 seconds, thruput 200.59 ops/sec

Ideal cassandra read performance is (without caches) is 2 IOPS per read-> one io to read index, second to data.


pure write workload:
Running workload in 40 threads 100000 ops each.
Workload took 302.51 seconds, thruput 13222.62 ops/sec

write is slow here because nodes are running out of memory most likelydue to memory leaks in 1.0 branch. Also writes in this test are not batched.

Cassandra is really awesome for its price tag. Getting similar numbersfrom Oracle will cost you way too much. For one 2 core Oracle licencesuitable for processing large data you can get about 8 cassandra nodes- and dont forget that oracle needs some hardware too. Transactions arenot always needed for data warehousing - if you are importing chunks ofdata, you do not need to do rollbacks, just schedule failed chunks forlater processing. If you are able to code your app to work withouttransactions, cassandra is way to go.

Hadoop and cassandra are very good products for working with large databasically for just price of learning new technology. Usually cassandrais deployed first, its easy to get it running and day-to-day operationsare simple. Hadoop follows later after discovering that cassandra is notreally suitable for large batch jobs because it needs random access fordata reading.

We finished processing migration from Commercional SQL to Hadoop/Cassain 3 months, not only that it costs 10x less, we are able to processabout 100 times larger datasets. Our largest dataset has 1200 billion rows.


Problems with this setup are:

bloom filters are using too much memory. they should be configurablefor applications where read performance is unimportant

   node startup is really slow

data loaded into cassandra are about 2 times bigger then CSV export.(not really problem, diskspace is cheap, but there is kinda high per rowoverhead)writing applications is harder then coding for SQL backend. Hadoopis way harder to use then cassandra.lack of good import/export tools for cassandra. especially lack ofmonitoringmust have knowledge of workarounds for hadoop bugs. Hadoop is noteasy to use efficiently.index overhead is too big (about 100% slower) compared to indexoverhead in SQL databases (about 20% slower)

   no delete over index
   repair is slow

cassandra read performance on large dataset

Reply via email to