Re: how can we get (a lot) more performance from cassandra

Oleg Dulin Wed, 16 May 2012 13:44:52 -0700

Please do keep us posted. We have a somewhat similar Cassandrautilization pattern, and I would like to know what your solution is...


On 2012-05-16 20:38:37 +0000, Yiming Sun said:

Thanks Oleg. Another caveat from our side is, we have a very largedata space (imaging picking 100 items out of 3 million, the chance ofhaving 2 items from the same bin is pretty low). We will experimentwith row cache, and hopefully it will help, not the opposite (thetuning guide says row cache could be detrimental in some circumstances).
-- Y.

On Wed, May 16, 2012 at 4:25 PM, Oleg Dulin <oleg.du...@gmail.com> wrote:
Indeed. This is how we are trying to solve this problem.
Our application has a built-in cache that resembles a supercolumn orstandardcolumn data structure and has API that resembles a combinationof Pelops selector and mutator. You can do something like that forHector.
The cache is constrained and uses LRU to purge unused items and keepmemory usage steady.
It is not perfect and we have bugs still but it cuts down on 90% ofcassandra reads.
On 2012-05-16 20:07:11 +0000, Mike Peters said:

Hi Yiming,

Cassandra is optimized for write-heavy environments.
If you have a read-heavy application, you shouldn't be running yourreads through Cassandra.
On the bright side - Cassandra read throughput will remain consistent,regardless of your volume. But you are going to have to "wrap" yourreads with memcache (or redis), so that the bulk of your reads can beserved from memory.
Thanks,
Mike Peters

On 5/16/2012 3:59 PM, Yiming Sun wrote:
Hello,
I asked the question as a follow-up under a different thread, so Ifigure I should ask here instead in case the other one gets buried, andbesides, I have a little more information.
"We find the lack of performance disturbing" as we are only able to getabout 3-4MB/sec read performance out of Cassandra.
We are using cassandra as the backend for an IR repository of digitaltexts. It is a read-mostly repository with occasional writes. Each rowrepresents a book volume, and each column of a row represents a page ofthe volume. Granted the data size is small -- the average size of acolumn text is 2-3KB, and each row has about 250 columns (varies quitea bit from one volume to another).
Currently we are running a 3-node cluster, and will soon be upgraded toa 6-node setup. Each node is a VM with 4 cores and 16GB of memory.All VMs use SAN as disk storage.
To retrieve a volume, a slice query is used via Hector that specifiesthe row key (the volume), and a list of column keys (pages), and theconsistency level is set to ONE. It is typical to retrieve multiplevolumes per request.
The read rate that I have been seeing is about 3-4 MB/sec, and that isreading the raw bytes... using string serializer the rate is evenlower, about 2.2MB/sec.
The server log shows the GC ParNew frequently gets longer than 200ms,often in the range of 4-5seconds. But nowhere near 15 seconds (whichis an indication that JVM heap is being swapped out).
Currently we have not added JNA. From a blog post, it seems JNA isable to increase the performance by 13%, and we are hoping to increasethe performance by something more like 1300% (3-4 MB/sec is justdisturbingly low). And we are hesitant to disable swap entirely sinceone of the nodes is running a couple other services
Do you have any suggestions on how we may boost the performance?  Thanks!

-- Y.

Re: how can we get (a lot) more performance from cassandra

Reply via email to