> I'm quite desperate about Cassandra's performance in our production
> cluster. We have 8 real-HW nodes, 32core CPU, 32GB memory, 4 disks in
> raid10, cassandra 0.8.8, RF=3 and Hadoop.
> We four keyspaces, one is the large one, it has 2 CFs, one is kind of
> index, the other holds data. There are about 7milinon rows, mean row
> size is 7kB. We run several mapreduce tasks, most of them just reads
> from cassandra and writes to hdfs, but one fetch rows from cassnadra,
> compute something and write it back, for each row we compute three new
> json values, about 1kB each (they get overwritten next round).
>
> We got lots and lots of Timeout exceptions, LiveSSTablesCount over
> 100. Reapir doesn't finish even in 24hours, reading from the other
> keyspaces timeouts as well.  We set compaction_throughput_mb_per_sec:
> 0 but it didn't help.

Exactly why you're seeing timeouts would depend on quite a few
factors. In general however, my observation is that you have ~ 100 gig
per node on nodes with 32 gigs of memory, *AND* you say you're running
map reduce jobs.

In general, I would expect that any performance problems you have are
probably due to cache misses and simply bottlenecking on disk I/O.
What to do about it depends very much on the situation and it's
difficult to give a concrete suggestion without more context.

Some things that might mitigate effects include using row cache for
the hot data set (if you have a very small hot data set that should
work well since the row cache is unaffected by e.g. mapreduce jobs),
selecting a different compaction strategy (leveled can be better,
depending), running map reduce on a separate DC that takes writes but
is separated from the live cluster that takes reads (unless you're
only doing batch request).

But those are just some random things thrown in the air; do not take
that as concrete suggestions for your particular case.

The key is understanding the access pattern, what the bottlenecks are,
in combination with how the database works - and figure out what the
most cost-effective solution is.

Note that if you're bottlenecking on disk I/O, it's not surprising at
all that repairing ~ 100 gigs of data takes more than 24 hours.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Reply via email to