2010/9/15 Kamil Gorlo <kgs4...@gmail.com>

> Hey,
>
> we are considering using Cassandra for quite large project and because
> of that I made some tests with Cassandra. I was testing performance
> and stability mainly.
>
> My main tool was stress.py for benchmarks (or equivalent written in
> C++ to deal with python2.5 lack of multiprocessing). I will focus only
> on reads (random with normal distribution, what is default in
> stress.py) because writes were /quite/ good.
>
> I have 8 machines (xen quests with dedicated pair of 2TB SATA disks
> combined in RAID-O for every guest). Every machine has 4 individual
> cores of 2.4 Ghz and 4GB RAM.
>
> Cassandra commitlog and data dirs were on the same disk, I gave 2.5GB
> for Heap for Cassandra, key and row cached were disabled (standard
> Keyspace1 schema, all tests use Standard1 CF). All other options were
> defaults. I've disabled cache because I was testing random (or semi
> random - normal distribution) reads so it wouldnt help so much (and
> also because 4GB of RAM is not a lot).
>
> For first test I installed Cassandra on only one machine to test it
> and remember results for further comparisons with large cluster and
> other DBs.
>
> 1) RF was set to 1. I've inserted ~20GB of data (this is number
> reported in load column form nodetool ring output) using stress.py
> (100 colums per row). Then I've tested reads and got 200 rows/second
> (reading 100 columns per row, CL=ONE, disks were bottleneck, util was
> 100%). There was no other operation pending during reads (compaction,
> insertion, etc..).
>
> 2) So I moved to bigger cluster, with 8 machines and RF set to 2. I've
> inserted about ~20GB data per node (so 20 GB * 8 / 2 = 80GB of "real
> data"). Then I've tested reads, exactly te same way as before, and got
> about 450 rows/second (reading 100 columns (but reading only 1 in fact
> makes no difference), CL=ONE, disks on every machine was 100% util
> because of random reads).
>
> 3) Then I changed RF from 2 to 3 on cluster described in 2). So I
> ended with every node loaded with about 30GB of data. Then as usual,
> I've tested reads, and got only 300 rows/second from whole cluster
> (100% util on every disk).
>
> 4) Last test was with RF=3 as before, but I've inserted even more
> data, so every node on 8-machines cluster had ~100GB of data (8 *
> 100GB / 3 = 266GB of real data). In this case I've got only 125
> rows/second.
>
> I was using multiple processes and machines to test reads.
>
>
> *So my question is why these numbers are so low? What is especially
> suprising for me is that changing RF from 2 to 3 drops performance
> from 450 to 300 reads per second. Is this because of read repair?*
>

Yes.
Even for CL=ONE reading, requesting is forward to all replications for
read-repair.
As disk access is your bottleneck, it sounds reasonable that 450 X 2 = 300 X
3.

>
>
> PS. To compare Cassandra performance with other DBs, I've also tested
> MySQL with almost exact data (one table with two columns, key (int PK)
> and value(VARCHAR(500))  simulating 100 columns in Cassandra for
> single row). MySQL was installed on the same machine as Cassandra from
> test 1) (which is one of these 8 machines described before). I've
> inserted some data and then tested random reads (which was even worse
> for caching because I've used standard rand() from C++ to generate
> keys, not normal distribution). Here are results:
>
> size of data in db -> reads per second
> 21 GB  -> 340
> 400 GB -> 200
>
> So I've got more reads from single MySQL with 400GB of data than from
> 8 machines storing about 266GB. This doesn't look good. What am I
> doing wrong? :)
>
> Disable row cache is ok, but key cache should be enabled. It use little
memory, but reading peformance will improve a lot.


Cheers,
> Kamil
>



-- 
Best Regards,
Chen Xinli

Reply via email to