In some sense 1 for one performance "almost" does not matter.
Thou I bet you can get Cassandra better (I remember old school
ycsb white paper benches against a sharded mysql).
One of the main bullet points of Cassandra is if you want to grow
from 4 nodes, to 8 nodes, to 14 nodes, and so on, Cassandra is
elastic and supports online adding and removing of nodes. A
do-it-yourself hash mod this algorithm really has no upgrade path
Edward
On Sun, Jan 22, 2012 at 9:26 AM, Chris Gerken
<chrisger...@mindspring.com <mailto:chrisger...@mindspring.com>>
wrote:
Howdy Gustavo,
One thing that jumped out at me is your having put two
cassandra images on the same box. There may be enough CPU
and memory for the two images combined but you may be seeing
some other resource not being shared so nicely - network card
bandwidth, for example.
More generally, the real question is what the bottleneck is
(for both db's, actually). Start with Cassandra running in
that configuration and start with one client thread sending
one request a second. Look at the CPU, network and memory
metrics for all boxes (including the client). Nothing should
be even close to maxing out that that throughout. Now
incrementally increase one of the test parameters (number of
clients or number of inserts per second) just a bit (say from
one transaction to 5) and note the above metrics. Keep
slowly increasing the test parameters, one at a time, until
one of the metrics maxes out. That's the bottleneck you're
wondering about. Fix that and the db, be it Cassandra or
MySQL) will move ahead of the other performance-wise. Turn
your attention to the other db and repeat.
- Chris Gerken
On Jan 22, 2012, at 7:10 AM, Gustavo Gustavo wrote:
Hello,
I've set up a testing evironment for Cassandra and MySQL, to
compare both, regarding *performance only*. And I must admit
that I was expecting Cassandra to beat MySQL. But I've not
seen this happening up to now.
My application/use case is INSERT intensive, since I'm not
updating anything, just inserting all the time.
To compare both I created virtual machines with Ubuntu
11.10, and installed the latest versions of each datastore.
Each VM has 1GB of RAM. I've used VMs as a way to give both
datastores an equal sandbox.
MySQL is set up to work as sharded, with 2 databases, that
means that records are inserted to a specific instance based
on key % 2. The engine is MyISAM (InnoDB was really slow and
not really needed to my case). There's a primary compound
key (integer and datetime columns) in this test table.
Let's name the "nodes" MySQL1 and MySQL2.
Cassandra is set up to work with 4 nodes, with keys (tokens)
set up to distribute records evenly across the 4 nodes
(nodetool ring reports 25% to each node), replication factor
1 and RandomPartitioner, the other configs are left to
default. Let's name the nodes Cassandra1, Cassandra2,
Cassandra3 and Cassandra4.
I'm using 2 physical machines (Windows7) to host the 4
(Cassandra) or 2 (MySQL) virtual machines, this way:
Machine1: MySQL1, Cassandra1, Cassandra3
Machine2: MySQL2, Cassandra2, Cassandra4
The machines have CPU and RAM enough to host Cassandra
Cluster or MySQL "Cluster" at a time.
The client test applicatin is running in a third physical
machine, with 8 threads doing inserts. The test application
is written in C# (Windows7) using Aquiles high-level client.
My use case is a vehicle tracking system. So, let's suppose,
from minute to minute, the vehicle sends its position
together with some other GPS data and vehicle status
information. The columns in my Cassandra cluster are just
the DateTime (long value) of a position for a specific
vehicle, and the value is all the other data serialized to
binary format. Therefore, my CF really grows in columns
number. So all data is inserted only to one CF/Table named
Positions. The key to Cassandra is the VehicleID and to
MySQL VehicleID + PositionDateTime (MySQL creates an index
to this automatically). Important to note that MySQL threw
tons of connection exceptions, even though, the insert was
retried until it got through MySQL.
My test case was to insert 1k positions for 1k vehicles to
10 days - which gives 10.000.000 of inserts.
The final thoughtput that my application had for this
scenario was:
Cassandra x 4
2012-01-21 11 <tel:2012-01-21%2011>:45:38,044 #6
[Logger.Log] INFO - >> Inserted 10000 positions for 1000
vehicles (10000000 inserts):
2012-01-21 11 <tel:2012-01-21%2011>:45:38,082 #6
[Logger.Log] INFO - >> Total Time: 2:37:03,359
2012-01-21 11 <tel:2012-01-21%2011>:45:38,085 #6
[Logger.Log] INFO - >> Throughput: 1061 inserts/s
And for MySQL x 2
2012-01-21 14 <tel:2012-01-21%2014>:26:25,197 #6
[Logger.Log] INFO - >> Inserted 10000 positions for 1000
vehicles (10000000 inserts):
2012-01-21 14 <tel:2012-01-21%2014>:26:25,250 #6
[Logger.Log] INFO - >> Total Time: 2:06:25,914
2012-01-21 14 <tel:2012-01-21%2014>:26:25,263 #6
[Logger.Log] INFO - >> Throughput: 1318 inserts/s
Is there something that I'm missing here? Is this excepted?
Or the problem is somewhere else and that's hard to say
looking at this description?
Cheers,
Gustavo