Sure, I've tried different numbers for batches and threads, but
generally I'm running 10-30 threads at a time on the client, each
sending a batch of 100 insert statements in every call, using the
QueryBuilder.batch() API from the latest datastax java driver, then
calling the Session.execute() function (synchronous) on the Batch.
I can't post my code, but my client does this on each iteration:
-- divides up the set of inserts by the number of threads
-- stores the current time
-- tells all the threads to send their inserts
-- then when they've all returned checks the elapsed time
At about 2000 rows for each iteration, 20 threads with 100 inserts each
finish in about 1 second. For 4000 rows, 40 threads with 100 inserts
each finish in about 1.5 - 2 seconds, and as I said all 3 cassandra
nodes have a heavy CPU load while the client is hardly loaded. I've
tried with 10 threads and more inserts per batch, or up to 60 threads
with fewer, doesn't seem to make a lot of difference.
On 08/19/2013 05:00 PM, Nate McCall wrote:
How big are the batch sizes? In other words, how many rows are you
sending per insert operation?
Other than the above, not much else to suggest without seeing some
example code (on pastebin, gist or similar, ideally).
On Mon, Aug 19, 2013 at 5:49 PM, Keith Freeman <8fo...@gmail.com
<mailto:8fo...@gmail.com>> wrote:
I've got a 3-node cassandra cluster (16G/4-core VMs ESXi v5 on
2.5Ghz machines not shared with any other VMs). I'm inserting
time-series data into a single column-family using "wide rows"
(timeuuids) and have a 3-part partition key so my primary key is
something like ((a, b, day), in-time-uuid), x, y, z).
My java client is feeding rows (about 1k of raw data size each) in
batches using multiple threads, and the fastest I can get it run
reliably is about 2000 rows/second. Even at that speed, all 3
cassandra nodes are very CPU bound, with loads of 6-9 each (and
the client machine is hardly breaking a sweat). I've tried
turning off compression in my table which reduced the loads
slightly but not much. There are no other updates or reads
occurring, except the datastax opscenter.
I was expecting to be able to insert at least 10k rows/second with
this configuration, and after a lot of reading of docs, blogs, and
google, can't really figure out what's slowing my client down.
When I increase the insert speed of my client beyond 2000/second,
the server responses are just too slow and the client falls
behind. I had a single-node Mysql database that can handle 10k of
these data rows/second, so I really feel like I'm missing
something in Cassandra. Any ideas?