Thrift will allow for more large, free-form batch contstruction. The increase will be doing a lot more in the same payload message. Otherwise CQL is more efficient.
If you do build those giant string, yes you should see a performance improvement. On Tue, Aug 20, 2013 at 8:03 PM, Keith Freeman <8fo...@gmail.com> wrote: > Thanks. Can you tell me why would using thrift would improve performance? > > Also, if I do try to build those giant strings for a prepared batch > statement, should I expect another performance improvement? > > > > On 08/20/2013 05:06 PM, Nate McCall wrote: > > Ugh - sorry, I knew Sylvain and Michaƫl had worked on this recently but > it is only in 2.0 - I could have sworn it got marked for inclusion back > into 1.2 but I was wrong: > https://issues.apache.org/jira/browse/CASSANDRA-4693 > > This is indeed an issue if you don't know the column count before hand > (or had a very large number of them like in your case). Again, apologies, I > would not have recommended that route if I knew it was only in 2.0. > > I would be willing to bet you could hit those insert numbers pretty > easily with thrift given the shape of your mutation. > > > On Tue, Aug 20, 2013 at 5:00 PM, Keith Freeman <8fo...@gmail.com> wrote: > >> So I tried inserting prepared statements separately (no batch), and my >> server nodes load definitely dropped significantly. Throughput from my >> client improved a bit, but only a few %. I was able to *almost* get 5000 >> rows/sec (sort of) by also reducing the rows/insert-thread to 20-50 and >> eliminating all overhead from the timing, i.e. timing only the tight for >> loop of inserts. But that's still a lot slower than I expected. >> >> I couldn't do batches because the driver doesn't allow prepared >> statements in a batch (QueryBuilder API). It appears the batch itself >> could possibly be a prepared statement, but since I have 40+ columns on >> each insert that would take some ugly code to build so I haven't tried it >> yet. >> >> I'm using CL "ONE" on the inserts and RF 2 in my schema. >> >> >> On 08/20/2013 08:04 AM, Nate McCall wrote: >> >> John makes a good point re:prepared statements (I'd increase batch sizes >> again once you did this as well - separate, incremental runs of course so >> you can gauge the effect of each). That should take out some of the >> processing overhead of statement validation in the server (some - that load >> spike still seems high though). >> >> I'd actually be really interested as to what your results were after >> doing so - i've not tried any A/B testing here for prepared statements on >> inserts. >> >> Given your load is on the server, i'm not sure adding more async >> indirection on the client would buy you too much though. >> >> Also, at what RF and consistency level are you writing? >> >> >> On Tue, Aug 20, 2013 at 8:56 AM, Keith Freeman <8fo...@gmail.com> wrote: >> >>> Ok, I'll try prepared statements. But while sending my statements >>> async might speed up my client, it wouldn't improve throughput on the >>> cassandra nodes would it? They're running at pretty high loads and only >>> about 10% idle, so my concern is that they can't handle the data any >>> faster, so something's wrong on the server side. I don't really think >>> there's anything on the client side that matters for this problem. >>> >>> Of course I know there are obvious h/w things I can do to improve server >>> performance: SSDs, more RAM, more cores, etc. But I thought the servers I >>> have would be able to handle more rows/sec than say Mysql, since write >>> speed is supposed to be one of Cassandra's strengths. >>> >>> >>> On 08/19/2013 09:03 PM, John Sanda wrote: >>> >>> I'd suggest using prepared statements that you initialize at application >>> start up and switching to use Session.executeAsync coupled with Google >>> Guava Futures API to get better throughput on the client side. >>> >>> >>> On Mon, Aug 19, 2013 at 10:14 PM, Keith Freeman <8fo...@gmail.com>wrote: >>> >>>> Sure, I've tried different numbers for batches and threads, but >>>> generally I'm running 10-30 threads at a time on the client, each sending a >>>> batch of 100 insert statements in every call, using the >>>> QueryBuilder.batch() API from the latest datastax java driver, then calling >>>> the Session.execute() function (synchronous) on the Batch. >>>> >>>> I can't post my code, but my client does this on each iteration: >>>> -- divides up the set of inserts by the number of threads >>>> -- stores the current time >>>> -- tells all the threads to send their inserts >>>> -- then when they've all returned checks the elapsed time >>>> >>>> At about 2000 rows for each iteration, 20 threads with 100 inserts each >>>> finish in about 1 second. For 4000 rows, 40 threads with 100 inserts each >>>> finish in about 1.5 - 2 seconds, and as I said all 3 cassandra nodes have a >>>> heavy CPU load while the client is hardly loaded. I've tried with 10 >>>> threads and more inserts per batch, or up to 60 threads with fewer, doesn't >>>> seem to make a lot of difference. >>>> >>>> >>>> On 08/19/2013 05:00 PM, Nate McCall wrote: >>>> >>>> How big are the batch sizes? In other words, how many rows are you >>>> sending per insert operation? >>>> >>>> Other than the above, not much else to suggest without seeing some >>>> example code (on pastebin, gist or similar, ideally). >>>> >>>> On Mon, Aug 19, 2013 at 5:49 PM, Keith Freeman <8fo...@gmail.com>wrote: >>>> >>>>> I've got a 3-node cassandra cluster (16G/4-core VMs ESXi v5 on 2.5Ghz >>>>> machines not shared with any other VMs). I'm inserting time-series data >>>>> into a single column-family using "wide rows" (timeuuids) and have a >>>>> 3-part >>>>> partition key so my primary key is something like ((a, b, day), >>>>> in-time-uuid), x, y, z). >>>>> >>>>> My java client is feeding rows (about 1k of raw data size each) in >>>>> batches using multiple threads, and the fastest I can get it run reliably >>>>> is about 2000 rows/second. Even at that speed, all 3 cassandra nodes are >>>>> very CPU bound, with loads of 6-9 each (and the client machine is hardly >>>>> breaking a sweat). I've tried turning off compression in my table which >>>>> reduced the loads slightly but not much. There are no other updates or >>>>> reads occurring, except the datastax opscenter. >>>>> >>>>> I was expecting to be able to insert at least 10k rows/second with >>>>> this configuration, and after a lot of reading of docs, blogs, and google, >>>>> can't really figure out what's slowing my client down. When I increase >>>>> the >>>>> insert speed of my client beyond 2000/second, the server responses are >>>>> just >>>>> too slow and the client falls behind. I had a single-node Mysql database >>>>> that can handle 10k of these data rows/second, so I really feel like I'm >>>>> missing something in Cassandra. Any ideas? >>>>> >>>>> >>>> >>>> >>> >>> >>> -- >>> >>> - John >>> >>> >>> >> >> > >