Re: ReplicateOnWrite issues

Sylvain Lebresne Tue, 12 Jul 2011 15:03:00 -0700

On Tue, Jul 12, 2011 at 11:42 PM, David Hawthorne <dha...@gmx.3crowd.com> wrote:
> Well, I was using a large number of clients:  I tried configuring a hector 
> pool of 20-200 to see what affect that had on throughput. There's definitely 
> a point after which there's no gain, so I dialed it back down.  To clarify a 
> few other things, when I say inserts I mean increments, as this test only 
> involves counters.  When I say a row is full, I mean I've written all of the 
> data into it that belongs in it from our application's perspective.  Not all 
> rows are the same length, and the average number of subcolumns per 
> supercolumn is higher for some rows than for others.
>


Oh yes, you're using supercolumns. That could help magnify the problem
a little bit, given that to read any subcolumn of a supercolumn, we
have to read the whole supercolumn. Depends how many counters per
supercolumns we're talking about.

> I'm curious, is it necessary for a read to happen for replication when using 
> RF=1?

No, it's not necessary with RF=1 ... but we still do it :(. Ok, it's
basically a bug, we'll
fix that for 0.8.2.

> I am currently running a test against a single node cluster, RF=1, CL.ONE, 4 
> connections/concurrent increments, and graphing various data points.  If 
> something interesting happens, I'll send a graph.
>
> When you say stagger writes, do you mean just lower the overall number of 
> increments/sec my client is attempting,

Yes, this is what I mean.

>or do you mean I should stagger the load amongst many CFs/rows at once?  Right 
>now I'm writing heavily into a small set of rows, each going into 3 CFs, 
>before moving on to the next small set of rows.
>
>
> On Jul 12, 2011, at 12:26 PM, Sylvain Lebresne wrote:
>
>> When you do counter increment at CL.ONE, the write is acknowledged as
>> soon as the first replica getting the the write has pushed the
>> increment into his memtable. However, there is a read happening for
>> the replication to the other replicas (this is necessary to the
>> counter design). What is happening is that our nodes are not able to
>> do those reads fast enough. Or, another way to see it is that you are
>> writing faster than you cluster can handle.
>>
>> I've opened https://issues.apache.org/jira/browse/CASSANDRA-2889 to do
>> something about that, but for now you'll have to stagger your writes.
>> I'm more surprised that tasks on the replicate on write stage would
>> stack up at QUORUM, unless you're using a very large number of
>> clients.
>>
>> --
>> Sylvain
>>
>> On Tue, Jul 12, 2011 at 8:57 AM, David Hawthorne <dha...@gmx.3crowd.com> 
>> wrote:
>>> This is now my fourth attempt to get the message through.  Apologies if you 
>>> see multiple copies.
>>>
>>> I've tried to give as much relevant data as I can think of, but please let 
>>> me know if you need any other info.  I spent the day getting jmxtrans to 
>>> talk to statsd with the cassandra JMX data, so I can provide graphs of just 
>>> about anything you want.  I can also re-run the tests to replicate the 
>>> problem if need be.  As always, thanks for taking a look!
>>>
>>> I have a 4 host test cluster that I'm writing counters into running 0.8.1, 
>>> and I'm writing to it with hector with CL.ONE or CL.QUORUM (test was run 
>>> with each and both gave the same results).  Partitioner is the 
>>> RandomPartitioner, and RF=3.  nodetool ring shows that actual data is well 
>>> balanced:
>>>
>>> Address         DC          Rack        Status State   Load            Owns 
>>>    Token
>>>                                                                             
>>>  127605887595351923798765477786913079296
>>> 10.0.0.57    datacenter1 rack1       Up     Normal  1.59 GB         25.00%  >>> 0
>>> 10.0.0.56    datacenter1 rack1       Up     Normal  1.39 GB         25.00%  
>>> 42535295865117307932921825928971026432
>>> 10.0.0.55    datacenter1 rack1       Up     Normal  1.24 GB         25.00%  
>>> 85070591730234615865843651857942052864
>>> 10.0.0.54    datacenter1 rack1       Up     Normal  1.34 GB         25.00%  
>>> 127605887595351923798765477786913079296
>>>
>>> I also tested against a 1-box test cluster with RF=1.
>>>
>>> Both show the same results:  high performance for a while, and then 
>>> ReplicateOnWrite backs way up (I've seen it as high as a million), the 
>>> cassandra process becomes unresponsive, and the hector client starts 
>>> throwing exceptions.  High performance in this case equates to about 5-20k 
>>> inserts/sec on both the 4-box and 1-box cluster (I'm inserting one long row 
>>> at a time, so the performance is bounded by the one box owning the row).  
>>> The cassandra process never seems to recover, even if left for 24 hours.  
>>> It's still chewing through those ReplicateOnWrite pending tasks.
>>>
>>> GC doesn't seem to be a factor.  Logs show GCs completing in < 1s on all 4 
>>> boxes in the test cluster, as well as on the 1 standalone server.
>>>
>>> I'm also seeing highly erratic performance from the cluster in general:  
>>> inserts/sec usually start at around 3k, but then maybe creep up towards 6k, 
>>> then drop back down to below 1k for an extended period of time, then maybe 
>>> 1% of the time they'll spike up to 8k or even 15k, but that doesn't last 
>>> more than 30 seconds.
>>>
>>> Here's the tpstats output from all 4 boxes:
>>>
>>> Slowest/most problematic/highest cpu usage one:
>>>
>>> This tpstats output is also representative of the one box cluster about 20 
>>> minutes into my insert run.
>>>
>>> 4 $ ./nodetool -h localhost tpstats
>>> Pool Name                    Active   Pending      Completed
>>> ReadStage                         0         0              0
>>> RequestResponseStage              0         0       59095840
>>> MutationStage                     0         0       31182942
>>> ReadRepairStage                   0         0              0
>>> ReplicateOnWriteStage            32    928042       28246716
>>> GossipStage                       1         1         449464
>>> AntiEntropyStage                  0         0              0
>>> MigrationStage                    0         0              4
>>> MemtablePostFlusher               0         0            445
>>> StreamStage                       0         0              0
>>> FlushWriter                       0         0            445
>>> FILEUTILS-DELETE-POOL             0         0             38
>>> MiscStage                         0         0              0
>>> FlushSorter                       0         0              0
>>> InternalResponseStage             0         0              4
>>> HintedHandoff                     0         0              0
>>>
>>> Second highest cpu usage one:
>>>
>>> 3 $ ./nodetool -h localhost tpstats
>>> Pool Name                    Active   Pending      Completed
>>> ReadStage                         0         0              0
>>> RequestResponseStage              0         0       27413910
>>> MutationStage                     0         0       27523094
>>> ReadRepairStage                   0         0              0
>>> ReplicateOnWriteStage             0         0       11111971
>>> GossipStage                       0         0         335422
>>> AntiEntropyStage                  0         0              0
>>> MigrationStage                    0         0              4
>>> MemtablePostFlusher               0         0             66
>>> StreamStage                       0         0              0
>>> FlushWriter                       0         0             66
>>> FILEUTILS-DELETE-POOL             0         0             31
>>> MiscStage                         0         0              0
>>> FlushSorter                       0         0              0
>>> InternalResponseStage             0         0              4
>>> HintedHandoff                     0         0              0
>>>
>>> The other 2 that see minimal cpu usage:
>>>
>>> 2 $ ./nodetool -h localhost tpstats
>>> Pool Name                    Active   Pending      Completed
>>> ReadStage                         0         0              0
>>> RequestResponseStage              0         0        9213999
>>> MutationStage                     0         0       33233114
>>> ReadRepairStage                   0         0              0
>>> ReplicateOnWriteStage             0         0              0
>>> GossipStage                       0         0         336016
>>> AntiEntropyStage                  0         0              0
>>> MigrationStage                    0         0              4
>>> MemtablePostFlusher               0         0             77
>>> StreamStage                       0         0              0
>>> FlushWriter                       0         0             77
>>> FILEUTILS-DELETE-POOL             0         0             40
>>> MiscStage                         0         0              0
>>> FlushSorter                       0         0              0
>>> InternalResponseStage             0         0              4
>>> HintedHandoff                     0         0              0
>>>
>>>
>>> 1 $ ./nodetool -h localhost tpstats
>>> Pool Name                    Active   Pending      Completed
>>> ReadStage                         0         0              0
>>> RequestResponseStage              0         0       13796875
>>> MutationStage                     0         0       36116662
>>> ReadRepairStage                   0         0              0
>>> ReplicateOnWriteStage             0         0              0
>>> GossipStage                       0         0         336496
>>> AntiEntropyStage                  0         0              0
>>> MigrationStage                    0         0              4
>>> MemtablePostFlusher               0         0             91
>>> StreamStage                       0         0              0
>>> FlushWriter                       0         0             91
>>> FILEUTILS-DELETE-POOL             0         0             43
>>> MiscStage                         0         0              0
>>> FlushSorter                       0         0              0
>>> InternalResponseStage             0         0              4
>>> HintedHandoff                     0         0              0
>
>

Re: ReplicateOnWrite issues

Reply via email to