Re: ReplicateOnWrite issues

Sylvain Lebresne Tue, 12 Jul 2011 12:26:40 -0700

When you do counter increment at CL.ONE, the write is acknowledged as
soon as the first replica getting the the write has pushed the
increment into his memtable. However, there is a read happening for
the replication to the other replicas (this is necessary to the
counter design). What is happening is that our nodes are not able to
do those reads fast enough. Or, another way to see it is that you are
writing faster than you cluster can handle.


I've opened https://issues.apache.org/jira/browse/CASSANDRA-2889 to do
something about that, but for now you'll have to stagger your writes.
I'm more surprised that tasks on the replicate on write stage would
stack up at QUORUM, unless you're using a very large number of
clients.

--
Sylvain

On Tue, Jul 12, 2011 at 8:57 AM, David Hawthorne <dha...@gmx.3crowd.com> wrote:
> This is now my fourth attempt to get the message through.  Apologies if you 
> see multiple copies.
>
> I've tried to give as much relevant data as I can think of, but please let me 
> know if you need any other info.  I spent the day getting jmxtrans to talk to 
> statsd with the cassandra JMX data, so I can provide graphs of just about 
> anything you want.  I can also re-run the tests to replicate the problem if 
> need be.  As always, thanks for taking a look!
>
> I have a 4 host test cluster that I'm writing counters into running 0.8.1, 
> and I'm writing to it with hector with CL.ONE or CL.QUORUM (test was run with 
> each and both gave the same results).  Partitioner is the RandomPartitioner, 
> and RF=3.  nodetool ring shows that actual data is well balanced:
>
> Address         DC          Rack        Status State   Load            Owns   
>  Token
>                                                                              
> 127605887595351923798765477786913079296
> 10.0.0.57    datacenter1 rack1       Up     Normal  1.59 GB         25.00%  0
> 10.0.0.56    datacenter1 rack1       Up     Normal  1.39 GB         25.00%  
> 42535295865117307932921825928971026432
> 10.0.0.55    datacenter1 rack1       Up     Normal  1.24 GB         25.00%  
> 85070591730234615865843651857942052864
> 10.0.0.54    datacenter1 rack1       Up     Normal  1.34 GB         25.00%  
> 127605887595351923798765477786913079296
>
> I also tested against a 1-box test cluster with RF=1.
>
> Both show the same results:  high performance for a while, and then 
> ReplicateOnWrite backs way up (I've seen it as high as a million), the 
> cassandra process becomes unresponsive, and the hector client starts throwing 
> exceptions.  High performance in this case equates to about 5-20k inserts/sec 
> on both the 4-box and 1-box cluster (I'm inserting one long row at a time, so 
> the performance is bounded by the one box owning the row).  The cassandra 
> process never seems to recover, even if left for 24 hours.  It's still 
> chewing through those ReplicateOnWrite pending tasks.
>
> GC doesn't seem to be a factor.  Logs show GCs completing in < 1s on all 4 
> boxes in the test cluster, as well as on the 1 standalone server.
>
> I'm also seeing highly erratic performance from the cluster in general:  
> inserts/sec usually start at around 3k, but then maybe creep up towards 6k, 
> then drop back down to below 1k for an extended period of time, then maybe 1% 
> of the time they'll spike up to 8k or even 15k, but that doesn't last more 
> than 30 seconds.
>
> Here's the tpstats output from all 4 boxes:
>
> Slowest/most problematic/highest cpu usage one:
>
> This tpstats output is also representative of the one box cluster about 20 
> minutes into my insert run.
>
> 4 $ ./nodetool -h localhost tpstats
> Pool Name                    Active   Pending      Completed
> ReadStage                         0         0              0
> RequestResponseStage              0         0       59095840
> MutationStage                     0         0       31182942
> ReadRepairStage                   0         0              0
> ReplicateOnWriteStage            32    928042       28246716
> GossipStage                       1         1         449464
> AntiEntropyStage                  0         0              0
> MigrationStage                    0         0              4
> MemtablePostFlusher               0         0            445
> StreamStage                       0         0              0
> FlushWriter                       0         0            445
> FILEUTILS-DELETE-POOL             0         0             38
> MiscStage                         0         0              0
> FlushSorter                       0         0              0
> InternalResponseStage             0         0              4
> HintedHandoff                     0         0              0
>
> Second highest cpu usage one:
>
> 3 $ ./nodetool -h localhost tpstats
> Pool Name                    Active   Pending      Completed
> ReadStage                         0         0              0
> RequestResponseStage              0         0       27413910
> MutationStage                     0         0       27523094
> ReadRepairStage                   0         0              0
> ReplicateOnWriteStage             0         0       11111971
> GossipStage                       0         0         335422
> AntiEntropyStage                  0         0              0
> MigrationStage                    0         0              4
> MemtablePostFlusher               0         0             66
> StreamStage                       0         0              0
> FlushWriter                       0         0             66
> FILEUTILS-DELETE-POOL             0         0             31
> MiscStage                         0         0              0
> FlushSorter                       0         0              0
> InternalResponseStage             0         0              4
> HintedHandoff                     0         0              0
>
> The other 2 that see minimal cpu usage:
>
> 2 $ ./nodetool -h localhost tpstats
> Pool Name                    Active   Pending      Completed
> ReadStage                         0         0              0
> RequestResponseStage              0         0        9213999
> MutationStage                     0         0       33233114
> ReadRepairStage                   0         0              0
> ReplicateOnWriteStage             0         0              0
> GossipStage                       0         0         336016
> AntiEntropyStage                  0         0              0
> MigrationStage                    0         0              4
> MemtablePostFlusher               0         0             77
> StreamStage                       0         0              0
> FlushWriter                       0         0             77
> FILEUTILS-DELETE-POOL             0         0             40
> MiscStage                         0         0              0
> FlushSorter                       0         0              0
> InternalResponseStage             0         0              4
> HintedHandoff                     0         0              0
>
>
> 1 $ ./nodetool -h localhost tpstats
> Pool Name                    Active   Pending      Completed
> ReadStage                         0         0              0
> RequestResponseStage              0         0       13796875
> MutationStage                     0         0       36116662
> ReadRepairStage                   0         0              0
> ReplicateOnWriteStage             0         0              0
> GossipStage                       0         0         336496
> AntiEntropyStage                  0         0              0
> MigrationStage                    0         0              4
> MemtablePostFlusher               0         0             91
> StreamStage                       0         0              0
> FlushWriter                       0         0             91
> FILEUTILS-DELETE-POOL             0         0             43
> MiscStage                         0         0              0
> FlushSorter                       0         0              0
> InternalResponseStage             0         0              4
> HintedHandoff                     0         0              0

Re: ReplicateOnWrite issues

Reply via email to