Re: counters + replication = awful performance?

Sergey Olefir Tue, 27 Nov 2012 22:34:32 -0800

I think there might be a misunderstanding as to the nature of the problem.

Say, I have test set T. And I have two identical servers A and B.
- I tested that server A (singly) is able to handle load of T.
- I tested that server B (singly) is able to handle load of T.
- I then join A and B in the cluster and set replication=2 -- this means
that each server in effect has to handle full test load individually
(because there are two servers and replication=2 it means that each server
effectively has to handle all the data written to the cluster). Under these
circumstances it is reasonable to assume that cluster A+B shall be able to
handle load T because each server is able to do so individually.


HOWEVER, this is not the case. In fact, A+B together are only able to handle
less than 1/3 of T DESPITE the fact that A and B individually are able to
handle T just fine.

I think there's something wrong with Cassandra replication (possibly as
simple as me misconfiguring something) -- it shouldn't be three times faster
to write to two separate nodes in parallel as compared to writing to 2-node
Cassandra cluster with replication=2.


Edward Capriolo wrote
> Say you are doing 100 inserts rf1 on two nodes. That is 50 inserts a node.
> If you go to rf2 that is 100 inserts a node.  If you were at 75 % capacity
> on each mode your now at 150% which is not possible so things bog down.
> 
> To figure out what is going on we would need to see tpstat, iostat , and
> top information.
> 
> I think your looking at the performance the wrong way. Starting off at rf
> 1
> is not the way to understand cassandra performance.
> 
> You do not get the benefits of "scala out" don't happen until you fix your
> rf and increment your nodecount. Ie 5 nodes at rf 3 is fast 10 nodes at rf
> 3 even better.
> On Tuesday, November 27, 2012, Sergey Olefir &lt;

> solf.lists@

> &gt; wrote:
>> I already do a lot of in-memory aggregation before writing to Cassandra.
>>
>> The question here is what is wrong with Cassandra (or its configuration)
>> that causes huge performance drop when moving from 1-replication to
>> 2-replication for counters -- and more importantly how to resolve the
>> problem. 2x-3x drop when moving from 1-replication to 2-replication on
>> two
>> nodes is reasonable. 6x is not. Like I said, with this kind of
>> performance
>> degradation it makes more sense to run two clusters with replication=1 in
>> parallel rather than rely on Cassandra replication.
>>
>> And yes, Rainbird was the inspiration for what we are trying to do here
>> :)
>>
>>
>>
>> Edward Capriolo wrote
>>> Cassandra's counters read on increment. Additionally they are
>>> distributed
>>> so that can be multiple reads on increment. If they are not fast enough
>>> and
>>> you have avoided all tuning options add more servers to handle the load.
>>>
>>> In many cases incrementing the same counter n times can be avoided.
>>>
>>> Twitter's rainbird did just that. It avoided multiple counter increments
>>> by
>>> batching them.
>>>
>>> I have done a similar think using cassandra and Kafka.
>>>
>>>
> https://github.com/edwardcapriolo/IronCount/blob/master/src/test/java/com/jointhegrid/ironcount/mockingbird/MockingBirdMessageHandler.java
>>>
>>>
>>> On Tuesday, November 27, 2012, Sergey Olefir &lt;
>>
>>> solf.lists@
>>
>>> &gt; wrote:
>>>> Hi, thanks for your suggestions.
>>>>
>>>> Regarding replicate=2 vs replicate=1 performance: I expected that below
>>>> configurations will have similar performance:
>>>> - single node, replicate = 1
>>>> - two nodes, replicate = 2 (okay, this probably should be a bit slower
>>>> due
>>>> to additional overhead).
>>>>
>>>> However what I'm seeing is that second option (replicate=2) is about
>>>> THREE
>>>> times slower than single node.
>>>>
>>>>
>>>> Regarding replicate_on_write -- it is, in fact, a dangerous option. As
>>> JIRA
>>>> discusses, if you make changes to your ring (moving tokens and such)
>>>> you
>>>> will *silently* lose data. That is on top of whatever data you might
>>>> end
>>> up
>>>> losing if you run replicate_on_write=false and the only node that got
> the
>>>> data fails.
>>>>
>>>> But what is much worse -- with replicate_on_write being false the data
>>> will
>>>> NOT be replicated (in my tests) ever unless you explicitly request the
>>> cell.
>>>> Then it will return the wrong result. And only on subsequent reads it
>>>> will
>>>> return adequate results. I haven't tested it, but documentation states
>>> that
>>>> range query will NOT do 'read repair' and thus will not force
>>>> replication.
>>>> The test I did went like this:
>>>> - replicate_on_write = false
>>>> - write something to node A (which should in theory replicate to node
>>>> B)
>>>> - wait for a long time (longest was on the order of 5 hours)
>>>> - read from node B (and here I was getting null / wrong result)
>>>> - read from node B again (here you get what you'd expect after read
>>> repair)
>>>>
>>>> In essence, using replicate_on_write=false with rarely read data will
>>>> practically defeat the purpose of having replication in the first place
>>>> (failover, data redundancy).
>>>>
>>>>
>>>> Or, in other words, this option doesn't look to be applicable to my
>>>> situation.
>>>>
>>>> It looks like I will get much better performance by simply writing to
> two
>>>> separate clusters rather than using single cluster with replicate=2.
>>>> Which
>>>> is kind of stupid :) I think something's fishy with counters and
>>>> replication.
>>>>
>>>>
>>>>
>>>> Edward Capriolo wrote
>>>>> I mispoke really. It is not dangerous you just have to understand what
>>>>> it
>>>>> means. this jira discusses it.
>>>>>
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-3868
>>>>>
>>>>> On Tue, Nov 27, 2012 at 6:13 PM, Scott McKay &lt;
>>>>
>>>>> scottm@
>>>>
>>>>> &gt;wrote:
>>>>>
>>>>>>  We're having a similar performance problem.  Setting
>>>>>> 'replicate_on_write:
>>>>>> false' fixes the performance issue in our tests.
>>>>>>
>>>>>> How dangerous is it?  What exactly could go wrong?
>>>>>>
>>>>>> On 12-11-27 01:44 PM, Edward Capriolo wrote:
>>>>>>
>>>>>> The difference between Replication factor =1 and replication factor >
> 1
>>>>>> is
>>>>>> significant. Also it sounds like your cluster is 2 node so going from
>>>>>> RF=1
>>>>>> to RF=2 means double the load on both nodes.
>>>>>>
>>>>>>  You may want to experiment with the very dangerous column family
>>>>>> attribute:
>>>>>>
>>>>>>  - replicate_on_write: Replicate every counter update from the leader
>>>>>> to
>>>>>> the
>>>>>> follower replicas. Accepts the values true and false.
>>>>>>
>>>>>>  Edward
>>>>>>  On Tue, Nov 27, 2012 at 1:02 PM, Michael Kjellman <
>>>>>>
>>>>
>>>>> mkjellman@
>>>>
>>>>>> wrote:
>>>>>>
>>>>>>> Are you writing with QUORUM consistency or ONE?
>>>>>>>
>>>>>>> On 11/27/12 9:52 AM, "Sergey Olefir" &lt;
>>>>
>>>>> solf.lists@
>>>>
>>>>> &gt; wrote:
>>>>>>>
>>>>>>> >Hi Juan,
>>>>> cassandra-user@.apache
>>
>>>  mailing list archive at
>>> Nabble.com.
>>>>
>>
>>
>>
>>
>>
>> --
>> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/counters-replication-awful-performance-tp7583993p7584014.html
>> Sent from the 

> cassandra-user@.apache

>  mailing list archive at
> Nabble.com.
>>





--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/counters-replication-awful-performance-tp7583993p7584031.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.

Re: counters + replication = awful performance?

Reply via email to