I think there might be a misunderstanding as to the nature of the problem. Say, I have test set T. And I have two identical servers A and B. - I tested that server A (singly) is able to handle load of T. - I tested that server B (singly) is able to handle load of T. - I then join A and B in the cluster and set replication=2 -- this means that each server in effect has to handle full test load individually (because there are two servers and replication=2 it means that each server effectively has to handle all the data written to the cluster). Under these circumstances it is reasonable to assume that cluster A+B shall be able to handle load T because each server is able to do so individually.
HOWEVER, this is not the case. In fact, A+B together are only able to handle less than 1/3 of T DESPITE the fact that A and B individually are able to handle T just fine. I think there's something wrong with Cassandra replication (possibly as simple as me misconfiguring something) -- it shouldn't be three times faster to write to two separate nodes in parallel as compared to writing to 2-node Cassandra cluster with replication=2. Edward Capriolo wrote > Say you are doing 100 inserts rf1 on two nodes. That is 50 inserts a node. > If you go to rf2 that is 100 inserts a node. If you were at 75 % capacity > on each mode your now at 150% which is not possible so things bog down. > > To figure out what is going on we would need to see tpstat, iostat , and > top information. > > I think your looking at the performance the wrong way. Starting off at rf > 1 > is not the way to understand cassandra performance. > > You do not get the benefits of "scala out" don't happen until you fix your > rf and increment your nodecount. Ie 5 nodes at rf 3 is fast 10 nodes at rf > 3 even better. > On Tuesday, November 27, 2012, Sergey Olefir < > solf.lists@ > > wrote: >> I already do a lot of in-memory aggregation before writing to Cassandra. >> >> The question here is what is wrong with Cassandra (or its configuration) >> that causes huge performance drop when moving from 1-replication to >> 2-replication for counters -- and more importantly how to resolve the >> problem. 2x-3x drop when moving from 1-replication to 2-replication on >> two >> nodes is reasonable. 6x is not. Like I said, with this kind of >> performance >> degradation it makes more sense to run two clusters with replication=1 in >> parallel rather than rely on Cassandra replication. >> >> And yes, Rainbird was the inspiration for what we are trying to do here >> :) >> >> >> >> Edward Capriolo wrote >>> Cassandra's counters read on increment. Additionally they are >>> distributed >>> so that can be multiple reads on increment. If they are not fast enough >>> and >>> you have avoided all tuning options add more servers to handle the load. >>> >>> In many cases incrementing the same counter n times can be avoided. >>> >>> Twitter's rainbird did just that. It avoided multiple counter increments >>> by >>> batching them. >>> >>> I have done a similar think using cassandra and Kafka. >>> >>> > https://github.com/edwardcapriolo/IronCount/blob/master/src/test/java/com/jointhegrid/ironcount/mockingbird/MockingBirdMessageHandler.java >>> >>> >>> On Tuesday, November 27, 2012, Sergey Olefir < >> >>> solf.lists@ >> >>> > wrote: >>>> Hi, thanks for your suggestions. >>>> >>>> Regarding replicate=2 vs replicate=1 performance: I expected that below >>>> configurations will have similar performance: >>>> - single node, replicate = 1 >>>> - two nodes, replicate = 2 (okay, this probably should be a bit slower >>>> due >>>> to additional overhead). >>>> >>>> However what I'm seeing is that second option (replicate=2) is about >>>> THREE >>>> times slower than single node. >>>> >>>> >>>> Regarding replicate_on_write -- it is, in fact, a dangerous option. As >>> JIRA >>>> discusses, if you make changes to your ring (moving tokens and such) >>>> you >>>> will *silently* lose data. That is on top of whatever data you might >>>> end >>> up >>>> losing if you run replicate_on_write=false and the only node that got > the >>>> data fails. >>>> >>>> But what is much worse -- with replicate_on_write being false the data >>> will >>>> NOT be replicated (in my tests) ever unless you explicitly request the >>> cell. >>>> Then it will return the wrong result. And only on subsequent reads it >>>> will >>>> return adequate results. I haven't tested it, but documentation states >>> that >>>> range query will NOT do 'read repair' and thus will not force >>>> replication. >>>> The test I did went like this: >>>> - replicate_on_write = false >>>> - write something to node A (which should in theory replicate to node >>>> B) >>>> - wait for a long time (longest was on the order of 5 hours) >>>> - read from node B (and here I was getting null / wrong result) >>>> - read from node B again (here you get what you'd expect after read >>> repair) >>>> >>>> In essence, using replicate_on_write=false with rarely read data will >>>> practically defeat the purpose of having replication in the first place >>>> (failover, data redundancy). >>>> >>>> >>>> Or, in other words, this option doesn't look to be applicable to my >>>> situation. >>>> >>>> It looks like I will get much better performance by simply writing to > two >>>> separate clusters rather than using single cluster with replicate=2. >>>> Which >>>> is kind of stupid :) I think something's fishy with counters and >>>> replication. >>>> >>>> >>>> >>>> Edward Capriolo wrote >>>>> I mispoke really. It is not dangerous you just have to understand what >>>>> it >>>>> means. this jira discusses it. >>>>> >>>>> https://issues.apache.org/jira/browse/CASSANDRA-3868 >>>>> >>>>> On Tue, Nov 27, 2012 at 6:13 PM, Scott McKay < >>>> >>>>> scottm@ >>>> >>>>> >wrote: >>>>> >>>>>> We're having a similar performance problem. Setting >>>>>> 'replicate_on_write: >>>>>> false' fixes the performance issue in our tests. >>>>>> >>>>>> How dangerous is it? What exactly could go wrong? >>>>>> >>>>>> On 12-11-27 01:44 PM, Edward Capriolo wrote: >>>>>> >>>>>> The difference between Replication factor =1 and replication factor > > 1 >>>>>> is >>>>>> significant. Also it sounds like your cluster is 2 node so going from >>>>>> RF=1 >>>>>> to RF=2 means double the load on both nodes. >>>>>> >>>>>> You may want to experiment with the very dangerous column family >>>>>> attribute: >>>>>> >>>>>> - replicate_on_write: Replicate every counter update from the leader >>>>>> to >>>>>> the >>>>>> follower replicas. Accepts the values true and false. >>>>>> >>>>>> Edward >>>>>> On Tue, Nov 27, 2012 at 1:02 PM, Michael Kjellman < >>>>>> >>>> >>>>> mkjellman@ >>>> >>>>>> wrote: >>>>>> >>>>>>> Are you writing with QUORUM consistency or ONE? >>>>>>> >>>>>>> On 11/27/12 9:52 AM, "Sergey Olefir" < >>>> >>>>> solf.lists@ >>>> >>>>> > wrote: >>>>>>> >>>>>>> >Hi Juan, >>>>> cassandra-user@.apache >> >>> mailing list archive at >>> Nabble.com. >>>> >> >> >> >> >> >> -- >> View this message in context: > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/counters-replication-awful-performance-tp7583993p7584014.html >> Sent from the > cassandra-user@.apache > mailing list archive at > Nabble.com. >> -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/counters-replication-awful-performance-tp7583993p7584031.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.