Not sure whether it's an option for you, but you might consider to do some in-memory aggregation of counter values and flushing only once every X updates / seconds. This will decrease both load, latency and throughput. However this is not possible in every single use case.
Best regards, Robin Verlangen *Software engineer* * * W http://www.robinverlangen.nl E ro...@us2.nl <http://goo.gl/Lt7BC> Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. On Wed, Nov 28, 2012 at 9:24 AM, Sylvain Lebresne <sylv...@datastax.com>wrote: > Counters replication works in different ways than the one of "normal" > writes. Namely, a counter update is written to a first replica, then a read > is perform and the result of that is replicated to the other nodes. With > RF=1, since there is only one replica no read is involved but in a way it's > a degenerate case. So there is two reason why RF>2 is much slower than RF=1: > 1) it involves a read to replicate and that read takes times. Especially > if that read hits the disk, it may even dominate the insertion time. > 2) the replication to the first replica and the one to the res of the > replica are not done in parallel but sequentially. Note that this is only > true for the first replica versus the othere. In other words, from RF=2 to > RF=3 you should see a significant performance degradation. > > Note that while there is nothing you can do for 2), you can try to speed > up 1) by using row cache for instance (in case you weren't). > > In other words, with counters, it is expected that RF=1 be potentially > much faster than RF>1. That is the way counters works. > > And don't get me wrong, I'm not suggesting you should use RF=1 at all. > What I am saying is that the performance you see with RF=2 is the > performance of counters in Cassandra. > > -- > Sylvain > > > On Wed, Nov 28, 2012 at 7:34 AM, Sergey Olefir <solf.li...@gmail.com>wrote: > >> I think there might be a misunderstanding as to the nature of the problem. >> >> Say, I have test set T. And I have two identical servers A and B. >> - I tested that server A (singly) is able to handle load of T. >> - I tested that server B (singly) is able to handle load of T. >> - I then join A and B in the cluster and set replication=2 -- this means >> that each server in effect has to handle full test load individually >> (because there are two servers and replication=2 it means that each server >> effectively has to handle all the data written to the cluster). Under >> these >> circumstances it is reasonable to assume that cluster A+B shall be able to >> handle load T because each server is able to do so individually. >> >> HOWEVER, this is not the case. In fact, A+B together are only able to >> handle >> less than 1/3 of T DESPITE the fact that A and B individually are able to >> handle T just fine. >> >> I think there's something wrong with Cassandra replication (possibly as >> simple as me misconfiguring something) -- it shouldn't be three times >> faster >> to write to two separate nodes in parallel as compared to writing to >> 2-node >> Cassandra cluster with replication=2. >> >> >> Edward Capriolo wrote >> > Say you are doing 100 inserts rf1 on two nodes. That is 50 inserts a >> node. >> > If you go to rf2 that is 100 inserts a node. If you were at 75 % >> capacity >> > on each mode your now at 150% which is not possible so things bog down. >> > >> > To figure out what is going on we would need to see tpstat, iostat , and >> > top information. >> > >> > I think your looking at the performance the wrong way. Starting off at >> rf >> > 1 >> > is not the way to understand cassandra performance. >> > >> > You do not get the benefits of "scala out" don't happen until you fix >> your >> > rf and increment your nodecount. Ie 5 nodes at rf 3 is fast 10 nodes at >> rf >> > 3 even better. >> > On Tuesday, November 27, 2012, Sergey Olefir < >> >> > solf.lists@ >> >> > > wrote: >> >> I already do a lot of in-memory aggregation before writing to >> Cassandra. >> >> >> >> The question here is what is wrong with Cassandra (or its >> configuration) >> >> that causes huge performance drop when moving from 1-replication to >> >> 2-replication for counters -- and more importantly how to resolve the >> >> problem. 2x-3x drop when moving from 1-replication to 2-replication on >> >> two >> >> nodes is reasonable. 6x is not. Like I said, with this kind of >> >> performance >> >> degradation it makes more sense to run two clusters with replication=1 >> in >> >> parallel rather than rely on Cassandra replication. >> >> >> >> And yes, Rainbird was the inspiration for what we are trying to do here >> >> :) >> >> >> >> >> >> >> >> Edward Capriolo wrote >> >>> Cassandra's counters read on increment. Additionally they are >> >>> distributed >> >>> so that can be multiple reads on increment. If they are not fast >> enough >> >>> and >> >>> you have avoided all tuning options add more servers to handle the >> load. >> >>> >> >>> In many cases incrementing the same counter n times can be avoided. >> >>> >> >>> Twitter's rainbird did just that. It avoided multiple counter >> increments >> >>> by >> >>> batching them. >> >>> >> >>> I have done a similar think using cassandra and Kafka. >> >>> >> >>> >> > >> https://github.com/edwardcapriolo/IronCount/blob/master/src/test/java/com/jointhegrid/ironcount/mockingbird/MockingBirdMessageHandler.java >> >>> >> >>> >> >>> On Tuesday, November 27, 2012, Sergey Olefir < >> >> >> >>> solf.lists@ >> >> >> >>> > wrote: >> >>>> Hi, thanks for your suggestions. >> >>>> >> >>>> Regarding replicate=2 vs replicate=1 performance: I expected that >> below >> >>>> configurations will have similar performance: >> >>>> - single node, replicate = 1 >> >>>> - two nodes, replicate = 2 (okay, this probably should be a bit >> slower >> >>>> due >> >>>> to additional overhead). >> >>>> >> >>>> However what I'm seeing is that second option (replicate=2) is about >> >>>> THREE >> >>>> times slower than single node. >> >>>> >> >>>> >> >>>> Regarding replicate_on_write -- it is, in fact, a dangerous option. >> As >> >>> JIRA >> >>>> discusses, if you make changes to your ring (moving tokens and such) >> >>>> you >> >>>> will *silently* lose data. That is on top of whatever data you might >> >>>> end >> >>> up >> >>>> losing if you run replicate_on_write=false and the only node that got >> > the >> >>>> data fails. >> >>>> >> >>>> But what is much worse -- with replicate_on_write being false the >> data >> >>> will >> >>>> NOT be replicated (in my tests) ever unless you explicitly request >> the >> >>> cell. >> >>>> Then it will return the wrong result. And only on subsequent reads it >> >>>> will >> >>>> return adequate results. I haven't tested it, but documentation >> states >> >>> that >> >>>> range query will NOT do 'read repair' and thus will not force >> >>>> replication. >> >>>> The test I did went like this: >> >>>> - replicate_on_write = false >> >>>> - write something to node A (which should in theory replicate to node >> >>>> B) >> >>>> - wait for a long time (longest was on the order of 5 hours) >> >>>> - read from node B (and here I was getting null / wrong result) >> >>>> - read from node B again (here you get what you'd expect after read >> >>> repair) >> >>>> >> >>>> In essence, using replicate_on_write=false with rarely read data will >> >>>> practically defeat the purpose of having replication in the first >> place >> >>>> (failover, data redundancy). >> >>>> >> >>>> >> >>>> Or, in other words, this option doesn't look to be applicable to my >> >>>> situation. >> >>>> >> >>>> It looks like I will get much better performance by simply writing to >> > two >> >>>> separate clusters rather than using single cluster with replicate=2. >> >>>> Which >> >>>> is kind of stupid :) I think something's fishy with counters and >> >>>> replication. >> >>>> >> >>>> >> >>>> >> >>>> Edward Capriolo wrote >> >>>>> I mispoke really. It is not dangerous you just have to understand >> what >> >>>>> it >> >>>>> means. this jira discusses it. >> >>>>> >> >>>>> https://issues.apache.org/jira/browse/CASSANDRA-3868 >> >>>>> >> >>>>> On Tue, Nov 27, 2012 at 6:13 PM, Scott McKay < >> >>>> >> >>>>> scottm@ >> >>>> >> >>>>> >wrote: >> >>>>> >> >>>>>> We're having a similar performance problem. Setting >> >>>>>> 'replicate_on_write: >> >>>>>> false' fixes the performance issue in our tests. >> >>>>>> >> >>>>>> How dangerous is it? What exactly could go wrong? >> >>>>>> >> >>>>>> On 12-11-27 01:44 PM, Edward Capriolo wrote: >> >>>>>> >> >>>>>> The difference between Replication factor =1 and replication >> factor > >> > 1 >> >>>>>> is >> >>>>>> significant. Also it sounds like your cluster is 2 node so going >> from >> >>>>>> RF=1 >> >>>>>> to RF=2 means double the load on both nodes. >> >>>>>> >> >>>>>> You may want to experiment with the very dangerous column family >> >>>>>> attribute: >> >>>>>> >> >>>>>> - replicate_on_write: Replicate every counter update from the >> leader >> >>>>>> to >> >>>>>> the >> >>>>>> follower replicas. Accepts the values true and false. >> >>>>>> >> >>>>>> Edward >> >>>>>> On Tue, Nov 27, 2012 at 1:02 PM, Michael Kjellman < >> >>>>>> >> >>>> >> >>>>> mkjellman@ >> >>>> >> >>>>>> wrote: >> >>>>>> >> >>>>>>> Are you writing with QUORUM consistency or ONE? >> >>>>>>> >> >>>>>>> On 11/27/12 9:52 AM, "Sergey Olefir" < >> >>>> >> >>>>> solf.lists@ >> >>>> >> >>>>> > wrote: >> >>>>>>> >> >>>>>>> >Hi Juan, >> >>>>> cassandra-user@.apache >> >> >> >>> mailing list archive at >> >>> Nabble.com. >> >>>> >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> View this message in context: >> > >> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/counters-replication-awful-performance-tp7583993p7584014.html >> >> Sent from the >> >> > cassandra-user@.apache >> >> > mailing list archive at >> > Nabble.com. >> >> >> >> >> >> >> >> -- >> View this message in context: >> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/counters-replication-awful-performance-tp7583993p7584031.html >> Sent from the cassandra-u...@incubator.apache.org mailing list archive >> at Nabble.com. >> > >