Re: counters + replication = awful performance?

Edward Capriolo Tue, 27 Nov 2012 15:21:41 -0800

I mispoke really. It is not dangerous you just have to understand what it
means. this jira discusses it.


https://issues.apache.org/jira/browse/CASSANDRA-3868

On Tue, Nov 27, 2012 at 6:13 PM, Scott McKay <sco...@mailchannels.com>wrote:

>  We're having a similar performance problem.  Setting 'replicate_on_write:
> false' fixes the performance issue in our tests.
>
> How dangerous is it?  What exactly could go wrong?
>
> On 12-11-27 01:44 PM, Edward Capriolo wrote:
>
> The difference between Replication factor =1 and replication factor > 1 is
> significant. Also it sounds like your cluster is 2 node so going from RF=1
> to RF=2 means double the load on both nodes.
>
>  You may want to experiment with the very dangerous column family
> attribute:
>
>  - replicate_on_write: Replicate every counter update from the leader to
> the
> follower replicas. Accepts the values true and false.
>
>  Edward
>  On Tue, Nov 27, 2012 at 1:02 PM, Michael Kjellman <
> mkjell...@barracuda.com> wrote:
>
>> Are you writing with QUORUM consistency or ONE?
>>
>> On 11/27/12 9:52 AM, "Sergey Olefir" <solf.li...@gmail.com> wrote:
>>
>> >Hi Juan,
>> >
>> >thanks for your input!
>> >
>> >In my case, however, I doubt this is the case -- clients are able to push
>> >many more updates than I need to saturate replication_factor=2 case (e.g.
>> >I'm doing as many as 6x more increments when testing 2-node cluster with
>> >replication_factor=1), so bandwidth between clients and server should be
>> >sufficient.
>> >
>> >Bandwidth between nodes in the cluster should also be quite sufficient
>> >since
>> >they are both in the same DC. But it is something to check, thanks!
>> >
>> >Best regards,
>> >Sergey
>> >
>> >
>> >Juan Valencia wrote
>> >> Hi Sergey,
>> >>
>> >> I know I've had similar issues with counters which were bottle-necked
>> by
>> >> network throughput.  You might be seeing a problem with throughput
>> >>between
>> >> the clients and Cass or between the two Cass nodes.  It might not be
>> >>your
>> >> case, but that was what happened to me :-)
>> >>
>> >> Juan
>> >>
>> >>
>> >> On Tue, Nov 27, 2012 at 8:48 AM, Sergey Olefir &lt;
>> >
>> >> solf.lists@
>> >
>> >> &gt; wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> I have a serious problem with counters performance and I can't seem to
>> >>> figure it out.
>> >>>
>> >>> Basically I'm building a system for accumulating some statistics "on
>> >>>the
>> >>> fly" via Cassandra distributed counters. For this I need counter
>> >>>updates
>> >>> to
>> >>> work "really fast" and herein lies my problem -- as soon as I enable
>> >>> replication_factor = 2, the performance goes down the drain. This
>> >>>happens
>> >>> in
>> >>> my tests using both 1.0.x and 1.1.6.
>> >>>
>> >>> Let me elaborate:
>> >>>
>> >>> I have two boxes (virtual servers on top of physical servers rented
>> >>> specifically for this purpose, i.e. it's not a cloud, nor it is
>> shared;
>> >>> virtual servers are managed by our admins as a way to limit damage as
>> I
>> >>> suppose :)). Cassandra partitioner is set to ByteOrderedPartitioner
>> >>> because
>> >>> I want to be able to do some range queries.
>> >>>
>> >>> First, I set up Cassandra individually on each box (not in a cluster)
>> >>>and
>> >>> test counter increments performance (exclusively increments, no
>> reads).
>> >>> For
>> >>> tests I use code that is intended to somewhat resemble the expected
>> >>>load
>> >>> pattern -- particularly the majority of increments create new counters
>> >>> with
>> >>> some updating (adding) to already existing counters. In this test each
>> >>> single node exhibits respectable performance - something on the order
>> >>>of
>> >>> 70k
>> >>> (seventy thousand) increments per second.
>> >>>
>> >>> I then join both of these nodes into single cluster (using
>> SimpleSnitch
>> >>> and
>> >>> SimpleStrategy, nothing fancy yet). I then run the same test using
>> >>> replication_factor=1. The performance is on the order of 120k
>> >>>increments
>> >>> per
>> >>> second -- which seems to be a reasonable increase over the single node
>> >>> performance.
>> >>>
>> >>>
>> >>> HOWEVER I then rerun the same test on the two-node cluster using
>> >>> replication_factor=2 -- which is the least I'll need for actual
>> >>> production
>> >>> for redundancy purposes. And the performance I get is absolutely
>> >>>horrible
>> >>> --
>> >>> much, MUCH worse than even single-node performance -- something on the
>> >>> order
>> >>> of less than 25k increments per second. In addition to clients not
>> >>>being
>> >>> able to push updates fast enough, I also see a lot of 'messages
>> >>>dropped'
>> >>> messages in the Cassandra log under this load.
>> >>>
>> >>> Could anyone advise what could be causing such drastic performance
>> drop
>> >>> under replication_factor=2? I was expecting something on the order of
>> >>> single-node performance, not approximately 3x less.
>> >>>
>> >>>
>> >>> When testing replication_factor=2 on 1.1.6 I can see that CPU usage
>> >>>goes
>> >>> through the roof. On 1.0.x I think it looked more like disk overload,
>> >>>but
>> >>> I'm not sure (being on virtual server I apparently can't see true
>> >>> iostats).
>> >>>
>> >>> I do have Cassandra data on a separate disk, commit log and cache are
>> >>> currently on the same disk as the system. I experimented with commit
>> >>>log
>> >>> flush modes and even with disabling commit log at all -- but it
>> doesn't
>> >>> seem
>> >>> to have noticeable impact on the performance when under
>> >>> replication_factor=2.
>> >>>
>> >>>
>> >>> Any suggestions and hints will be much appreciated :) And please let
>> me
>> >>> know
>> >>> if I need to share additional information about the configuration I'm
>> >>> running on.
>> >>>
>> >>> Best regards,
>> >>> Sergey
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> View this message in context:
>> >>>
>> >>>
>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/counter
>> >>>s-replication-awful-performance-tp7583993.html
>> >>> Sent from the
>> >
>> >> cassandra-user@.apache
>> >
>> >>  mailing list archive at
>> >>> Nabble.com.
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >>
>> >> Learn More:  SQI (Social Quality Index) - A Universal Measure of Social
>> >> Quality
>> >
>> >
>> >
>> >
>> >
>> >--
>> >View this message in context:
>> >
>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/counters-
>> >replication-awful-performance-tp7583993p7583996.html
>> >Sent from the cassandra-u...@incubator.apache.org mailing list archive
>> at
>> >Nabble.com.
>>
>>
>> 'Like' us on Facebook for exclusive content and other resources on all
>> Barracuda Networks solutions.
>>
>> Visit http://barracudanetworks.com/facebook
>>
>>
>>
>>
>>
>
> --
> *Scott McKay*, Sr. Software Developer
> MailChannels
>
> Tel: +1 604 685 7488 x 509
> www.mailchannels.com
>

Re: counters + replication = awful performance?

Reply via email to