Hi, I have a serious problem with counters performance and I can't seem to figure it out.
Basically I'm building a system for accumulating some statistics "on the fly" via Cassandra distributed counters. For this I need counter updates to work "really fast" and herein lies my problem -- as soon as I enable replication_factor = 2, the performance goes down the drain. This happens in my tests using both 1.0.x and 1.1.6. Let me elaborate: I have two boxes (virtual servers on top of physical servers rented specifically for this purpose, i.e. it's not a cloud, nor it is shared; virtual servers are managed by our admins as a way to limit damage as I suppose :)). Cassandra partitioner is set to ByteOrderedPartitioner because I want to be able to do some range queries. First, I set up Cassandra individually on each box (not in a cluster) and test counter increments performance (exclusively increments, no reads). For tests I use code that is intended to somewhat resemble the expected load pattern -- particularly the majority of increments create new counters with some updating (adding) to already existing counters. In this test each single node exhibits respectable performance - something on the order of 70k (seventy thousand) increments per second. I then join both of these nodes into single cluster (using SimpleSnitch and SimpleStrategy, nothing fancy yet). I then run the same test using replication_factor=1. The performance is on the order of 120k increments per second -- which seems to be a reasonable increase over the single node performance. HOWEVER I then rerun the same test on the two-node cluster using replication_factor=2 -- which is the least I'll need for actual production for redundancy purposes. And the performance I get is absolutely horrible -- much, MUCH worse than even single-node performance -- something on the order of less than 25k increments per second. In addition to clients not being able to push updates fast enough, I also see a lot of 'messages dropped' messages in the Cassandra log under this load. Could anyone advise what could be causing such drastic performance drop under replication_factor=2? I was expecting something on the order of single-node performance, not approximately 3x less. When testing replication_factor=2 on 1.1.6 I can see that CPU usage goes through the roof. On 1.0.x I think it looked more like disk overload, but I'm not sure (being on virtual server I apparently can't see true iostats). I do have Cassandra data on a separate disk, commit log and cache are currently on the same disk as the system. I experimented with commit log flush modes and even with disabling commit log at all -- but it doesn't seem to have noticeable impact on the performance when under replication_factor=2. Any suggestions and hints will be much appreciated :) And please let me know if I need to share additional information about the configuration I'm running on. Best regards, Sergey -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/counters-replication-awful-performance-tp7583993.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.