As somewhat of a conclusion to this thread, we have resolved the major issue having to do with the hotspots. We were balanced between availability zones in aws/ec2 us-east - a,b,c with the number of nodes in our cluster. However we didn't alternate by rack with the token order. We are using the property file snitch and had defined each node as being part of a single DC (for now) and a rack (a-c). But we should have made the tokens in order go from a to b to c with their rack.
So as an example of alternating: <token1>: <node in rack b> <token2>: <node in rack a> <token3>: <node in rack c> <token4>: <node in rack b> … So we took some time and shifted things around and things appear to be much more spread out across the cluster. As a side note: how we discovered the root of the problem. We kept poring over logs and stats but it helped to trial DataStax's OpsCenter product to get a view of the cluster. All that info is available via jmx and we could have mapped out node replication ourselves or with another monitoring product. However that's what helped us. That along with Brandon and others from the community helped us discover the reason. Thanks again. The reason why the order matters - currently when replicating I believe NetworkTopologyStrategy uses a pattern of choosing local for the first replica, in-rack for second replica, and off-rack for the next replica (depending on replication factor). It appears though that when choosing the non-local replicas, it looks for the next token in the ring of the same rack and the next token of a different rack (depending on which it is looking for). So that is why alternating by rack is important. That might be able to be smarter in the future which would be nice - to not have to care and let Cassandra spread the replication around intelligently. On Aug 23, 2011, at 6:02 AM, Jeremy Hanna wrote: > > On Aug 23, 2011, at 3:43 AM, aaron morton wrote: > >> Dropped messages in ReadRepair is odd. Are you also dropping mutations ? >> >> There are two tasks performed on the ReadRepair stage. The digests are >> compared on this stage, and secondly the repair happens on the stage. >> Comparing digests is quick. Doing the repair could take a bit longer, all >> the cf's returned are collated, filtered and deletes removed. >> >> We don't do background Read Repair on range scans, they do have foreground >> digest checking though. >> >> What CL are you using ? > > CL.ONE for hadoop writes, CL.QUORUM for hadoop reads > >> >> begin crazy theory: >> >> Could there be a very big row that is out of sync ? The increased RR >> would be resulting in mutations been sent back to the replicas. Which would >> give you a hot spot in mutations. >> >> Check max compacted row size on the hot nodes. >> >> Turn the logging up to DEBUG on the hot machines for >> o.a.c.service.RowRepairResolver and look for the "resolve:…" message it has >> the time taken. > > The max compacted size didn't seem unreasonable - about a MB. I turned up > logging to DEBUG for that class and I get plenty of dropped READ_REPAIR > messages, but nothing coming out of DEBUG in the logs to indicate the time > taken that I can see. > >> >> Cheers >> >> ----------------- >> Aaron Morton >> Freelance Cassandra Developer >> @aaronmorton >> http://www.thelastpickle.com >> >> On 23/08/2011, at 7:52 PM, Jeremy Hanna wrote: >> >>> >>> On Aug 23, 2011, at 2:25 AM, Peter Schuller wrote: >>> >>>>> We've been having issues where as soon as we start doing heavy writes >>>>> (via hadoop) recently, it really hammers 4 nodes out of 20. We're using >>>>> random partitioner and we've set the initial tokens for our 20 nodes >>>>> according to the general spacing formula, except for a few token offsets >>>>> as we've replaced dead nodes. >>>> >>>> Is the hadoop job iterating over keys in the cluster in token order >>>> perhaps, and you're generating writes to those keys? That would >>>> explain a "moving hotspot" along the cluster. >>> >>> Yes - we're iterating over all the keys of particular column families, >>> doing joins using pig as we enrich and perform measure calculations. When >>> we write, we're usually writing out for a certain small subset of keys >>> which shouldn't have hotspots with RandomPartitioner afaict. >>> >>>> >>>> -- >>>> / Peter Schuller (@scode on twitter) >>> >> >