Dropped messages in ReadRepair is odd. Are you also dropping mutations ? 

There are two tasks performed on the ReadRepair stage. The digests are compared 
on this stage, and secondly the repair happens on the stage. Comparing digests 
is quick. Doing the repair could take a bit longer, all the cf's returned are 
collated, filtered and deletes removed.  

We don't do background Read Repair on range scans, they do have foreground 
digest checking though.

What CL are you using ? 

begin crazy theory:

        Could there be a very big row that is out of sync ? The increased RR 
would be resulting in mutations been sent back to the replicas. Which would 
give you a hot spot in mutations.
        
        Check max compacted row size on the hot nodes. 
        
        Turn the logging up to DEBUG on the hot machines for 
o.a.c.service.RowRepairResolver and look for the "resolve:…" message it has the 
time taken.

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 23/08/2011, at 7:52 PM, Jeremy Hanna wrote:

> 
> On Aug 23, 2011, at 2:25 AM, Peter Schuller wrote:
> 
>>> We've been having issues where as soon as we start doing heavy writes (via 
>>> hadoop) recently, it really hammers 4 nodes out of 20.  We're using random 
>>> partitioner and we've set the initial tokens for our 20 nodes according to 
>>> the general spacing formula, except for a few token offsets as we've 
>>> replaced dead nodes.
>> 
>> Is the hadoop job iterating over keys in the cluster in token order
>> perhaps, and you're generating writes to those keys? That would
>> explain a "moving hotspot" along the cluster.
> 
> Yes - we're iterating over all the keys of particular column families, doing 
> joins using pig as we enrich and perform measure calculations.  When we 
> write, we're usually writing out for a certain small subset of keys which 
> shouldn't have hotspots with RandomPartitioner afaict.
> 
>> 
>> -- 
>> / Peter Schuller (@scode on twitter)
> 

Reply via email to