> I’d request data, nothing would be returned, I would then re-request the data 
> and it would correctly be returned:
> 
What CL are you using for reads and writes?

> I see a number of dropped ‘MUTATION’ operations : just under 5% of the total 
> ‘MutationStage’ count.
> 
Dropped mutations in a multi DC setup may be a sign of network congestion or 
overloaded nodes. 


> -          Could anybody suggest anything specific to look at to see why the 
> repair operations aren’t having the desired effect? 
> 
I would first build a test case to ensure correct operation when using strong 
consistency. i.e. QUOURM write and read. Because you are using RF 2 per DC I 
assume you are not using LOCAL_QUOURM because that is 2 and you would not have 
any redundancy in the DC. 

 
> 
> -          Would increasing logging level to ‘DEBUG’ show read-repair 
> activity (to confirm that this is happening, when & for what proportion of 
> total requests)?
It would, but the INFO logging for the AES is pretty good. I would hold off for 
now. 

> 
> -          Is there something obvious that I could be missing here?
When a new AES session starts it logs this

            logger.info(String.format("[repair #%s] new session: will sync %s 
on range %s for %s.%s", getName(), repairedNodes(), range, tablename, 
Arrays.toString(cfnames)));

When it completes it logs this

logger.info(String.format("[repair #%s] session completed successfully", 
getName()));

Or this on failure 

logger.error(String.format("[repair #%s] session completed with the following 
error", getName()), exception);


Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 10/02/2013, at 9:56 PM, Brian Fleming <bigbrianflem...@gmail.com> wrote:

> 
>  
> 
> Hi,
> 
>  
> 
> I have a 20 node cluster running v1.0.7 split between 5 data centres, each 
> with an RF of 2, containing a ~1TB unique dataset/~10TB of total data. 
> 
>  
> 
> I’ve had some intermittent issues with a new data centre (3 nodes, RF=2) I 
> brought online late last year with data consistency & availability: I’d 
> request data, nothing would be returned, I would then re-request the data and 
> it would correctly be returned: i.e. read-repair appeared to be occurring.  
> However running repairs on the nodes didn’t resolve this (I tried general 
> ‘repair’ commands as well as targeted keyspace commands) – this didn’t alter 
> the behaviour.
> 
>  
> 
> After a lot of fruitless investigation, I decided to wipe & 
> re-install/re-populate the nodes.  The re-install & repair operations are now 
> complete: I see the expected amount of data on the nodes, however I am still 
> seeing the same behaviour, i.e. I only get data after one failed attempt.
> 
>  
> 
> When I run repair commands, I don’t see any errors in the logs. 
> 
> I see the expected ‘AntiEntropySessions’ count in ‘nodetool tpstats’ during 
> repair sessions.
> 
> I see a number of dropped ‘MUTATION’ operations : just under 5% of the total 
> ‘MutationStage’ count.
> 
>  
> 
> Questions :
> 
> -          Could anybody suggest anything specific to look at to see why the 
> repair operations aren’t having the desired effect? 
> 
> -          Would increasing logging level to ‘DEBUG’ show read-repair 
> activity (to confirm that this is happening, when & for what proportion of 
> total requests)?
> 
> -          Is there something obvious that I could be missing here?
> 
>  
> 
> Many thanks,
> 
> Brian
> 
>  
> 

Reply via email to