> I’d request data, nothing would be returned, I would then re-request the data > and it would correctly be returned: > What CL are you using for reads and writes?
> I see a number of dropped ‘MUTATION’ operations : just under 5% of the total > ‘MutationStage’ count. > Dropped mutations in a multi DC setup may be a sign of network congestion or overloaded nodes. > - Could anybody suggest anything specific to look at to see why the > repair operations aren’t having the desired effect? > I would first build a test case to ensure correct operation when using strong consistency. i.e. QUOURM write and read. Because you are using RF 2 per DC I assume you are not using LOCAL_QUOURM because that is 2 and you would not have any redundancy in the DC. > > - Would increasing logging level to ‘DEBUG’ show read-repair > activity (to confirm that this is happening, when & for what proportion of > total requests)? It would, but the INFO logging for the AES is pretty good. I would hold off for now. > > - Is there something obvious that I could be missing here? When a new AES session starts it logs this logger.info(String.format("[repair #%s] new session: will sync %s on range %s for %s.%s", getName(), repairedNodes(), range, tablename, Arrays.toString(cfnames))); When it completes it logs this logger.info(String.format("[repair #%s] session completed successfully", getName())); Or this on failure logger.error(String.format("[repair #%s] session completed with the following error", getName()), exception); Cheers ----------------- Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 10/02/2013, at 9:56 PM, Brian Fleming <bigbrianflem...@gmail.com> wrote: > > > > Hi, > > > > I have a 20 node cluster running v1.0.7 split between 5 data centres, each > with an RF of 2, containing a ~1TB unique dataset/~10TB of total data. > > > > I’ve had some intermittent issues with a new data centre (3 nodes, RF=2) I > brought online late last year with data consistency & availability: I’d > request data, nothing would be returned, I would then re-request the data and > it would correctly be returned: i.e. read-repair appeared to be occurring. > However running repairs on the nodes didn’t resolve this (I tried general > ‘repair’ commands as well as targeted keyspace commands) – this didn’t alter > the behaviour. > > > > After a lot of fruitless investigation, I decided to wipe & > re-install/re-populate the nodes. The re-install & repair operations are now > complete: I see the expected amount of data on the nodes, however I am still > seeing the same behaviour, i.e. I only get data after one failed attempt. > > > > When I run repair commands, I don’t see any errors in the logs. > > I see the expected ‘AntiEntropySessions’ count in ‘nodetool tpstats’ during > repair sessions. > > I see a number of dropped ‘MUTATION’ operations : just under 5% of the total > ‘MutationStage’ count. > > > > Questions : > > - Could anybody suggest anything specific to look at to see why the > repair operations aren’t having the desired effect? > > - Would increasing logging level to ‘DEBUG’ show read-repair > activity (to confirm that this is happening, when & for what proportion of > total requests)? > > - Is there something obvious that I could be missing here? > > > > Many thanks, > > Brian > > >