I indeed had some of those in the past. But my point is not that much to understand how I can get different counts depending on the node (I consider this as a weakness of counters and I am aware of it), my wonder is more why those inconsistent, distinct counters never converge even after a repair. Your last comment on this JIRA summarize quite well our problem.
I hope that commiters will find out something. 2013/5/16 Janne Jalkanen <janne.jalka...@ecyrd.com> > > Might you be experiencing this? > https://issues.apache.org/jira/browse/CASSANDRA-4417 > > /Janne > > On May 16, 2013, at 14:49 , Alain RODRIGUEZ <arodr...@gmail.com> wrote: > > @Rob: Thanks about the feedback. > > Yet I have a weird behavior still unexplained about repairing. Are > counters supposed to be "repaired" too ? I mean, while reading at CL.ONE I > can have different values depending on what node is answering. Even after a > read repair or a full repair. Shouldn't a repair fix these discrepancies ? > > The only way I found to get always the same count is to read data at > CL.QUORUM, but this is a workaround since the data itself remains wrong on > some nodes. > > Any clue on it ? > > Alain > > 2013/5/15 Edward Capriolo <edlinuxg...@gmail.com> > >> http://basho.com/introducing-riak-1-3/ >> >> Introduced Active Anti-Entropy. Riak now has active anti-entropy. In >> distributed systems, inconsistencies can arise between replicas due to >> failure modes, concurrent updates, and physical data loss or corruption. >> Pre-1.3 Riak already had several features for repairing this “entropy”, but >> they all required some form of user intervention. Riak 1.3 introduces >> automatic, self-healing properties that repair entropy on an ongoing basis. >> >> >> On Wed, May 15, 2013 at 5:32 PM, Robert Coli <rc...@eventbrite.com>wrote: >> >>> On Wed, May 15, 2013 at 1:27 AM, Alain RODRIGUEZ <arodr...@gmail.com> >>> wrote: >>> > Rob, I was wondering something. Are you a commiter working on >>> improving the >>> > repair or something similar ? >>> >>> I am not a committer [1], but I have an active interest in potential >>> improvements to the best practices for repair. The specific change >>> that I am considering is a modification to the default >>> gc_grace_seconds value, which seems picked out of a hat at 10 days. My >>> view is that the current implementation of repair has such negative >>> performance consequences that I do not believe that holding onto >>> tombstones for longer than 10 days could possibly be as bad as the >>> fixed cost of running repair once every 10 days. I believe that this >>> value is too low for a default (it also does not map cleanly to the >>> work week!) and likely should be increased to 14, 21 or 28 days. >>> >>> > Anyway, if a commiter (or any other expert) could give us some >>> feedback on >>> > our comments (Are we doing well or not, whether things we observe are >>> normal >>> > or unexplained, what is going to be improved in the future about >>> repair...) >>> >>> 1) you are doing things according to best practice >>> 2) unfortunately your experience with significantly degraded >>> performance, including a blocked go-live due to repair bloat is pretty >>> typical >>> 3) the things you are experiencing are part of the current >>> implementation of repair and are also typical, however I do not >>> believe they are fully "explained" [2] >>> 4) as has been mentioned further down thread, there are discussions >>> regarding (and some already committed) improvements to both the >>> current repair paradigm and an evolution to a new paradigm >>> >>> Thanks to all for the responses so far, please keep them coming! :D >>> >>> =Rob >>> [1] hence the (unofficial) tag for this thread. I do have minor >>> patches accepted to the codebase, but always merged by an actual >>> committer. :) >>> [2] driftx@#cassandra feels that these things are explained/understood >>> by core team, and points to >>> https://issues.apache.org/jira/browse/CASSANDRA-5280 as a useful >>> approach to minimize same. >>> >> >> > >