Re: (unofficial) Community Poll for Production Operators : Repair

Alain RODRIGUEZ Thu, 16 May 2013 08:43:30 -0700

I indeed had some of those in the past. But my point is not that much to
understand how I can get different counts depending on the node (I consider
this as a weakness of counters and I am aware of it),  my wonder is more
why those inconsistent, distinct counters never converge even after a
repair. Your last comment on this JIRA summarize quite well our problem.


I hope that commiters will find out something.


2013/5/16 Janne Jalkanen <janne.jalka...@ecyrd.com>

>
> Might you be experiencing this?
> https://issues.apache.org/jira/browse/CASSANDRA-4417
>
> /Janne
>
> On May 16, 2013, at 14:49 , Alain RODRIGUEZ <arodr...@gmail.com> wrote:
>
> @Rob: Thanks about the feedback.
>
> Yet I have a weird behavior still unexplained about repairing. Are
> counters supposed to be "repaired" too ? I mean, while reading at CL.ONE I
> can have different values depending on what node is answering. Even after a
> read repair or a full repair. Shouldn't a repair fix these discrepancies ?
>
> The only way I found to get always the same count is to read data at
> CL.QUORUM, but this is a workaround since the data itself remains wrong on
> some nodes.
>
> Any clue on it ?
>
> Alain
>
> 2013/5/15 Edward Capriolo <edlinuxg...@gmail.com>
>
>> http://basho.com/introducing-riak-1-3/
>>
>> Introduced Active Anti-Entropy. Riak now has active anti-entropy. In
>> distributed systems, inconsistencies can arise between replicas due to
>> failure modes, concurrent updates, and physical data loss or corruption.
>> Pre-1.3 Riak already had several features for repairing this “entropy”, but
>> they all required some form of user intervention. Riak 1.3 introduces
>> automatic, self-healing properties that repair entropy on an ongoing basis.
>>
>>
>> On Wed, May 15, 2013 at 5:32 PM, Robert Coli <rc...@eventbrite.com>wrote:
>>
>>> On Wed, May 15, 2013 at 1:27 AM, Alain RODRIGUEZ <arodr...@gmail.com>
>>> wrote:
>>> > Rob, I was wondering something. Are you a commiter working on
>>> improving the
>>> > repair or something similar ?
>>>
>>> I am not a committer [1], but I have an active interest in potential
>>> improvements to the best practices for repair. The specific change
>>> that I am considering is a modification to the default
>>> gc_grace_seconds value, which seems picked out of a hat at 10 days. My
>>> view is that the current implementation of repair has such negative
>>> performance consequences that I do not believe that holding onto
>>> tombstones for longer than 10 days could possibly be as bad as the
>>> fixed cost of running repair once every 10 days. I believe that this
>>> value is too low for a default (it also does not map cleanly to the
>>> work week!) and likely should be increased to 14, 21 or 28 days.
>>>
>>> > Anyway, if a commiter (or any other expert) could give us some
>>> feedback on
>>> > our comments (Are we doing well or not, whether things we observe are
>>> normal
>>> > or unexplained, what is going to be improved in the future about
>>> repair...)
>>>
>>> 1) you are doing things according to best practice
>>> 2) unfortunately your experience with significantly degraded
>>> performance, including a blocked go-live due to repair bloat is pretty
>>> typical
>>> 3) the things you are experiencing are part of the current
>>> implementation of repair and are also typical, however I do not
>>> believe they are fully "explained" [2]
>>> 4) as has been mentioned further down thread, there are discussions
>>> regarding (and some already committed) improvements to both the
>>> current repair paradigm and an evolution to a new paradigm
>>>
>>> Thanks to all for the responses so far, please keep them coming! :D
>>>
>>> =Rob
>>> [1] hence the (unofficial) tag for this thread. I do have minor
>>> patches accepted to the codebase, but always merged by an actual
>>> committer. :)
>>> [2] driftx@#cassandra feels that these things are explained/understood
>>> by core team, and points to
>>> https://issues.apache.org/jira/browse/CASSANDRA-5280 as a useful
>>> approach to minimize same.
>>>
>>
>>
>
>

Re: (unofficial) Community Poll for Production Operators : Repair

Reply via email to