Re: Tombstone passed GC period causes un-repairable inconsistent data

kurt greaves Wed, 20 Jun 2018 19:48:33 -0700

That makes sense going forward (assuming it works), but this is still
pretty surprising behaviour. Although disregarding the read repair factor
entirely, the result will *eventually* come true when the tombstones are
purged, we're still returning a result that doesn't match up with what we
have on disk. I guess the question becomes whether it is less surprising to
return what is actually on disk versus what the data will look like in the
future. Read repair throws a spanner in the works because it could fix this
issue, but there's no guarantee that read repair would fix this issue
(tombstones may be purged prior to the read). TBH the read repair not
working is what makes it most noticeable, and more surprising. If it did
work you'd probably be in for potentially nastier surprises in the future,
and harder to identify. Ideally the read repair wouldn't be attempted at
all and no one would ever investigate what's in the SSTables :p. That may
be possible and not terribly disruptive, but in the scheme of things it's
just nitpicking.


Anyway, at the very least we should document (if we haven't aleady) this
behaviour, because it's pretty complicated and someone is bound to run into
it again. I know I'm probably going to forget about it and get confused
when we run into it again.



On 20 June 2018 at 16:35, sankalp kohli <kohlisank...@gmail.com> wrote:

> I agree with Stefan that we should use incremental repair and use patches
> from Marcus to drop tombstones only from repaired data.
> Regarding deep repair, you can bump the read repair and run the repair. The
> issue will be that you will stream lot of data and also your blocking read
> repair will go up when you bump the gc grace to higher value.
>
> On Wed, Jun 20, 2018 at 1:10 AM Stefan Podkowinski <s...@apache.org>
> wrote:
>
> > Sounds like an older issue that I tried to address two years ago:
> > https://issues.apache.org/jira/browse/CASSANDRA-11427
> >
> > As you can see, the result hasn't been as expected and we got some
> > unintended side effects based on the patch. I'm not sure I'd be willing
> > to give this another try, considering the behaviour we like to fix in
> > the first place is rather harmless and the read repairs shouldn't happen
> > at all to any users who regularly run repairs within gc_grace.
> >
> > What I'd suggest is to think more into the direction of a
> > post-full-repair-world and to fully embrace incremental repairs, as
> > fixed by Blake in 4.0. In that case, we should stop doing read repairs
> > at all for repaired data, as described in
> > https://issues.apache.org/jira/browse/CASSANDRA-13912. RRs are certainly
> > useful, but can be very risky if not very very carefully implemented. So
> > I'm wondering if we shouldn't disable RRs for everything but unrepaired
> > data. I'd btw also be interested to hear any opinions on this in context
> > of transient replicas.
> >
> >
> > On 20.06.2018 03:07, Jay Zhuang wrote:
> > > Hi,
> > >
> > > We know that the deleted data may re-appear if repair is not run within
> > > gc_grace_seconds. When the tombstone is not propagated to all nodes,
> the
> > > data will re-appear. But it's also causing following 2 issues before
> the
> > > tombstone is compacted away:
> > > a. inconsistent query result
> > >
> > > With consistency level ONE or QUORUM, it may or may not return the
> value.
> > > b. lots of read repairs, but doesn't repair anything
> > >
> > > With consistency level ALL, it always triggers a read repair.
> > > With consistency level QUORUM, it also very likely (2/3) causes a read
> > > repair. But it doesn't repair the data, so it's causing repair every
> > time.
> > >
> > >
> > > Here are the reproducing steps:
> > >
> > > 1. Create a 3 nodes cluster
> > > 2. Create a table (with small gc_grace_seconds):
> > >
> > > CREATE KEYSPACE foo WITH replication = {'class': 'SimpleStrategy',
> > > 'replication_factor': 3};
> > > CREATE TABLE foo.bar (
> > >     id int PRIMARY KEY,
> > >     name text
> > > ) WITH gc_grace_seconds=30;
> > >
> > > 3. Insert data with consistency all:
> > >
> > > INSERT INTO foo.bar (id, name) VALUES(1, 'cstar');
> > >
> > > 4. stop 1 node
> > >
> > > $ ccm node2 stop
> > >
> > > 5. Delete the data with consistency quorum:
> > >
> > > DELETE FROM foo.bar WHERE id=1;
> > >
> > > 6. Wait 30 seconds and then start node2:
> > >
> > > $ ccm node2 start
> > >
> > > Now the tombstone is on node1 and node3 but not on node2.
> > >
> > > With quorum read, it may or may not return value, and read repair will
> > send
> > > the data from node2 to node1 and node3, but it doesn't repair anything.
> > >
> > > I'd like to discuss a few potential solutions and workarounds:
> > >
> > > 1. Can hints replay sends GCed tombstone?
> > >
> > > 2. Can we have a "deep repair" which detects such issue and repair the
> > GCed
> > > tombstone? Or temperately increase the gc_grace_seconds for repair?
> > >
> > > What other suggestions you have if the user is having such issue?
> > >
> > >
> > > Thanks,
> > >
> > > Jay
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: dev-h...@cassandra.apache.org
> >
> >
>

Re: Tombstone passed GC period causes un-repairable inconsistent data

Reply via email to