We've seen this before but couldn't tie it to GCGS so we ended up
forgetting about it. Now with a reproducible test case things make much
more sense and we should be able to fix this.
Seems that it's most certainly a bug with partition deletions and handling
of GC grace seconds. It seems that the read repair isn't propagating the
partition level deletion in the case where the deletion is after GCGS.
Increasing GCGS to beyond the timestamp of the problematic deletion and
then causing a read repair will fix the issue as a workaround. Only way to
avoid the issue entirely is to repair the deletion within GCGS.

Looking at finding the bug now... I took the liberty of creating a JIRA for
the issue: https://issues.apache.org/jira/browse/CASSANDRA-14532


On 20 June 2018 at 01:07, Jay Zhuang <z...@uber.com.invalid> wrote:

> Hi,
>
> We know that the deleted data may re-appear if repair is not run within
> gc_grace_seconds. When the tombstone is not propagated to all nodes, the
> data will re-appear. But it's also causing following 2 issues before the
> tombstone is compacted away:
> a. inconsistent query result
>
> With consistency level ONE or QUORUM, it may or may not return the value.
> b. lots of read repairs, but doesn't repair anything
>
> With consistency level ALL, it always triggers a read repair.
> With consistency level QUORUM, it also very likely (2/3) causes a read
> repair. But it doesn't repair the data, so it's causing repair every time.
>
>
> Here are the reproducing steps:
>
> 1. Create a 3 nodes cluster
> 2. Create a table (with small gc_grace_seconds):
>
> CREATE KEYSPACE foo WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': 3};
> CREATE TABLE foo.bar (
>     id int PRIMARY KEY,
>     name text
> ) WITH gc_grace_seconds=30;
>
> 3. Insert data with consistency all:
>
> INSERT INTO foo.bar (id, name) VALUES(1, 'cstar');
>
> 4. stop 1 node
>
> $ ccm node2 stop
>
> 5. Delete the data with consistency quorum:
>
> DELETE FROM foo.bar WHERE id=1;
>
> 6. Wait 30 seconds and then start node2:
>
> $ ccm node2 start
>
> Now the tombstone is on node1 and node3 but not on node2.
>
> With quorum read, it may or may not return value, and read repair will send
> the data from node2 to node1 and node3, but it doesn't repair anything.
>
> I'd like to discuss a few potential solutions and workarounds:
>
> 1. Can hints replay sends GCed tombstone?
>
> 2. Can we have a "deep repair" which detects such issue and repair the GCed
> tombstone? Or temperately increase the gc_grace_seconds for repair?
>
> What other suggestions you have if the user is having such issue?
>
>
> Thanks,
>
> Jay
>

Reply via email to