Hi folks,

I have a question about a design choice on how expiring cells are
reconciled with tombstones.  For two cells with the same timestamp, if
one is expiring and one is a tombstone, Cassandra *always* prefers the
tombstone.  This matches its behavior for normal/non-expiring cells, but
the folks in my organization worry about what it may imply for nodes
experiencing clock skew.  Specifically, we're concerned about scenarios
like the following:

1) An expiring cell is committed via some node with a non-skewed clock.
2) Another replica for that cell experiences forward clock skew and
decides that the cell is expired.  It eventually runs a compaction that
converts the cell to a tombstone.
3) The tombstone propagates to other nodes via, e.g., node repair.
4) The other nodes all eventually run their own compactions.  Because of
the reconciliation logic, the expiring cell is purged on all of the
replicas, leaving behind only the tombstone.

If the cell should have still been live at (4), the reconciliation logic
will result in it being prematurely purged.  We have confirmed this
behavior experimentally.

My organization may be more concerned about clock skew than the larger
community, so I don't think we're inclined to propose a patch at this
time.  But to account for this kind of scenario we would like to patch
our internal version of Cassandra to conditionally prefer expiring cells
to tombstones if the node believes they should still be live; i.e., in
reconcile() in *ExpiringCell.java, instead of:

        if (cell instanceof DeletedCell)
            return cell;

use:

        if (cell instanceof DeletedCell)
            return isLive() ? this : cell;

Before we do so, however, we'd like to understand the rationale for the
existing behavior and the risks of making changes to it.  Why does
Cassandra consistently prefer tombstones to other kinds of cells?  By
modifying this behavior in this particular case, do we risk hitting
bizarre corner cases?

Thanks,
SK

Reply via email to