Hi folks, I have a question about a design choice on how expiring cells are reconciled with tombstones. For two cells with the same timestamp, if one is expiring and one is a tombstone, Cassandra *always* prefers the tombstone. This matches its behavior for normal/non-expiring cells, but the folks in my organization worry about what it may imply for nodes experiencing clock skew. Specifically, we're concerned about scenarios like the following:
1) An expiring cell is committed via some node with a non-skewed clock. 2) Another replica for that cell experiences forward clock skew and decides that the cell is expired. It eventually runs a compaction that converts the cell to a tombstone. 3) The tombstone propagates to other nodes via, e.g., node repair. 4) The other nodes all eventually run their own compactions. Because of the reconciliation logic, the expiring cell is purged on all of the replicas, leaving behind only the tombstone. If the cell should have still been live at (4), the reconciliation logic will result in it being prematurely purged. We have confirmed this behavior experimentally. My organization may be more concerned about clock skew than the larger community, so I don't think we're inclined to propose a patch at this time. But to account for this kind of scenario we would like to patch our internal version of Cassandra to conditionally prefer expiring cells to tombstones if the node believes they should still be live; i.e., in reconcile() in *ExpiringCell.java, instead of: if (cell instanceof DeletedCell) return cell; use: if (cell instanceof DeletedCell) return isLive() ? this : cell; Before we do so, however, we'd like to understand the rationale for the existing behavior and the risks of making changes to it. Why does Cassandra consistently prefer tombstones to other kinds of cells? By modifying this behavior in this particular case, do we risk hitting bizarre corner cases? Thanks, SK