maedhroz commented on PR #4748:
URL: https://github.com/apache/cassandra/pull/4748#issuecomment-4270929240
I've made a couple changes to ensure that the blocks we write in the cursor
path are the same as the non-cursor path. Here's Claude's summary:
### The Bug
In the original `addIndexBlock()`:
```java
// first clustering is already in, write last entry
if (rowIndexEntryLastClustering.clusteringLength() == 0) {
// first entry is the last entry, copy it
...
} else {
writeClusteringToRowIndexEntries(rowIndexEntryLastClustering);
rowIndexEntryLastClustering.resetClustering();
}
```
`clusteringLength() == 0` was used as a sentinel meaning *"no last
clustering was set, so the first clustering is also the last"*. But this
conflates two distinct states:
1. **Only one unfiltered in the block** → `rowIndexEntryLastClustering` was
never populated via `copy()`, so its length is 0 by default. The first entry
should be copied as the last. ✓
2. **A range tombstone bound with zero clustering columns** (e.g., on a
table with no clustering columns, or a bound that serializes to 0 bytes) →
`clusteringLength()` is also 0, but a distinct last clustering *was* set. The
first entry should **not** be copied as the last. ✗
In case 2, the original code incorrectly copies the first clustering as the
last, producing a malformed `IndexInfo` entry where `firstName == lastName`
even though the block spans multiple unfilteds. This corrupts the index block
boundaries, causing the `IndexState` seek logic to misidentify which block to
start reading from during range queries — specifically, it can skip the block
containing the range tombstone bound entirely, making the open marker invisible
to the reader.
### Why the Three Tests Catch This
The tests use `(pk text, ck text, ...)` — `ck` is a single `text` clustering
column, so bounds like `INCL_START_BOUND(r027)` have non-zero clustering bytes.
The bug wouldn't trigger on *these specific* bounds. However, the
`testMidBlockRangeTombstoneInLastBlock` test writes both an `INCL_START_BOUND`
and an `EXCL_END_BOUND` (or `INCL_END_BOUND`) into the last block. The sequence
within that block is:
```
row r023 ← first clustering of block (written to rowIndexEntries directly)
row r024
INCL_START_BOUND(r024) ← range tombstone open
INCL_END_BOUND(r025) ← range tombstone close ← last clustering tracked
in rowIndexEntryLastClustering
row r026
...
```
The `lastClusteringEmpty` flag is needed for a subtler case: when
`unfilteredDescriptor.clusteringLength() == 0` is detected in the `else` branch
of `updateMetadataAndIndexBlock`, it means a real unfiltered with empty
clustering bytes was encountered as the last item. Without the flag,
`addIndexBlock` can't distinguish this from "no last clustering set."
### The Fix — Line by Line
**1. Field declaration moved to class level (was already there in the
original, but the patch shows it being added):**
```java
private boolean lastClusteringEmpty = false;
```
This is the discriminator flag.
**2. Reset in `writePartitionStart`:**
```java
lastClusteringEmpty = false;
```
Correct — must be reset per partition, not just per index block, since
`addIndexBlock` resets it at the end of each block call.
**3. Set the flag in `updateMetadataAndIndexBlock`:**
```java
else
{
lastClusteringEmpty = unfilteredDescriptor.clusteringLength() == 0;
rowIndexEntryLastClustering.copy(unfilteredDescriptor);
}
```
When the last-seen unfiltered has zero clustering bytes, record that fact
separately from the clustering data itself.
**4. The guard in `addIndexBlock`:**
```java
if (!lastClusteringEmpty && rowIndexEntryLastClustering.clusteringLength()
== 0)
{
// first entry is the last entry, copy it
...
}
else
{
writeClusteringToRowIndexEntries(rowIndexEntryLastClustering);
rowIndexEntryLastClustering.resetClustering();
}
lastClusteringEmpty = false;
```
The condition now correctly means: *"the last clustering was never set (only
one unfiltered in this block) AND it's not the case that we explicitly saw a
zero-length clustering"*. The `lastClusteringEmpty = false` reset at the end is
correct — it's scoped per index block.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]