Re: [PR] CASSANDRA-21255 Differentiate between legitimate cases where the first entry is the same as the last entry and empty bounds in SSTableCursorWriter#addIndexBlock() [cassandra]

via GitHub Fri, 17 Apr 2026 13:06:51 -0700


maedhroz commented on PR #4748:
URL: https://github.com/apache/cassandra/pull/4748#issuecomment-4270929240


   I've made a couple changes to ensure that the blocks we write in the cursor 
path are the same as the non-cursor path. Here's Claude's summary:
   
   ### The Bug
   
   In the original `addIndexBlock()`:
   
   ```java
   // first clustering is already in, write last entry
   if (rowIndexEntryLastClustering.clusteringLength() == 0) {
       // first entry is the last entry, copy it
       ...
   } else {
       writeClusteringToRowIndexEntries(rowIndexEntryLastClustering);
       rowIndexEntryLastClustering.resetClustering();
   }
   ```
   
   `clusteringLength() == 0` was used as a sentinel meaning *"no last 
clustering was set, so the first clustering is also the last"*. But this 
conflates two distinct states:
   
   1. **Only one unfiltered in the block** → `rowIndexEntryLastClustering` was 
never populated via `copy()`, so its length is 0 by default. The first entry 
should be copied as the last. ✓
   2. **A range tombstone bound with zero clustering columns** (e.g., on a 
table with no clustering columns, or a bound that serializes to 0 bytes) → 
`clusteringLength()` is also 0, but a distinct last clustering *was* set. The 
first entry should **not** be copied as the last. ✗
   
   In case 2, the original code incorrectly copies the first clustering as the 
last, producing a malformed `IndexInfo` entry where `firstName == lastName` 
even though the block spans multiple unfilteds. This corrupts the index block 
boundaries, causing the `IndexState` seek logic to misidentify which block to 
start reading from during range queries — specifically, it can skip the block 
containing the range tombstone bound entirely, making the open marker invisible 
to the reader.
   
   ### Why the Three Tests Catch This
   
   The tests use `(pk text, ck text, ...)` — `ck` is a single `text` clustering 
column, so bounds like `INCL_START_BOUND(r027)` have non-zero clustering bytes. 
The bug wouldn't trigger on *these specific* bounds. However, the 
`testMidBlockRangeTombstoneInLastBlock` test writes both an `INCL_START_BOUND` 
and an `EXCL_END_BOUND` (or `INCL_END_BOUND`) into the last block. The sequence 
within that block is:
   
   ```
   row r023  ← first clustering of block (written to rowIndexEntries directly)
   row r024
   INCL_START_BOUND(r024)  ← range tombstone open
   INCL_END_BOUND(r025)    ← range tombstone close  ← last clustering tracked 
in rowIndexEntryLastClustering
   row r026
   ...
   ```
   
   The `lastClusteringEmpty` flag is needed for a subtler case: when 
`unfilteredDescriptor.clusteringLength() == 0` is detected in the `else` branch 
of `updateMetadataAndIndexBlock`, it means a real unfiltered with empty 
clustering bytes was encountered as the last item. Without the flag, 
`addIndexBlock` can't distinguish this from "no last clustering set."
   
   ### The Fix — Line by Line
   
   **1. Field declaration moved to class level (was already there in the 
original, but the patch shows it being added):**
   ```java
   private boolean lastClusteringEmpty = false;
   ```
   This is the discriminator flag.
   
   **2. Reset in `writePartitionStart`:**
   ```java
   lastClusteringEmpty = false;
   ```
   Correct — must be reset per partition, not just per index block, since 
`addIndexBlock` resets it at the end of each block call.
   
   **3. Set the flag in `updateMetadataAndIndexBlock`:**
   ```java
   else
   {
       lastClusteringEmpty = unfilteredDescriptor.clusteringLength() == 0;
       rowIndexEntryLastClustering.copy(unfilteredDescriptor);
   }
   ```
   When the last-seen unfiltered has zero clustering bytes, record that fact 
separately from the clustering data itself.
   
   **4. The guard in `addIndexBlock`:**
   ```java
   if (!lastClusteringEmpty && rowIndexEntryLastClustering.clusteringLength() 
== 0)
   {
       // first entry is the last entry, copy it
       ...
   }
   else
   {
       writeClusteringToRowIndexEntries(rowIndexEntryLastClustering);
       rowIndexEntryLastClustering.resetClustering();
   }
   lastClusteringEmpty = false;
   ```
   
   The condition now correctly means: *"the last clustering was never set (only 
one unfiltered in this block) AND it's not the case that we explicitly saw a 
zero-length clustering"*. The `lastClusteringEmpty = false` reset at the end is 
correct — it's scoped per index block.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] CASSANDRA-21255 Differentiate between legitimate cases where the first entry is the same as the last entry and empty bounds in SSTableCursorWriter#addIndexBlock() [cassandra]

Reply via email to