Shubham Roy created HBASE-29974:
-----------------------------------
Summary: Filter seek hints underutilized due to early circuit
breaks in scan pipeline, causing unnecessary cell-level iteration
Key: HBASE-29974
URL: https://issues.apache.org/jira/browse/HBASE-29974
Project: HBase
Issue Type: Improvement
Components: Filters, Scanners
Affects Versions: 2.5.13, 2.6.4
Reporter: Shubham Roy
Assignee: Shubham Roy
h1. Summary
The filter seek-hint infrastructure (SEEK_NEXT_USING_HINT / getNextCellHint) is
only reachable through one narrow path in the scan pipeline. Multiple earlier
circuit breaks — time range mismatch, column mismatch, version exhaustion, and
filterRowKey rejection — all short-circuit before the filter is consulted,
forcing the scanner to advance one cell at a time even when the filter could
provide a large forward jump.
h1. Background
HBase's filter API supports SEEK_NEXT_USING_HINT + getNextCellHint() to allow a
filter to tell the scanner "jump directly to this cell, skipping everything in
between." This is the most powerful skip primitive available. However, it is
only reachable via one path in matchColumn:
{code:java}
// All three must pass for filterCell to be reached:
tr.compare(timestamp) == 0 // time range gate
columns.checkColumn() == INCLUDE // column gate
columns.checkVersions() == INCLUDE* // version gate
→ filter.filterCell(cell) // only here can SEEK_NEXT_USING_HINT be
returned
{code}
Every other code path bypasses filterCell entirely.
h1. Problem
h2. Problem 1 — Uninteresting rows (filterRowKey=true)
When filterRowKey() returns true, the scanner calls nextRow(), which scans
forward one cell at a time via storeHeap.next(MOCKED_LIST). Inside this path,
matcher.match() is called per cell, but filterCell is only reached if a cell
passes the time range check. For rows with no cells in the scan's time range,
the time range gate fires for every cell, filterCell is never called, and the
filter's hint is unreachable. The scanner pays O(cells-in-row) cost per
rejected row rather than seeking directly to the next location.
h2. Problem 2 — Rows with cells outside the time range (filterRowKey=false)
Even when a row is not rejected at the row key level, cells outside the time
range hit:
{code:java}
if (tsCmp > 0) { return MatchCode.SKIP; } // filter bypassed
if (tsCmp < 0) { return columns.getNextRowOrNextColumn; } // filter bypassed
{code}
The filter is never consulted. If the filter could determine a better skip
target for these cells, that capability is wasted.
h2. Problem 3 — Cells failing column or version gates (filterRowKey=false, cell
in time range)
Even for cells within the time range, two further gates can short-circuit
before filterCell:
# checkColumn() ≠ INCLUDE → returns column-tracker hint (SEEK_NEXT_COL) without
consulting filter
# checkVersions() = SKIP or SEEK_NEXT_COL → returns without consulting filter
The column tracker can only suggest the next column or row. The filter may know
a much better target (e.g., skip several columns, or skip to a completely
different row), but is never asked.
h1. Impact
In all three cases, the scanner is forced into a cell-by-cell or row-by-row
iteration that it could avoid if the filter's hint were consulted. Filters with
efficient seeking logic (e.g., FuzzyRowFilter, ColumnRangeFilter, custom range
filters) incur unnecessary I/O proportional to the number of skipped cells/rows.
h1. Root Cause
The filter hint mechanism and the scan pipeline's short-circuit mechanism are
disconnected. Short-circuits exist for correctness and efficiency reasons (time
range, column set, version limits), but they each bypass the filter as a side
effect. The filter has no opportunity to provide a hint unless a cell passes
every prior gate.
h1. Solution
Two new purpose-built API methods are introduced on Filter (with concrete
default implementations returning null for full backward compatibility):
Filter.getHintForRejectedRow(Cell firstRowCell)
Addresses Path 1. Called in RegionScannerImpl immediately after filterRowKey()
returns true, instead of calling filterCell(). Gives the filter an opportunity
to provide a seek target to bypass row-by-row scanning.
Contract:
* Only called after filterRowKey returns true for the same cell
* May use state derived from filterRowKey (e.g., current range pointer in
MultiRowRangeFilter)
* Must not invoke filterCell logic — callers guarantee filterCell has not been
called for this row
* Default returns null (falls through to existing nextRow() behavior)
Filter.getSkipHint(Cell skippedCell)
Addresses Path 2. Called at every structural short-circuit in matchColumn
before filterCell is reached. Gives the filter an opportunity to provide a seek
target for cells skipped by the time range, column, or version gate.
Contract:
* May be called for cells that have not been passed through filterCell
* Must not modify filter state (completely stateless)
* Only filters with immutable, configuration-based hint computation should
override this
* Default returns null (falls through to existing skip/seek behavior)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)