[PR] Approach 2 (micro-row-group) reader prototype [parquet-java]

via GitHub Fri, 22 May 2026 15:28:46 -0700


alkis opened a new pull request, #3578:
URL: https://github.com/apache/parquet-java/pull/3578


   ## Summary
   
   Prototype reader path for a format extension that lets multiple 
`BlockMetaData` entries share a single physical column chunk ("Approach 2", aka 
micro-row-group). This PR is **draft / RFC** — opened for early feedback on the 
on-wire signaling and reader dispatch, not for merge.
   
   - **On-wire signaling**: `ColumnChunkMetaData.SENTINEL_OFFSET = -1` reuses 
`data_page_offset` to mark a column chunk as physically shared. 
`ColumnChunkMetaData.isPhysicallyShared()` and `BlockMetaData.isApproach2()` 
are the probes.
   - **Page lookup**: each shared block's pages are located via its 
`OffsetIndex` sidecar, with `PageLocation.first_row_index` interpreted as a 
**file-absolute** row index (not block-relative).
   - **Row trimming**: the new `ColumnChunkPageReadStore` carries an absolute 
`RowRanges` window (`RowRanges.createBetween(from, to)`) so 
`SynchronizingColumnReader` discards rows that spill across block boundaries 
when a physical page is listed by two blocks.
   - **Reader dispatch**: `ParquetFileReader.readNextRowGroup()` checks 
`block.isApproach2()` and delegates to a new `internalReadApproach2RowGroup()`. 
Legacy contiguous chunks take the existing path unchanged.
   - **Dictionary pages**: read out-of-band by `readDictionaryPageDirect()` 
because their compressed size is not in the OffsetIndex.
   
   ## Known prototype limitations
   
   Documented inline at the call sites — calling out here because they shape 
the review:
   
   - No cross-block dictionary cache: each block re-fetches and re-decodes the 
shared dict page.
   - No cross-block page cache: a `PhysicalChunkPageSource` is built per 
`readNextRowGroup()` call rather than hoisted to the file level.
   - Boundary pages listed by two adjacent blocks are read from disk twice.
   - Encrypted column chunks always return `isPhysicallyShared() == false` — 
the encrypted-metadata interaction is deliberately out of scope here.
   - Writer-side changes are not in this PR; the reader is testable today 
against fixtures produced by a separate writer prototype.
   
   ## What's in this PR
   
   - `RowRanges.createBetween(from, to)` — immutable single-range constructor 
with absolute coordinates.
   - `ColumnChunkMetaData.SENTINEL_OFFSET`, `isPhysicallyShared()`, 
sentinel-aware `getStartingPos()`.
   - `BlockMetaData.isApproach2()`, sentinel-aware `getStartingPos()` doc.
   - `ParquetFileReader.internalReadApproach2RowGroup()` + 
`readDictionaryPageDirect()` + `drainDataPagesQueue()` helpers, plus dispatch 
in `readNextRowGroup()`.
   - New `PhysicalChunkPageSource` scaffolding (single-call, not yet hoisted).
   - Unit tests for `RowRanges.createBetween`, the sentinel probe on 
`ColumnChunkMetaData`, and `BlockMetaData.isApproach2()` mode detection.
   
   ## Test plan
   
   - [ ] `mvn -pl parquet-column -Dtest=TestRowRanges test`
   - [ ] `mvn -pl parquet-hadoop 
-Dtest=TestColumnChunkMetaData,TestApproach2BlockMetaData test`
   - [ ] End-to-end read of an Approach 2 fixture file (fixture + writer 
prototype tracked separately)
   - [ ] Confirm a regular (non-Approach 2) file still takes the legacy path 
with no behavior change
   
   This pull request and its description were written by Isaac.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Approach 2 (micro-row-group) reader prototype [parquet-java]

Reply via email to