alkis opened a new pull request, #3578:
URL: https://github.com/apache/parquet-java/pull/3578
## Summary
Prototype reader path for a format extension that lets multiple
`BlockMetaData` entries share a single physical column chunk ("Approach 2", aka
micro-row-group). This PR is **draft / RFC** — opened for early feedback on the
on-wire signaling and reader dispatch, not for merge.
- **On-wire signaling**: `ColumnChunkMetaData.SENTINEL_OFFSET = -1` reuses
`data_page_offset` to mark a column chunk as physically shared.
`ColumnChunkMetaData.isPhysicallyShared()` and `BlockMetaData.isApproach2()`
are the probes.
- **Page lookup**: each shared block's pages are located via its
`OffsetIndex` sidecar, with `PageLocation.first_row_index` interpreted as a
**file-absolute** row index (not block-relative).
- **Row trimming**: the new `ColumnChunkPageReadStore` carries an absolute
`RowRanges` window (`RowRanges.createBetween(from, to)`) so
`SynchronizingColumnReader` discards rows that spill across block boundaries
when a physical page is listed by two blocks.
- **Reader dispatch**: `ParquetFileReader.readNextRowGroup()` checks
`block.isApproach2()` and delegates to a new `internalReadApproach2RowGroup()`.
Legacy contiguous chunks take the existing path unchanged.
- **Dictionary pages**: read out-of-band by `readDictionaryPageDirect()`
because their compressed size is not in the OffsetIndex.
## Known prototype limitations
Documented inline at the call sites — calling out here because they shape
the review:
- No cross-block dictionary cache: each block re-fetches and re-decodes the
shared dict page.
- No cross-block page cache: a `PhysicalChunkPageSource` is built per
`readNextRowGroup()` call rather than hoisted to the file level.
- Boundary pages listed by two adjacent blocks are read from disk twice.
- Encrypted column chunks always return `isPhysicallyShared() == false` —
the encrypted-metadata interaction is deliberately out of scope here.
- Writer-side changes are not in this PR; the reader is testable today
against fixtures produced by a separate writer prototype.
## What's in this PR
- `RowRanges.createBetween(from, to)` — immutable single-range constructor
with absolute coordinates.
- `ColumnChunkMetaData.SENTINEL_OFFSET`, `isPhysicallyShared()`,
sentinel-aware `getStartingPos()`.
- `BlockMetaData.isApproach2()`, sentinel-aware `getStartingPos()` doc.
- `ParquetFileReader.internalReadApproach2RowGroup()` +
`readDictionaryPageDirect()` + `drainDataPagesQueue()` helpers, plus dispatch
in `readNextRowGroup()`.
- New `PhysicalChunkPageSource` scaffolding (single-call, not yet hoisted).
- Unit tests for `RowRanges.createBetween`, the sentinel probe on
`ColumnChunkMetaData`, and `BlockMetaData.isApproach2()` mode detection.
## Test plan
- [ ] `mvn -pl parquet-column -Dtest=TestRowRanges test`
- [ ] `mvn -pl parquet-hadoop
-Dtest=TestColumnChunkMetaData,TestApproach2BlockMetaData test`
- [ ] End-to-end read of an Approach 2 fixture file (fixture + writer
prototype tracked separately)
- [ ] Confirm a regular (non-Approach 2) file still takes the legacy path
with no behavior change
This pull request and its description were written by Isaac.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]