schenksj opened a new issue, #22517:
URL: https://github.com/apache/datafusion/issues/22517
## Is your feature request related to a problem or challenge?
`ParquetSource` / `ParquetOpener` (in `datafusion-datasource-parquet`)
cannot emit the parquet reader's **row-number virtual column**, even though the
underlying `parquet` crate (58.x) fully supports it:
```rust
let row_number = Field::new("row_number", DataType::Int64, false)
.with_extension_type(parquet::arrow::...::RowNumber);
let builder = builder.with_virtual_columns(vec![row_number_field])?;
```
The row-number virtual column gives each row its **true physical position
within the file even under row-group / page / row-filter pruning**. This is
exactly what engines need to reconstruct stable per-row identity while still
benefiting from predicate pushdown.
Concretely, this blocks **Delta Lake row tracking** (`_metadata.row_id` =
`baseRowId + physical_row_index`) on top of DataFusion: to keep the synthesized
`row_id`/`row_index` correct, an integrating engine must currently *disable*
data-filter pushdown (so the reader returns every row in physical order and a
running counter stays aligned). That defeats row-group skipping whenever
`_metadata.row_id` is projected alongside a selective filter.
There is no hook to inject this today:
- `ParquetOpener` never calls `with_virtual_columns`, and its
`expr_adapter_factory` field is `pub(crate)`, so the opener can't be
reused/extended from outside the crate.
- `ParquetSource` exposes no builder-customization hook.
- The `ParquetFileReaderFactory` provides only the `AsyncFileReader`, not
builder configuration.
So the only workaround is to re-implement a custom `FileOpener` (duplicating
projection / row-filter / pruning plumbing), which is what we're doing
downstream in Apache DataFusion Comet (apache/datafusion-comet — Delta contrib).
## Describe the solution you'd like
Expose virtual columns on `ParquetSource` / `ParquetOpener`, e.g.:
```rust
let source = ParquetSource::new(schema)
.with_virtual_columns(vec![row_number_field]); // RowNumber-extension
field(s)
```
…and have `ParquetOpener` forward them to
`ParquetRecordBatchStreamBuilder::with_virtual_columns(...)` and include them
in the projected output schema, so the rest of the existing
pruning/row-filter/projection logic is reused unchanged.
## Describe alternatives you've considered
- Re-implementing a custom `FileOpener` that builds the stream with
`with_virtual_columns` (our current downstream approach — works, but duplicates
a lot of well-tested opener logic and is a maintenance burden).
- A reader-factory hook — insufficient, since virtual columns are configured
on the stream *builder*, not the reader.
## Additional context
Downstream consumer: Apache DataFusion Comet's native Delta Lake scan
(apache/datafusion-comet#4366). We'd be happy to contribute a PR if the API
shape above is agreeable.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]