adriangb opened a new pull request, #22024:
URL: https://github.com/apache/datafusion/pull/22024

   ## Which issue does this PR close?
   
   - Relates to #21624 (`datafusion.execution.collect_statistics` on wide 
tables).
   
   This PR extracts the first layer of #22000 — the opt-in `ParquetSource` 
sampling primitives — as a self-contained change. The TABLESAMPLE SQL surface, 
`Sample` logical/physical nodes, and `SamplePushdown` rule are deliberately not 
included; they will land in follow-ups.
   
   ## Rationale for this change
   
   DataFusion has the machinery for fine-grained parquet sampling 
(`ParquetAccessPlan` with `Skip` / `Scan` / `Selection(RowSelection)`) but no 
public way to ask for a sample without constructing the access plan by hand and 
stuffing it into `PartitionedFile.extensions`. That works for one-off code but 
is awkward for ad-hoc data exploration, layered helpers that want to compute 
approximate stats over a bounded slice, and `EXPLAIN ANALYZE`-driven debug runs 
against a representative slice.
   
   This PR adds the lowest layer: opt-in builders on `ParquetSource` that 
translate fractions into a `ParquetAccessPlan` lazily inside the opener (after 
the footer is loaded, so we sample by real row-group index). It is additive and 
has no behavior change for existing scans. The SQL surface in #22000 is built 
on top of these primitives.
   
   ## What changes are included in this PR?
   
   ```rust
   ParquetSource::new(schema)
       .with_row_group_sampling(0.1)   // keep ~10% of row groups per file
       .with_row_fraction(0.05)        // within each kept row group, keep ~5% 
of rows
       .with_row_cluster_size(8192);   // controls window granularity (default 
32 768)
   ```
   
   `with_row_group_sampling(fraction)`:
   
   * Selection is deferred until the opener has loaded the parquet footer, so 
we sample by real row-group index.
   * Deterministic per `(file_index, row_group_count, fraction)` — re-runs 
match. The opener passes the execution `partition_index` as the stable 
`file_index`, so sampling is reproducible across environments without depending 
on object-store paths.
   * Always keeps at least one row group (target = `max(1, ceil(N * 
fraction))`).
   * No-op when `fraction >= 1.0`.
   
   `with_row_fraction(fraction)`:
   
   * Translates the per-row-group target into K contiguous windows spread 
evenly across the row group, each placed at a random offset within its stride. 
Window count = `ceil(target / cluster_size)`.
   * Materializes a `RowSelection` per kept row group; the parquet reader uses 
the page index to read only the data pages covering the selected rows. This 
gives \"page-level\" IO savings without requiring per-column page alignment 
(which doesn't exist in parquet).
   * Falls back gracefully when the page index is missing — the reader still 
returns the right rows, the IO win just disappears.
   * Deterministic per `(file_index, row_group_index, fraction, cluster_size)`.
   * Window-layout math is extracted into a dedicated 
`build_row_window_selectors` function and fuzz-tested across thousands of 
configurations to guarantee no overlap, in-bounds positions, and full coverage.
   
   The two layers compose: `row_group_fraction = 0.1` × `row_fraction = 0.1` 
reads ~1% of the rows from ~10% of the row groups, with windows spread out so 
the sample isn't clustered at one end of each row group.
   
   ### Internals
   
   * New `ParquetSampling` struct re-exported at the crate root.
   * Plumbed through `ParquetMorselizer` → `PreparedParquetOpen`.
   * Two free functions invoked from `prune_row_groups` right after 
`create_initial_plan`.
   * New dep: `rand` with the `small_rng` feature (already in workspace 
`Cargo.toml`).
   
   ### Differences vs. the original commit in #22000
   
   Two pieces of review feedback on the parent PR are folded in here:
   
   * `apply_*_sampling` keys on a stable `file_index: usize` rather than 
`file_name: &str`, addressing 
https://github.com/apache/datafusion/pull/22000#discussion_r3187286356. The 
opener passes the execution `partition_index`. This removes the on-disk-path 
dependency from the seed inputs while still decorrelating files in different 
partitions.
   * The row-window layout math is extracted into `build_row_window_selectors` 
and fuzz-tested 
(https://github.com/apache/datafusion/pull/22000#discussion_r3187392811). 
Fuzzing surfaced an overlap bug at `target_rows ≈ total_rows` where 
`window_size = ceil(target / n_windows)` could exceed `stride = total_rows / 
n_windows`; the fix caps `window_size` at `stride`.
   
   ## Are these changes tested?
   
   12 tests in `datafusion-datasource-parquet`:
   
   **Row-group sampling (`sampling::tests`):**
   * `row_group_sampling_keeps_target_count` — `ceil(N * fraction)` math.
   * `row_group_sampling_is_deterministic` — same inputs → same selection.
   * `row_group_sampling_differs_per_file_index` — different `file_index` → 
different sample.
   * `row_group_sampling_no_op_when_fraction_is_one` — fraction ≥ 1.0 keeps 
everything.
   * `row_group_sampling_target_at_least_one` — `fraction = 0.001` over 100 row 
groups still keeps 1.
   * `row_group_sampling_no_op_when_unset` — `None` is a no-op.
   
   **Row-window selection (`sampling::tests`):**
   * `row_window_selection_basic_layout` — hand-checked anchor case.
   * `row_window_selection_returns_none_on_invalid_input` — degenerate inputs 
(zero row group, zero target, zero cluster) return `None`.
   * `row_window_selection_full_target_no_overlap` — the previously-buggy 
`target_rows == total_rows` case.
   * `row_window_selection_fuzz_invariants` — 5 000 randomized `(total_rows, 
target_rows, cluster_size, seed)` configurations, asserting full coverage, 
in-bounds positions, and no overlap.
   * `row_window_selection_fuzz_determinism` — 1 000 iterations verifying 
identical seeds produce identical layouts.
   
   **End-to-end (`opener::test`):**
   * `row_group_sampling_end_to_end` — writes a 4-row-group parquet to 
`InMemory`, scans with `fraction = 0.5`, asserts exactly 6 rows out (2 row 
groups × 3 rows).
   * `row_fraction_end_to_end` — writes a 100-row single-row-group parquet, 
scans with `row_fraction = 0.1` and `cluster_size = 4`, asserts the result is 
in the expected range.
   
   `cargo build`, `cargo fmt --all`, and `cargo clippy -p 
datafusion-datasource-parquet --all-targets --all-features -- -D warnings` are 
clean.
   
   ## Are there any user-facing changes?
   
   * **New public Rust API on `ParquetSource`:** `with_row_group_sampling`, 
`with_row_fraction`, `with_row_cluster_size`, `sampling()`, plus the 
`ParquetSampling` struct.
   * Existing scans / queries that don't opt in are unaffected.
   * No SQL surface yet — that lands in a follow-up to #22000.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to