Re: [PR] feat: TABLESAMPLE SYSTEM end-to-end + row-group / row sampling on ParquetSource [datafusion]

via GitHub Tue, 05 May 2026 02:42:33 -0700


geoffreyclaude commented on code in PR #22000:
URL: https://github.com/apache/datafusion/pull/22000#discussion_r3187286356



##########
datafusion/datasource-parquet/src/sampling.rs:
##########
@@ -74,6 +74,23 @@ pub struct ParquetSampling {
     /// `ceil(target / row_cluster_size)` windows distributed across
     /// the row group with a random offset within each stride.
     pub row_cluster_size: usize,
+    /// Internal coordination channel between
+    /// [`crate::source::ParquetSource::try_push_sample`] and the
+    /// parquet opener. Not part of the public sampling API — direct
+    /// callers configure sampling via the per-axis fields above. See
+    /// the [`TABLESAMPLE clause`] section of the SQL reference for
+    /// the pushdown strategy this implements.
+    ///
+    /// [`TABLESAMPLE clause`]: 
https://datafusion.apache.org/user-guide/sql/select.html#tablesample-clause
+    pub(crate) system_target_remaining: Option<f64>,
+    /// Optional `REPEATABLE(seed)` value plumbed through from
+    /// `TABLESAMPLE`. When set, the row-group and row-fraction
+    /// samplers ignore the file path and key only off the seed +

Review Comment:
   Should we reconsider ignoring the file path entirely for `REPEATABLE`? I 
agree we want to avoid environment specificities (as full paths) to be 
reproducible across environments, but keying only on `(seed, row_group_index, 
fraction, cluster_size)` means identical file layout will select the same row 
groups / row windows, creating cross-file correlation.
   
   How about instead of full file path using the file index or some other 
stable id, so `REPEATABLE` stays reproducible without making files correlate?



##########
datafusion/datasource-parquet/src/opener.rs:
##########
@@ -882,11 +887,66 @@ impl FiltersPreparedParquetOpen {
 
         // Determine which row groups to actually read. The idea is to skip
         // as many row groups as possible based on the metadata and query
-        let mut row_groups = RowGroupAccessPlanFilter::new(create_initial_plan(
+        let mut initial_plan = create_initial_plan(
             &prepared.file_name,
             prepared.extensions.clone(),
             rg_metadata.len(),
-        )?);
+        )?;
+
+        // SYSTEM-mode adaptive split: when the SamplePushdown rule

Review Comment:
   These two comments have a lot of duplication. How about keeping only the 
large inner block for details?



##########
docs/source/user-guide/sql/select.md:
##########
@@ -76,6 +76,124 @@ Example:
 SELECT t.a FROM table AS t
 ```
 
+## TABLESAMPLE clause
+
+`TABLESAMPLE` returns a random subset of rows from a table. It's
+useful for ad-hoc data exploration ("give me roughly 1% of this
+table"), bounded `EXPLAIN ANALYZE` runs against representative data,
+and any analytics workload where an approximate answer is acceptable
+in exchange for reading less data.
+
+```sql
+SELECT * FROM table TABLESAMPLE SYSTEM (10);             -- ~10% of the table
+SELECT * FROM table TABLESAMPLE SYSTEM (5) REPEATABLE (42);  -- deterministic
+```
+
+The percentage is in the range `(0, 100]`. `REPEATABLE(seed)` makes
+the sample deterministic — the same seed against the same data always
+returns the same rows.
+
+### What `SYSTEM` means
+
+`SYSTEM` is **block-level** sampling: instead of evaluating a
+per-row coin flip, the scan keeps or drops whole blocks of rows
+chosen at random. This is the same behaviour PostgreSQL documents
+for `TABLESAMPLE SYSTEM` and what Hive calls `BLOCK` sampling
+(DataFusion accepts `BLOCK` as an alias for `SYSTEM`).
+
+The trade-off vs. row-level sampling (`BERNOULLI`):
+
+- **`SYSTEM`** is fast — the scan can skip blocks entirely, so it
+  reads less I/O proportional to the requested fraction. Rows
+  inside each kept block are correlated, so it's statistically
+  lossier than per-row sampling.
+- **`BERNOULLI`** evaluates `random() < p` per row, so every row is
+  read but only some are kept. Statistically tighter, but no I/O
+  saving.
+
+DataFusion only ships `SYSTEM` out of the box. To add `BERNOULLI` or
+other forms, register a [`RelationPlanner`] extension; see the
+[extending SQL] guide and the [`relation_planner` example].
+
+[`relationplanner`]: 
https://docs.rs/datafusion/latest/datafusion/logical_expr/planner/trait.RelationPlanner.html
+[extending sql]: ../../library-user-guide/extending-sql.md
+[`relation_planner` example]: 
https://github.com/apache/datafusion/tree/main/datafusion-examples/examples/relation_planner
+
+### How `SYSTEM` is implemented for Parquet
+
+For a `ParquetSource`, `TABLESAMPLE SYSTEM(p%)` is pushed all the way
+into the scan rather than evaluated as a post-scan filter. The plan
+contains no `SampleExec` — instead, `ParquetSource` itself drops
+files, row groups, and row-clusters in proportion to `p`.
+
+The selection uses an **adaptive cube-root hybrid** that splits the
+budget across three independent levels — file, row-group, and row
+— and collapses the split when an axis can't reduce:
+
+1. **File level** (in the `SamplePushdown` rule, where the file count
+   is known): with `n_files ≥ 2`, keep `⌈n_files * cbrt(p)⌉` files
+   chosen by a seeded shuffle. With `n_files = 1`, the file axis
+   can't reduce, so the full budget flows to the opener.
+2. **Row-group level** (in the parquet opener, after the footer is
+   loaded so the row-group count is known): with multiple row groups
+   in a file, the residual fraction is split as `sqrt(remaining)` at
+   the row-group and row axes. With a single row group, the row-group
+   axis is skipped and the full residual is applied at the row level.
+3. **Row level** — within each kept row group, the kept fraction is
+   materialised as a small number of contiguous `RowSelection`
+   windows so the parquet reader can use the page index to skip data
+   pages entirely.
+
+The product across all axes is always `p`, so the expected result
+size is `p × N` rows regardless of how many files or row groups
+the scan happens to have. Spreading the reduction across all three
+axes means the I/O win at small fractions does not concentrate at
+a single granularity: dropping 90% of files (1/0.1 ≈ 10× fewer files)
+produces a coarser sample than dropping 90% across all axes evenly.
+Small inputs degenerate gracefully — a single-file scan still hits
+the requested `p`; a single-file / single-row-group scan reduces to
+pure row-level sampling.
+
+`REPEATABLE(seed)` mixes the seed into every random draw, so all
+levels produce the same selection across runs. The selection also
+depends on the file name, the row-group index within the file, and

Review Comment:
   It currently actually *doesn't* depend on file name. See my related comment 
for a suggested dependency of file index instead.



##########
datafusion-examples/examples/relation_planner/table_sample.rs:
##########
@@ -273,6 +300,32 @@ async fn run_examples(ctx: &SessionContext) -> Result<()> {
     +---------+---------+---------+---------+
     ");
 
+    // Example 7: SYSTEM sampling — handled by the built-in
+    // `TableSampleSystemPlanner`, **not** by this example's planner.
+    // Our planner returns `Original` for SYSTEM, so the chain falls
+    // through. Routed against the parquet-backed copy of the table so
+    // the `SamplePushdown` rule can absorb the sample into the scan.
+    // `REPEATABLE(42)` makes the rows deterministic across runs and
+    // across machines (the seed dominates the file path in the

Review Comment:
   Same comment issue as in other places with REPEATABLE which currently 
ignores file path.



##########
datafusion/physical-optimizer/src/sample_pushdown.rs:
##########
@@ -0,0 +1,128 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! `SamplePushdown` — push a [`SampleExec`] into the source it sits
+//! above.
+//!
+//! The rule walks the physical plan top-down. For each `SampleExec`
+//! it finds, it asks the immediate child via
+//! [`ExecutionPlan::try_push_sample`] what to do:
+//!
+//! * [`SamplePushdownResult::Absorbed`]: the child has incorporated
+//!   the sample. Replace the `SampleExec` with the new child.
+//! * [`SamplePushdownResult::Passthrough`]: the child is row-preserving
+//!   for sampling (filter, projection, coalesce, repartition,
+//!   non-fetch sort). Recurse into the child's single input and, if
+//!   that recursion succeeds, rebuild the child with the new
+//!   grandchild and drop the original `SampleExec`.
+//! * [`SamplePushdownResult::Unsupported`]: pushdown stops here.
+//!   Today this is a planning error — a generic post-scan filter
+//!   exec is a follow-up.
+//!
+//! Run order: this rule must come *after* the rules that may
+//! introduce or rewrite scan nodes (filter pushdown, projection
+//! pushdown). It runs before `SanityCheckPlan` so any leftover
+//! `SampleExec` produces a clean error.
+//!
+//! [`SampleExec`]: datafusion_physical_plan::sample::SampleExec
+//! [`SamplePushdownResult`]: 
datafusion_physical_plan::sample_pushdown::SamplePushdownResult
+
+use std::sync::Arc;
+
+use datafusion_common::Result;
+use datafusion_common::config::ConfigOptions;
+use datafusion_common::tree_node::{Transformed, TransformedResult, TreeNode};
+use datafusion_physical_plan::ExecutionPlan;
+use datafusion_physical_plan::sample::SampleExec;
+use datafusion_physical_plan::sample_pushdown::{SamplePushdownResult, 
SampleSpec};
+
+use crate::PhysicalOptimizerRule;
+
+/// Optimizer rule that attempts to push every `SampleExec` into the
+/// source below it.
+#[derive(Default, Debug)]
+pub struct SamplePushdown;
+
+impl PhysicalOptimizerRule for SamplePushdown {
+    fn optimize(
+        &self,
+        plan: Arc<dyn ExecutionPlan>,
+        _config: &ConfigOptions,
+    ) -> Result<Arc<dyn ExecutionPlan>> {
+        plan.transform_down(|node| {
+            let Some(sample) = node.as_ref().downcast_ref::<SampleExec>() else 
{
+                return Ok(Transformed::no(node));
+            };
+            let spec = sample.spec();
+            let child = Arc::clone(sample.input());
+            match push_into(child, &spec)? {
+                Pushdown::Pushed(new_child) => Ok(Transformed::yes(new_child)),
+                Pushdown::Failed(reason) => {
+                    datafusion_common::plan_err!(
+                        "TABLESAMPLE is not supported for this source: 
{reason}. \

Review Comment:
   `source` here can be confusing if pushdown failed at an intermediary node. 
How about: "TABLESAMPLE could not be pushed down: {reason}"?



##########
datafusion/sql/src/sample.rs:
##########
@@ -0,0 +1,169 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Built-in [`RelationPlanner`] for `TABLESAMPLE SYSTEM(p%)`.
+//!
+//! Auto-registered via [`SessionStateDefaults::default_relation_planners`]
+//! so SQL `TABLESAMPLE SYSTEM (10) [REPEATABLE (n)]` works out of the
+//! box on any default `SessionContext`. Other `TABLESAMPLE` flavours
+//! (`BERNOULLI`, `ROW`, `BUCKET ... OUT OF ...`, `OFFSET`) are rejected
+//! at planning time — implementing those is left to a downstream
+//! `RelationPlanner` (see `datafusion-examples/examples/relation_planner/`).
+//!
+//! `SessionStateBuilder::register_relation_planner` inserts new planners
+//! at the front of the chain, so a downstream planner that returns
+//! `Planned` for the same `TABLESAMPLE` syntax wins. Returning
+//! `Original` falls through to this default.
+//!
+//! [`SessionStateDefaults::default_relation_planners`]: 
../../datafusion/execution/session_state/struct.SessionStateDefaults.html
+
+use std::sync::Arc;
+
+use datafusion_common::{Result, not_impl_err, plan_datafusion_err, plan_err};
+use datafusion_expr::logical_plan::sample::{SampleMethod, sample_plan};
+use datafusion_expr::planner::{
+    PlannedRelation, RelationPlanner, RelationPlannerContext, RelationPlanning,
+};
+use sqlparser::ast::{
+    self, TableFactor, TableSampleKind, TableSampleMethod, TableSampleUnit,
+};
+
+/// Built-in `RelationPlanner` that lifts `TABLESAMPLE SYSTEM(p%)`
+/// (with optional `REPEATABLE(seed)`) into the core
+/// [`Sample`](datafusion_expr::logical_plan::sample::Sample) extension
+/// node so the `SamplePushdown` optimizer rule can absorb the sample
+/// into the scan.
+///
+/// Rejects every other form of `TABLESAMPLE` with a `not_impl_err`. To
+/// support `BERNOULLI`, row counts, or `BUCKET`, register your own
+/// `RelationPlanner` ahead of this one — `register_relation_planner`
+/// pushes to the front and the first `Planned` wins.
+#[derive(Debug, Default)]
+pub struct TableSampleSystemPlanner;
+
+impl RelationPlanner for TableSampleSystemPlanner {
+    fn plan_relation(
+        &self,
+        relation: TableFactor,
+        context: &mut dyn RelationPlannerContext,
+    ) -> Result<RelationPlanning> {
+        // Only act on Table relations carrying a `TABLESAMPLE` clause.
+        // Everything else (derived, function, unnest, join) falls
+        // through to the next planner / DataFusion's default logic.
+        let TableFactor::Table {
+            sample: Some(sample),
+            alias,
+            name,
+            args,
+            with_hints,
+            version,
+            with_ordinality,
+            partitions,
+            json_path,
+            index_hints,
+        } = relation
+        else {
+            return Ok(RelationPlanning::Original(Box::new(relation)));
+        };
+
+        let ts = match sample {
+            TableSampleKind::BeforeTableAlias(s)
+            | TableSampleKind::AfterTableAlias(s) => *s,
+        };
+
+        if ts.bucket.is_some() {
+            return not_impl_err!(
+                "TABLESAMPLE BUCKET is not supported (only SYSTEM PERCENT). \
+                 Register a custom RelationPlanner before the built-in \
+                 TableSampleSystemPlanner to handle other forms."
+            );
+        }
+        if ts.offset.is_some() {
+            return not_impl_err!(
+                "TABLESAMPLE OFFSET is not supported (only SYSTEM PERCENT)"
+            );
+        }
+        match ts.name {
+            // The built-in planner only handles SYSTEM (and BLOCK as an
+            // alias for SYSTEM, matching Hive). Anything else is a
+            // semantics commitment we don't want to make in core.
+            Some(TableSampleMethod::System) | Some(TableSampleMethod::Block) | 
None => {}

Review Comment:
   The `None` arm means the default if unspecified is `SYSTEM`: is this what we 
want?



##########
datafusion/datasource-parquet/src/sampling.rs:
##########
@@ -0,0 +1,335 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Random sampling primitives for parquet scans.
+//!
+//! [`ParquetSampling`] holds the sampling configuration carried on
+//! [`crate::source::ParquetSource`]. The two `apply_*_sampling`
+//! methods mutate a [`ParquetAccessPlan`] in place — they are invoked
+//! by the parquet [`crate::opener`] once the file footer is loaded.
+//!
+//! Selection within a row group is deterministic-but-random per
+//! `(file_name, row_group_index, fraction, cluster_size)`: the methods
+//! seed an `SmallRng` from a hash of those inputs so re-runs match.
+
+use crate::access_plan::ParquetAccessPlan;
+
+/// Hierarchical sampling config for parquet scans.
+///
+/// All fractions are in `(0.0, 1.0]`. `None` (or `1.0`) means "no
+/// sampling".
+///
+///   * `row_group_fraction` — within each scanned file, keep this
+///     fraction of row groups. Decision made inside the opener after
+///     the footer is loaded so we sample by actual row-group index.
+///   * `row_fraction` — within each kept row group, keep this fraction
+///     of rows by translating to a `RowSelection` of K small contiguous
+///     windows spread across the row group. The parquet reader uses
+///     the page index to read only the data pages covering the
+///     selected rows, so this gives "page-level" IO savings without
+///     requiring per-column page alignment. Falls back to scanning
+///     whole pages if the page index is missing.
+///   * `row_cluster_size` — controls how the per-row-group target is
+///     split into contiguous windows. Smaller = more diversity, more
+///     page-index lookups; larger = cheaper, fewer regions covered.
+///
+/// **Why this lives here, not as a one-shot `ParquetAccessPlan`:** the
+/// natural entry-point for "I want a sample" is at config time, before
+/// any metadata IO has happened. The actual *which row groups* /
+/// *which rows* selection still needs to be deferred until the opener
+/// has the footer — that's why these fractions get carried through and
+/// applied lazily.
+///
+/// **Why no file-level fraction:** [`crate::source::ParquetSource`]
+/// doesn't own the file list — that lives on `FileScanConfig.file_groups`.
+/// Callers that want to drop files should rebuild the `FileScanConfig`
+/// with a reduced `file_groups`. Adding a file-fraction setter here
+/// would have been a no-op and confusing.
+///
+/// Selection within a row group is deterministic-but-random per
+/// `(file_name, row_group_index, fraction, cluster_size)`: we seed an
+/// `SmallRng` from a hash of those inputs so re-runs match exactly.
+#[derive(Debug, Clone)]
+pub struct ParquetSampling {
+    /// Fraction of row groups to keep in each scanned file.
+    pub row_group_fraction: Option<f64>,
+    /// Fraction of rows to keep within each scanned row group.
+    pub row_fraction: Option<f64>,
+    /// Maximum size (in rows) of each contiguous window emitted by
+    /// row-fraction sampling. The total target rows are split into
+    /// `ceil(target / row_cluster_size)` windows distributed across
+    /// the row group with a random offset within each stride.
+    pub row_cluster_size: usize,
+    /// Internal coordination channel between
+    /// [`crate::source::ParquetSource::try_push_sample`] and the
+    /// parquet opener. Not part of the public sampling API — direct
+    /// callers configure sampling via the per-axis fields above. See
+    /// the [`TABLESAMPLE clause`] section of the SQL reference for
+    /// the pushdown strategy this implements.
+    ///
+    /// [`TABLESAMPLE clause`]: 
https://datafusion.apache.org/user-guide/sql/select.html#tablesample-clause
+    pub(crate) system_target_remaining: Option<f64>,
+    /// Optional `REPEATABLE(seed)` value plumbed through from
+    /// `TABLESAMPLE`. When set, the row-group and row-fraction
+    /// samplers ignore the file path and key only off the seed +
+    /// row-group index, so the same query is reproducible across
+    /// environments. When unset, sampling is keyed on the file path
+    /// so re-runs against the same file are stable while different
+    /// files produce uncorrelated samples.
+    pub(crate) seed: Option<u64>,
+}
+
+impl Default for ParquetSampling {
+    fn default() -> Self {
+        Self {
+            row_group_fraction: None,
+            row_fraction: None,
+            row_cluster_size: 32_768,
+            system_target_remaining: None,
+            seed: None,
+        }
+    }
+}
+
+impl ParquetSampling {
+    /// Mutate `plan` in-place to keep only a random subset of its
+    /// currently-scannable row groups, sized to `self.row_group_fraction`
+    /// of the total. No-op if `row_group_fraction` is `None`, `>= 1.0`,
+    /// or out of range.
+    ///
+    /// When `self.seed` is `Some`, selection is keyed on
+    /// `(seed, row_group_count, fraction)` and ignores the file path
+    /// — the same query is reproducible across environments (this is
+    /// the `REPEATABLE(seed)` SQL semantics). When `seed` is `None`,
+    /// selection is keyed on `(file_name, row_group_count, fraction)`
+    /// — re-runs against the same file are stable, and different
+    /// files produce uncorrelated samples.
+    pub(crate) fn apply_row_group_sampling(
+        &self,
+        plan: &mut ParquetAccessPlan,
+        row_group_count: usize,
+        file_name: &str,
+    ) {
+        use rand::SeedableRng;
+        use rand::seq::SliceRandom;
+        use std::hash::{Hash, Hasher};
+
+        let Some(fraction) = self.row_group_fraction else {
+            return;
+        };
+        if row_group_count == 0 || !(0.0..1.0).contains(&fraction) {
+            return; // no-op for fraction >= 1.0 (keep all) or invalid input
+        }
+        let target = ((row_group_count as f64) * fraction).ceil().max(1.0) as 
usize;
+        if target >= row_group_count {
+            return;
+        }
+
+        let mut hasher = std::collections::hash_map::DefaultHasher::new();
+        match self.seed {
+            Some(s) => {
+                "REPEATABLE-row-group".hash(&mut hasher);
+                s.hash(&mut hasher);
+            }
+            None => file_name.hash(&mut hasher),
+        }
+        row_group_count.hash(&mut hasher);
+        fraction.to_bits().hash(&mut hasher);
+        let seed = hasher.finish();
+
+        let mut indices: Vec<usize> = (0..row_group_count).collect();
+        let mut rng = rand::rngs::SmallRng::seed_from_u64(seed);
+        indices.shuffle(&mut rng);
+        let kept: std::collections::HashSet<usize> =
+            indices.into_iter().take(target).collect();
+
+        for i in 0..row_group_count {
+            if !kept.contains(&i) {
+                plan.skip(i);
+            }
+        }
+    }
+
+    /// For each row group still marked `Scan`, replace it with a
+    /// `Selection` that covers `self.row_fraction` of the row group's
+    /// rows in `ceil(target / self.row_cluster_size)` spread-out
+    /// contiguous windows. No-op if `row_fraction` is `None`, `>= 1.0`,
+    /// or out of range.
+    ///
+    /// Selection is deterministic per `(file_name, row_group_index,
+    /// fraction, cluster_size)`.
+    ///
+    /// If the parquet file has page indexes, the reader only reads the
+    /// data pages covering the selected rows. If page indexes are
+    /// missing, the reader still has to read whole pages and discard
+    /// rows; the IO win disappears in that case but correctness is
+    /// unaffected.
+    pub(crate) fn apply_row_fraction_sampling(
+        &self,
+        plan: &mut ParquetAccessPlan,
+        rg_metadata: &[parquet::file::metadata::RowGroupMetaData],
+        file_name: &str,
+    ) {
+        use parquet::arrow::arrow_reader::{RowSelection, RowSelector};
+        use rand::Rng;
+        use rand::SeedableRng;
+        use std::hash::{Hash, Hasher};
+
+        let Some(fraction) = self.row_fraction else {
+            return;
+        };
+        if !(0.0..1.0).contains(&fraction) {
+            return; // no-op for fraction >= 1.0 (keep all) or invalid input
+        }
+        let cluster_size = self.row_cluster_size.max(1);
+
+        for (idx, rg) in rg_metadata.iter().enumerate() {
+            if !plan.should_scan(idx) {
+                continue;
+            }
+            let total_rows = rg.num_rows() as usize;
+            if total_rows == 0 {
+                continue;
+            }
+
+            let mut hasher = std::collections::hash_map::DefaultHasher::new();
+            match self.seed {
+                Some(s) => {
+                    "REPEATABLE-row-fraction".hash(&mut hasher);
+                    s.hash(&mut hasher);
+                }
+                None => file_name.hash(&mut hasher),
+            }
+            idx.hash(&mut hasher);
+            fraction.to_bits().hash(&mut hasher);
+            cluster_size.hash(&mut hasher);
+            let seed = hasher.finish();
+            let mut rng = rand::rngs::SmallRng::seed_from_u64(seed);
+
+            let target_rows = ((total_rows as f64) * fraction).ceil().max(1.0) 
as usize;

Review Comment:
   This code seems pretty risky and error-prone, especially with regards to 
possible duplicate sampling on edge cases. Can you extract it to a dedicated 
function and fuzz it to ensure there aren't any window overlaps, out of bounds, 
general weirdness?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: TABLESAMPLE SYSTEM end-to-end + row-group / row sampling on ParquetSource [datafusion]

Reply via email to