[PR] Support data source sampling with TABLESAMPLE [datafusion]

via GitHub Sat, 07 Jun 2025 11:26:51 -0700


theirix opened a new pull request, #16325:
URL: https://github.com/apache/datafusion/pull/16325


   ## Which issue does this PR close?
   
   - Closes #13563
   
   ## Rationale for this change
   
   Explained in #13563 in detail with known syntax examples.
   Thanks to [changes to 
sqlparser](https://github.com/apache/datafusion-sqlparser-rs/pull/1580), it is 
now possible to use the `TABLESAMPLE` and `SAMPLE` constructs in a logical plan 
builder.
   
   Added a rewrite function to `datafusion-sql` which produces an additional 
`LogicalPlan::Filter` based on `TableSample` from sqlparser. There is no need 
to remove anything from SQL AST since it's not used anywhere.
   
   ## What changes are included in this PR?
   
   
   ## Are these changes tested?
   
   -    Unit tests (added a few)
   -    Regression tests (added to select.slt)
   -    Manual test (see below)
   
   ```sql
   create external table data stored as parquet location 'sample.parquet';
   select COUNT(*) from data TABLESAMPLE SYSTEM (13) where double_col < 42.0;
   ```
   
   Leads to the initial logical plan
   ```
       Projection: count(Int64(1)) AS count(*)
         Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]
           Filter: random() < Float64(42.6) / Float64(100)
             Filter: data.double_col < Float64(42)
               TableScan: data
   ```
   
   The physical plan is somewhat unusual, as volatile functions are pushed down 
to the data source (`datafusion.execution.parquet.pushdown_filters` is enabled 
by default), which was addressed in #13268.
   
   ```
    ProjectionExec: expr=[count(Int64(1))@0 as count(*)]
         AggregateExec: mode=Final, gby=[], aggr=[count(Int64(1))]
           CoalescePartitionsExec
             AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))]
               ProjectionExec: expr=[]
                 CoalesceBatchesExec: target_batch_size=8192
                   FilterExec: double_col@0 < 42 AND random() < 0.426
                     RepartitionExec: partitioning=RoundRobinBatch(8), 
input_partitions=1
                       DataSourceExec: file_groups={1 group: 
[[Users/irix/projects/third-party/datafusion-data/sample.parquet]]}, 
projection=[double_col], file_type=parquet, predicate=double_col@0 < 42 AND 
random() < 0.426, pruning_predicate=double_col_null_count@1 != row_count@2 AND 
double_col_min@0 < 42, required_guarantees=[]
   ```
   
   More details:
   <details>
   DataFusion CLI v48.0.0
   [2025-06-07T18:14:32Z DEBUG sqlparser::parser] parsing expr
   [2025-06-07T18:14:32Z DEBUG sqlparser::parser] prefix: Function(Function { 
name: ObjectName([Identifier(Ident { value: "COUNT", quote_style: None, span: 
Span(Location(1,8)..Location(1,13)) })]), uses_odbc_syntax: false, parameters: 
None, args: List(FunctionArgumentList { duplicate_treatment: None, args: 
[Unnamed(Wildcard)], clauses: [] }), filter: None, null_treatment: None, over: 
None, within_group: [] })
   [2025-06-07T18:14:32Z DEBUG sqlparser::dialect] get_next_precedence_full() 
TokenWithSpan { token: Word(Word { value: "from", quote_style: None, keyword: 
FROM }), span: Span(Location(1,17)..Location(1,21)) }
   [2025-06-07T18:14:32Z DEBUG sqlparser::parser] next precedence: 0
   [2025-06-07T18:14:32Z DEBUG sqlparser::parser] parsing expr
   [2025-06-07T18:14:32Z DEBUG sqlparser::parser] prefix: Value(ValueWithSpan { 
value: Number("42.6", false), span: Span(Location(1,39)..Location(1,43)) })
   [2025-06-07T18:14:32Z DEBUG sqlparser::dialect] get_next_precedence_full() 
TokenWithSpan { token: Word(Word { value: "where", quote_style: None, keyword: 
WHERE }), span: Span(Location(1,44)..Location(1,49)) }
   [2025-06-07T18:14:32Z DEBUG sqlparser::parser] next precedence: 0
   [2025-06-07T18:14:32Z DEBUG sqlparser::parser] parsing expr
   [2025-06-07T18:14:32Z DEBUG sqlparser::parser] prefix: Identifier(Ident { 
value: "double_col", quote_style: None, span: 
Span(Location(1,50)..Location(1,60)) })
   [2025-06-07T18:14:32Z DEBUG sqlparser::dialect] get_next_precedence_full() 
TokenWithSpan { token: Lt, span: Span(Location(1,61)..Location(1,62)) }
   [2025-06-07T18:14:32Z DEBUG sqlparser::parser] next precedence: 20
   [2025-06-07T18:14:32Z DEBUG sqlparser::parser] parsing expr
   [2025-06-07T18:14:32Z DEBUG sqlparser::parser] prefix: Value(ValueWithSpan { 
value: Number("42.0", false), span: Span(Location(1,63)..Location(1,67)) })
   [2025-06-07T18:14:32Z DEBUG sqlparser::dialect] get_next_precedence_full() 
TokenWithSpan { token: SemiColon, span: Span(Location(1,67)..Location(1,68)) }
   [2025-06-07T18:14:32Z DEBUG sqlparser::parser] next precedence: 0
   [2025-06-07T18:14:32Z DEBUG sqlparser::dialect] get_next_precedence_full() 
TokenWithSpan { token: SemiColon, span: Span(Location(1,67)..Location(1,68)) }
   [2025-06-07T18:14:32Z DEBUG sqlparser::parser] next precedence: 0
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::utils] 
resolve_grouping_function:
       Projection: count(Int64(1)) AS count(*)
         Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]
           Filter: random() < Float64(42.6) / Float64(100)
             Filter: data.double_col < Float64(42)
               TableScan: data
   
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::utils] type_coercion:
       Projection: count(Int64(1)) AS count(*)
         Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]
           Filter: random() < Float64(42.6) / Float64(100)
             Filter: data.double_col < Float64(42)
               TableScan: data
   
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::utils] Final analyzed plan:
       Projection: count(Int64(1)) AS count(*)
         Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]
           Filter: random() < Float64(42.6) / Float64(100)
             Filter: data.double_col < Float64(42)
               TableScan: data
   
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::analyzer] Analyzer took 0 
ms
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::utils] Optimizer input 
(pass 0):
       Projection: count(Int64(1)) AS count(*)
         Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]
           Filter: random() < Float64(42.6) / Float64(100)
             Filter: data.double_col < Float64(42)
               TableScan: data
   
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'eliminate_nested_union' (pass 0)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::utils] 
simplify_expressions:
       Projection: count(Int64(1)) AS count(*)
         Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]
           Filter: random() < Float64(0.426)
             Filter: data.double_col < Float64(42)
               TableScan: data
   
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'replace_distinct_aggregate' (pass 0)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'eliminate_join' (pass 0)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'decorrelate_predicate_subquery' (pass 0)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'scalar_subquery_to_join' (pass 0)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'decorrelate_lateral_join' (pass 0)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'extract_equijoin_predicate' (pass 0)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'eliminate_duplicated_expr' (pass 0)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'eliminate_filter' (pass 0)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'eliminate_cross_join' (pass 0)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'eliminate_limit' (pass 0)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'propagate_empty_relation' (pass 0)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'eliminate_one_union' (pass 0)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'filter_null_join_keys' (pass 0)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'eliminate_outer_join' (pass 0)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'push_down_limit' (pass 0)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::utils] push_down_filter:
       Projection: count(Int64(1)) AS count(*)
         Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]
           Filter: data.double_col < Float64(42) AND random() < Float64(0.426)
             TableScan: data, partial_filters=[data.double_col < Float64(42)]
   
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'single_distinct_aggregation_to_group_by' (pass 0)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'eliminate_group_by_constant' (pass 0)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'common_sub_expression_eliminate' (pass 0)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::utils] 
optimize_projections:
       Projection: count(Int64(1)) AS count(*)
         Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]
           Projection:
             Filter: data.double_col < Float64(42) AND random() < Float64(0.426)
               TableScan: data projection=[double_col], 
partial_filters=[data.double_col < Float64(42)]
   
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::utils] Optimized plan 
(pass 0):
       Projection: count(Int64(1)) AS count(*)
         Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]
           Projection:
             Filter: data.double_col < Float64(42) AND random() < Float64(0.426)
               TableScan: data projection=[double_col], 
partial_filters=[data.double_col < Float64(42)]
   
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::utils] Optimizer input 
(pass 1):
       Projection: count(Int64(1)) AS count(*)
         Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]
           Projection:
             Filter: data.double_col < Float64(42) AND random() < Float64(0.426)
               TableScan: data projection=[double_col], 
partial_filters=[data.double_col < Float64(42)]
   
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'eliminate_nested_union' (pass 1)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'simplify_expressions' (pass 1)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'replace_distinct_aggregate' (pass 1)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'eliminate_join' (pass 1)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'decorrelate_predicate_subquery' (pass 1)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'scalar_subquery_to_join' (pass 1)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'decorrelate_lateral_join' (pass 1)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'extract_equijoin_predicate' (pass 1)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'eliminate_duplicated_expr' (pass 1)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'eliminate_filter' (pass 1)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'eliminate_cross_join' (pass 1)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'eliminate_limit' (pass 1)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'propagate_empty_relation' (pass 1)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'eliminate_one_union' (pass 1)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'filter_null_join_keys' (pass 1)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'eliminate_outer_join' (pass 1)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'push_down_limit' (pass 1)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::utils] push_down_filter:
       Projection: count(Int64(1)) AS count(*)
         Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]
           Projection:
             Filter: data.double_col < Float64(42) AND random() < Float64(0.426)
               TableScan: data projection=[double_col], 
partial_filters=[data.double_col < Float64(42)]
   
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'single_distinct_aggregation_to_group_by' (pass 1)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'eliminate_group_by_constant' (pass 1)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged 
by optimizer rule 'common_sub_expression_eliminate' (pass 1)
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::utils] 
optimize_projections:
       Projection: count(Int64(1)) AS count(*)
         Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]
           Projection:
             Filter: data.double_col < Float64(42) AND random() < Float64(0.426)
               TableScan: data projection=[double_col], 
partial_filters=[data.double_col < Float64(42)]
   
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::utils] Optimized plan 
(pass 1):
       Projection: count(Int64(1)) AS count(*)
         Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]
           Projection:
             Filter: data.double_col < Float64(42) AND random() < Float64(0.426)
               TableScan: data projection=[double_col], 
partial_filters=[data.double_col < Float64(42)]
   
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] optimizer pass 
1 did not make changes
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::utils] Final optimized 
plan:
       Projection: count(Int64(1)) AS count(*)
         Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]
           Projection:
             Filter: data.double_col < Float64(42) AND random() < Float64(0.426)
               TableScan: data projection=[double_col], 
partial_filters=[data.double_col < Float64(42)]
   
   [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Optimizer took 
7 ms
   [2025-06-07T18:14:32Z DEBUG datafusion::physical_planner] Input physical 
plan:
       ProjectionExec: expr=[count(Int64(1))@0 as count(*)]
         AggregateExec: mode=Final, gby=[], aggr=[count(Int64(1))]
           AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))]
             ProjectionExec: expr=[]
               FilterExec: double_col@0 < 42 AND random() < 0.426
                 DataSourceExec: file_groups={1 group: 
[[Users/irix/projects/third-party/datafusion-data/sample.parquet]]}, 
projection=[double_col], file_type=parquet
   
   
   [2025-06-07T18:14:32Z DEBUG datafusion_physical_optimizer::pruning] Error 
building pruning expression: Error during planning: Multi-column expressions 
are not currently supported
   [2025-06-07T18:14:32Z DEBUG datafusion_physical_optimizer::pruning] Error 
building pruning expression: Error during planning: Multi-column expressions 
are not currently supported
   [2025-06-07T18:14:32Z DEBUG datafusion_datasource_parquet::page_filter] 
Ignoring always true page pruning predicate: random() < 0.426
   [2025-06-07T18:14:32Z DEBUG datafusion::physical_planner] Optimized physical 
plan:
       ProjectionExec: expr=[count(Int64(1))@0 as count(*)]
         AggregateExec: mode=Final, gby=[], aggr=[count(Int64(1))]
           CoalescePartitionsExec
             AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))]
               ProjectionExec: expr=[]
                 CoalesceBatchesExec: target_batch_size=8192
                   FilterExec: double_col@0 < 42 AND random() < 0.426
                     RepartitionExec: partitioning=RoundRobinBatch(8), 
input_partitions=1
                       DataSourceExec: file_groups={1 group: 
[[Users/irix/projects/third-party/datafusion-data/sample.parquet]]}, 
projection=[double_col], file_type=parquet, predicate=double_col@0 < 42 AND 
random() < 0.426, pruning_predicate=double_col_null_count@1 != row_count@2 AND 
double_col_min@0 < 42, required_guarantees=[]
   
   
   
   [2025-06-07T18:14:32Z DEBUG datafusion_physical_optimizer::pruning] Error 
building pruning expression: Error during planning: Multi-column expressions 
are not currently supported
   [2025-06-07T18:14:32Z DEBUG datafusion_physical_optimizer::pruning] Error 
building pruning expression: Error during planning: Multi-column expressions 
are not currently supported
   [2025-06-07T18:14:32Z DEBUG datafusion_datasource_parquet::page_filter] 
Ignoring always true page pruning predicate: random() < 0.426
   [2025-06-07T18:14:32Z DEBUG datafusion_datasource_parquet::page_filter] Use 
filter and page index to create RowSelection RowSelection { selectors: 
[RowSelector { row_count: 7300, skip: false }] } from predicate: BinaryExpr { 
left: BinaryExpr { left: Column { name: "double_col_null_count", index: 1 }, 
op: NotEq, right: Column { name: "row_count", index: 2 }, fail_on_overflow: 
false }, op: And, right: BinaryExpr { left: Column { name: "double_col_min", 
index: 0 }, op: Lt, right: Literal { value: Float64(42), field: Field { name: 
"42", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, 
metadata: {} } }, fail_on_overflow: false }, fail_on_overflow: false }
   +----------+
   | count(*) |
   +----------+
   | 1576     |
   +----------+
   1 row(s) fetched.
   Elapsed 0.042 seconds.
   </details>
   
   ## Are there any user-facing changes?
   
   No. The behaviour is changed only if a user specifies a new `SAMPLE` / 
`TABLESAMPLE` SQL syntax, which was not supported before.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Support data source sampling with TABLESAMPLE [datafusion]

Reply via email to