theirix opened a new pull request, #16325: URL: https://github.com/apache/datafusion/pull/16325
## Which issue does this PR close? - Closes #13563 ## Rationale for this change Explained in #13563 in detail with known syntax examples. Thanks to [changes to sqlparser](https://github.com/apache/datafusion-sqlparser-rs/pull/1580), it is now possible to use the `TABLESAMPLE` and `SAMPLE` constructs in a logical plan builder. Added a rewrite function to `datafusion-sql` which produces an additional `LogicalPlan::Filter` based on `TableSample` from sqlparser. There is no need to remove anything from SQL AST since it's not used anywhere. ## What changes are included in this PR? ## Are these changes tested? - Unit tests (added a few) - Regression tests (added to select.slt) - Manual test (see below) ```sql create external table data stored as parquet location 'sample.parquet'; select COUNT(*) from data TABLESAMPLE SYSTEM (13) where double_col < 42.0; ``` Leads to the initial logical plan ``` Projection: count(Int64(1)) AS count(*) Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]] Filter: random() < Float64(42.6) / Float64(100) Filter: data.double_col < Float64(42) TableScan: data ``` The physical plan is somewhat unusual, as volatile functions are pushed down to the data source (`datafusion.execution.parquet.pushdown_filters` is enabled by default), which was addressed in #13268. ``` ProjectionExec: expr=[count(Int64(1))@0 as count(*)] AggregateExec: mode=Final, gby=[], aggr=[count(Int64(1))] CoalescePartitionsExec AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))] ProjectionExec: expr=[] CoalesceBatchesExec: target_batch_size=8192 FilterExec: double_col@0 < 42 AND random() < 0.426 RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1 DataSourceExec: file_groups={1 group: [[Users/irix/projects/third-party/datafusion-data/sample.parquet]]}, projection=[double_col], file_type=parquet, predicate=double_col@0 < 42 AND random() < 0.426, pruning_predicate=double_col_null_count@1 != row_count@2 AND double_col_min@0 < 42, required_guarantees=[] ``` More details: <details> DataFusion CLI v48.0.0 [2025-06-07T18:14:32Z DEBUG sqlparser::parser] parsing expr [2025-06-07T18:14:32Z DEBUG sqlparser::parser] prefix: Function(Function { name: ObjectName([Identifier(Ident { value: "COUNT", quote_style: None, span: Span(Location(1,8)..Location(1,13)) })]), uses_odbc_syntax: false, parameters: None, args: List(FunctionArgumentList { duplicate_treatment: None, args: [Unnamed(Wildcard)], clauses: [] }), filter: None, null_treatment: None, over: None, within_group: [] }) [2025-06-07T18:14:32Z DEBUG sqlparser::dialect] get_next_precedence_full() TokenWithSpan { token: Word(Word { value: "from", quote_style: None, keyword: FROM }), span: Span(Location(1,17)..Location(1,21)) } [2025-06-07T18:14:32Z DEBUG sqlparser::parser] next precedence: 0 [2025-06-07T18:14:32Z DEBUG sqlparser::parser] parsing expr [2025-06-07T18:14:32Z DEBUG sqlparser::parser] prefix: Value(ValueWithSpan { value: Number("42.6", false), span: Span(Location(1,39)..Location(1,43)) }) [2025-06-07T18:14:32Z DEBUG sqlparser::dialect] get_next_precedence_full() TokenWithSpan { token: Word(Word { value: "where", quote_style: None, keyword: WHERE }), span: Span(Location(1,44)..Location(1,49)) } [2025-06-07T18:14:32Z DEBUG sqlparser::parser] next precedence: 0 [2025-06-07T18:14:32Z DEBUG sqlparser::parser] parsing expr [2025-06-07T18:14:32Z DEBUG sqlparser::parser] prefix: Identifier(Ident { value: "double_col", quote_style: None, span: Span(Location(1,50)..Location(1,60)) }) [2025-06-07T18:14:32Z DEBUG sqlparser::dialect] get_next_precedence_full() TokenWithSpan { token: Lt, span: Span(Location(1,61)..Location(1,62)) } [2025-06-07T18:14:32Z DEBUG sqlparser::parser] next precedence: 20 [2025-06-07T18:14:32Z DEBUG sqlparser::parser] parsing expr [2025-06-07T18:14:32Z DEBUG sqlparser::parser] prefix: Value(ValueWithSpan { value: Number("42.0", false), span: Span(Location(1,63)..Location(1,67)) }) [2025-06-07T18:14:32Z DEBUG sqlparser::dialect] get_next_precedence_full() TokenWithSpan { token: SemiColon, span: Span(Location(1,67)..Location(1,68)) } [2025-06-07T18:14:32Z DEBUG sqlparser::parser] next precedence: 0 [2025-06-07T18:14:32Z DEBUG sqlparser::dialect] get_next_precedence_full() TokenWithSpan { token: SemiColon, span: Span(Location(1,67)..Location(1,68)) } [2025-06-07T18:14:32Z DEBUG sqlparser::parser] next precedence: 0 [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::utils] resolve_grouping_function: Projection: count(Int64(1)) AS count(*) Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]] Filter: random() < Float64(42.6) / Float64(100) Filter: data.double_col < Float64(42) TableScan: data [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::utils] type_coercion: Projection: count(Int64(1)) AS count(*) Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]] Filter: random() < Float64(42.6) / Float64(100) Filter: data.double_col < Float64(42) TableScan: data [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::utils] Final analyzed plan: Projection: count(Int64(1)) AS count(*) Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]] Filter: random() < Float64(42.6) / Float64(100) Filter: data.double_col < Float64(42) TableScan: data [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::analyzer] Analyzer took 0 ms [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::utils] Optimizer input (pass 0): Projection: count(Int64(1)) AS count(*) Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]] Filter: random() < Float64(42.6) / Float64(100) Filter: data.double_col < Float64(42) TableScan: data [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'eliminate_nested_union' (pass 0) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::utils] simplify_expressions: Projection: count(Int64(1)) AS count(*) Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]] Filter: random() < Float64(0.426) Filter: data.double_col < Float64(42) TableScan: data [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'replace_distinct_aggregate' (pass 0) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'eliminate_join' (pass 0) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'decorrelate_predicate_subquery' (pass 0) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'scalar_subquery_to_join' (pass 0) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'decorrelate_lateral_join' (pass 0) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'extract_equijoin_predicate' (pass 0) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'eliminate_duplicated_expr' (pass 0) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'eliminate_filter' (pass 0) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'eliminate_cross_join' (pass 0) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'eliminate_limit' (pass 0) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'propagate_empty_relation' (pass 0) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'eliminate_one_union' (pass 0) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'filter_null_join_keys' (pass 0) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'eliminate_outer_join' (pass 0) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'push_down_limit' (pass 0) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::utils] push_down_filter: Projection: count(Int64(1)) AS count(*) Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]] Filter: data.double_col < Float64(42) AND random() < Float64(0.426) TableScan: data, partial_filters=[data.double_col < Float64(42)] [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'single_distinct_aggregation_to_group_by' (pass 0) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'eliminate_group_by_constant' (pass 0) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'common_sub_expression_eliminate' (pass 0) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::utils] optimize_projections: Projection: count(Int64(1)) AS count(*) Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]] Projection: Filter: data.double_col < Float64(42) AND random() < Float64(0.426) TableScan: data projection=[double_col], partial_filters=[data.double_col < Float64(42)] [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::utils] Optimized plan (pass 0): Projection: count(Int64(1)) AS count(*) Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]] Projection: Filter: data.double_col < Float64(42) AND random() < Float64(0.426) TableScan: data projection=[double_col], partial_filters=[data.double_col < Float64(42)] [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::utils] Optimizer input (pass 1): Projection: count(Int64(1)) AS count(*) Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]] Projection: Filter: data.double_col < Float64(42) AND random() < Float64(0.426) TableScan: data projection=[double_col], partial_filters=[data.double_col < Float64(42)] [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'eliminate_nested_union' (pass 1) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'simplify_expressions' (pass 1) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'replace_distinct_aggregate' (pass 1) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'eliminate_join' (pass 1) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'decorrelate_predicate_subquery' (pass 1) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'scalar_subquery_to_join' (pass 1) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'decorrelate_lateral_join' (pass 1) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'extract_equijoin_predicate' (pass 1) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'eliminate_duplicated_expr' (pass 1) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'eliminate_filter' (pass 1) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'eliminate_cross_join' (pass 1) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'eliminate_limit' (pass 1) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'propagate_empty_relation' (pass 1) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'eliminate_one_union' (pass 1) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'filter_null_join_keys' (pass 1) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'eliminate_outer_join' (pass 1) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'push_down_limit' (pass 1) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::utils] push_down_filter: Projection: count(Int64(1)) AS count(*) Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]] Projection: Filter: data.double_col < Float64(42) AND random() < Float64(0.426) TableScan: data projection=[double_col], partial_filters=[data.double_col < Float64(42)] [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'single_distinct_aggregation_to_group_by' (pass 1) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'eliminate_group_by_constant' (pass 1) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Plan unchanged by optimizer rule 'common_sub_expression_eliminate' (pass 1) [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::utils] optimize_projections: Projection: count(Int64(1)) AS count(*) Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]] Projection: Filter: data.double_col < Float64(42) AND random() < Float64(0.426) TableScan: data projection=[double_col], partial_filters=[data.double_col < Float64(42)] [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::utils] Optimized plan (pass 1): Projection: count(Int64(1)) AS count(*) Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]] Projection: Filter: data.double_col < Float64(42) AND random() < Float64(0.426) TableScan: data projection=[double_col], partial_filters=[data.double_col < Float64(42)] [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] optimizer pass 1 did not make changes [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::utils] Final optimized plan: Projection: count(Int64(1)) AS count(*) Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]] Projection: Filter: data.double_col < Float64(42) AND random() < Float64(0.426) TableScan: data projection=[double_col], partial_filters=[data.double_col < Float64(42)] [2025-06-07T18:14:32Z DEBUG datafusion_optimizer::optimizer] Optimizer took 7 ms [2025-06-07T18:14:32Z DEBUG datafusion::physical_planner] Input physical plan: ProjectionExec: expr=[count(Int64(1))@0 as count(*)] AggregateExec: mode=Final, gby=[], aggr=[count(Int64(1))] AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))] ProjectionExec: expr=[] FilterExec: double_col@0 < 42 AND random() < 0.426 DataSourceExec: file_groups={1 group: [[Users/irix/projects/third-party/datafusion-data/sample.parquet]]}, projection=[double_col], file_type=parquet [2025-06-07T18:14:32Z DEBUG datafusion_physical_optimizer::pruning] Error building pruning expression: Error during planning: Multi-column expressions are not currently supported [2025-06-07T18:14:32Z DEBUG datafusion_physical_optimizer::pruning] Error building pruning expression: Error during planning: Multi-column expressions are not currently supported [2025-06-07T18:14:32Z DEBUG datafusion_datasource_parquet::page_filter] Ignoring always true page pruning predicate: random() < 0.426 [2025-06-07T18:14:32Z DEBUG datafusion::physical_planner] Optimized physical plan: ProjectionExec: expr=[count(Int64(1))@0 as count(*)] AggregateExec: mode=Final, gby=[], aggr=[count(Int64(1))] CoalescePartitionsExec AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))] ProjectionExec: expr=[] CoalesceBatchesExec: target_batch_size=8192 FilterExec: double_col@0 < 42 AND random() < 0.426 RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1 DataSourceExec: file_groups={1 group: [[Users/irix/projects/third-party/datafusion-data/sample.parquet]]}, projection=[double_col], file_type=parquet, predicate=double_col@0 < 42 AND random() < 0.426, pruning_predicate=double_col_null_count@1 != row_count@2 AND double_col_min@0 < 42, required_guarantees=[] [2025-06-07T18:14:32Z DEBUG datafusion_physical_optimizer::pruning] Error building pruning expression: Error during planning: Multi-column expressions are not currently supported [2025-06-07T18:14:32Z DEBUG datafusion_physical_optimizer::pruning] Error building pruning expression: Error during planning: Multi-column expressions are not currently supported [2025-06-07T18:14:32Z DEBUG datafusion_datasource_parquet::page_filter] Ignoring always true page pruning predicate: random() < 0.426 [2025-06-07T18:14:32Z DEBUG datafusion_datasource_parquet::page_filter] Use filter and page index to create RowSelection RowSelection { selectors: [RowSelector { row_count: 7300, skip: false }] } from predicate: BinaryExpr { left: BinaryExpr { left: Column { name: "double_col_null_count", index: 1 }, op: NotEq, right: Column { name: "row_count", index: 2 }, fail_on_overflow: false }, op: And, right: BinaryExpr { left: Column { name: "double_col_min", index: 0 }, op: Lt, right: Literal { value: Float64(42), field: Field { name: "42", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, fail_on_overflow: false }, fail_on_overflow: false } +----------+ | count(*) | +----------+ | 1576 | +----------+ 1 row(s) fetched. Elapsed 0.042 seconds. </details> ## Are there any user-facing changes? No. The behaviour is changed only if a user specifies a new `SAMPLE` / `TABLESAMPLE` SQL syntax, which was not supported before. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org