Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]

via GitHub Wed, 18 Jun 2025 12:55:52 -0700


theirix commented on PR #16325:
URL: https://github.com/apache/datafusion/pull/16325#issuecomment-2985522134


   > According to PostgreSQL's reference: 
https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation#SYSTEM_Option I 
believe `SYSTEM` option is equivalent to keep the entire `RecordBatch` 
according to the specified probability, this rewrite rule implemented here is 
sampling row by row, which follows the behavior of `BERNOULLI` option. Since df 
has vectorized execution, evaluation a `random() < x` filter should be 
efficient, I think we can apply this implementation on both `SYSTEM` and 
`BERNOULLI` option to keep it simple.
   
   @2010YOUY01 I'd like to double-check if a volatile filter pushdown to a 
Parquet executor is expected. In the mentioned PR, I disabled optimisation in a 
logical plan optimiser to push down volatile predicates. But it seems like the 
physical optimiser still pushes this predicate to an executor. While it helps 
us with automatic sampling, the results could be wrong. How do you think – 
should we implement a similar mechanism to make volatile predicates as 
unsupported filters?
   
   Before:
   ```
   [2025-06-18T18:20:07Z TRACE datafusion::physical_planner] Optimized physical 
plan by LimitedDistinctAggregation:
       OutputRequirementExec
         ProjectionExec: expr=[count(Int64(1))@0 as count(*)]
           AggregateExec: mode=Final, gby=[], aggr=[count(Int64(1))]
             AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))]
               FilterExec: random() < 0.1
                 DataSourceExec: file_groups={1 group: [[sample.parquet]]}, 
file_type=parquet
   ```
   
   After:
   ```
   [2025-06-18T18:20:07Z TRACE datafusion::physical_planner] Optimized physical 
plan by FilterPushdown:
       OutputRequirementExec
         ProjectionExec: expr=[count(Int64(1))@0 as count(*)]
           AggregateExec: mode=Final, gby=[], aggr=[count(Int64(1))]
             AggregateExec: mode=Partial, gby=[], aggr=[count(Int64(1))]
               DataSourceExec: file_groups={1 group: [[sample.parquet]]}, 
file_type=parquet, predicate=random() < 0.1
   ```
   
   Data:
   <details>
   set datafusion.execution.parquet.pushdown_filters=true;
   create external table data stored as parquet location 'sample.parquet';
   SELECT count(*) FROM data WHERE random() < 0.1;
   </details>
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]

Reply via email to