alamb opened a new issue, #12115:
URL: https://github.com/apache/datafusion/issues/12115

   ### Is your feature request related to a problem or challenge?
   
   We have several forms of predicate pushdown in DataFusion's Parquet reader. 
The code path taken depends on the exact data layout and predicates defined
   
   @itsjunetime  is working on https://github.com/apache/datafusion/issues/4028 
to improve performance by being more clever about some of these predicates. 
   
   The current code paths taken depend on 
   1. Row group size
   2. Sort order of the data within the file
   3. File repartitioning size (how many partitions are read)
   4. Number of row groups
   3. Datapage size
   3. Use predicate pushdown?
   3. Use predicate reordering?
   
   
   ### Describe the solution you'd like
   
   I would like some additional test coverage (for correctness) when reading 
from parquet files with the various forms of pushdown enabled. It is especially 
important to ensure correctness with the various pushdowns enabled. 
   
   ### Describe alternatives you've considered
   
   I would like to have a test that
   1. Creates multiple parquet files with different orderings / row group 
distribution etc
   2. Runs the same query on the same input
   3. Compares the results from the different queries and ensures it is the same
   
   
   Parameters to check
   1. Row group size
   2. Sort order
   3. Number of row groups
   3. Datapage size
   3. Use predicate pushdown
   4. use predicate reordering
   
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to