connec opened a new issue, #15179:
URL: https://github.com/apache/datafusion/issues/15179

   ### Is your feature request related to a problem or challenge?
   
   We are using DataFusion as described in #13456 – i.e. querying Parquet files 
directly from object storage (S3). I've recently been exploring the possibility 
of introducing cached parquet metadata into query planning and execution 
(related: #12592).
   
   After a bit of a journey, I was able to get a pre-deserialized 
`ParquetMetaData` plumbed into a custom `TableProvider`, which would then refer 
to the ready-set metadata throughout planning and execution (for execution, a 
custom `ParquetFileReaderFactory` and `AsyncFileReader` are also required).
   
   For queries with a filter (e.g. `column = value`), this had the expected 
effect of removing the metadata calls from our traces – success!
   
   However, I was surprised to see that traces for queries *without* a filter 
(only a limit) showed dramatically different traces. Specifically:
   
   - Using the default setup, one query would make **6** object store calls (5 
for metadata and 1 for data), and would scan **~137Mb** of data.
   - Using the caching setup, the same query would make **29** object store 
calls (all for data), and would scan **~4.2Mb** of data.
   
   I had thought my changes to be fairly surgical, and would not have 
introduced this kind of difference in the query execution. After picking 
through the implementation again, I think this is caused by this logic in 
`ParquetOpener`:
   
   
https://github.com/apache/datafusion/blob/efb75f3fe34b0de51ff7aed1959451202e7733a4/datafusion/datasource-parquet/src/opener.rs#L119-L122
   
   This calls `should_enable_page_index`, which is defined as:
   
   
https://github.com/apache/datafusion/blob/efb75f3fe34b0de51ff7aed1959451202e7733a4/datafusion/datasource-parquet/src/mod.rs#L537-L547
   
   The resulting `enable_page_index` seems to ultimately be what controls 
whether the page index is read and used inside `ParquetOpener`.
   
   In my caching implementation, the whole `ParquetMetaData` is always 
available and so presumably some of the range pruning logic inside 
`ParquetOpener` has an effect, where otherwise that metadata would not be 
loaded.
   
   For the example query above, this leads to a significant (~30%) reduction in 
latency.
   
   ### Describe the solution you'd like
   
   The latency reduction we have observed comes almost entirely from using the 
page index even though there's no predicate.
   
   I.e., although caching does reduce the number of object calls for otherwise 
like-for-like execution plans (e.g. with a predicate), it has a small impact on 
latency.
   
   For our use-case, we would get most of the benefit if there was some way to 
"force" the use of the page index, even without a pruning predicate (and/or for 
this decision to made heuristically based on the likely savings).
   
   And/or, the documentation around 
[`enable_page_index`](https://docs.rs/datafusion/latest/datafusion/config/struct.ParquetOptions.html#structfield.enable_page_index)
 could clarify that actual use of the metadata will be heuristic.
   
   ### Describe alternatives you've considered
   
   Even without caching, it still takes quite a lot of plumbing to get the page 
index to be used. The smallest implementation I could come up with uses:
   
   - A `FileFormat`, whose `create_physical_plan` injects a custom...
   - `ParquetFileReaderFactory`, whose `create_reader`:
     -  Builds a `ParquetObjectReader` and sets 
`with_preload_column_index`/`with_preload_offset_index` to `true`
     - Returns a custom...
   - `AsyncFileReader` (not necessary for functionality, but 
`ParquetFileReader` used by `DefaultParquetFileReaderFactory` is private).
   
   This is ~200 lines of boilerplate consisting mostly of trait forwarding and 
copied implementations in order to add 
`.with_preload_column_index(true).with_preload_offset_index(true)` at the right 
point.
   
   I've also not done a thorough investigation into the performance – it's 
possible that not using the page index is the best choice in the general case.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to