[I] Improve performance of `datafusion-cli` when reading from remote storage [datafusion]

via GitHub Tue, 10 Jun 2025 21:48:34 -0700


alamb opened a new issue, #16365:
URL: https://github.com/apache/datafusion/issues/16365


   ### Is your feature request related to a problem or challenge?
   
   - Part of https://github.com/apache/datafusion/pull/16300/files
   
   While testing https://github.com/apache/datafusion/pull/16300, I 
(re-noticed) that datafusion-cli is really slow when reading remote files
   
   For example, 
   
   The initial table creation takes 7.5 seconds on my pretty crappy connection:
   ```shell
   DataFusion CLI v48.0.0
   > CREATE EXTERNAL TABLE nyc_taxi_rides
   STORED AS PARQUET LOCATION 
's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet/';
   
   0 row(s) fetched.
   Elapsed 7.492 seconds.
   ```
   
   However then simple queries just to get the count take 8-10 seconds 😱 
   
   ```sql
   > select count(*) from nyc_taxi_rides;
   +------------+
   | count(*)   |
   +------------+
   | 1310903963 |
   +------------+
   1 row(s) fetched.
   Elapsed 8.945 seconds.
   ```
   
   ```sql
   > select count(*) from nyc_taxi_rides;
   +------------+
   | count(*)   |
   +------------+
   | 1310903963 |
   +------------+
   1 row(s) fetched.
   Elapsed 10.456 seconds.
   ```
   
   I am almost certain this delay is due to having to read the footers of the 
parquet files for each query. 
   
   Note the speed is much faster when `collect_statistics` is on
   
   ```sl
   > set datafusion.execution.collect_statistics = true;
   0 row(s) fetched.
   Elapsed 0.012 seconds.
   
   > CREATE EXTERNAL TABLE nyc_taxi_rides
   STORED AS PARQUET LOCATION 
's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet/';
   0 row(s) fetched.
   Elapsed 7.770 seconds.
   ```
   
   The first query is still slow for some reason: (5 seconds)
   ```sql
   > select count(*) from nyc_taxi_rides;
   +------------+
   | count(*)   |
   +------------+
   | 1310903963 |
   +------------+
   1 row(s) fetched.
   Elapsed 5.114 seconds.
   ```
   
   But subsequent queries are quite fast:
   
   ```sql
   > select count(*) from nyc_taxi_rides;
   +------------+
   | count(*)   |
   +------------+
   | 1310903963 |
   +------------+
   1 row(s) fetched.
   Elapsed 0.297 seconds.
   
   ### Describe the solution you'd like
   
   I would like:
   1. `datafusion-cli` to be faster for such queries
   2. `datafusion-cli` to be an easy to follow model for how to cache metadata 
when working with ListingTabe that others who build with DataFusion could follow
   
   ### Describe alternatives you've considered
   
   I think the obvious thing that is needed is a cache for the ParquetMetadata
   
   I think the actual cache should be in `datafusion-cli` but NOT in the 
datafusion core crate as I think what and how to cache will be different across 
systems. 
   
   What I envision is:
   1. APIs in the ListingTable / RuntimeEnv / etc for adding caching of 
ParquetMetadata
   2. An implementation of those APIs in `datafusion-cli`
   3. Bonus Points: documentation / examples that show how to use those APIs in 
other system
   
   
   There is some vestigal code in the cache_manager crate that I think could 
provide a home for such caching APIs: * 
https://docs.rs/datafusion/latest/datafusion/execution/cache/cache_manager/index.html
   
   
   
   ### Additional context
   
   Related issues
   - https://github.com/apache/datafusion/issues/15585
   - https://github.com/apache/datafusion/issues/15582


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Improve performance of `datafusion-cli` when reading from remote storage [datafusion]

Reply via email to