alamb opened a new issue, #16365: URL: https://github.com/apache/datafusion/issues/16365
### Is your feature request related to a problem or challenge? - Part of https://github.com/apache/datafusion/pull/16300/files While testing https://github.com/apache/datafusion/pull/16300, I (re-noticed) that datafusion-cli is really slow when reading remote files For example, The initial table creation takes 7.5 seconds on my pretty crappy connection: ```shell DataFusion CLI v48.0.0 > CREATE EXTERNAL TABLE nyc_taxi_rides STORED AS PARQUET LOCATION 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet/'; 0 row(s) fetched. Elapsed 7.492 seconds. ``` However then simple queries just to get the count take 8-10 seconds 😱 ```sql > select count(*) from nyc_taxi_rides; +------------+ | count(*) | +------------+ | 1310903963 | +------------+ 1 row(s) fetched. Elapsed 8.945 seconds. ``` ```sql > select count(*) from nyc_taxi_rides; +------------+ | count(*) | +------------+ | 1310903963 | +------------+ 1 row(s) fetched. Elapsed 10.456 seconds. ``` I am almost certain this delay is due to having to read the footers of the parquet files for each query. Note the speed is much faster when `collect_statistics` is on ```sl > set datafusion.execution.collect_statistics = true; 0 row(s) fetched. Elapsed 0.012 seconds. > CREATE EXTERNAL TABLE nyc_taxi_rides STORED AS PARQUET LOCATION 's3://altinity-clickhouse-data/nyc_taxi_rides/data/tripdata_parquet/'; 0 row(s) fetched. Elapsed 7.770 seconds. ``` The first query is still slow for some reason: (5 seconds) ```sql > select count(*) from nyc_taxi_rides; +------------+ | count(*) | +------------+ | 1310903963 | +------------+ 1 row(s) fetched. Elapsed 5.114 seconds. ``` But subsequent queries are quite fast: ```sql > select count(*) from nyc_taxi_rides; +------------+ | count(*) | +------------+ | 1310903963 | +------------+ 1 row(s) fetched. Elapsed 0.297 seconds. ### Describe the solution you'd like I would like: 1. `datafusion-cli` to be faster for such queries 2. `datafusion-cli` to be an easy to follow model for how to cache metadata when working with ListingTabe that others who build with DataFusion could follow ### Describe alternatives you've considered I think the obvious thing that is needed is a cache for the ParquetMetadata I think the actual cache should be in `datafusion-cli` but NOT in the datafusion core crate as I think what and how to cache will be different across systems. What I envision is: 1. APIs in the ListingTable / RuntimeEnv / etc for adding caching of ParquetMetadata 2. An implementation of those APIs in `datafusion-cli` 3. Bonus Points: documentation / examples that show how to use those APIs in other system There is some vestigal code in the cache_manager crate that I think could provide a home for such caching APIs: * https://docs.rs/datafusion/latest/datafusion/execution/cache/cache_manager/index.html ### Additional context Related issues - https://github.com/apache/datafusion/issues/15585 - https://github.com/apache/datafusion/issues/15582 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org