BlakeOrth commented on issue #16365: URL: https://github.com/apache/datafusion/issues/16365#issuecomment-3177075544
@alamb I have some freshly minted test results from using both my POC for normalizing the access patterns of hive partitioned datasets and flat datasets when using the `ListingTable` as well as enabling the existing list files cache by default, and enabling the new parquet metadata cache. Obviously getting a (very rough) concept of either of these workflows isn't the hard part, as you've already mentioned communicating to users the expected behavior of the default configuration is going to be the difficult thing. However, I think the performance here likely speaks for itself for simple queries: Current performance (copied from above): ```sql > CREATE EXTERNAL TABLE overture_maps STORED AS PARQUET LOCATION 's3://overturemaps-us-west-2/release/2025-07-23.0/'; 0 row(s) fetched. Elapsed 10.764 seconds. > select count(*) from overture_maps where type='address'; list_partitions_from_paths: listing from release/2025-07-23.0 list_partitions_from_paths: found 512 files list_partitions_from_paths: built 22 partitions Listed 22 partitions in 136.81548ms Pruning yielded 1 partitions in 0.147322ms file_list duration: 136.97224ms full group files duration: 353.54166ms +-----------+ | count(*) | +-----------+ | 446544475 | +-----------+ 1 row(s) fetched. Elapsed 0.360 seconds. > select count(*) from overture_maps where type='address'; list_partitions_from_paths: listing from release/2025-07-23.0 list_partitions_from_paths: found 512 files list_partitions_from_paths: built 22 partitions Listed 22 partitions in 181.16426ms Pruning yielded 1 partitions in 0.092711ms file_list duration: 181.26404ms full group files duration: 181.33081ms +-----------+ | count(*) | +-----------+ | 446544475 | +-----------+ 1 row(s) fetched. Elapsed 0.186 seconds. ``` POC performance: ```sql DataFusion CLI v49.0.0 > set datafusion.execution.parquet.cache_metadata = true; 0 row(s) fetched. Elapsed 0.001 seconds. > CREATE EXTERNAL TABLE overture_maps STORED AS PARQUET LOCATION 's3://overturemaps-us-west-2/release/2025-07-23.0/theme=addresses/'; 0 row(s) fetched. Elapsed 1.134 seconds. > select count(*) from overture_maps where type='address'; OPTIMIZED: Starting file listing from: release/2025-07-23.0/theme=addresses OPTIMIZED: Listed all files in 0.0040539997ms file_list duration: 0.033161ms full group files duration: 340.62543ms +-----------+ | count(*) | +-----------+ | 446544475 | +-----------+ 1 row(s) fetched. Elapsed 0.342 seconds. > select count(*) from overture_maps where type='address'; OPTIMIZED: Starting file listing from: release/2025-07-23.0/theme=addresses OPTIMIZED: Listed all files in 0.004288ms file_list duration: 0.017247ms full group files duration: 0.334348ms +-----------+ | count(*) | +-----------+ | 446544475 | +-----------+ 1 row(s) fetched. Elapsed 0.001 seconds. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org