BlakeOrth commented on issue #16365:
URL: https://github.com/apache/datafusion/issues/16365#issuecomment-3177075544

   @alamb I have some freshly minted test results from using both my POC for 
normalizing the access patterns of hive partitioned datasets and flat datasets 
when using the `ListingTable` as well as enabling the existing list files cache 
by default, and enabling the new parquet metadata cache. Obviously getting a 
(very rough) concept of either of these workflows isn't the hard part, as 
you've already mentioned communicating to users the expected behavior of the 
default configuration is going to be the difficult thing. However, I think the 
performance here likely speaks for itself for simple queries:
   
   Current performance (copied from above):
   ```sql
   > CREATE EXTERNAL TABLE overture_maps
   STORED AS PARQUET LOCATION 
's3://overturemaps-us-west-2/release/2025-07-23.0/';
   0 row(s) fetched.
   Elapsed 10.764 seconds.
   
   > select count(*) from overture_maps where type='address';
   list_partitions_from_paths: listing from release/2025-07-23.0
   list_partitions_from_paths: found 512 files
   list_partitions_from_paths: built 22 partitions
   Listed 22 partitions in 136.81548ms
   Pruning yielded 1 partitions in 0.147322ms
   file_list duration: 136.97224ms
   full group files duration: 353.54166ms
   +-----------+
   | count(*)  |
   +-----------+
   | 446544475 |
   +-----------+
   1 row(s) fetched.
   Elapsed 0.360 seconds.
   
   > select count(*) from overture_maps where type='address';
   list_partitions_from_paths: listing from release/2025-07-23.0
   list_partitions_from_paths: found 512 files
   list_partitions_from_paths: built 22 partitions
   Listed 22 partitions in 181.16426ms
   Pruning yielded 1 partitions in 0.092711ms
   file_list duration: 181.26404ms
   full group files duration: 181.33081ms
   +-----------+
   | count(*)  |
   +-----------+
   | 446544475 |
   +-----------+
   1 row(s) fetched.
   Elapsed 0.186 seconds.
   ```
   
   POC performance:
   ```sql
   DataFusion CLI v49.0.0
   > set datafusion.execution.parquet.cache_metadata = true;
   0 row(s) fetched.
   Elapsed 0.001 seconds.
   
   > CREATE EXTERNAL TABLE overture_maps
   STORED AS PARQUET LOCATION 
's3://overturemaps-us-west-2/release/2025-07-23.0/theme=addresses/';
   0 row(s) fetched.
   Elapsed 1.134 seconds.
   
   > select count(*) from overture_maps where type='address';
   OPTIMIZED: Starting file listing from: release/2025-07-23.0/theme=addresses
   OPTIMIZED: Listed all files in 0.0040539997ms
   file_list duration: 0.033161ms
   full group files duration: 340.62543ms
   +-----------+
   | count(*)  |
   +-----------+
   | 446544475 |
   +-----------+
   1 row(s) fetched.
   Elapsed 0.342 seconds.
   
   > select count(*) from overture_maps where type='address';
   OPTIMIZED: Starting file listing from: release/2025-07-23.0/theme=addresses
   OPTIMIZED: Listed all files in 0.004288ms
   file_list duration: 0.017247ms
   full group files duration: 0.334348ms
   +-----------+
   | count(*)  |
   +-----------+
   | 446544475 |
   +-----------+
   1 row(s) fetched.
   Elapsed 0.001 seconds.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to