BlakeOrth commented on issue #16365:
URL: https://github.com/apache/datafusion/issues/16365#issuecomment-3165993099

   I've been investigating performance when using the `ListingTable` with 
remote storage, and since `datafusion-cli` ultimately uses the `ListingTable` 
I'm curious if my findings might be interesting and related to this issue. Most 
of my focus has been on using the `ListingTable` with hive partitioned data. I 
observed there was a very large discrepancy in the performance between hive 
partitioned datasets and flat partitioned datasets, even when the number of 
underlying objects was similar. This was actually noted in an unresolved bug 
from about a year ago #9654 .
   
   I put a few simple targeted timing debug prints in the code and learned that 
in many cases most of the time spent for remote queries is dominated by listing 
the files in the backing object store. Due to the current implementation around 
listing files for hive partitioned data this can often be doubly true for such 
datasets (the depth of the partitioning structure impacts the number of 
round-trip latency costs that are paid every query). Here are some examples: 
   ```sql
   DataFusion CLI v49.0.0
   > CREATE EXTERNAL TABLE athena_partitioned
   STORED AS PARQUET LOCATION 
's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
   0 row(s) fetched.
   Elapsed 4.382 seconds.
   
   > select count(*) from athena_partitioned;
   file_list duration: 0.0110909995ms
   full group files duration: 2028.1296ms
   +----------+
   | count(*) |
   +----------+
   | 99997497 |
   +----------+
   1 row(s) fetched.
   Elapsed 2.031 seconds.
   
   > select count(*) from athena_partitioned;
   file_list duration: 0.0058609997ms
   -- NOTE: this is likely substantially faster due to server side caching of 
the list request
   -- A similar pattern can be seen for nearly all of these examples
   full group files duration: 156.28888ms
   +----------+
   | count(*) |
   +----------+
   | 99997497 |
   +----------+
   1 row(s) fetched.
   Elapsed 0.159 seconds.
   
   > select count(*) from 
's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
   file_list duration: 0.004357ms
   full group files duration: 1681.8896ms
   +----------+
   | count(*) |
   +----------+
   | 99997497 |
   +----------+
   1 row(s) fetched.
   Elapsed 4.231 seconds.
   
   > select count(*) from 
's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
   file_list duration: 0.004645ms
   full group files duration: 1461.2577ms
   +----------+
   | count(*) |
   +----------+
   | 99997497 |
   +----------+
   1 row(s) fetched.
   Elapsed 3.538 seconds.
   
   > CREATE EXTERNAL TABLE overture_maps
   STORED AS PARQUET LOCATION 
's3://overturemaps-us-west-2/release/2025-07-23.0/';
   0 row(s) fetched.
   Elapsed 10.764 seconds.
   
   > select count(*) from overture_maps where type='address';
   list_partitions_from_paths: listing from release/2025-07-23.0
   list_partitions_from_paths: found 512 files
   list_partitions_from_paths: built 22 partitions
   Listed 22 partitions in 136.81548ms
   Pruning yielded 1 partitions in 0.147322ms
   file_list duration: 136.97224ms
   full group files duration: 353.54166ms
   +-----------+
   | count(*)  |
   +-----------+
   | 446544475 |
   +-----------+
   1 row(s) fetched.
   Elapsed 0.360 seconds.
   
   > select count(*) from overture_maps where type='address';
   list_partitions_from_paths: listing from release/2025-07-23.0
   list_partitions_from_paths: found 512 files
   list_partitions_from_paths: built 22 partitions
   Listed 22 partitions in 181.16426ms
   Pruning yielded 1 partitions in 0.092711ms
   file_list duration: 181.26404ms
   full group files duration: 181.33081ms
   +-----------+
   | count(*)  |
   +-----------+
   | 446544475 |
   +-----------+
   1 row(s) fetched.
   Elapsed 0.186 seconds.
   
   > select count(*) from overture_maps where type='address' and 
theme='addresses';
   list_partitions_from_paths: listing from 
release/2025-07-23.0/theme=addresses/type=address
   list_partitions_from_paths: found 18 files
   list_partitions_from_paths: built 2 partitions
   Listed 2 partitions in 126.23396ms
   Pruning yielded 1 partitions in 0.084972ms
   file_list duration: 126.32848ms
   full group files duration: 164.76274ms
   +-----------+
   | count(*)  |
   +-----------+
   | 446544475 |
   +-----------+
   1 row(s) fetched.
   Elapsed 0.170 seconds.
   ```
   
   I have a POC that, at least initially, appears to normalize the performance 
between hive partitioned datasets and flat partitioned datasets and would allow 
both partitioned and flat datasets a like to leverage the existing 
`ListFilesCache` for users that create a `ListingTable` manually (such as 
myself). Would there be interest in looking further at the POC or discussing 
additional strategies for normalizing and reducing the amount of time spent 
listing objects?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to