BlakeOrth commented on issue #17211:
URL: https://github.com/apache/datafusion/issues/17211#issuecomment-3272025207

   Ok, I've done some data collection using the in progress instrumented object 
store that illustrates some differences in request patterns. I'll share and go 
over the entire output, including the setup so it can be reproduced. Note that 
all of the following sections were run in a single `datafusion-cli` session, 
I'm just breaking it up a bit to make it easier to understand and discuss. One 
item to note is I'm running a `debug` build instead of `release-nonlto` because 
I think we're mostly interested in the IO operations, not necessarily whether 
or not they're running at optimized speeds.
   
   ## Analysis
   ### Table Setup
   ```sql
   DataFusion CLI v50.0.0
   > set datafusion.execution.listing_table_factory_infer_partitions=false;
   0 row(s) fetched.
   Elapsed 0.002 seconds.
   
   > \object_store_profiling summary
   ObjectStore Profile mode set to Summary
   > CREATE EXTERNAL TABLE overture_flat
   STORED AS PARQUET LOCATION 
's3://overturemaps-us-west-2/release/2025-07-23.0/';
   0 row(s) fetched.
   Elapsed 34.223 seconds.
   
   Object Store Profiling
   List Summary:
     count: 1
   
   List Summary:
     count: 1
   
   Get Summary:
     count: 1536
     duration min: 0.045644s
     duration max: 0.593886s
     duration avg: 0.101914s
     size min: 8 B
     size max: 1328116 B
     size avg: 329311 B
     size sum: 505822728 B
   
   > set datafusion.execution.listing_table_factory_infer_partitions=true;
   0 row(s) fetched.
   Elapsed 0.001 seconds.
   
   Object Store Profiling
   > CREATE EXTERNAL TABLE overture_partitioned
   STORED AS PARQUET LOCATION 
's3://overturemaps-us-west-2/release/2025-07-23.0/';
   0 row(s) fetched.
   Elapsed 35.500 seconds.
   
   Object Store Profiling
   List Summary:
     count: 1
   
   List Summary:
     count: 3
   
   Get Summary:
     count: 1536
     duration min: 0.042258s
     duration max: 0.314563s
     duration avg: 0.100344s
     size min: 8 B
     size max: 1328116 B
     size avg: 329311 B
     size sum: 505822728 B
   ```
   
   This is the test case setup. In order to keep things as fair as possible, we 
can set up two tables using the same underlying data. One of the tables is a 
flat table with no partitions, the other lets the ListingTable operate on the 
partitions defined by the paths. Initially I'm using the summary mode for 
profiling due to the size of the underlying data, but even here we can already 
see some discrepancies between the flat and partitioned table emerge. In spite 
of the same number of `GET` requests, the partitioned table executes 2 
additional `LIST` operations.
   
   ### First queries
   ```sql
   > select count(*) from overture_flat;
   +------------+
   | count(*)   |
   +------------+
   | 4198576839 |
   +------------+
   1 row(s) fetched.
   Elapsed 36.787 seconds.
   
   Object Store Profiling
   List Summary:
     count: 1
   
   Get Summary:
     count: 1536
     duration min: 0.044221s
     duration max: 0.390369s
     duration avg: 0.102023s
     size min: 8 B
     size max: 1328116 B
     size avg: 329311 B
     size sum: 505822728 B
   
   > select count(*) from overture_partitioned;
   +------------+
   | count(*)   |
   +------------+
   | 4198576839 |
   +------------+
   1 row(s) fetched.
   Elapsed 37.422 seconds.
   
   Object Store Profiling
   List Summary:
     count: 22
     duration min: 0.042043s
     duration max: 0.148611s
     duration avg: 0.096775s
   
   Get Summary:
     count: 1536
     duration min: 0.045364s
     duration max: 0.375742s
     duration avg: 0.099633s
     size min: 8 B
     size max: 1328116 B
     size avg: 329311 B
     size sum: 505822728 B
   ```
   
   These first queries really begin to show discrepancies. In spite of the 
`GET` requests being identical between the two queries, the partitioned table 
executes 22 `LIST` operations compared to the flat table's 1 `LIST` operation. 
Curiously, the `GET` requests appear to be identical between the table creation 
above and the first queries. It merits additional investigation, but the fact 
the minimum size for the `GET` is `8`, I suspect this is fetching parquet 
footers and subsequently the metadata. This may point to an opportunity to seed 
the metadata cache earlier, which could improve performance for the first query 
to the `ListingTable` in general.
   
   ### Subsequent Queries (cached metadata)
   ```sql
   > \object_store_profiling trace
   ObjectStore Profile mode set to Trace
   > select count(*) from overture_flat;
   +------------+
   | count(*)   |
   +------------+
   | 4198576839 |
   +------------+
   1 row(s) fetched.
   Elapsed 0.791 seconds.
   
   Object Store Profiling
   2025-09-09T17:53:07.108517313+00:00 operation=List path=release/2025-07-23.0
   List Summary:
     count: 1
   
   > select count(*) from overture_partitioned;
   +------------+
   | count(*)   |
   +------------+
   | 4198576839 |
   +------------+
   1 row(s) fetched.
   Elapsed 0.943 seconds.
   
   Object Store Profiling
   2025-09-09T17:53:17.991269778+00:00 operation=List duration=0.142341s 
path=release/2025-07-23.0
   2025-09-09T17:53:18.133630478+00:00 operation=List duration=0.042730s 
path=release/2025-07-23.0/theme=addresses
   2025-09-09T17:53:18.133986930+00:00 operation=List duration=0.084354s 
path=release/2025-07-23.0/theme=base
   2025-09-09T17:53:18.134506596+00:00 operation=List duration=0.109531s 
path=release/2025-07-23.0/theme=divisions
   2025-09-09T17:53:18.218349909+00:00 operation=List duration=0.042024s 
path=release/2025-07-23.0/theme=base/type=bathymetry
   2025-09-09T17:53:18.134958553+00:00 operation=List duration=0.128599s 
path=release/2025-07-23.0/theme=transportation
   2025-09-09T17:53:18.134252722+00:00 operation=List duration=0.131287s 
path=release/2025-07-23.0/theme=buildings
   2025-09-09T17:53:18.134744436+00:00 operation=List duration=0.135978s 
path=release/2025-07-23.0/theme=places
   2025-09-09T17:53:18.176435159+00:00 operation=List duration=0.097157s 
path=release/2025-07-23.0/theme=addresses/type=address
   2025-09-09T17:53:18.218771771+00:00 operation=List duration=0.075236s 
path=release/2025-07-23.0/theme=base/type=land
   2025-09-09T17:53:18.218580227+00:00 operation=List duration=0.086389s 
path=release/2025-07-23.0/theme=base/type=infrastructure
   2025-09-09T17:53:18.219139311+00:00 operation=List duration=0.087766s 
path=release/2025-07-23.0/theme=base/type=land_use
   2025-09-09T17:53:18.219346240+00:00 operation=List duration=0.089985s 
path=release/2025-07-23.0/theme=base/type=water
   2025-09-09T17:53:18.244078850+00:00 operation=List duration=0.070198s 
path=release/2025-07-23.0/theme=divisions/type=division
   2025-09-09T17:53:18.244290215+00:00 operation=List duration=0.072455s 
path=release/2025-07-23.0/theme=divisions/type=division_area
   2025-09-09T17:53:18.265775691+00:00 operation=List duration=0.071635s 
path=release/2025-07-23.0/theme=buildings/type=building_part
   2025-09-09T17:53:18.218957344+00:00 operation=List duration=0.124570s 
path=release/2025-07-23.0/theme=base/type=land_cover
   2025-09-09T17:53:18.263595467+00:00 operation=List duration=0.091649s 
path=release/2025-07-23.0/theme=transportation/type=connector
   2025-09-09T17:53:18.244510237+00:00 operation=List duration=0.115401s 
path=release/2025-07-23.0/theme=divisions/type=division_boundary
   2025-09-09T17:53:18.270813563+00:00 operation=List duration=0.093048s 
path=release/2025-07-23.0/theme=places/type=place
   2025-09-09T17:53:18.263816680+00:00 operation=List duration=0.120937s 
path=release/2025-07-23.0/theme=transportation/type=segment
   2025-09-09T17:53:18.265574453+00:00 operation=List duration=0.179032s 
path=release/2025-07-23.0/theme=buildings/type=building
   List Summary:
     count: 22
     duration min: 0.042024s
     duration max: 0.179032s
     duration avg: 0.099650s
   ```
   
   These final queries have been switched over to `trace` for the profiling now 
that metadata has been cached and the output won't be dominated by over 1500 
`GET` requests. The difference in operation between the two tables is pretty 
clear here. Curiously, the `LIST` operations for the partitioned table are 
being emitted with a `duration`, which implies it's actually running 
`list_with_delimiter()` which has allowed the instrumented object store to 
collect accurate duration data!
   
   Paying close attention to the timestamps here we can see the breadth first 
search listing strategy used on partitioned tables in action. First the root 
path is listed, blocking until the result is returned:
   ```
   2025-09-09T17:53:17.991269778+00:00 operation=List duration=0.142341s 
path=release/2025-07-23.0
   ```
   $`17:53:17.991269778 + 0.142341s = 17:53:18.133610778`$
   
   
   Then, the first level of the table is listed (this is the partition of 
`theme`). This set of requests are executed in parallel, but block their 
respective next level in the search. To keep things brief I'll just show the 
math for `theme=addresses`.
   ```
   2025-09-09T17:53:18.133630478+00:00 operation=List duration=0.042730s 
path=release/2025-07-23.0/theme=addresses
   2025-09-09T17:53:18.133986930+00:00 operation=List duration=0.084354s 
path=release/2025-07-23.0/theme=base
   2025-09-09T17:53:18.134506596+00:00 operation=List duration=0.109531s 
path=release/2025-07-23.0/theme=divisions
   2025-09-09T17:53:18.134958553+00:00 operation=List duration=0.128599s 
path=release/2025-07-23.0/theme=transportation
   2025-09-09T17:53:18.134252722+00:00 operation=List duration=0.131287s 
path=release/2025-07-23.0/theme=buildings
   2025-09-09T17:53:18.134744436+00:00 operation=List duration=0.135978s 
path=release/2025-07-23.0/theme=places
   ```
   $`17:53:18.133630478 + 0.042730s = 17:53:18.176360478`$
   
   Finally the last set of directories get listed as their respective upstream 
list requests complete
   ```
   2025-09-09T17:53:18.176435159+00:00 operation=List duration=0.097157s 
path=release/2025-07-23.0/theme=addresses/type=address
   2025-09-09T17:53:18.218349909+00:00 operation=List duration=0.042024s 
path=release/2025-07-23.0/theme=base/type=bathymetry
   2025-09-09T17:53:18.218771771+00:00 operation=List duration=0.075236s 
path=release/2025-07-23.0/theme=base/type=land
   2025-09-09T17:53:18.218580227+00:00 operation=List duration=0.086389s 
path=release/2025-07-23.0/theme=base/type=infrastructure
   2025-09-09T17:53:18.219139311+00:00 operation=List duration=0.087766s 
path=release/2025-07-23.0/theme=base/type=land_use
   2025-09-09T17:53:18.219346240+00:00 operation=List duration=0.089985s 
path=release/2025-07-23.0/theme=base/type=water
   2025-09-09T17:53:18.218957344+00:00 operation=List duration=0.124570s 
path=release/2025-07-23.0/theme=base/type=land_cover
   ... etc
   ```
   
   Additionally, comparing the first `LIST` timestamp to the tail `LIST` 
timestamp and how long it took to complete, we can see that we spent 0.453s 
executing just listing operations of our total 0.943s query execution time.
   $`(17:53:18.265574453 + 0.179032s) - 17:53:17.991269778 = 0.453336675`$
   
   ##
   @alamb Given the above analysis, in the spirit of talking about this in 
terms of IO reduction, normalizing the IO pattern between partitioned and flat 
tables in this specific case would eliminate 21 `LIST` calls. Unfortunately, 
it's difficult to talk about this in absolutes because the number of `LIST` 
calls for partitioned tables depends on the table structure (and actually on 
the query run as well, but I think this initial analysis is already long 
enough). This data also suggests that if a `ListFilesCache` were to be 
implemented and enabled it could eliminate  all the IO for simple queries that 
can be computed entirely from metadata.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to