BlakeOrth commented on issue #17211: URL: https://github.com/apache/datafusion/issues/17211#issuecomment-3272025207
Ok, I've done some data collection using the in progress instrumented object store that illustrates some differences in request patterns. I'll share and go over the entire output, including the setup so it can be reproduced. Note that all of the following sections were run in a single `datafusion-cli` session, I'm just breaking it up a bit to make it easier to understand and discuss. One item to note is I'm running a `debug` build instead of `release-nonlto` because I think we're mostly interested in the IO operations, not necessarily whether or not they're running at optimized speeds. ## Analysis ### Table Setup ```sql DataFusion CLI v50.0.0 > set datafusion.execution.listing_table_factory_infer_partitions=false; 0 row(s) fetched. Elapsed 0.002 seconds. > \object_store_profiling summary ObjectStore Profile mode set to Summary > CREATE EXTERNAL TABLE overture_flat STORED AS PARQUET LOCATION 's3://overturemaps-us-west-2/release/2025-07-23.0/'; 0 row(s) fetched. Elapsed 34.223 seconds. Object Store Profiling List Summary: count: 1 List Summary: count: 1 Get Summary: count: 1536 duration min: 0.045644s duration max: 0.593886s duration avg: 0.101914s size min: 8 B size max: 1328116 B size avg: 329311 B size sum: 505822728 B > set datafusion.execution.listing_table_factory_infer_partitions=true; 0 row(s) fetched. Elapsed 0.001 seconds. Object Store Profiling > CREATE EXTERNAL TABLE overture_partitioned STORED AS PARQUET LOCATION 's3://overturemaps-us-west-2/release/2025-07-23.0/'; 0 row(s) fetched. Elapsed 35.500 seconds. Object Store Profiling List Summary: count: 1 List Summary: count: 3 Get Summary: count: 1536 duration min: 0.042258s duration max: 0.314563s duration avg: 0.100344s size min: 8 B size max: 1328116 B size avg: 329311 B size sum: 505822728 B ``` This is the test case setup. In order to keep things as fair as possible, we can set up two tables using the same underlying data. One of the tables is a flat table with no partitions, the other lets the ListingTable operate on the partitions defined by the paths. Initially I'm using the summary mode for profiling due to the size of the underlying data, but even here we can already see some discrepancies between the flat and partitioned table emerge. In spite of the same number of `GET` requests, the partitioned table executes 2 additional `LIST` operations. ### First queries ```sql > select count(*) from overture_flat; +------------+ | count(*) | +------------+ | 4198576839 | +------------+ 1 row(s) fetched. Elapsed 36.787 seconds. Object Store Profiling List Summary: count: 1 Get Summary: count: 1536 duration min: 0.044221s duration max: 0.390369s duration avg: 0.102023s size min: 8 B size max: 1328116 B size avg: 329311 B size sum: 505822728 B > select count(*) from overture_partitioned; +------------+ | count(*) | +------------+ | 4198576839 | +------------+ 1 row(s) fetched. Elapsed 37.422 seconds. Object Store Profiling List Summary: count: 22 duration min: 0.042043s duration max: 0.148611s duration avg: 0.096775s Get Summary: count: 1536 duration min: 0.045364s duration max: 0.375742s duration avg: 0.099633s size min: 8 B size max: 1328116 B size avg: 329311 B size sum: 505822728 B ``` These first queries really begin to show discrepancies. In spite of the `GET` requests being identical between the two queries, the partitioned table executes 22 `LIST` operations compared to the flat table's 1 `LIST` operation. Curiously, the `GET` requests appear to be identical between the table creation above and the first queries. It merits additional investigation, but the fact the minimum size for the `GET` is `8`, I suspect this is fetching parquet footers and subsequently the metadata. This may point to an opportunity to seed the metadata cache earlier, which could improve performance for the first query to the `ListingTable` in general. ### Subsequent Queries (cached metadata) ```sql > \object_store_profiling trace ObjectStore Profile mode set to Trace > select count(*) from overture_flat; +------------+ | count(*) | +------------+ | 4198576839 | +------------+ 1 row(s) fetched. Elapsed 0.791 seconds. Object Store Profiling 2025-09-09T17:53:07.108517313+00:00 operation=List path=release/2025-07-23.0 List Summary: count: 1 > select count(*) from overture_partitioned; +------------+ | count(*) | +------------+ | 4198576839 | +------------+ 1 row(s) fetched. Elapsed 0.943 seconds. Object Store Profiling 2025-09-09T17:53:17.991269778+00:00 operation=List duration=0.142341s path=release/2025-07-23.0 2025-09-09T17:53:18.133630478+00:00 operation=List duration=0.042730s path=release/2025-07-23.0/theme=addresses 2025-09-09T17:53:18.133986930+00:00 operation=List duration=0.084354s path=release/2025-07-23.0/theme=base 2025-09-09T17:53:18.134506596+00:00 operation=List duration=0.109531s path=release/2025-07-23.0/theme=divisions 2025-09-09T17:53:18.218349909+00:00 operation=List duration=0.042024s path=release/2025-07-23.0/theme=base/type=bathymetry 2025-09-09T17:53:18.134958553+00:00 operation=List duration=0.128599s path=release/2025-07-23.0/theme=transportation 2025-09-09T17:53:18.134252722+00:00 operation=List duration=0.131287s path=release/2025-07-23.0/theme=buildings 2025-09-09T17:53:18.134744436+00:00 operation=List duration=0.135978s path=release/2025-07-23.0/theme=places 2025-09-09T17:53:18.176435159+00:00 operation=List duration=0.097157s path=release/2025-07-23.0/theme=addresses/type=address 2025-09-09T17:53:18.218771771+00:00 operation=List duration=0.075236s path=release/2025-07-23.0/theme=base/type=land 2025-09-09T17:53:18.218580227+00:00 operation=List duration=0.086389s path=release/2025-07-23.0/theme=base/type=infrastructure 2025-09-09T17:53:18.219139311+00:00 operation=List duration=0.087766s path=release/2025-07-23.0/theme=base/type=land_use 2025-09-09T17:53:18.219346240+00:00 operation=List duration=0.089985s path=release/2025-07-23.0/theme=base/type=water 2025-09-09T17:53:18.244078850+00:00 operation=List duration=0.070198s path=release/2025-07-23.0/theme=divisions/type=division 2025-09-09T17:53:18.244290215+00:00 operation=List duration=0.072455s path=release/2025-07-23.0/theme=divisions/type=division_area 2025-09-09T17:53:18.265775691+00:00 operation=List duration=0.071635s path=release/2025-07-23.0/theme=buildings/type=building_part 2025-09-09T17:53:18.218957344+00:00 operation=List duration=0.124570s path=release/2025-07-23.0/theme=base/type=land_cover 2025-09-09T17:53:18.263595467+00:00 operation=List duration=0.091649s path=release/2025-07-23.0/theme=transportation/type=connector 2025-09-09T17:53:18.244510237+00:00 operation=List duration=0.115401s path=release/2025-07-23.0/theme=divisions/type=division_boundary 2025-09-09T17:53:18.270813563+00:00 operation=List duration=0.093048s path=release/2025-07-23.0/theme=places/type=place 2025-09-09T17:53:18.263816680+00:00 operation=List duration=0.120937s path=release/2025-07-23.0/theme=transportation/type=segment 2025-09-09T17:53:18.265574453+00:00 operation=List duration=0.179032s path=release/2025-07-23.0/theme=buildings/type=building List Summary: count: 22 duration min: 0.042024s duration max: 0.179032s duration avg: 0.099650s ``` These final queries have been switched over to `trace` for the profiling now that metadata has been cached and the output won't be dominated by over 1500 `GET` requests. The difference in operation between the two tables is pretty clear here. Curiously, the `LIST` operations for the partitioned table are being emitted with a `duration`, which implies it's actually running `list_with_delimiter()` which has allowed the instrumented object store to collect accurate duration data! Paying close attention to the timestamps here we can see the breadth first search listing strategy used on partitioned tables in action. First the root path is listed, blocking until the result is returned: ``` 2025-09-09T17:53:17.991269778+00:00 operation=List duration=0.142341s path=release/2025-07-23.0 ``` $`17:53:17.991269778 + 0.142341s = 17:53:18.133610778`$ Then, the first level of the table is listed (this is the partition of `theme`). This set of requests are executed in parallel, but block their respective next level in the search. To keep things brief I'll just show the math for `theme=addresses`. ``` 2025-09-09T17:53:18.133630478+00:00 operation=List duration=0.042730s path=release/2025-07-23.0/theme=addresses 2025-09-09T17:53:18.133986930+00:00 operation=List duration=0.084354s path=release/2025-07-23.0/theme=base 2025-09-09T17:53:18.134506596+00:00 operation=List duration=0.109531s path=release/2025-07-23.0/theme=divisions 2025-09-09T17:53:18.134958553+00:00 operation=List duration=0.128599s path=release/2025-07-23.0/theme=transportation 2025-09-09T17:53:18.134252722+00:00 operation=List duration=0.131287s path=release/2025-07-23.0/theme=buildings 2025-09-09T17:53:18.134744436+00:00 operation=List duration=0.135978s path=release/2025-07-23.0/theme=places ``` $`17:53:18.133630478 + 0.042730s = 17:53:18.176360478`$ Finally the last set of directories get listed as their respective upstream list requests complete ``` 2025-09-09T17:53:18.176435159+00:00 operation=List duration=0.097157s path=release/2025-07-23.0/theme=addresses/type=address 2025-09-09T17:53:18.218349909+00:00 operation=List duration=0.042024s path=release/2025-07-23.0/theme=base/type=bathymetry 2025-09-09T17:53:18.218771771+00:00 operation=List duration=0.075236s path=release/2025-07-23.0/theme=base/type=land 2025-09-09T17:53:18.218580227+00:00 operation=List duration=0.086389s path=release/2025-07-23.0/theme=base/type=infrastructure 2025-09-09T17:53:18.219139311+00:00 operation=List duration=0.087766s path=release/2025-07-23.0/theme=base/type=land_use 2025-09-09T17:53:18.219346240+00:00 operation=List duration=0.089985s path=release/2025-07-23.0/theme=base/type=water 2025-09-09T17:53:18.218957344+00:00 operation=List duration=0.124570s path=release/2025-07-23.0/theme=base/type=land_cover ... etc ``` Additionally, comparing the first `LIST` timestamp to the tail `LIST` timestamp and how long it took to complete, we can see that we spent 0.453s executing just listing operations of our total 0.943s query execution time. $`(17:53:18.265574453 + 0.179032s) - 17:53:17.991269778 = 0.453336675`$ ## @alamb Given the above analysis, in the spirit of talking about this in terms of IO reduction, normalizing the IO pattern between partitioned and flat tables in this specific case would eliminate 21 `LIST` calls. Unfortunately, it's difficult to talk about this in absolutes because the number of `LIST` calls for partitioned tables depends on the table structure (and actually on the query run as well, but I think this initial analysis is already long enough). This data also suggests that if a `ListFilesCache` were to be implemented and enabled it could eliminate all the IO for simple queries that can be computed entirely from metadata. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org