BlakeOrth opened a new issue, #19273: URL: https://github.com/apache/datafusion/issues/19273
### Is your feature request related to a problem or challenge? The current implementation of the `DefaultListFilesCache` stores and retrieves entries from the cache using a provided `Path` as the key: https://github.com/apache/datafusion/blob/c1aa1b530ab2fa73efcdeb8896dbb50c30c492f0/datafusion/execution/src/cache/list_files_cache.rs#L147 When using tables that have partitions, DataFusion will attempt to list files for a specific prefix if a user's query filters can be evaluated to exact, known partition values. E.g. ```sql select * from my_table where a=1 ``` will use `my_table/a=1/` as the Path if that partition exists. In these scenarios, it's possible that the key for `my_table` with all of the files backing the table already exists in the cache, however the `DefaultListFilesCache` would not be able to fetch data for `my_table/a=1/` because they keys would not match. A cache miss in this scenario is undesirable for two reasons: 1. DataFusion will execute a `List` request to backing storage to fetch a key that already exists in the cache 2. DataFusion will add `my_table/a=1/` as a key to the cache, duplicating data in the cache ### Describe the solution you'd like I would like to enhance the `DefaultListFilesCache` to be "prefix aware" when attempting to fetch data. The cache infrastructure currently allows a `get_with_extra` method: https://github.com/apache/datafusion/blob/c1aa1b530ab2fa73efcdeb8896dbb50c30c492f0/datafusion/execution/src/cache/list_files_cache.rs#L269 I think it should be possible to define `type Extra = Path` (or perhaps Vec<PathPart>?) where the `extra` parameter could represent the prefix, and the standard `key` parameter can represent the base `table_url`. This should allow the `DefaultListFilesCache` to find and filter entries for a table to the requested path prefix. Care would need to be taken to ensure that adding prefixed data to the cache does not return incomplete results for subsequent queries to the table. ### Describe alternatives you've considered It's possible using a different keying mechanism for the cache entirely could work as well. There's likely a potential solution that uses `key: Vec<PathPart>` or something similar and have the cache itself internalize the management of understanding when entries may or may not match. The difficulty here would likely be efficiently determining which parts of a path belong to a table vs a prefix. ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
