BlakeOrth opened a new issue, #19273:
URL: https://github.com/apache/datafusion/issues/19273

   ### Is your feature request related to a problem or challenge?
   
   The current implementation of the `DefaultListFilesCache` stores and 
retrieves entries from the cache using a provided `Path` as the key:
   
https://github.com/apache/datafusion/blob/c1aa1b530ab2fa73efcdeb8896dbb50c30c492f0/datafusion/execution/src/cache/list_files_cache.rs#L147
   
   When using tables that have partitions, DataFusion will attempt to list 
files for a specific prefix if a user's query filters can be evaluated to 
exact, known partition values. E.g.
   ```sql
   select * from my_table where a=1
   ```
   will use `my_table/a=1/` as the Path if that partition exists.
   
   In these scenarios, it's possible that the key for `my_table` with all of 
the files backing the table already exists in the cache, however the 
`DefaultListFilesCache` would not be able to fetch data for `my_table/a=1/` 
because they keys would not match. 
   
   A cache miss in this scenario is undesirable for two reasons:
   1. DataFusion will execute a `List` request to backing storage to fetch a 
key that already exists in the cache
   2. DataFusion will add `my_table/a=1/` as a key to the cache, duplicating 
data in the cache
   
   ### Describe the solution you'd like
   
   I would like to enhance the `DefaultListFilesCache` to be "prefix aware" 
when attempting to fetch data.
   
   The cache infrastructure currently allows a `get_with_extra` method:
   
https://github.com/apache/datafusion/blob/c1aa1b530ab2fa73efcdeb8896dbb50c30c492f0/datafusion/execution/src/cache/list_files_cache.rs#L269
   
   I think it should be possible to define `type Extra = Path` (or perhaps 
Vec<PathPart>?) where the `extra` parameter could represent the prefix, and the 
standard `key` parameter can represent the base `table_url`. This should allow 
the `DefaultListFilesCache` to find and filter entries for a table to the 
requested path prefix.
   
   Care would need to be taken to ensure that adding prefixed data to the cache 
does not return incomplete results for subsequent queries to the table.
   
   ### Describe alternatives you've considered
   
   It's possible using a different keying mechanism for the cache entirely 
could work as well. There's likely a potential solution that uses `key: 
Vec<PathPart>` or something similar and have the cache itself internalize the 
management of understanding when entries may or may not match. The difficulty 
here would likely be efficiently determining which parts of a path belong to a 
table vs a prefix.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to