sgrebnov commented on issue #12779:
URL: https://github.com/apache/datafusion/issues/12779#issuecomment-2401494063

   This is very cool, wanted to share a few ideas/concepts that worked well for 
[https://github.com/spiceai/spiceai ](https://github.com/spiceai/spiceai )– 
recently added simple caching using DataFusion with a slightly different 
approach – without solving the delta problem between prefetched/cached and 
actual data. In our case, we control dataset updates in most cases and just 
perform cache invalidation. Responding with outdated information per 
configurable TTL for other cases is expected behavior.
   1. Cache is based on the root `LogicalPlan`, where the query is first 
transformed into a logical plan that is used as a key.
   We used the root logical plan, but I imagine this can be generalized to 
apply the same approach on the execution plan level to return cached items 
instead of actual execution. Will work well for scenarios where predicate push 
downs are not fully supported or unavailable  parquet encoded statistics so 
there are repetitive executions/inputs even for different queries.
   1. I like the idea of not having opinions about where the cached data is 
stored; for us, [moka](https://docs.rs/moka/latest/moka/) worked the best. We 
compared it with a few other libraries in terms of performance, etc.
   1. Worked well for us – specifying max cache size (configurable cache size). 
We operate streams, so we just [wrap response record batches 
stream](https://github.com/spiceai/spiceai/blob/trunk/crates/cache/src/utils.rs#L33)
 to try caching records until we see that the response is too large and we 
don’t want to cache it. Total size is implemented by adding weights (actual 
size) to cached items (part of moka functionality).
   1. Worked well for us: cache invalidation approach – tracking input datasets 
as part of the cache based on the [logical plan information 
](https://github.com/spiceai/spiceai/blob/45532b1fd73936586aed1085a07f81061f767947/crates/cache/src/utils.rs#L78)allows
 for simple cache invalidation when a dataset is updated. We do this when we 
update the local dataset copy (materialized dataset copy).
   1. Cached items eviction algorithm - should be independent IMO. We use LRU + 
configurable TTL
   
   Cache implementation: 
https://github.com/spiceai/spiceai/blob/trunk/crates/cache/src/lru_cache.rs
   Usage example: 
https://github.com/spiceai/spiceai/blob/trunk/crates/runtime/src/datafusion/query.rs#L167


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to