We are considering using an on-disk - this is planned for later. Even with an on-disk cache, we still need an eviction policy to ensure that Gandiva doesn't use up the entire disk.
For now, we are assuming that we can measure the cost accurately - the assumption is that the query engine would use Gandiva on a thread that is pinned to a core. For other engines, an alternate estimate of cost can be the complexity of the expression. On Tue, Apr 20, 2021 at 2:46 PM Antoine Pitrou <anto...@python.org> wrote: > > Hi Projjal, > > The main issue here is to compute the cost accurately (is it computation > runtime? memory footprint? can you measure the computation time > accurately, regardless of system noise - e.g. other threads and > processes?). > > Intuitively, if the LRU cache shows too many misses, a simple measure is > to increase its size ;-) > > Last question: have you considered a second level on-disk cache? Numba > uses such a cache with good results: > https://numba.readthedocs.io/en/stable/developer/caching.html > > Regards > > Antoine. > > > Le 20/04/2021 à 06:28, Projjal Chanda a écrit : > > Hi, > > We currently have a cache[1] in gandiva that caches the built projector > or filter module with LRU based eviction policy. However since the cost of > building different expressions is not uniform it makes sense to have a > different eviction policy that takes into account an associated cost of a > cache miss (while also discounting the items which have not been recently > used). We are planning to use an algorithm called GreedyDual-Size Algorithm > [2] which seems fit for the purpose. The algorithm is quite simple - > > Each item has a cost (build time in our case) and item with lowest cost > (c_min) is evicted. All other items cost are deducted by (c_min) > > On cache hit, the item cost is restored to the original value > > > > This can be implemented using a priority queue and an efficient > implementation of this can handle both cache hit and eviction in O(logk) > time. > > > > Does anybody have any other suggestions or ideas on this? > > > > [1] https://github.com/apache/arrow/blob/master/cpp/src/gandiva/cache.h > <https://github.com/apache/arrow/blob/master/cpp/src/gandiva/cache.h> > > [2] > https://www.usenix.org/legacy/publications/library/proceedings/usits97/full_papers/cao/cao_html/node8.html > < > https://www.usenix.org/legacy/publications/library/proceedings/usits97/full_papers/cao/cao_html/node8.html > > > > > > Regards, > > Projjal > > >