We would love to use Gandiva in Apache Calcite [1] but we are blocked because the JAR on Maven Central doesn't work on macOS, Linux or Windows [2] and there seems to be no interest in fixing the problem. So I doubt whether anyone is using Gandiva in production (unless they have built the artifacts for themselves).
Once Gandiva is working for us we will have an opinion about caching. Julian [1] https://issues.apache.org/jira/browse/CALCITE-2040 [2] https://issues.apache.org/jira/browse/ARROW-11135 On Tue, Apr 20, 2021 at 2:58 AM Vivekanand Vellanki <vi...@dremio.com> wrote: > > We are considering using an on-disk - this is planned for later. Even with > an on-disk cache, we still need an eviction policy to ensure that Gandiva > doesn't use up the entire disk. > > For now, we are assuming that we can measure the cost accurately - the > assumption is that the query engine would use Gandiva on a thread that is > pinned to a core. For other engines, an alternate estimate of cost can be > the complexity of the expression. > > On Tue, Apr 20, 2021 at 2:46 PM Antoine Pitrou <anto...@python.org> wrote: > > > > > Hi Projjal, > > > > The main issue here is to compute the cost accurately (is it computation > > runtime? memory footprint? can you measure the computation time > > accurately, regardless of system noise - e.g. other threads and > > processes?). > > > > Intuitively, if the LRU cache shows too many misses, a simple measure is > > to increase its size ;-) > > > > Last question: have you considered a second level on-disk cache? Numba > > uses such a cache with good results: > > https://numba.readthedocs.io/en/stable/developer/caching.html > > > > Regards > > > > Antoine. > > > > > > Le 20/04/2021 à 06:28, Projjal Chanda a écrit : > > > Hi, > > > We currently have a cache[1] in gandiva that caches the built projector > > or filter module with LRU based eviction policy. However since the cost of > > building different expressions is not uniform it makes sense to have a > > different eviction policy that takes into account an associated cost of a > > cache miss (while also discounting the items which have not been recently > > used). We are planning to use an algorithm called GreedyDual-Size Algorithm > > [2] which seems fit for the purpose. The algorithm is quite simple - > > > Each item has a cost (build time in our case) and item with lowest cost > > (c_min) is evicted. All other items cost are deducted by (c_min) > > > On cache hit, the item cost is restored to the original value > > > > > > This can be implemented using a priority queue and an efficient > > implementation of this can handle both cache hit and eviction in O(logk) > > time. > > > > > > Does anybody have any other suggestions or ideas on this? > > > > > > [1] https://github.com/apache/arrow/blob/master/cpp/src/gandiva/cache.h > > <https://github.com/apache/arrow/blob/master/cpp/src/gandiva/cache.h> > > > [2] > > https://www.usenix.org/legacy/publications/library/proceedings/usits97/full_papers/cao/cao_html/node8.html > > < > > https://www.usenix.org/legacy/publications/library/proceedings/usits97/full_papers/cao/cao_html/node8.html > > > > > > > > > Regards, > > > Projjal > > > > >