Re: [Gandiva] Replacing the LRU cache in gandiva

Wes McKinney Wed, 21 Apr 2021 14:20:20 -0700

This sounds interesting. We have an increasing ability to execute
interpreted expressions in C++, too, though if all of the functions
are supported in Gandiva, then the Gandiva version may be faster.
We're working on relational operators (e.g. Aggregate has been under
heavy development the last few months), so it would be interesting to
see what we can do to enable you to make use of them.


On Wed, Apr 21, 2021 at 1:33 PM Julian Hyde <jhyde.apa...@gmail.com> wrote:
>
> We are building an “Arrow Adapter”. The initial goal is to read Arrow data 
> and apply simple projects (field references) and filters (expressions such as 
> comparisons, ANDs, ORs).
>
> But I would like to go further, build what we call an “engine" rather than an 
> “adapter”, so that we can implement all core relational operators (Project, 
> Filter, Aggregate, Sort, Join, Union, Values) on Arrow data, single- or 
> multi-threaded, on a single node. Then this could replace Enumerable 
> (generated Java Iterator classes that implement the Volcano execution model) 
> as Calcite’s default implementation. If we were, say, joining CSV data to 
> data from MongoDB, we would first convert both data streams to Arrow.
>
> I gather Gandiva doesn’t do much more than Project and Filter at this point. 
> But I think it ought to do the other operations, or at least implement the 
> low-level support to build them (pushing a batch of records into a hash 
> table, applying bloom filters, sorting a batch, etc.)
>
> For more details, see https://issues.apache.org/jira/browse/CALCITE-2040 
> <https://issues.apache.org/jira/browse/CALCITE-2040>.
>
> Julian
>
>
> > On Apr 21, 2021, at 12:46 AM, Vivekanand Vellanki <vi...@dremio.com> wrote:
> >
> > Julian, How do you plan to use Gandiva in Apache Calcite?
> >
> > On Tue, Apr 20, 2021 at 9:57 PM Julian Hyde <jh...@apache.org> wrote:
> >
> >> We would love to use Gandiva in Apache Calcite [1] but we are blocked
> >> because the JAR on Maven Central doesn't work on macOS, Linux or
> >> Windows  [2] and there seems to be no interest in fixing the problem.
> >> So I doubt whether anyone is using Gandiva in production (unless they
> >> have built the artifacts for themselves).
> >>
> >> Once Gandiva is working for us we will have an opinion about caching.
> >>
> >> Julian
> >>
> >> [1] https://issues.apache.org/jira/browse/CALCITE-2040
> >>
> >> [2] https://issues.apache.org/jira/browse/ARROW-11135
> >>
> >> On Tue, Apr 20, 2021 at 2:58 AM Vivekanand Vellanki <vi...@dremio.com>
> >> wrote:
> >>>
> >>> We are considering using an on-disk - this is planned for later. Even
> >> with
> >>> an on-disk cache, we still need an eviction policy to ensure that Gandiva
> >>> doesn't use up the entire disk.
> >>>
> >>> For now, we are assuming that we can measure the cost accurately - the
> >>> assumption is that the query engine would use Gandiva on a thread that is
> >>> pinned to a core. For other engines, an alternate estimate of cost can be
> >>> the complexity of the expression.
> >>>
> >>> On Tue, Apr 20, 2021 at 2:46 PM Antoine Pitrou <anto...@python.org>
> >> wrote:
> >>>
> >>>>
> >>>> Hi Projjal,
> >>>>
> >>>> The main issue here is to compute the cost accurately (is it
> >> computation
> >>>> runtime? memory footprint? can you measure the computation time
> >>>> accurately, regardless of system noise - e.g. other threads and
> >>>> processes?).
> >>>>
> >>>> Intuitively, if the LRU cache shows too many misses, a simple measure
> >> is
> >>>> to increase its size ;-)
> >>>>
> >>>> Last question: have you considered a second level on-disk cache?  Numba
> >>>> uses such a cache with good results:
> >>>> https://numba.readthedocs.io/en/stable/developer/caching.html
> >>>>
> >>>> Regards
> >>>>
> >>>> Antoine.
> >>>>
> >>>>
> >>>> Le 20/04/2021 à 06:28, Projjal Chanda a écrit :
> >>>>> Hi,
> >>>>> We currently have a cache[1] in gandiva that caches the built
> >> projector
> >>>> or filter module with LRU based eviction policy. However since the
> >> cost of
> >>>> building different expressions is not uniform it makes sense to have a
> >>>> different eviction policy that takes into account an associated cost
> >> of a
> >>>> cache miss (while also discounting the items which have not been
> >> recently
> >>>> used). We are planning to use an algorithm called GreedyDual-Size
> >> Algorithm
> >>>> [2] which seems fit for the purpose. The algorithm is quite simple -
> >>>>> Each item has a cost (build time in our case) and item with lowest
> >> cost
> >>>> (c_min) is evicted. All other items cost are deducted by (c_min)
> >>>>> On cache hit, the item cost is restored to the original value
> >>>>>
> >>>>> This can be implemented using a priority queue and an efficient
> >>>> implementation of this can handle both cache hit and eviction in
> >> O(logk)
> >>>> time.
> >>>>>
> >>>>> Does anybody have any other suggestions or ideas on this?
> >>>>>
> >>>>> [1]
> >> https://github.com/apache/arrow/blob/master/cpp/src/gandiva/cache.h
> >>>> <https://github.com/apache/arrow/blob/master/cpp/src/gandiva/cache.h>
> >>>>> [2]
> >>>>
> >> https://www.usenix.org/legacy/publications/library/proceedings/usits97/full_papers/cao/cao_html/node8.html
> >>>> <
> >>>>
> >> https://www.usenix.org/legacy/publications/library/proceedings/usits97/full_papers/cao/cao_html/node8.html
> >>>>>
> >>>>>
> >>>>> Regards,
> >>>>> Projjal
> >>>>>
> >>>>
> >>
>

Re: [Gandiva] Replacing the LRU cache in gandiva

Reply via email to