Thanks Jim, that's helpful. In the long run, we'd like to make sure that the Arrow libraries can serve as a robust dependency for computational systems like OAMap. It doesn't strike me that OAMap is a library that could be used within the Arrow project, though there are some things which could be implemented as part of a function kernel library or function code-generator. The streaming data / IPC machinery we have developed could be useful for working with large on-disk datasets or efficient data movement between nodes in a cluster.
Let us know as you run into issues or missing features so we can incorporate into our development roadmap. best Wes On Mon, Jun 25, 2018 at 5:25 PM, Jim Pivarski <jpivar...@gmail.com> wrote: > What Martin said about OAMap and ROOT is true: there's no dependence, ROOT > and Arrow are both backends. > > What Wes said about embeddability is also right: OAMap is pure Python+Numpy > and would be hard to use within a C++ framework (or Java). This has already > been an issue with a potential user in the ALICE experiment. They're > considering Arrow in a C++ framework. > > I've been quiet for a few months because I've been trying to work OAMap > into Dask and having trouble with the fact that OAMap data are "loose—" the > schema is a separate thing from the data and different columns of the data > can be in different files, in different file format, on different machines. > I've found another reformulation of the problem in a bottom-up way, in > which the data are in an extended set of array types— like Numpy arrays, > but they can be chunked, jagged, AOS-SOA split, generated on demand, etc. > The schema is implicit in the nesting of these objects, rather than being > explicit and separate from the objects, and they reproduce all of the > desirable effects of OAMap. > > So, my project being a research project, I'm following this promising line > of attack. But it probably cuts this discussion short if it was about Gandiva > vs OAMap as duplicating effort. Martin, I was planning on sending you a > note about this when I had a working example, particularly if I could > convince Dask that these arrays can be Dask lazy arrays. (I started working > on this alternate approach exactly two weeks ago— it's very recent.) > > Meanwhile, I'm also going to look into Gandiva and possibly recommend it to > the ALICE experiment. > > Cheers, > Jim > > > > > On Mon, Jun 25, 2018, 4:00 PM Martin Durant <martin.dur...@utoronto.ca> > wrote: > >> Let me just quickly correct a couple of point, for clarity. >> The following module >> https://github.com/diana-hep/oamap/blob/master/oamap/backend/arrow.py >> is a proof-of-concept of oamap running directly on arrow memory - this is >> the original reason I raised the topic here. >> >> There are also POCs showing operation on numpy records and parquet files. >> oamap was written with ROOT in mind, >> but is not necessarily tied to it; it just so happens that ROOT data tends >> to have just the kind of deeply nested structures >> that tend to perform terribly in a pandas apply situation or other python >> object-based processing. oamap depends on >> numba/llvm-lite >> >> So indeed maybe this all belongs as a conversation in pyarrow rather than >> arrow, but insomuch as it enables - or may in >> the future enable - machine-speed computation on in-memory nested arrow >> data, I think oamap should be on >> everyone’s radar as a interesting and useful project. >> >> > On 25 Jun 2018, at 16:45, Wes McKinney <wesmck...@gmail.com> wrote: >> > >> > hi Martin, >> > >> > These projects are very different. Many analytic databases feature >> > code generation (recently a lot of these use LLVM -- see Hyper, Apache >> > Impala, and others) on the hot paths for function evaluation (e.g. for >> > evaluating the expressions in the SELECT part or the WHERE part) -- >> > the reason people are excited about Gandiva is that it makes this type >> > of functionality available as a library running atop an open standard >> > memory format (Arrow columnar), so can be used in any programming >> > language assuming suitable bindings can be developed. This is very >> > much in line with our vision for creating a "deconstructed database" >> > (see a talk that Julien gave on this topic: >> > >> https://www.slideshare.net/julienledem/from-flat-files-to-deconstructed-database >> ) >> > >> > I have not looked a great deal at oamap, but it does not use the Arrow >> > columnar format AFAIK. It is written in Python and presumes some other >> > technologies in use (like the ROOT format). >> > >> > So to summarize: >> > >> > Gandiva >> > * Compiles analytical expressions to execute against Arrow columnar >> format >> > * Is written in C++ and can be embedded in other systems (Dremio is >> > using it from Java) >> > >> > oamap >> > * Does not use the Arrow columnar format >> > * Presumes other technologies in use (ROOT) >> > * Is written in Python, and would be challenging to use an embedded >> > system component >> > >> > I'm certain these projects can learn from each other -- I have spoken >> > with Jim (one of the developers of oamap) in the past, so welcome >> > further discussion here on the mailing list. >> > >> > Thanks, >> > Wes >> > >> > On Mon, Jun 25, 2018 at 1:27 PM, Martin Durant >> > <martin.dur...@utoronto.ca> wrote: >> >> I am a little surprised by the very positive reception to Gandiva >> (which doubtless is very useful - I know very little about it) versus when >> I brought up the prospect of using oamap ( >> https://github.com/diana-hep/oamap ) on this mailing list. >> >> >> >> oamap uses numba to compile *python* functions at run-time and can walk >> complex nested schema down to leaf nodes in native python syntax (for-loops >> and attribute/item lookup) but at full machine speed, and without >> materialising any objects along the way. It was written for the ROOT >> format, but has implementations for simple types in parquet and arrow, >> which each do the nested lists and dict things similarly but differently. >> >> >> >> Would someone care to explain the silence over oamap? >> >> >> >>> On 25 Jun 2018, at 02:06, Praveen Kumar <prav...@dremio.com> wrote: >> >>> >> >>> Hi Everyone, >> >>> >> >>> I am Praveen, another engineer working on Gandiva. The interest and >> speed of engagement around this is great !!Excited to engage with you folks >> on this. >> >>> >> >>> Thx. >> >>> >> >>> On 2018/06/22 18:09:42, Julian Hyde <j...@apache.org> wrote: >> >>>> This is exciting. We have wanted to build an Arrow adapter in Calcite >> for some time and have a prototype (see >> https://issues.apache.org/jira/browse/CALCITE-2173 < >> https://issues.apache.org/jira/browse/CALCITE-2173>) but I hope that we >> can use Gandiva. I know that Gandiva has Java bindings, but will these >> allow queries to be compiled and executed from a pure Java process?> >> >>>> >> >>>> Can you describe Gandiva’s governance model? Without an open >> governance model, companies that compete with Dremio may be wary about >> contributing.> >> >>>> >> >>>> Can you compare and contrast your approach to Hyper[1]? Hyper is also >> concerned with efficient use to the bus, and also uses LLVM, but it has a >> different memory format and places much emphasis on lock-free data >> structures.> >> >>>> >> >>>> I just attended SIGMOD and there were interesting industry papers >> from MemSQL[2][3] and Oracle RAPID[4]. I was impressed with some of the >> tricks MemSQL uses to achieve SIMD parallelism on queries such as “select >> k4, sum(x) from t group by k4” (where k4 has 4 values).> >> >>>> >> >>>> I missed part of the RAPID talk, but I got the impression that they >> are using disk-based algorithms (e.g. hybrid hash join) to handle data >> spread between fast and slow memory.> >> >>>> >> >>>> MemSQL uses TPC-H query 1 as a motivating benchmark and I think this >> would be good target for Gandiva also. It is a table scan with a range >> filter (returning 98% of rows), a low-cardinality aggregate (grouping by >> two fields with 3 values each), and several aggregate functions, the >> arguments of which contain common sub-expressions.> >> >>>> >> >>>> SELECT> >> >>>> l_returnflag,> >> >>>> l_linestatus,> >> >>>> sum(l_quantity),> >> >>>> sum(l_extendedprice),> >> >>>> sum(l_extendedprice * (1 - l_discount)),> >> >>>> sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)),> >> >>>> avg(l_quantity),> >> >>>> avg(l_extendedprice),> >> >>>> avg(l_discount),> >> >>>> count(*)> >> >>>> FROM lineitem> >> >>>> WHERE l_shipdate <= date '1998-12-01' - interval '90’ day> >> >>>> GROUP BY> >> >>>> l_returnflag,> >> >>>> l_linestatus> >> >>>> ORDER BY> >> >>>> l_returnflag,> >> >>>> l_linestatus;> >> >>>> >> >>>> Julian> >> >>>> >> >>>> [1] http://www.vldb.org/pvldb/vol4/p539-neumann.pdf < >> http://www.vldb.org/pvldb/vol4/p539-neumann.pdf>> >> >>>> >> >>>> [2] >> http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/ >> < >> http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/ >> >> >> >>>> >> >>>> [3] https://dl.acm.org/citation.cfm?id=3183713.3190658 < >> https://dl.acm.org/citation.cfm?id=3183713.3190658>> >> >>>> >> >>>> [4] https://dl.acm.org/citation.cfm?id=3183713.3190655 < >> https://dl.acm.org/citation.cfm?id=3183713.3190655>> >> >>>> >> >>>>> On Jun 22, 2018, at 7:22 AM, ravind...@gmail.com wrote:> >> >>>>>> >> >>>>> Hi everyone,> >> >>>>>> >> >>>>> I'm Ravindra and I'm a developer on the Gandiva project. I do >> believe that the combination of arrow and llvm for efficient expression >> evaluation is powerful, and has a broad range of use-cases. We've just >> started and hope to finesse and add a lot of functionality over the next >> few months.> >> >>>>>> >> >>>>> Welcome your feedback and participation in gandiva !!> >> >>>>>> >> >>>>> thanks & regards,> >> >>>>> ravindra.> >> >>>>>> >> >>>>> On 2018/06/21 19:15:20, Jacques Nadeau <ja...@apache.org> wrote: > >> >>>>>> Hey Guys,> >> >>>>>>> >> >>>>>> Dremio just open sourced a new framework for processing data in >> Arrow data> >> >>>>>> structures [1], built on top of the Apache Arrow C++ APIs and >> leveraging> >> >>>>>> LLVM (Apache licensed). It also includes Java APIs that leverage >> the Apache> >> >>>>>> Arrow Java libraries. I expect the developers who have been working >> on this> >> >>>>>> will introduce themselves soon. To read more about it, take a look >> at our> >> >>>>>> Ravindra's blog post (he's the lead developer driving this work): >> [2].> >> >>>>>> Hopefully people will find this interesting/useful.> >> >>>>>>> >> >>>>>> Let us know what you all think!> >> >>>>>>> >> >>>>>> thanks,> >> >>>>>> Jacques> >> >>>>>>> >> >>>>>>> >> >>>>>> [1] https://github.com/dremio/gandiva> >> >>>>>> [2] >> https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/> >> >>>>>>> >> >>>> >> >> >> >> — >> >> Martin Durant >> >> martin.dur...@utoronto.ca >> >> >> >> >> >> >> >> — >> Martin Durant >> martin.dur...@utoronto.ca >> >> >> >>