I missed the talk but watched the video, which was fascinating. It helped me get the whole picture of what DataFusion does, which is impressive. In my previous job, I built a data analysis engine on a smaller scale in Java, so some of the problems that DataFusion tackles are familiar to me.
The initial implementation of my engine would load some data from a relational DB into a columnar memory store that I implemented (very much like Arrow); it would then perform various transformations analogous to the logical plan in DataFusion (sort, group, filter, aggregate, etc), but also supporting OLAP-like multi-level hierarchies and cubes. This query model didn't have a language itself; the UI manipulated an object model which contained the logical plan (although unfortunately the query model was tangled with other layers). This was later enhanced to generate SQL queries so you wouldn't have to load everything into memory, but you could do in-memory operations on top of the SQL result. I came up with an expression language close to SQL which could be translated into in-memory or SQL operations. I had to do something like the merge operator in DataFusion to support multi-stage aggregation (e.g. implement count(x) -> sum(count(x)), average(x) -> sum(sum(x))/sum(count(x)), etc. ). Like I said, my framework was nowhere near as heavy-duty as DataFusion + Arrow, but my familiarity with the power of in-memory columnar stores is what drew me to Arrow in the first place. I am curious about how the various language implementations in Arrow are evolving computation frameworks; for Rust, there is DataFusion, and I noticed that there has been a lot of work going on in C++/Python. For Java, it seems like this would be in the realm of Gandiva or the dremio product...and of course there's Spark! I am still surveying the terrain, but any pointers to work people are doing in Java would be welcome. On 2021/03/12 19:39:16, Andrew Lamb <al...@influxdata.com> wrote: > Here are links to the content, should anyone be interested: > > Query Engine Design and the Rust-Based DataFusion in Apache Arrow > recording: https://www.youtube.com/watch?v=K6eCAVEk4kU > slides: (datafusion content starts on slide 6): > https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934 > > On Thu, Mar 4, 2021 at 4:05 PM Andrew Lamb <al...@influxdata.com> wrote: > > > In case anyone is interested in the topic in general or DataFusion in > > particular, I plan a tech talk [1] next week about "Query Engine Design and > > the Rust based DataFusion in Apache Arrow." > > > > If you are curious how (SQL) query engines in general are structured, I > > plan to describe the typical high level architecture, using DataFusion as > > an exemplar. > > > > It will be held next Wednesday, March 10, 2021 at 8:00 am PST | 4:00 pm > > GMT, and posted publicly afterwards. > > > > Andrew > > > > [1] https://www.influxdata.com/community-showcase/influxdb-tech-talks/ > > > > >