I missed the talk but watched the video, which was fascinating. It helped me 
get the whole picture of what DataFusion does, which is impressive. In my 
previous job, I built a data analysis engine on a smaller scale in Java, so 
some of the problems that DataFusion tackles are familiar to me.

The initial implementation of my engine would load some data from a relational 
DB into a columnar memory store that I implemented (very much like Arrow); it 
would then perform various transformations analogous to the logical plan in 
DataFusion (sort, group, filter, aggregate, etc), but also supporting OLAP-like 
multi-level hierarchies and cubes. This query model didn't have a language 
itself; the UI manipulated an object model which contained the logical plan 
(although unfortunately the query model was tangled with other layers).

This was later enhanced to generate SQL queries so you wouldn't have to load 
everything into memory, but you could do in-memory operations on top of the SQL 
result. I came up with an expression language close to SQL which could be 
translated into in-memory or SQL operations. I had to do something like the 
merge operator in DataFusion to support multi-stage aggregation (e.g. implement 
count(x) -> sum(count(x)), average(x) -> sum(sum(x))/sum(count(x)), etc. ).

Like I said, my framework was nowhere near as heavy-duty as DataFusion + Arrow, 
but my familiarity with the power of in-memory columnar stores is what drew me 
to Arrow in the first place. 

I am curious about how the various language implementations in Arrow are 
evolving computation frameworks; for Rust, there is DataFusion, and I noticed 
that there has been a lot of work going on in C++/Python. For Java, it seems 
like this would be in the realm of Gandiva or the dremio product...and of 
course there's Spark! I am still surveying the terrain, but any pointers to 
work people are doing in Java would be welcome.

On 2021/03/12 19:39:16, Andrew Lamb <al...@influxdata.com> wrote: 
> Here are links to the content, should anyone be interested:
> 
> Query Engine Design and the Rust-Based DataFusion in Apache Arrow
> recording: https://www.youtube.com/watch?v=K6eCAVEk4kU
> slides: (datafusion content starts on slide 6):
> https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934
> 
> On Thu, Mar 4, 2021 at 4:05 PM Andrew Lamb <al...@influxdata.com> wrote:
> 
> > In case anyone is interested in the topic in general or DataFusion in
> > particular, I plan a tech talk [1] next week about "Query Engine Design and
> > the Rust based DataFusion in Apache Arrow."
> >
> > If you are curious how (SQL) query engines in general are structured, I
> > plan to describe the typical high level architecture, using DataFusion as
> > an exemplar.
> >
> > It will be held next Wednesday, March 10, 2021 at 8:00 am PST | 4:00 pm
> > GMT, and posted publicly afterwards.
> >
> > Andrew
> >
> > [1] https://www.influxdata.com/community-showcase/influxdb-tech-talks/
> >
> >
> 

Reply via email to