The system you describe sounds quite cool. I don't know what is going on the Java world -- as you say I think there is work a foot for technologies similar in usecase to DataFusion in C++ (though I suspect the implementation will be fairly different)
On Wed, Mar 17, 2021 at 5:37 PM bobtins <bobti...@gmail.com> wrote: > I missed the talk but watched the video, which was fascinating. It helped > me get the whole picture of what DataFusion does, which is impressive. In > my previous job, I built a data analysis engine on a smaller scale in Java, > so some of the problems that DataFusion tackles are familiar to me. > > The initial implementation of my engine would load some data from a > relational DB into a columnar memory store that I implemented (very much > like Arrow); it would then perform various transformations analogous to the > logical plan in DataFusion (sort, group, filter, aggregate, etc), but also > supporting OLAP-like multi-level hierarchies and cubes. This query model > didn't have a language itself; the UI manipulated an object model which > contained the logical plan (although unfortunately the query model was > tangled with other layers). > > This was later enhanced to generate SQL queries so you wouldn't have to > load everything into memory, but you could do in-memory operations on top > of the SQL result. I came up with an expression language close to SQL which > could be translated into in-memory or SQL operations. I had to do something > like the merge operator in DataFusion to support multi-stage aggregation > (e.g. implement count(x) -> sum(count(x)), average(x) -> > sum(sum(x))/sum(count(x)), etc. ). > > Like I said, my framework was nowhere near as heavy-duty as DataFusion + > Arrow, but my familiarity with the power of in-memory columnar stores is > what drew me to Arrow in the first place. > > I am curious about how the various language implementations in Arrow are > evolving computation frameworks; for Rust, there is DataFusion, and I > noticed that there has been a lot of work going on in C++/Python. For Java, > it seems like this would be in the realm of Gandiva or the dremio > product...and of course there's Spark! I am still surveying the terrain, > but any pointers to work people are doing in Java would be welcome. > > On 2021/03/12 19:39:16, Andrew Lamb <al...@influxdata.com> wrote: > > Here are links to the content, should anyone be interested: > > > > Query Engine Design and the Rust-Based DataFusion in Apache Arrow > > recording: https://www.youtube.com/watch?v=K6eCAVEk4kU > > slides: (datafusion content starts on slide 6): > > > https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934 > > > > On Thu, Mar 4, 2021 at 4:05 PM Andrew Lamb <al...@influxdata.com> wrote: > > > > > In case anyone is interested in the topic in general or DataFusion in > > > particular, I plan a tech talk [1] next week about "Query Engine > Design and > > > the Rust based DataFusion in Apache Arrow." > > > > > > If you are curious how (SQL) query engines in general are structured, I > > > plan to describe the typical high level architecture, using DataFusion > as > > > an exemplar. > > > > > > It will be held next Wednesday, March 10, 2021 at 8:00 am PST | 4:00 pm > > > GMT, and posted publicly afterwards. > > > > > > Andrew > > > > > > [1] https://www.influxdata.com/community-showcase/influxdb-tech-talks/ > > > > > > > > >