The system you describe sounds quite cool. I don't know what is going on
the Java world -- as you say I think there is work a foot for technologies
similar in usecase to DataFusion in C++ (though I suspect the
implementation will be fairly different)



On Wed, Mar 17, 2021 at 5:37 PM bobtins <bobti...@gmail.com> wrote:

> I missed the talk but watched the video, which was fascinating. It helped
> me get the whole picture of what DataFusion does, which is impressive. In
> my previous job, I built a data analysis engine on a smaller scale in Java,
> so some of the problems that DataFusion tackles are familiar to me.
>
> The initial implementation of my engine would load some data from a
> relational DB into a columnar memory store that I implemented (very much
> like Arrow); it would then perform various transformations analogous to the
> logical plan in DataFusion (sort, group, filter, aggregate, etc), but also
> supporting OLAP-like multi-level hierarchies and cubes. This query model
> didn't have a language itself; the UI manipulated an object model which
> contained the logical plan (although unfortunately the query model was
> tangled with other layers).
>
> This was later enhanced to generate SQL queries so you wouldn't have to
> load everything into memory, but you could do in-memory operations on top
> of the SQL result. I came up with an expression language close to SQL which
> could be translated into in-memory or SQL operations. I had to do something
> like the merge operator in DataFusion to support multi-stage aggregation
> (e.g. implement count(x) -> sum(count(x)), average(x) ->
> sum(sum(x))/sum(count(x)), etc. ).
>
> Like I said, my framework was nowhere near as heavy-duty as DataFusion +
> Arrow, but my familiarity with the power of in-memory columnar stores is
> what drew me to Arrow in the first place.
>
> I am curious about how the various language implementations in Arrow are
> evolving computation frameworks; for Rust, there is DataFusion, and I
> noticed that there has been a lot of work going on in C++/Python. For Java,
> it seems like this would be in the realm of Gandiva or the dremio
> product...and of course there's Spark! I am still surveying the terrain,
> but any pointers to work people are doing in Java would be welcome.
>
> On 2021/03/12 19:39:16, Andrew Lamb <al...@influxdata.com> wrote:
> > Here are links to the content, should anyone be interested:
> >
> > Query Engine Design and the Rust-Based DataFusion in Apache Arrow
> > recording: https://www.youtube.com/watch?v=K6eCAVEk4kU
> > slides: (datafusion content starts on slide 6):
> >
> https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934
> >
> > On Thu, Mar 4, 2021 at 4:05 PM Andrew Lamb <al...@influxdata.com> wrote:
> >
> > > In case anyone is interested in the topic in general or DataFusion in
> > > particular, I plan a tech talk [1] next week about "Query Engine
> Design and
> > > the Rust based DataFusion in Apache Arrow."
> > >
> > > If you are curious how (SQL) query engines in general are structured, I
> > > plan to describe the typical high level architecture, using DataFusion
> as
> > > an exemplar.
> > >
> > > It will be held next Wednesday, March 10, 2021 at 8:00 am PST | 4:00 pm
> > > GMT, and posted publicly afterwards.
> > >
> > > Andrew
> > >
> > > [1] https://www.influxdata.com/community-showcase/influxdb-tech-talks/
> > >
> > >
> >
>

Reply via email to