I just missed the call, but I watched the recording (thank you to Andrew
for posting [1]). Really interesting!
I'm diving into Arrow because I have some previous experience with
in-memory query engines. I'm following discussions around improving
performance and adding features so I can determine how best to contribute.

In particular, I was interested in some of the background for the JIT
implementation [2] and the row format [3] but I guess I'm missing context.
I saw the comment in #1708 [4] that "many pipeline-breaking operators are
inherently row-based".
My questions:
- By "pipeline-breaking" I assume you mean "very slow", but can you give me
details? Does this arise from some particular observation, or other
reported issues?
  - An example would be nice, like "select a, b, c from blah order by d"
with table "blah" having 1 million rows and 10 columns takes 5 minutes, or
even anecdotal evidence like mailing list discussions
- In general, what tools are you using to analyze datafusion performance?
  - The criterion benchmarks are nice but do you have anything higher-level
which exercises a broad range of workloads?
  - How much profiling have you done to identify bottlenecks?

To be honest, I was kind of surprised to see using a row format to solve a
performance problem, but I figured you must have good reasons, and I'm
still getting my brain around datafusion's query execution model. Thanks
for any illumination!

[1] https://youtu.be/5NJcqXm6uE0
[2] https://github.com/apache/arrow-datafusion/pull/1849
[3] https://github.com/apache/arrow-datafusion/pull/1782
[4] https://github.com/apache/arrow-datafusion/issues/1708

On Tue, Mar 8, 2022 at 12:25 PM Andrew Lamb <al...@influxdata.com> wrote:

> I am not sure if everyone saw it in the agenda[1], but we plan to have a
> meeting tomorrow. I'll plan to record it for anyone who can not make this
> time.
>
> 15:00 UTC Wednesday March 9, 2022
> Meeting Location: (in agenda)
> Matthew Turner:  focused on JIT and row representation, next Wednesday,
> March 9th,
> @yijie: JIT  overview
>
> [1]
>
> https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#
>
> On Thu, Mar 3, 2022 at 12:50 AM Benson Muite <benson_mu...@emailplus.org>
> wrote:
>
> > Interested in learning more about this. Can work through the code and
> > discuss on 17 March either 4:00 or 16:00 UTC.
> >
> > Benson
> >
> > On 3/3/22 12:03 AM, Andrew Lamb wrote:
> > > I noticed that Matthew Turner added a note to the agenda[1] for a walk
> > > through of the JIT code. I would be interested in this as well -- would
> > > anyone plan to be on the call and discuss it?
> > >
> > > I don't think I have time to prepare that content prior
> > >
> > > Andrew
> > >
> > > [1]
> > >
> >
> https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#
> > >
> >
>

Reply via email to