[DISCUSS][Rust] Performance Measurements (was Biweekly sync call for arrow/datafusion again?)

Andrew Lamb Sat, 12 Mar 2022 03:06:29 -0800

Hi Bob,

> - By "pipeline-breaking" I assume you mean "very slow", but can you give
me
details? Does this arise from some particular observation, or other
reported issues?


In general pipeline breaking means that the output of the operator can't be
produced until it has seen *ALL* its input.

For example, a sort (ORDER BY x) is a pipeline breaker because the engine
has to see the entire input prior to being able to produce any output.

However, a filter (WHERE x > 500) is not a pipeline breaker because the
operator can produce output rows as soon as it sees any that pass the
filter criteria.

> - In general, what tools are you using to analyze datafusion performance?

The tools used most commonly are in the benchmark directory [1] There is
some other work

>  - How much profiling have you done to identify bottlenecks?

I would say it is done on an "as needed basis" -- namely someone runs a
query that is important to them and then improves whatever hotspot they may
find.

However, we don't have regular runs of the same queries or automatically
gather data over time. dianaclarke added integration for condabench in [2]
that I think would allow for such data collection, but no one has hooked up
the benchmarks to it uet.

Getting regular runs of the performance benchmark up and running would be
very valuable indeed, if you were looking to help.

Andrew


[1] https://github.com/apache/arrow-datafusion/tree/master/benchmarks
[2] https://github.com/apache/arrow-datafusion/pull/1791

On Fri, Mar 11, 2022 at 4:56 PM Bob Tinsman <[email protected]> wrote:

> I just missed the call, but I watched the recording (thank you to Andrew
> for posting [1]). Really interesting!
> I'm diving into Arrow because I have some previous experience with
> in-memory query engines. I'm following discussions around improving
> performance and adding features so I can determine how best to contribute.
>
> In particular, I was interested in some of the background for the JIT
> implementation [2] and the row format [3] but I guess I'm missing context.
> I saw the comment in #1708 [4] that "many pipeline-breaking operators are
> inherently row-based".
> My questions:
> - By "pipeline-breaking" I assume you mean "very slow", but can you give me
> details? Does this arise from some particular observation, or other
> reported issues?
>   - An example would be nice, like "select a, b, c from blah order by d"
> with table "blah" having 1 million rows and 10 columns takes 5 minutes, or
> even anecdotal evidence like mailing list discussions
> - In general, what tools are you using to analyze datafusion performance?
>   - The criterion benchmarks are nice but do you have anything higher-level
> which exercises a broad range of workloads?
>   - How much profiling have you done to identify bottlenecks?
>
> To be honest, I was kind of surprised to see using a row format to solve a
> performance problem, but I figured you must have good reasons, and I'm
> still getting my brain around datafusion's query execution model. Thanks
> for any illumination!
>
> [1] https://youtu.be/5NJcqXm6uE0
> [2] https://github.com/apache/arrow-datafusion/pull/1849
> [3] https://github.com/apache/arrow-datafusion/pull/1782
> [4] https://github.com/apache/arrow-datafusion/issues/1708
>
> On Tue, Mar 8, 2022 at 12:25 PM Andrew Lamb <[email protected]> wrote:
>
> > I am not sure if everyone saw it in the agenda[1], but we plan to have a
> > meeting tomorrow. I'll plan to record it for anyone who can not make this
> > time.
> >
> > 15:00 UTC Wednesday March 9, 2022
> > Meeting Location: (in agenda)
> > Matthew Turner:  focused on JIT and row representation, next Wednesday,
> > March 9th,
> > @yijie: JIT  overview
> >
> > [1]
> >
> >
> https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#
> >
> > On Thu, Mar 3, 2022 at 12:50 AM Benson Muite <[email protected]
> >
> > wrote:
> >
> > > Interested in learning more about this. Can work through the code and
> > > discuss on 17 March either 4:00 or 16:00 UTC.
> > >
> > > Benson
> > >
> > > On 3/3/22 12:03 AM, Andrew Lamb wrote:
> > > > I noticed that Matthew Turner added a note to the agenda[1] for a
> walk
> > > > through of the JIT code. I would be interested in this as well --
> would
> > > > anyone plan to be on the call and discuss it?
> > > >
> > > > I don't think I have time to prepare that content prior
> > > >
> > > > Andrew
> > > >
> > > > [1]
> > > >
> > >
> >
> https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#
> > > >
> > >
> >
>

[DISCUSS][Rust] Performance Measurements (was Biweekly sync call for arrow/datafusion again?)

Reply via email to