Re: [DISCUSS][Rust] Performance Measurements (was Biweekly sync call for arrow/datafusion again?)

Bob Tinsman Mon, 14 Mar 2022 16:31:15 -0700

Thanks for pulling this out from the long thread...

On Sat, Mar 12, 2022 at 3:06 AM Andrew Lamb <[email protected]> wrote:


> Hi Bob,
>
> > - By "pipeline-breaking" I assume you mean "very slow", but can you give
> me
> details? Does this arise from some particular observation, or other
> reported issues?
>
> In general pipeline breaking means that the output of the operator can't be
> produced until it has seen *ALL* its input.
>
> For example, a sort (ORDER BY x) is a pipeline breaker because the engine
> has to see the entire input prior to being able to produce any output.
>
> However, a filter (WHERE x > 500) is not a pipeline breaker because the
> operator can produce output rows as soon as it sees any that pass the
> filter criteria.
>

Aha, I get it--so the goal is not necessarily to speed up the whole thing
but to be able to send output to the next processing stage sooner.
So IIRC besides sorts, the other types of queries mentioned were joins,
group by, and hash aggregates?

>
> > - In general, what tools are you using to analyze datafusion performance?
>
> The tools used most commonly are in the benchmark directory [1] There is
> some other work
>
> >  - How much profiling have you done to identify bottlenecks?
>
> I would say it is done on an "as needed basis" -- namely someone runs a
> query that is important to them and then improves whatever hotspot they may
> find.
>
> However, we don't have regular runs of the same queries or automatically
> gather data over time. dianaclarke added integration for condabench in [2]
> that I think would allow for such data collection, but no one has hooked up
> the benchmarks to it uet.
>
> Getting regular runs of the performance benchmark up and running would be
> very valuable indeed, if you were looking to help.
>
> Yes, I'm definitely looking to help, and maybe getting more perf
benchmarks up would be a good way of starting.
I noticed that matthewmturner was working on something to run benchmarks in
docker, which is pretty nice! [3]
Any suggestions for performance use cases would be welcome; I could add
them in.
One thing I like to do is to run the same benchmark and tweak the knobs,
such as number of rows, cardinality, etc. because the effects can vary A
LOT.

I am tempted to venture opinions on how to do things based on my experience
building my own (closed-source) columnar query engine, but that one is an
entirely different beast, so I am not qualified to opine until I learn more.
I'm starting to follow history about various performance improvements, but
if anyone has any suggestion, like "I wish datafusion could complete X
query on 50 bazillion rows in less than 3 days", let me know. In
performance, there are so many variables that it's hard to know where to
start.

Thanks, Bob


>
> [1] https://github.com/apache/arrow-datafusion/tree/master/benchmarks
> [2] https://github.com/apache/arrow-datafusion/pull/1791
>
> [3] https://github.com/apache/arrow-datafusion/pull/1928



On Fri, Mar 11, 2022 at 4:56 PM Bob Tinsman <[email protected]> wrote:
>
> > I just missed the call, but I watched the recording (thank you to Andrew
> > for posting [1]). Really interesting!
> > I'm diving into Arrow because I have some previous experience with
> > in-memory query engines. I'm following discussions around improving
> > performance and adding features so I can determine how best to
> contribute.
> >
> > In particular, I was interested in some of the background for the JIT
> > implementation [2] and the row format [3] but I guess I'm missing
> context.
> > I saw the comment in #1708 [4] that "many pipeline-breaking operators are
> > inherently row-based".
> > My questions:
> > - By "pipeline-breaking" I assume you mean "very slow", but can you give
> me
> > details? Does this arise from some particular observation, or other
> > reported issues?
> >   - An example would be nice, like "select a, b, c from blah order by d"
> > with table "blah" having 1 million rows and 10 columns takes 5 minutes,
> or
> > even anecdotal evidence like mailing list discussions
> > - In general, what tools are you using to analyze datafusion performance?
> >   - The criterion benchmarks are nice but do you have anything
> higher-level
> > which exercises a broad range of workloads?
> >   - How much profiling have you done to identify bottlenecks?
> >
> > To be honest, I was kind of surprised to see using a row format to solve
> a
> > performance problem, but I figured you must have good reasons, and I'm
> > still getting my brain around datafusion's query execution model. Thanks
> > for any illumination!
> >
> > [1] https://youtu.be/5NJcqXm6uE0
> > [2] https://github.com/apache/arrow-datafusion/pull/1849
> > [3] https://github.com/apache/arrow-datafusion/pull/1782
> > [4] https://github.com/apache/arrow-datafusion/issues/1708
> >
> > On Tue, Mar 8, 2022 at 12:25 PM Andrew Lamb <[email protected]>
> wrote:
> >
> > > I am not sure if everyone saw it in the agenda[1], but we plan to have
> a
> > > meeting tomorrow. I'll plan to record it for anyone who can not make
> this
> > > time.
> > >
> > > 15:00 UTC Wednesday March 9, 2022
> > > Meeting Location: (in agenda)
> > > Matthew Turner:  focused on JIT and row representation, next Wednesday,
> > > March 9th,
> > > @yijie: JIT  overview
> > >
> > > [1]
> > >
> > >
> >
> https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#
> > >
> > > On Thu, Mar 3, 2022 at 12:50 AM Benson Muite <
> [email protected]
> > >
> > > wrote:
> > >
> > > > Interested in learning more about this. Can work through the code and
> > > > discuss on 17 March either 4:00 or 16:00 UTC.
> > > >
> > > > Benson
> > > >
> > > > On 3/3/22 12:03 AM, Andrew Lamb wrote:
> > > > > I noticed that Matthew Turner added a note to the agenda[1] for a
> > walk
> > > > > through of the JIT code. I would be interested in this as well --
> > would
> > > > > anyone plan to be on the call and discuss it?
> > > > >
> > > > > I don't think I have time to prepare that content prior
> > > > >
> > > > > Andrew
> > > > >
> > > > > [1]
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit#
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS][Rust] Performance Measurements (was Biweekly sync call for arrow/datafusion again?)

Reply via email to