Hi Bob, > - By "pipeline-breaking" I assume you mean "very slow", but can you give me details? Does this arise from some particular observation, or other reported issues?
In general pipeline breaking means that the output of the operator can't be produced until it has seen *ALL* its input. For example, a sort (ORDER BY x) is a pipeline breaker because the engine has to see the entire input prior to being able to produce any output. However, a filter (WHERE x > 500) is not a pipeline breaker because the operator can produce output rows as soon as it sees any that pass the filter criteria. > - In general, what tools are you using to analyze datafusion performance? The tools used most commonly are in the benchmark directory [1] There is some other work > - How much profiling have you done to identify bottlenecks? I would say it is done on an "as needed basis" -- namely someone runs a query that is important to them and then improves whatever hotspot they may find. However, we don't have regular runs of the same queries or automatically gather data over time. dianaclarke added integration for condabench in [2] that I think would allow for such data collection, but no one has hooked up the benchmarks to it uet. Getting regular runs of the performance benchmark up and running would be very valuable indeed, if you were looking to help. Andrew [1] https://github.com/apache/arrow-datafusion/tree/master/benchmarks [2] https://github.com/apache/arrow-datafusion/pull/1791 On Fri, Mar 11, 2022 at 4:56 PM Bob Tinsman <bobti...@gmail.com> wrote: > I just missed the call, but I watched the recording (thank you to Andrew > for posting [1]). Really interesting! > I'm diving into Arrow because I have some previous experience with > in-memory query engines. I'm following discussions around improving > performance and adding features so I can determine how best to contribute. > > In particular, I was interested in some of the background for the JIT > implementation [2] and the row format [3] but I guess I'm missing context. > I saw the comment in #1708 [4] that "many pipeline-breaking operators are > inherently row-based". > My questions: > - By "pipeline-breaking" I assume you mean "very slow", but can you give me > details? Does this arise from some particular observation, or other > reported issues? > - An example would be nice, like "select a, b, c from blah order by d" > with table "blah" having 1 million rows and 10 columns takes 5 minutes, or > even anecdotal evidence like mailing list discussions > - In general, what tools are you using to analyze datafusion performance? > - The criterion benchmarks are nice but do you have anything higher-level > which exercises a broad range of workloads? > - How much profiling have you done to identify bottlenecks? > > To be honest, I was kind of surprised to see using a row format to solve a > performance problem, but I figured you must have good reasons, and I'm > still getting my brain around datafusion's query execution model. Thanks > for any illumination! > > [1] https://youtu.be/5NJcqXm6uE0 > [2] https://github.com/apache/arrow-datafusion/pull/1849 > [3] https://github.com/apache/arrow-datafusion/pull/1782 > [4] https://github.com/apache/arrow-datafusion/issues/1708 > > On Tue, Mar 8, 2022 at 12:25 PM Andrew Lamb <al...@influxdata.com> wrote: > > > I am not sure if everyone saw it in the agenda[1], but we plan to have a > > meeting tomorrow. I'll plan to record it for anyone who can not make this > > time. > > > > 15:00 UTC Wednesday March 9, 2022 > > Meeting Location: (in agenda) > > Matthew Turner: focused on JIT and row representation, next Wednesday, > > March 9th, > > @yijie: JIT overview > > > > [1] > > > > > https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit# > > > > On Thu, Mar 3, 2022 at 12:50 AM Benson Muite <benson_mu...@emailplus.org > > > > wrote: > > > > > Interested in learning more about this. Can work through the code and > > > discuss on 17 March either 4:00 or 16:00 UTC. > > > > > > Benson > > > > > > On 3/3/22 12:03 AM, Andrew Lamb wrote: > > > > I noticed that Matthew Turner added a note to the agenda[1] for a > walk > > > > through of the JIT code. I would be interested in this as well -- > would > > > > anyone plan to be on the call and discuss it? > > > > > > > > I don't think I have time to prepare that content prior > > > > > > > > Andrew > > > > > > > > [1] > > > > > > > > > > https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit# > > > > > > > > > >