> > > > > - By "pipeline-breaking" I assume you mean "very slow", but can you > give > > me > > details? Does this arise from some particular observation, or other > > reported issues? > > > > In general pipeline breaking means that the output of the operator can't > be > > produced until it has seen *ALL* its input. > > > > For example, a sort (ORDER BY x) is a pipeline breaker because the engine > > has to see the entire input prior to being able to produce any output. > > > > However, a filter (WHERE x > 500) is not a pipeline breaker because the > > operator can produce output rows as soon as it sees any that pass the > > filter criteria. > > > > Aha, I get it--so the goal is not necessarily to speed up the whole thing > but to be able to send output to the next processing stage sooner. > So IIRC besides sorts, the other types of queries mentioned were joins, > group by, and hash aggregates? >
Yes. DataFusion uses hash aggregates today to implement GROUP BY so I probably wouldn't describe them as different types of queries. > > > > > > - In general, what tools are you using to analyze datafusion > performance? > > > > The tools used most commonly are in the benchmark directory [1] There is > > some other work > > > > > - How much profiling have you done to identify bottlenecks? > > > > I would say it is done on an "as needed basis" -- namely someone runs a > > query that is important to them and then improves whatever hotspot they > may > > find. > > > > However, we don't have regular runs of the same queries or automatically > > gather data over time. dianaclarke added integration for condabench in > [2] > > that I think would allow for such data collection, but no one has hooked > up > > the benchmarks to it uet. > > > > Getting regular runs of the performance benchmark up and running would be > > very valuable indeed, if you were looking to help. > > > > Yes, I'm definitely looking to help, and maybe getting more perf > benchmarks up would be a good way of starting. > I noticed that matthewmturner was working on something to run benchmarks in > docker, which is pretty nice! [3] > Any suggestions for performance use cases would be welcome; I could add > them in. > One thing I like to do is to run the same benchmark and tweak the knobs, > such as number of rows, cardinality, etc. because the effects can vary A > LOT. > I agree this is a great strategy. I don't think there is any reason we can't do this to DataFusion, but no one has yet invested the time into a systematic investigation. I am tempted to venture opinions on how to do things based on my experience > building my own (closed-source) columnar query engine, but that one is an > entirely different beast, so I am not qualified to opine until I learn > more. > I'm starting to follow history about various performance improvements, but > if anyone has any suggestion, like "I wish datafusion could complete X > query on 50 bazillion rows in less than 3 days", let me know. In > performance, there are so many variables that it's hard to know where to > start. > > I think the TPCH queries would be a great place to start. Specifically, getting to the point where we had a baseline number at Scale Factor 1 would be great. DataFusion doesn't run some of them due to lack of various features, but I can't remember how well they are tracked. Getting DF to the point where all the queries complete in a "reasonable" amount of time would be pretty awesome Andrew