I just missed the call, but I watched the recording (thank you to Andrew for posting [1]). Really interesting! I'm diving into Arrow because I have some previous experience with in-memory query engines. I'm following discussions around improving performance and adding features so I can determine how best to contribute.
In particular, I was interested in some of the background for the JIT implementation [2] and the row format [3] but I guess I'm missing context. I saw the comment in #1708 [4] that "many pipeline-breaking operators are inherently row-based". My questions: - By "pipeline-breaking" I assume you mean "very slow", but can you give me details? Does this arise from some particular observation, or other reported issues? - An example would be nice, like "select a, b, c from blah order by d" with table "blah" having 1 million rows and 10 columns takes 5 minutes, or even anecdotal evidence like mailing list discussions - In general, what tools are you using to analyze datafusion performance? - The criterion benchmarks are nice but do you have anything higher-level which exercises a broad range of workloads? - How much profiling have you done to identify bottlenecks? To be honest, I was kind of surprised to see using a row format to solve a performance problem, but I figured you must have good reasons, and I'm still getting my brain around datafusion's query execution model. Thanks for any illumination! [1] https://youtu.be/5NJcqXm6uE0 [2] https://github.com/apache/arrow-datafusion/pull/1849 [3] https://github.com/apache/arrow-datafusion/pull/1782 [4] https://github.com/apache/arrow-datafusion/issues/1708 On Tue, Mar 8, 2022 at 12:25 PM Andrew Lamb <al...@influxdata.com> wrote: > I am not sure if everyone saw it in the agenda[1], but we plan to have a > meeting tomorrow. I'll plan to record it for anyone who can not make this > time. > > 15:00 UTC Wednesday March 9, 2022 > Meeting Location: (in agenda) > Matthew Turner: focused on JIT and row representation, next Wednesday, > March 9th, > @yijie: JIT overview > > [1] > > https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit# > > On Thu, Mar 3, 2022 at 12:50 AM Benson Muite <benson_mu...@emailplus.org> > wrote: > > > Interested in learning more about this. Can work through the code and > > discuss on 17 March either 4:00 or 16:00 UTC. > > > > Benson > > > > On 3/3/22 12:03 AM, Andrew Lamb wrote: > > > I noticed that Matthew Turner added a note to the agenda[1] for a walk > > > through of the JIT code. I would be interested in this as well -- would > > > anyone plan to be on the call and discuss it? > > > > > > I don't think I have time to prepare that content prior > > > > > > Andrew > > > > > > [1] > > > > > > https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit# > > > > > >