+1 on this too When I implemented "group by all", I introduced at least two subtle bugs that many reviewers weren't able to catch and those two bugs would not have been possible to introduce if we had a single pass analyzer. Single pass can make the whole framework more robust.
On Tue, Aug 20, 2024 at 7:23 PM Xiao Li <gatorsm...@gmail.com> wrote: > This sounds like a good idea! > > The Analyzer is complex. The changes in the new Analyzer should not affect > the existing one. The users could add the QO rules and rely on the existing > structures and patterns of the logical plan trees generated by the current > one. The new Analyzer needs to generate the same logical plan trees as the > current one. > > Cheers, > > Xiao > > Vladimir Golubev <vvdr....@gmail.com> 于2024年8月14日周三 11:27写道: > >> - I think we can rely on the current tests. One possibility would be to >> dual-run both Analyzer implementations if `Utils.isTesting` and compare the >> (normalized) logical plans >> - We can implement the analyzer functionality by milestones (Milestone 0: >> Project, Filter, UnresolvedInlineTable, Milestone 2: Support main >> datasources, ...). Running both analyzers in mixed mode may lead to >> unexpected logical plan problems, because that would introduce a completely >> different chain of transformations >> >> On Wed, Aug 14, 2024 at 3:58 PM Herman van Hovell <her...@databricks.com> >> wrote: >> >>> +1(000) on this! >>> >>> This should massively reduce allocations done in the analyzer, and it is >>> much more efficient. I also can't count the times that I had to increase >>> the number of iterations. This sounds like a no-brainer to me. >>> >>> I do have two questions: >>> >>> - How do we ensure that we don't accidentally break the analyzer? >>> Are existing tests enough? >>> - How can we introduce this incrementally? Can we run the analyzer >>> in mixed mode (both single pass rules and the existing tree traversal >>> rules) for a while? >>> >>> Cheers, >>> Herman >>> >>> On Fri, Aug 9, 2024 at 10:48 AM Vladimir Golubev <vvdr....@gmail.com> >>> wrote: >>> >>>> Hello All, >>>> >>>> I recently did some research in the Catalyst Analyzer area to check if >>>> it’s possible to make it single-pass instead of fixed-point. Despite the >>>> flexibility of the current fixed-point approach (new functionality - new >>>> rule), it has some drawbacks. The dependencies between the rules are >>>> unobvious, so it’s hard to introduce changes without having the full >>>> knowledge. By modifying one rule, the whole chain of transformations can >>>> change in an unobvious way. Since we can hit the maximum number of >>>> iterations, there’s no guarantee that the plan is going to be resolved. And >>>> from a performance perspective the Analyzer can be quite slow for large >>>> logical plans and wide tables. >>>> >>>> The idea is to resolve the tree in a single-pass post-order traversal. >>>> This approach can be beneficial for the following reasons: >>>> >>>> - >>>> >>>> Code can rely on child nodes being resolved >>>> - >>>> >>>> Name visibility can be controlled by a scope, which would be >>>> maintained during the traversal >>>> - >>>> >>>> Random in-tree lookups are disallowed during the traversal, so the >>>> resolution algorithm would be linear with respect to the parsed tree >>>> size >>>> - >>>> >>>> Analyzer would deterministically produce resolved logical plan or a >>>> descriptive error >>>> >>>> >>>> The prototype shows that it’s possible to do a bottom-up Analyzer >>>> implementation and reuse a large chunk of node-processing code from rule >>>> bodies. I did the performance benchmark on two cases where the current >>>> Analyzer struggles the most - wide nested views and large logical plans and >>>> got 7-10x performance speedups. >>>> >>>> Is this something that would be interesting for the Spark community? >>>> What would be the best way to proceed with this idea? >>>> >>>> Regards, >>>> >>>> Vladimir Golubev. >>>> >>>