Re: [DISCUSS] [Spark SQL] A single-pass resolution approach for the Catalyst Analyzer

Reynold Xin Tue, 20 Aug 2024 11:48:37 -0700

+1 on this too

When I implemented "group by all", I introduced at least two subtle bugs
that many reviewers weren't able to catch and those two bugs would not have
been possible to introduce if we had a single pass analyzer. Single pass
can make the whole framework more robust.







On Tue, Aug 20, 2024 at 7:23 PM Xiao Li <[email protected]> wrote:

> This sounds like a good idea!
>
> The Analyzer is complex. The changes in the new Analyzer should not affect
> the existing one. The users could add the QO rules and rely on the existing
> structures and patterns of the logical plan trees generated by the current
> one.  The new Analyzer needs to generate the same logical plan trees as the
> current one.
>
> Cheers,
>
> Xiao
>
> Vladimir Golubev <[email protected]> 于2024年8月14日周三 11:27写道：
>
>> - I think we can rely on the current tests. One possibility would be to
>> dual-run both Analyzer implementations if `Utils.isTesting` and compare the
>> (normalized) logical plans
>> - We can implement the analyzer functionality by milestones (Milestone 0:
>> Project, Filter, UnresolvedInlineTable, Milestone 2: Support main
>> datasources, ...). Running both analyzers in mixed mode may lead to
>> unexpected logical plan problems, because that would introduce a completely
>> different chain of transformations
>>
>> On Wed, Aug 14, 2024 at 3:58 PM Herman van Hovell <[email protected]>
>> wrote:
>>
>>> +1(000) on this!
>>>
>>> This should massively reduce allocations done in the analyzer, and it is
>>> much more efficient. I also can't count the times that I had to increase
>>> the number of iterations. This sounds like a no-brainer to me.
>>>
>>> I do have two questions:
>>>
>>>    - How do we ensure that we don't accidentally break the analyzer?
>>>    Are existing tests enough?
>>>    - How can we introduce this incrementally? Can we run the analyzer
>>>    in mixed mode (both single pass rules and the existing tree traversal
>>>    rules) for a while?
>>>
>>> Cheers,
>>> Herman
>>>
>>> On Fri, Aug 9, 2024 at 10:48 AM Vladimir Golubev <[email protected]>
>>> wrote:
>>>
>>>> Hello All,
>>>>
>>>> I recently did some research in the Catalyst Analyzer area to check if
>>>> it’s possible to make it single-pass instead of fixed-point. Despite the
>>>> flexibility of the current fixed-point approach (new functionality - new
>>>> rule), it has some drawbacks. The dependencies between the rules are
>>>> unobvious, so it’s hard to introduce changes without having the full
>>>> knowledge. By modifying one rule, the whole chain of transformations can
>>>> change in an unobvious way. Since we can hit the maximum number of
>>>> iterations, there’s no guarantee that the plan is going to be resolved. And
>>>> from a performance perspective the Analyzer can be quite slow for large
>>>> logical plans and wide tables.
>>>>
>>>> The idea is to resolve the tree in a single-pass post-order traversal.
>>>> This approach can be beneficial for the following reasons:
>>>>
>>>>    -
>>>>
>>>>    Code can rely on child nodes being resolved
>>>>    -
>>>>
>>>>    Name visibility can be controlled by a scope, which would be
>>>>    maintained during the traversal
>>>>    -
>>>>
>>>>    Random in-tree lookups are disallowed during the traversal, so the
>>>>    resolution algorithm would be linear with respect to the parsed tree 
>>>> size
>>>>    -
>>>>
>>>>    Analyzer would deterministically produce resolved logical plan or a
>>>>    descriptive error
>>>>
>>>>
>>>> The prototype shows that it’s possible to do a bottom-up Analyzer
>>>> implementation and reuse a large chunk of node-processing code from rule
>>>> bodies. I did the performance benchmark on two cases where the current
>>>> Analyzer struggles the most - wide nested views and large logical plans and
>>>> got 7-10x performance speedups.
>>>>
>>>> Is this something that would be interesting for the Spark community?
>>>> What would be the best way to proceed with this idea?
>>>>
>>>> Regards,
>>>>
>>>> Vladimir Golubev.
>>>>
>>>

Re: [DISCUSS] [Spark SQL] A single-pass resolution approach for the Catalyst Analyzer

Reply via email to