Hello All, I recently did some research in the Catalyst Analyzer area to check if it’s possible to make it single-pass instead of fixed-point. Despite the flexibility of the current fixed-point approach (new functionality - new rule), it has some drawbacks. The dependencies between the rules are unobvious, so it’s hard to introduce changes without having the full knowledge. By modifying one rule, the whole chain of transformations can change in an unobvious way. Since we can hit the maximum number of iterations, there’s no guarantee that the plan is going to be resolved. And from a performance perspective the Analyzer can be quite slow for large logical plans and wide tables.
The idea is to resolve the tree in a single-pass post-order traversal. This approach can be beneficial for the following reasons: - Code can rely on child nodes being resolved - Name visibility can be controlled by a scope, which would be maintained during the traversal - Random in-tree lookups are disallowed during the traversal, so the resolution algorithm would be linear with respect to the parsed tree size - Analyzer would deterministically produce resolved logical plan or a descriptive error The prototype shows that it’s possible to do a bottom-up Analyzer implementation and reuse a large chunk of node-processing code from rule bodies. I did the performance benchmark on two cases where the current Analyzer struggles the most - wide nested views and large logical plans and got 7-10x performance speedups. Is this something that would be interesting for the Spark community? What would be the best way to proceed with this idea? Regards, Vladimir Golubev.