A work-in-progress PR: https://github.com/apache/spark/pull/21822

The PR also adds the infrastructure to throw exceptions in test mode when
the various transform methods are used as part of analysis. Unfortunately
there are couple edge cases that do need that, and as a result there is
this ugly bypassTransformAnalyzerCheck method.




On Thu, Jul 19, 2018 at 2:52 PM Reynold Xin <r...@databricks.com> wrote:

> We have had multiple bugs introduced by AnalysisBarrier. In hindsight I
> think the original design before analysis barrier was much simpler and
> requires less developer knowledge of the infrastructure.
>
> As long as analysis barrier is there, developers writing various code in
> analyzer will have to be aware of this special node and we are bound to
> have more bugs in the future due to people not considering it.
>
>
> Filed this JIRA ticket: https://issues.apache.org/jira/browse/SPARK-24865
>
>
>
> AnalysisBarrier was introduced in SPARK-20392
> <https://issues.apache.org/jira/browse/SPARK-20392> to improve analysis
> speed (don't re-analyze nodes that have already been analyzed).
>
> Before AnalysisBarrier, we already had some infrastructure in place, with
> analysis specific functions (resolveOperators and resolveExpressions).
> These functions do not recursively traverse down subplans that are already
> analyzed (with a mutable boolean flag _analyzed). The issue with the old
> system was that developers started using transformDown, which does a
> top-down traversal of the plan tree, because there was not top-down
> resolution function, and as a result analyzer performance became pretty bad.
>
> In order to fix the issue in SPARK-20392
> <https://issues.apache.org/jira/browse/SPARK-20392>, AnalysisBarrier was
> introduced as a special node and for this special node,
> transform/transformUp/transformDown don't traverse down. However, the
> introduction of this special node caused a lot more troubles than it
> solves. This implicit node breaks assumptions and code in a few places, and
> it's hard to know when analysis barrier would exist, and when it wouldn't.
> Just a simple search of AnalysisBarrier in PR discussions demonstrates it
> is a source of bugs and additional complexity.
>
> Instead, I think a much simpler fix to the original issue is to introduce
> resolveOperatorsDown, and change all places that call transformDown in the
> analyzer to use that. We can also ban accidental uses of the various
> transform* methods by using a linter (which can only lint specific
> packages), or in test mode inspect the stack trace and fail explicitly if
> transform* are called in the analyzer.
>
>
>
>
>
> On Thu, Jul 19, 2018 at 11:41 AM Xiao Li <gatorsm...@gmail.com> wrote:
>
>> dfWithUDF.cache()
>> dfWithUDF.write.saveAsTable("t")
>> dfWithUDF.write.saveAsTable("t1")
>>
>>
>> Cached data is not being used. It causes a big performance regression.
>>
>>
>>
>>
>> 2018-07-19 11:32 GMT-07:00 Sean Owen <sro...@gmail.com>:
>>
>>> What regression are you referring to here? A -1 vote really needs a
>>> rationale.
>>>
>>> On Thu, Jul 19, 2018 at 1:27 PM Xiao Li <gatorsm...@gmail.com> wrote:
>>>
>>>> I would first vote -1.
>>>>
>>>> I might find another regression caused by the analysis barrier. Will
>>>> keep you posted.
>>>>
>>>>
>>

Reply via email to