Re: And.eval short circuiting

Reynold Xin Thu, 17 Sep 2015 23:23:21 -0700

Please file a ticket and cc me. Thanks.


On Thu, Sep 17, 2015 at 11:20 PM, Mingyu Kim <m...@palantir.com> wrote:

> That sounds good. I think the optimizer should not change the behavior of
> execution and reordering the filters can easily result in errors as
> exemplified below. I agree that the optimizer should not reorder the
> filters for correctness. Please correct me if I have an incorrect
> assumption about the guarantees of the optimizer.
>
> Is there a bug filed that tracks the change you suggested below, btw? I’d
> like to follow the issue, if there’s one.
>
> Thanks,
> Mingyu
>
> From: Reynold Xin
> Date: Wednesday, September 16, 2015 at 1:17 PM
> To: Zack Sampson
> Cc: "dev@spark.apache.org", Mingyu Kim, Peter Faiman, Matt Cheah, Michael
> Armbrust
>
> Subject: Re: And.eval short circuiting
>
> This is "expected" in the sense that DataFrame operations can get
> re-ordered under the hood by the optimizer. For example, if the optimizer
> deems it is cheaper to apply the 2nd filter first, it might re-arrange the
> filters. In reality, it doesn't do that. I think this is too confusing and
> violates principle of least astonishment, so we should fix it.
>
> I discussed more with Michael offline, and think we can add a rule for the
> physical filter operator to replace the general AND/OR/equality/etc with a
> special version that treats null as false. This rule needs to be carefully
> written because it should only apply to subtrees of AND/OR/equality/etc
> (e.g. it shouldn't rewrite children of isnull).
>
>
> On Tue, Sep 15, 2015 at 1:09 PM, Zack Sampson <zsamp...@palantir.com>
> wrote:
>
>> I see. We're having problems with code like this (forgive my noob scala):
>>
>> val df = Seq(("moose","ice"), (null,"fire")).toDF("animals", "elements")
>> df
>>   .filter($"animals".rlike(".*"))
>>   .filter(callUDF({(value: String) => value.length > 2}, BooleanType, 
>> $"animals"))
>> .collect()
>>
>> This code throws a NPE because:
>> * Catalyst combines the filters with an AND
>> * the first filter passes returns null on the first input
>> * the second filter tries to read the length of that null
>>
>> This feels weird. Reading that code, I wouldn't expect null to be passed
>> to the second filter. Even weirder is that if you call collect() after the
>> first filter you won't see nulls, and if you write the data to disk and
>> reread it, the NPE won't happen.
>>
>> It's bewildering! Is this the intended behavior?
>> ------------------------------
>> *From:* Reynold Xin [r...@databricks.com]
>> *Sent:* Monday, September 14, 2015 10:14 PM
>> *To:* Zack Sampson
>> *Cc:* dev@spark.apache.org
>> *Subject:* Re: And.eval short circuiting
>>
>> rxin=# select null and true;
>>  ?column?
>> ----------
>>
>> (1 row)
>>
>> rxin=# select null and false;
>>  ?column?
>> ----------
>>  f
>> (1 row)
>>
>>
>> null and false should return false.
>>
>>
>> On Mon, Sep 14, 2015 at 9:12 PM, Zack Sampson <zsamp...@palantir.com>
>> wrote:
>>
>>> It seems like And.eval can avoid calculating right.eval if left.eval
>>> returns null. Is there a reason it's written like it is?
>>>
>>> override def eval(input: Row): Any = {
>>>   val l = left.eval(input)
>>>   if (l == false) {
>>>     false
>>>   } else {
>>>     val r = right.eval(input)
>>>     if (r == false) {
>>>       false
>>>     } else {
>>>       if (l != null && r != null) {
>>>         true
>>>       } else {
>>>         null
>>>       }
>>>     }
>>>   }
>>> }
>>>
>>>
>>
>

Re: And.eval short circuiting

Reply via email to