This is "expected" in the sense that DataFrame operations can get
re-ordered under the hood by the optimizer. For example, if the optimizer
deems it is cheaper to apply the 2nd filter first, it might re-arrange the
filters. In reality, it doesn't do that. I think this is too confusing and
violates principle of least astonishment, so we should fix it.

I discussed more with Michael offline, and think we can add a rule for the
physical filter operator to replace the general AND/OR/equality/etc with a
special version that treats null as false. This rule needs to be carefully
written because it should only apply to subtrees of AND/OR/equality/etc
(e.g. it shouldn't rewrite children of isnull).


On Tue, Sep 15, 2015 at 1:09 PM, Zack Sampson <zsamp...@palantir.com> wrote:

> I see. We're having problems with code like this (forgive my noob scala):
>
> val df = Seq(("moose","ice"), (null,"fire")).toDF("animals", "elements")
> df
>   .filter($"animals".rlike(".*"))
>   .filter(callUDF({(value: String) => value.length > 2}, BooleanType, 
> $"animals"))
> .collect()
>
> This code throws a NPE because:
> * Catalyst combines the filters with an AND
> * the first filter passes returns null on the first input
> * the second filter tries to read the length of that null
>
> This feels weird. Reading that code, I wouldn't expect null to be passed
> to the second filter. Even weirder is that if you call collect() after the
> first filter you won't see nulls, and if you write the data to disk and
> reread it, the NPE won't happen.
>
> It's bewildering! Is this the intended behavior?
> ------------------------------
> *From:* Reynold Xin [r...@databricks.com]
> *Sent:* Monday, September 14, 2015 10:14 PM
> *To:* Zack Sampson
> *Cc:* dev@spark.apache.org
> *Subject:* Re: And.eval short circuiting
>
> rxin=# select null and true;
>  ?column?
> ----------
>
> (1 row)
>
> rxin=# select null and false;
>  ?column?
> ----------
>  f
> (1 row)
>
>
> null and false should return false.
>
>
> On Mon, Sep 14, 2015 at 9:12 PM, Zack Sampson <zsamp...@palantir.com>
> wrote:
>
>> It seems like And.eval can avoid calculating right.eval if left.eval
>> returns null. Is there a reason it's written like it is?
>>
>> override def eval(input: Row): Any = {
>>   val l = left.eval(input)
>>   if (l == false) {
>>     false
>>   } else {
>>     val r = right.eval(input)
>>     if (r == false) {
>>       false
>>     } else {
>>       if (l != null && r != null) {
>>         true
>>       } else {
>>         null
>>       }
>>     }
>>   }
>> }
>>
>>
>

Reply via email to