That sounds good. I think the optimizer should not change the behavior of 
execution and reordering the filters can easily result in errors as exemplified 
below. I agree that the optimizer should not reorder the filters for 
correctness. Please correct me if I have an incorrect assumption about the 
guarantees of the optimizer.

Is there a bug filed that tracks the change you suggested below, btw? I’d like 
to follow the issue, if there’s one.

Thanks,
Mingyu

From:  Reynold Xin
Date:  Wednesday, September 16, 2015 at 1:17 PM
To:  Zack Sampson
Cc:  "dev@spark.apache.org", Mingyu Kim, Peter Faiman, Matt Cheah, Michael 
Armbrust
Subject:  Re: And.eval short circuiting

This is "expected" in the sense that DataFrame operations can get re-ordered 
under the hood by the optimizer. For example, if the optimizer deems it is 
cheaper to apply the 2nd filter first, it might re-arrange the filters. In 
reality, it doesn't do that. I think this is too confusing and violates 
principle of least astonishment, so we should fix it. 

I discussed more with Michael offline, and think we can add a rule for the 
physical filter operator to replace the general AND/OR/equality/etc with a 
special version that treats null as false. This rule needs to be carefully 
written because it should only apply to subtrees of AND/OR/equality/etc (e.g. 
it shouldn't rewrite children of isnull).


On Tue, Sep 15, 2015 at 1:09 PM, Zack Sampson <zsamp...@palantir.com> wrote:
I see. We're having problems with code like this (forgive my noob scala):
val df = Seq(("moose","ice"), (null,"fire")).toDF("animals", "elements")
df
  .filter($"animals".rlike(".*"))
  .filter(callUDF({(value: String) => value.length > 2}, BooleanType, 
$"animals"))
.collect()
This code throws a NPE because:
* Catalyst combines the filters with an AND
* the first filter passes returns null on the first input
* the second filter tries to read the length of that null

This feels weird. Reading that code, I wouldn't expect null to be passed to the 
second filter. Even weirder is that if you call collect() after the first 
filter you won't see nulls, and if you write the data to disk and reread it, 
the NPE won't happen.

It's bewildering! Is this the intended behavior?
From: Reynold Xin [r...@databricks.com]
Sent: Monday, September 14, 2015 10:14 PM
To: Zack Sampson
Cc: dev@spark.apache.org
Subject: Re: And.eval short circuiting

rxin=# select null and true;
 ?column? 
----------
 
(1 row)

rxin=# select null and false;
 ?column? 
----------
 f
(1 row)


null and false should return false.


On Mon, Sep 14, 2015 at 9:12 PM, Zack Sampson <zsamp...@palantir.com> wrote:
It seems like And.eval can avoid calculating right.eval if left.eval returns 
null. Is there a reason it's written like it is? 

override def eval(input: Row): Any = {
  val l = left.eval(input)
  if (l == false) {
    false
  } else {
    val r = right.eval(input)
    if (r == false) {
      false
    } else {
      if (l != null && r != null) {
        true
      } else {
        null
      }
    }
  }
}



Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to