This is "expected" in the sense that DataFrame operations can get re-ordered under the hood by the optimizer. For example, if the optimizer deems it is cheaper to apply the 2nd filter first, it might re-arrange the filters. In reality, it doesn't do that. I think this is too confusing and violates principle of least astonishment, so we should fix it.
I discussed more with Michael offline, and think we can add a rule for the physical filter operator to replace the general AND/OR/equality/etc with a special version that treats null as false. This rule needs to be carefully written because it should only apply to subtrees of AND/OR/equality/etc (e.g. it shouldn't rewrite children of isnull). On Tue, Sep 15, 2015 at 1:09 PM, Zack Sampson <zsamp...@palantir.com> wrote: > I see. We're having problems with code like this (forgive my noob scala): > > val df = Seq(("moose","ice"), (null,"fire")).toDF("animals", "elements") > df > .filter($"animals".rlike(".*")) > .filter(callUDF({(value: String) => value.length > 2}, BooleanType, > $"animals")) > .collect() > > This code throws a NPE because: > * Catalyst combines the filters with an AND > * the first filter passes returns null on the first input > * the second filter tries to read the length of that null > > This feels weird. Reading that code, I wouldn't expect null to be passed > to the second filter. Even weirder is that if you call collect() after the > first filter you won't see nulls, and if you write the data to disk and > reread it, the NPE won't happen. > > It's bewildering! Is this the intended behavior? > ------------------------------ > *From:* Reynold Xin [r...@databricks.com] > *Sent:* Monday, September 14, 2015 10:14 PM > *To:* Zack Sampson > *Cc:* dev@spark.apache.org > *Subject:* Re: And.eval short circuiting > > rxin=# select null and true; > ?column? > ---------- > > (1 row) > > rxin=# select null and false; > ?column? > ---------- > f > (1 row) > > > null and false should return false. > > > On Mon, Sep 14, 2015 at 9:12 PM, Zack Sampson <zsamp...@palantir.com> > wrote: > >> It seems like And.eval can avoid calculating right.eval if left.eval >> returns null. Is there a reason it's written like it is? >> >> override def eval(input: Row): Any = { >> val l = left.eval(input) >> if (l == false) { >> false >> } else { >> val r = right.eval(input) >> if (r == false) { >> false >> } else { >> if (l != null && r != null) { >> true >> } else { >> null >> } >> } >> } >> } >> >> >