I filed SPARK-10703. Thanks! Mingyu
From: Reynold Xin Date: Thursday, September 17, 2015 at 11:22 PM To: Mingyu Kim Cc: Zack Sampson, "dev@spark.apache.org", Peter Faiman, Matt Cheah, Michael Armbrust Subject: Re: And.eval short circuiting Please file a ticket and cc me. Thanks. On Thu, Sep 17, 2015 at 11:20 PM, Mingyu Kim <m...@palantir.com> wrote: That sounds good. I think the optimizer should not change the behavior of execution and reordering the filters can easily result in errors as exemplified below. I agree that the optimizer should not reorder the filters for correctness. Please correct me if I have an incorrect assumption about the guarantees of the optimizer. Is there a bug filed that tracks the change you suggested below, btw? I’d like to follow the issue, if there’s one. Thanks, Mingyu From: Reynold Xin Date: Wednesday, September 16, 2015 at 1:17 PM To: Zack Sampson Cc: "dev@spark.apache.org", Mingyu Kim, Peter Faiman, Matt Cheah, Michael Armbrust Subject: Re: And.eval short circuiting This is "expected" in the sense that DataFrame operations can get re-ordered under the hood by the optimizer. For example, if the optimizer deems it is cheaper to apply the 2nd filter first, it might re-arrange the filters. In reality, it doesn't do that. I think this is too confusing and violates principle of least astonishment, so we should fix it. I discussed more with Michael offline, and think we can add a rule for the physical filter operator to replace the general AND/OR/equality/etc with a special version that treats null as false. This rule needs to be carefully written because it should only apply to subtrees of AND/OR/equality/etc (e.g. it shouldn't rewrite children of isnull). On Tue, Sep 15, 2015 at 1:09 PM, Zack Sampson <zsamp...@palantir.com> wrote: I see. We're having problems with code like this (forgive my noob scala): val df = Seq(("moose","ice"), (null,"fire")).toDF("animals", "elements") df .filter($"animals".rlike(".*")) .filter(callUDF({(value: String) => value.length > 2}, BooleanType, $"animals")) .collect() This code throws a NPE because: * Catalyst combines the filters with an AND * the first filter passes returns null on the first input * the second filter tries to read the length of that null This feels weird. Reading that code, I wouldn't expect null to be passed to the second filter. Even weirder is that if you call collect() after the first filter you won't see nulls, and if you write the data to disk and reread it, the NPE won't happen. It's bewildering! Is this the intended behavior? From: Reynold Xin [r...@databricks.com] Sent: Monday, September 14, 2015 10:14 PM To: Zack Sampson Cc: dev@spark.apache.org Subject: Re: And.eval short circuiting rxin=# select null and true; ?column? ---------- (1 row) rxin=# select null and false; ?column? ---------- f (1 row) null and false should return false. On Mon, Sep 14, 2015 at 9:12 PM, Zack Sampson <zsamp...@palantir.com> wrote: It seems like And.eval can avoid calculating right.eval if left.eval returns null. Is there a reason it's written like it is? override def eval(input: Row): Any = { val l = left.eval(input) if (l == false) { false } else { val r = right.eval(input) if (r == false) { false } else { if (l != null && r != null) { true } else { null } } } }
smime.p7s
Description: S/MIME cryptographic signature