Re: [Spark SQL] Unexpected Behaviour

2016-03-29 Thread Jerry Lam
Hi guys, Another point is that if this is unsupported shouldn't it throw an exception instead of giving the wrong answer? I mean if d1.join(d2, "id").select(d2("label")) should not work at all, the proper behaviour is to throw the analysis exception. It now returns a wrong answer though. As I sai

Re: [Spark SQL] Unexpected Behaviour

2016-03-29 Thread Jerry Lam
Hi Divya, This is not a self-join. d1 and d2 contain totally different rows. They are derived from the same table. The transformation that are applied to generate d1 and d2 should be able to disambiguate the labels in the question. Best Regards, Jerry On Tue, Mar 29, 2016 at 2:43 AM, Divya Ge

Re: [Spark SQL] Unexpected Behaviour

2016-03-28 Thread Jerry Lam
Hi guys, I have another example to illustrate the issue. I think the problem is pretty nasty. val base = sc.parallelize(( 0 to 49).zip( 0 to 49) ++ (30 to 79).zip(50 to 99)).toDF("id", "label") val d1 = base.where($"label" < 60) val d2 = base.where($"label" === 60) d1.join(d2, "id").show +---+---

Re: [Spark SQL] Unexpected Behaviour

2016-03-28 Thread Jerry Lam
Hi Sunitha, Thank you for the reference Jira. It looks like this is the bug I'm hitting. Most of the bugs related to this seems to associate with dataframes derived from the one dataframe (base in this case). In SQL, this is a self-join and dropping d2.label should not affect d1.label. There are o

Re: [Spark SQL] Unexpected Behaviour

2016-03-28 Thread Sunitha Kambhampati
Hi Jerry, I think you are running into an issue similar to SPARK-14040 https://issues.apache.org/jira/browse/SPARK-14040 One way to resolve it is to use alias. Here is an example that I tried on trunk and I do not see any exceptions. v

Re: [Spark SQL] Unexpected Behaviour

2016-03-28 Thread Alexander Krasnukhin
You drop label column and later you try to select it. It won't find it, indeed. -- Alexander aka Six-Hat-Thinker > On 28 Mar 2016, at 23:34, Jerry Lam wrote: > > Hi spark users and developers, > > I'm using spark 1.5.1 (I have no choice because this is what we used). I ran > into some very un

Re: [Spark SQL] Unexpected Behaviour

2016-03-28 Thread Mich Talebzadeh
Hi Jerry What do you expect the outcome to be? This is Spark 1.6.1 I see this without dropping d2! scala> d1.join(d2, d1("id") === d2("id"), "left_outer").select(d1("label")).collect res15: Array[org.apache.spark.sql.Row] = Array([0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0],

[Spark SQL] Unexpected Behaviour

2016-03-28 Thread Jerry Lam
Hi spark users and developers, I'm using spark 1.5.1 (I have no choice because this is what we used). I ran into some very unexpected behaviour when I did some join operations lately. I cannot post my actual code here and the following code is not for practical reasons but it should demonstrate th