Re: Weird results with Spark SQL Outer joins

Michael Segel Tue, 03 May 2016 11:26:54 -0700

Silly question? 

If you change the predicate to 
( s.date >= ‘2016-01-03’ OR s.date IS NULL ) 
AND 
(d.date >= ‘2016-01-03’ OR d.date IS NULL)


What do you get? 

Sorry if the syntax isn’t 100% correct. The idea is to not drop null values 
from the query. 
I would imagine that this shouldn’t kill performance since its most likely a 
post join filter on the result set? 
(Or is that just a Hive thing?) 

-Mike

> On May 3, 2016, at 12:42 PM, Davies Liu <[email protected]> wrote:
> 
> Bingo, the two predicate s.date >= '2016-01-03' AND d.date >=
> '2016-01-03' is the root cause,
> which will filter out all the nulls from outer join, will have same
> result as inner join.
> 
> In Spark 2.0, we turn these join into inner join actually.
> 
> On Tue, May 3, 2016 at 9:50 AM, Cesar Flores <[email protected]> wrote:
>> Hi
>> 
>> Have you tried the joins without the where clause? When you use them you are
>> filtering all the rows with null columns in those fields. In other words you
>> are doing a inner join in all your queries.
>> 
>> On Tue, May 3, 2016 at 11:37 AM, Gourav Sengupta <[email protected]>
>> wrote:
>>> 
>>> Hi Kevin,
>>> 
>>> Having given it a first look I do think that you have hit something here
>>> and this does not look quite fine. I have to work on the multiple AND
>>> conditions in ON and see whether that is causing any issues.
>>> 
>>> Regards,
>>> Gourav Sengupta
>>> 
>>> On Tue, May 3, 2016 at 8:28 AM, Kevin Peng <[email protected]> wrote:
>>>> 
>>>> Davies,
>>>> 
>>>> Here is the code that I am typing into the spark-shell along with the
>>>> results (my question is at the bottom):
>>>> 
>>>> val dps =
>>>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>>>> "true").load("file:///home/ltu/dps_csv/")
>>>> val swig =
>>>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>>>> "true").load("file:///home/ltu/swig_csv/")
>>>> 
>>>> dps.count
>>>> res0: Long = 42694
>>>> 
>>>> swig.count
>>>> res1: Long = 42034
>>>> 
>>>> 
>>>> dps.registerTempTable("dps_pin_promo_lt")
>>>> swig.registerTempTable("swig_pin_promo_lt")
>>>> 
>>>> sqlContext.sql("select * from dps_pin_promo_lt where date >
>>>> '2016-01-03'").count
>>>> res4: Long = 42666
>>>> 
>>>> sqlContext.sql("select * from swig_pin_promo_lt where date >
>>>> '2016-01-03'").count
>>>> res5: Long = 34131
>>>> 
>>>> sqlContext.sql("select distinct date, account, ad from dps_pin_promo_lt
>>>> where date > '2016-01-03'").count
>>>> res6: Long = 42533
>>>> 
>>>> sqlContext.sql("select distinct date, account, ad from swig_pin_promo_lt
>>>> where date > '2016-01-03'").count
>>>> res7: Long = 34131
>>>> 
>>>> 
>>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>>>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
>>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad 
>>>> =
>>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
>>>> res9: Long = 23809
>>>> 
>>>> 
>>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>>>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN
>>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad 
>>>> =
>>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
>>>> res10: Long = 23809
>>>> 
>>>> 
>>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>>>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN
>>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad 
>>>> =
>>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
>>>> res11: Long = 23809
>>>> 
>>>> 
>>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>>>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN
>>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad 
>>>> =
>>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
>>>> res12: Long = 23809
>>>> 
>>>> 
>>>> 
>>>> From my results above, we notice that the counts of distinct values based
>>>> on the join criteria and filter criteria for each individual table is
>>>> located at res6 and res7.  My question is why is the outer join producing
>>>> less rows than the smallest table; if there are no matches it should still
>>>> bring in that row as part of the outer join.  For the full and right outer
>>>> join I am expecting to see a minimum of res6 rows, but I get less, is there
>>>> something specific that I am missing here?  I am expecting that the full
>>>> outer join would give me the union of the two table sets so I am expecting
>>>> at least 42533 rows not 23809.
>>>> 
>>>> 
>>>> Gourav,
>>>> 
>>>> I just ran this result set on a new session with slightly newer data...
>>>> still seeing those results.
>>>> 
>>>> 
>>>> 
>>>> Thanks,
>>>> 
>>>> KP
>>>> 
>>>> 
>>>> On Mon, May 2, 2016 at 11:16 PM, Davies Liu <[email protected]>
>>>> wrote:
>>>>> 
>>>>> as @Gourav said, all the join with different join type show the same
>>>>> results,
>>>>> which meant that all the rows from left could match at least one row
>>>>> from right,
>>>>> all the rows from right could match at least one row from left, even
>>>>> the number of row from left does not equal that of right.
>>>>> 
>>>>> This is correct result.
>>>>> 
>>>>> On Mon, May 2, 2016 at 2:13 PM, Kevin Peng <[email protected]> wrote:
>>>>>> Yong,
>>>>>> 
>>>>>> Sorry, let explain my deduction; it is going be difficult to get a
>>>>>> sample
>>>>>> data out since the dataset I am using is proprietary.
>>>>>> 
>>>>>> From the above set queries (ones mentioned in above comments) both
>>>>>> inner and
>>>>>> outer join are producing the same counts.  They are basically pulling
>>>>>> out
>>>>>> selected columns from the query, but there is no roll up happening or
>>>>>> anything that would possible make it suspicious that there is any
>>>>>> difference
>>>>>> besides the type of joins.  The tables are matched based on three keys
>>>>>> that
>>>>>> are present in both tables (ad, account, and date), on top of this
>>>>>> they are
>>>>>> filtered by date being above 2016-01-03.  Since all the joins are
>>>>>> producing
>>>>>> the same counts, the natural suspicions is that the tables are
>>>>>> identical,
>>>>>> but I when I run the following two queries:
>>>>>> 
>>>>>> scala> sqlContext.sql("select * from swig_pin_promo_lt where date
>>>>>>> ='2016-01-03'").count
>>>>>> 
>>>>>> res14: Long = 34158
>>>>>> 
>>>>>> scala> sqlContext.sql("select * from dps_pin_promo_lt where date
>>>>>>> ='2016-01-03'").count
>>>>>> 
>>>>>> res15: Long = 42693
>>>>>> 
>>>>>> 
>>>>>> The above two queries filter out the data based on date used by the
>>>>>> joins of
>>>>>> 2016-01-03 and you can see the row count between the two tables are
>>>>>> different, which is why I am suspecting something is wrong with the
>>>>>> outer
>>>>>> joins in spark sql, because in this situation the right and outer
>>>>>> joins may
>>>>>> produce the same results, but it should not be equal to the left join
>>>>>> and
>>>>>> definitely not the inner join; unless I am missing something.
>>>>>> 
>>>>>> 
>>>>>> Side note: In my haste response above I posted the wrong counts for
>>>>>> dps.count, the real value is res16: Long = 42694
>>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> 
>>>>>> KP
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Mon, May 2, 2016 at 12:50 PM, Yong Zhang <[email protected]>
>>>>>> wrote:
>>>>>>> 
>>>>>>> We are still not sure what is the problem, if you cannot show us with
>>>>>>> some
>>>>>>> example data.
>>>>>>> 
>>>>>>> For dps with 42632 rows, and swig with 42034 rows, if dps full outer
>>>>>>> join
>>>>>>> with swig on 3 columns; with additional filters, get the same
>>>>>>> resultSet row
>>>>>>> count as dps lefter outer join with swig on 3 columns, with
>>>>>>> additional
>>>>>>> filters, again get the the same resultSet row count as dps right
>>>>>>> outer join
>>>>>>> with swig on 3 columns, with same additional filters.
>>>>>>> 
>>>>>>> Without knowing your data, I cannot see the reason that has to be a
>>>>>>> bug in
>>>>>>> the spark.
>>>>>>> 
>>>>>>> Am I misunderstanding your bug?
>>>>>>> 
>>>>>>> Yong
>>>>>>> 
>>>>>>> ________________________________
>>>>>>> From: [email protected]
>>>>>>> Date: Mon, 2 May 2016 12:11:18 -0700
>>>>>>> Subject: Re: Weird results with Spark SQL Outer joins
>>>>>>> To: [email protected]
>>>>>>> CC: [email protected]
>>>>>>> 
>>>>>>> 
>>>>>>> Gourav,
>>>>>>> 
>>>>>>> I wish that was case, but I have done a select count on each of the
>>>>>>> two
>>>>>>> tables individually and they return back different number of rows:
>>>>>>> 
>>>>>>> 
>>>>>>> dps.registerTempTable("dps_pin_promo_lt")
>>>>>>> swig.registerTempTable("swig_pin_promo_lt")
>>>>>>> 
>>>>>>> 
>>>>>>> dps.count()
>>>>>>> RESULT: 42632
>>>>>>> 
>>>>>>> 
>>>>>>> swig.count()
>>>>>>> RESULT: 42034
>>>>>>> 
>>>>>>> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta
>>>>>>> <[email protected]> wrote:
>>>>>>> 
>>>>>>> This shows that both the tables have matching records and no
>>>>>>> mismatches.
>>>>>>> Therefore obviously you have the same results irrespective of whether
>>>>>>> you
>>>>>>> use right or left join.
>>>>>>> 
>>>>>>> I think that there is no problem here, unless I am missing something.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Gourav
>>>>>>> 
>>>>>>> On Mon, May 2, 2016 at 7:48 PM, kpeng1 <[email protected]> wrote:
>>>>>>> 
>>>>>>> Also, the results of the inner query produced the same results:
>>>>>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
>>>>>>> d.account
>>>>>>> AS
>>>>>>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>>>>>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
>>>>>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>>>>>>> s.ad
>>>>>>> =
>>>>>>> d.ad) WHERE s.date >= '2016-01-03'    AND d.date >=
>>>>>>> '2016-01-03'").count()
>>>>>>> RESULT:23747
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> 
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com.
>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Cesar Flores
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Weird results with Spark SQL Outer joins

Reply via email to