Silly question? If you change the predicate to ( s.date >= ‘2016-01-03’ OR s.date IS NULL ) AND (d.date >= ‘2016-01-03’ OR d.date IS NULL)
What do you get? Sorry if the syntax isn’t 100% correct. The idea is to not drop null values from the query. I would imagine that this shouldn’t kill performance since its most likely a post join filter on the result set? (Or is that just a Hive thing?) -Mike > On May 3, 2016, at 12:42 PM, Davies Liu <[email protected]> wrote: > > Bingo, the two predicate s.date >= '2016-01-03' AND d.date >= > '2016-01-03' is the root cause, > which will filter out all the nulls from outer join, will have same > result as inner join. > > In Spark 2.0, we turn these join into inner join actually. > > On Tue, May 3, 2016 at 9:50 AM, Cesar Flores <[email protected]> wrote: >> Hi >> >> Have you tried the joins without the where clause? When you use them you are >> filtering all the rows with null columns in those fields. In other words you >> are doing a inner join in all your queries. >> >> On Tue, May 3, 2016 at 11:37 AM, Gourav Sengupta <[email protected]> >> wrote: >>> >>> Hi Kevin, >>> >>> Having given it a first look I do think that you have hit something here >>> and this does not look quite fine. I have to work on the multiple AND >>> conditions in ON and see whether that is causing any issues. >>> >>> Regards, >>> Gourav Sengupta >>> >>> On Tue, May 3, 2016 at 8:28 AM, Kevin Peng <[email protected]> wrote: >>>> >>>> Davies, >>>> >>>> Here is the code that I am typing into the spark-shell along with the >>>> results (my question is at the bottom): >>>> >>>> val dps = >>>> sqlContext.read.format("com.databricks.spark.csv").option("header", >>>> "true").load("file:///home/ltu/dps_csv/") >>>> val swig = >>>> sqlContext.read.format("com.databricks.spark.csv").option("header", >>>> "true").load("file:///home/ltu/swig_csv/") >>>> >>>> dps.count >>>> res0: Long = 42694 >>>> >>>> swig.count >>>> res1: Long = 42034 >>>> >>>> >>>> dps.registerTempTable("dps_pin_promo_lt") >>>> swig.registerTempTable("swig_pin_promo_lt") >>>> >>>> sqlContext.sql("select * from dps_pin_promo_lt where date > >>>> '2016-01-03'").count >>>> res4: Long = 42666 >>>> >>>> sqlContext.sql("select * from swig_pin_promo_lt where date > >>>> '2016-01-03'").count >>>> res5: Long = 34131 >>>> >>>> sqlContext.sql("select distinct date, account, ad from dps_pin_promo_lt >>>> where date > '2016-01-03'").count >>>> res6: Long = 42533 >>>> >>>> sqlContext.sql("select distinct date, account, ad from swig_pin_promo_lt >>>> where date > '2016-01-03'").count >>>> res7: Long = 34131 >>>> >>>> >>>> sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account >>>> AS d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , >>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN >>>> dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND s.ad >>>> = >>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count() >>>> res9: Long = 23809 >>>> >>>> >>>> sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account >>>> AS d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , >>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN >>>> dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND s.ad >>>> = >>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count() >>>> res10: Long = 23809 >>>> >>>> >>>> sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account >>>> AS d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , >>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN >>>> dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND s.ad >>>> = >>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count() >>>> res11: Long = 23809 >>>> >>>> >>>> sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account >>>> AS d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , >>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN >>>> dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND s.ad >>>> = >>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count() >>>> res12: Long = 23809 >>>> >>>> >>>> >>>> From my results above, we notice that the counts of distinct values based >>>> on the join criteria and filter criteria for each individual table is >>>> located at res6 and res7. My question is why is the outer join producing >>>> less rows than the smallest table; if there are no matches it should still >>>> bring in that row as part of the outer join. For the full and right outer >>>> join I am expecting to see a minimum of res6 rows, but I get less, is there >>>> something specific that I am missing here? I am expecting that the full >>>> outer join would give me the union of the two table sets so I am expecting >>>> at least 42533 rows not 23809. >>>> >>>> >>>> Gourav, >>>> >>>> I just ran this result set on a new session with slightly newer data... >>>> still seeing those results. >>>> >>>> >>>> >>>> Thanks, >>>> >>>> KP >>>> >>>> >>>> On Mon, May 2, 2016 at 11:16 PM, Davies Liu <[email protected]> >>>> wrote: >>>>> >>>>> as @Gourav said, all the join with different join type show the same >>>>> results, >>>>> which meant that all the rows from left could match at least one row >>>>> from right, >>>>> all the rows from right could match at least one row from left, even >>>>> the number of row from left does not equal that of right. >>>>> >>>>> This is correct result. >>>>> >>>>> On Mon, May 2, 2016 at 2:13 PM, Kevin Peng <[email protected]> wrote: >>>>>> Yong, >>>>>> >>>>>> Sorry, let explain my deduction; it is going be difficult to get a >>>>>> sample >>>>>> data out since the dataset I am using is proprietary. >>>>>> >>>>>> From the above set queries (ones mentioned in above comments) both >>>>>> inner and >>>>>> outer join are producing the same counts. They are basically pulling >>>>>> out >>>>>> selected columns from the query, but there is no roll up happening or >>>>>> anything that would possible make it suspicious that there is any >>>>>> difference >>>>>> besides the type of joins. The tables are matched based on three keys >>>>>> that >>>>>> are present in both tables (ad, account, and date), on top of this >>>>>> they are >>>>>> filtered by date being above 2016-01-03. Since all the joins are >>>>>> producing >>>>>> the same counts, the natural suspicions is that the tables are >>>>>> identical, >>>>>> but I when I run the following two queries: >>>>>> >>>>>> scala> sqlContext.sql("select * from swig_pin_promo_lt where date >>>>>>> ='2016-01-03'").count >>>>>> >>>>>> res14: Long = 34158 >>>>>> >>>>>> scala> sqlContext.sql("select * from dps_pin_promo_lt where date >>>>>>> ='2016-01-03'").count >>>>>> >>>>>> res15: Long = 42693 >>>>>> >>>>>> >>>>>> The above two queries filter out the data based on date used by the >>>>>> joins of >>>>>> 2016-01-03 and you can see the row count between the two tables are >>>>>> different, which is why I am suspecting something is wrong with the >>>>>> outer >>>>>> joins in spark sql, because in this situation the right and outer >>>>>> joins may >>>>>> produce the same results, but it should not be equal to the left join >>>>>> and >>>>>> definitely not the inner join; unless I am missing something. >>>>>> >>>>>> >>>>>> Side note: In my haste response above I posted the wrong counts for >>>>>> dps.count, the real value is res16: Long = 42694 >>>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> >>>>>> KP >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Mon, May 2, 2016 at 12:50 PM, Yong Zhang <[email protected]> >>>>>> wrote: >>>>>>> >>>>>>> We are still not sure what is the problem, if you cannot show us with >>>>>>> some >>>>>>> example data. >>>>>>> >>>>>>> For dps with 42632 rows, and swig with 42034 rows, if dps full outer >>>>>>> join >>>>>>> with swig on 3 columns; with additional filters, get the same >>>>>>> resultSet row >>>>>>> count as dps lefter outer join with swig on 3 columns, with >>>>>>> additional >>>>>>> filters, again get the the same resultSet row count as dps right >>>>>>> outer join >>>>>>> with swig on 3 columns, with same additional filters. >>>>>>> >>>>>>> Without knowing your data, I cannot see the reason that has to be a >>>>>>> bug in >>>>>>> the spark. >>>>>>> >>>>>>> Am I misunderstanding your bug? >>>>>>> >>>>>>> Yong >>>>>>> >>>>>>> ________________________________ >>>>>>> From: [email protected] >>>>>>> Date: Mon, 2 May 2016 12:11:18 -0700 >>>>>>> Subject: Re: Weird results with Spark SQL Outer joins >>>>>>> To: [email protected] >>>>>>> CC: [email protected] >>>>>>> >>>>>>> >>>>>>> Gourav, >>>>>>> >>>>>>> I wish that was case, but I have done a select count on each of the >>>>>>> two >>>>>>> tables individually and they return back different number of rows: >>>>>>> >>>>>>> >>>>>>> dps.registerTempTable("dps_pin_promo_lt") >>>>>>> swig.registerTempTable("swig_pin_promo_lt") >>>>>>> >>>>>>> >>>>>>> dps.count() >>>>>>> RESULT: 42632 >>>>>>> >>>>>>> >>>>>>> swig.count() >>>>>>> RESULT: 42034 >>>>>>> >>>>>>> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta >>>>>>> <[email protected]> wrote: >>>>>>> >>>>>>> This shows that both the tables have matching records and no >>>>>>> mismatches. >>>>>>> Therefore obviously you have the same results irrespective of whether >>>>>>> you >>>>>>> use right or left join. >>>>>>> >>>>>>> I think that there is no problem here, unless I am missing something. >>>>>>> >>>>>>> Regards, >>>>>>> Gourav >>>>>>> >>>>>>> On Mon, May 2, 2016 at 7:48 PM, kpeng1 <[email protected]> wrote: >>>>>>> >>>>>>> Also, the results of the inner query produced the same results: >>>>>>> sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , >>>>>>> d.account >>>>>>> AS >>>>>>> d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , >>>>>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN >>>>>>> dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND >>>>>>> s.ad >>>>>>> = >>>>>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >= >>>>>>> '2016-01-03'").count() >>>>>>> RESULT:23747 >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> View this message in context: >>>>>>> >>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html >>>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>>> Nabble.com. >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>> For additional commands, e-mail: [email protected] >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>> >>>> >>> >> >> >> >> -- >> Cesar Flores > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
