Re: column expression in left outer join for DataFrame

Michael Armbrust Wed, 25 Mar 2015 11:06:53 -0700

Unfortunately you are now hitting a bug (that is fixed in master and will
be released in 1.3.1 hopefully next week).  However, even with that your
query is still ambiguous and you will need to use aliases:


val df_1 = df.filter( df("event") === 0)
                  . select("country", "cnt").as("a")
val df_2 = df.filter( df("event") === 3)
                  . select("country", "cnt").as("b")
val both = df_2.join(df_1, $"a.country" === $"b.country"), "left_outer")



On Tue, Mar 24, 2015 at 11:57 PM, S Krishna <skrishna...@gmail.com> wrote:

> Hi,
>
> Thanks for your response. I modified my code as per your suggestion, but
> now I am getting a runtime error. Here's my code:
>
> val df_1 = df.filter( df("event") === 0)
>                   . select("country", "cnt")
>
> val df_2 = df.filter( df("event") === 3)
>                   . select("country", "cnt")
>
> df_1.show()
> //produces the following output :
> // country    cnt
> //   tw           3000
> //   uk           2000
> //   us           1000
>
> df_2.show()
> //produces the following output :
> // country    cnt
> //   tw           25
> //   uk           200
> //   us           95
>
> val both = df_2.join(df_1, df_2("country")===df_1("country"), "left_outer")
>
> I am getting the following error when executing the join statement:
>
> java.util.NoSuchElementException: next on empty iterator.
>
> This error seems to be originating at DataFrame.join (line 133 in
> DataFrame.scala).
>
> The show() results show that both dataframes do have columns named
> "country" and that they are non-empty. I also tried the simpler join ( i.e.
> df_2.join(df_1) ) and got the same error stated above.
>
> I would like to know what is wrong with the join statement above.
>
> thanks
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Tue, Mar 24, 2015 at 6:08 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> You need to use `===`, so that you are constructing a column expression
>> instead of evaluating the standard scala equality method.  Calling methods
>> to access columns (i.e. df.county is only supported in python).
>>
>> val join_df =  df1.join( df2, df1("country") === df2("country"),
>> "left_outer")
>>
>> On Tue, Mar 24, 2015 at 5:50 PM, SK <skrishna...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am trying to port some code that was working in Spark 1.2.0 on the
>>> latest
>>> version, Spark 1.3.0. This code involves a left outer join between two
>>> SchemaRDDs which I am now trying to change to a left outer join between 2
>>> DataFrames. I followed the example  for left outer join of DataFrame at
>>>
>>> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
>>>
>>> Here's my code, where df1 and df2 are the 2 dataframes I am joining on
>>> the
>>> "country" field:
>>>
>>>  val join_df =  df1.join( df2,  df1.country == df2.country, "left_outer")
>>>
>>> But I got a compilation error that value  country is not a member of
>>> sql.DataFrame
>>>
>>> I  also tried the following:
>>>  val join_df =  df1.join( df2, df1("country") == df2("country"),
>>> "left_outer")
>>>
>>> I got a compilation error that it is a Boolean whereas a Column is
>>> required.
>>>
>>> So what is the correct Column expression I need to provide for joining
>>> the 2
>>> dataframes on a specific field ?
>>>
>>> thanks
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/column-expression-in-left-outer-join-for-DataFrame-tp22209.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Re: column expression in left outer join for DataFrame

Reply via email to