Re: How to distinguish columns when joining DataFrames with shared parent?

Michael Armbrust Wed, 21 Oct 2015 13:55:55 -0700

Yeah, I was suggesting that you avoid using
org.apache.spark.sql.DataFrame.apply(colName:
String) when you are working with selfjoins as it eagerly binds to a
specific column in a what that breaks when we do the rewrite of one side of
the query.  Using the apply method constructs a resolved column eagerly
(which looses the alias information).


On Wed, Oct 21, 2015 at 1:49 PM, Isabelle Phan <[email protected]> wrote:

> Thanks Michael and Ali for the reply!
>
> I'll make sure to use unresolved columns when working with self joins then.
>
> As pointed by Ali, isn't there still an issue with the aliasing? It works
> when using org.apache.spark.sql.functions.col(colName: String) method, but
> not when using org.apache.spark.sql.DataFrame.apply(colName: String):
>
> scala> j.select(col("lv.value")).show
> +-----+
> |value|
> +-----+
> |   10|
> |   20|
> +-----+
>
>
> scala> j.select(largeValues("lv.value")).show
> +-----+
> |value|
> +-----+
> |    1|
> |    5|
> +-----+
>
> Or does this behavior have the same root cause as detailed in Michael's
> email?
>
>
> -Isabelle
>
>
>
>
> On Wed, Oct 21, 2015 at 11:46 AM, Michael Armbrust <[email protected]
> > wrote:
>
>> Unfortunately, the mechanisms that we use to differentiate columns
>> automatically don't work particularly well in the presence of self joins.
>> However, you can get it work if you use the $"column" syntax
>> consistently:
>>
>> val df = Seq((1, 1), (1, 10), (2, 3), (3, 20), (3, 5), (4, 10)).toDF("key", 
>> "value")val smallValues = df.filter('value < 10).as("sv")val largeValues = 
>> df.filter('value >= 10).as("lv")
>> 
>> smallValues
>>   .join(largeValues, $"sv.key" === $"lv.key")
>>   .select($"sv.key".as("key"), $"sv.value".as("small_value"), 
>> $"lv.value".as("large_value"))
>>   .withColumn("diff", $"small_value" - $"large_value")
>>   .show()
>> +---+-----------+-----------+----+|key|small_value|large_value|diff|+---+-----------+-----------+----+|
>>   1|          1|         10|  -9||  3|          5|         20| 
>> -15|+---+-----------+-----------+----+
>>
>>
>> The problem with the other cases is that calling
>> smallValues("columnName") or largeValues("columnName") is eagerly
>> resolving the attribute to the same column (since the data is actually
>> coming from the same place).  By the time we realize that you are joining
>> the data with itself (at which point we rewrite one side of the join to use
>> different expression ids) its too late.  At the core the problem is that in
>> Scala we have no easy way to differentiate largeValues("columnName")
>> from smallValues("columnName").  This is because the data is coming from
>> the same DataFrame and we don't actually know which variable name you are
>> using.  There are things we can change here, but its pretty hard to change
>> the semantics without breaking other use cases.
>>
>> So, this isn't a straight forward "bug", but its definitely a usability
>> issue.  For now, my advice would be: only use unresolved columns (i.e.
>> $"[alias.]column" or col("[alias.]column")) when working with self joins.
>>
>> Michael
>>
>
>

Re: How to distinguish columns when joining DataFrames with shared parent?

Reply via email to