Yeah, I was suggesting that you avoid using org.apache.spark.sql.DataFrame.apply(colName: String) when you are working with selfjoins as it eagerly binds to a specific column in a what that breaks when we do the rewrite of one side of the query. Using the apply method constructs a resolved column eagerly (which looses the alias information).
On Wed, Oct 21, 2015 at 1:49 PM, Isabelle Phan <[email protected]> wrote: > Thanks Michael and Ali for the reply! > > I'll make sure to use unresolved columns when working with self joins then. > > As pointed by Ali, isn't there still an issue with the aliasing? It works > when using org.apache.spark.sql.functions.col(colName: String) method, but > not when using org.apache.spark.sql.DataFrame.apply(colName: String): > > scala> j.select(col("lv.value")).show > +-----+ > |value| > +-----+ > | 10| > | 20| > +-----+ > > > scala> j.select(largeValues("lv.value")).show > +-----+ > |value| > +-----+ > | 1| > | 5| > +-----+ > > Or does this behavior have the same root cause as detailed in Michael's > email? > > > -Isabelle > > > > > On Wed, Oct 21, 2015 at 11:46 AM, Michael Armbrust <[email protected] > > wrote: > >> Unfortunately, the mechanisms that we use to differentiate columns >> automatically don't work particularly well in the presence of self joins. >> However, you can get it work if you use the $"column" syntax >> consistently: >> >> val df = Seq((1, 1), (1, 10), (2, 3), (3, 20), (3, 5), (4, 10)).toDF("key", >> "value")val smallValues = df.filter('value < 10).as("sv")val largeValues = >> df.filter('value >= 10).as("lv") >> >> smallValues >> .join(largeValues, $"sv.key" === $"lv.key") >> .select($"sv.key".as("key"), $"sv.value".as("small_value"), >> $"lv.value".as("large_value")) >> .withColumn("diff", $"small_value" - $"large_value") >> .show() >> +---+-----------+-----------+----+|key|small_value|large_value|diff|+---+-----------+-----------+----+| >> 1| 1| 10| -9|| 3| 5| 20| >> -15|+---+-----------+-----------+----+ >> >> >> The problem with the other cases is that calling >> smallValues("columnName") or largeValues("columnName") is eagerly >> resolving the attribute to the same column (since the data is actually >> coming from the same place). By the time we realize that you are joining >> the data with itself (at which point we rewrite one side of the join to use >> different expression ids) its too late. At the core the problem is that in >> Scala we have no easy way to differentiate largeValues("columnName") >> from smallValues("columnName"). This is because the data is coming from >> the same DataFrame and we don't actually know which variable name you are >> using. There are things we can change here, but its pretty hard to change >> the semantics without breaking other use cases. >> >> So, this isn't a straight forward "bug", but its definitely a usability >> issue. For now, my advice would be: only use unresolved columns (i.e. >> $"[alias.]column" or col("[alias.]column")) when working with self joins. >> >> Michael >> > >
