Re: How to distinguish columns when joining DataFrames with shared parent?

Michael Armbrust Wed, 21 Oct 2015 11:48:12 -0700

Unfortunately, the mechanisms that we use to differentiate columns
automatically don't work particularly well in the presence of self joins.
However, you can get it work if you use the $"column" syntax consistently:


val df = Seq((1, 1), (1, 10), (2, 3), (3, 20), (3, 5), (4,
10)).toDF("key", "value")val smallValues = df.filter('value <
10).as("sv")val largeValues = df.filter('value >= 10).as("lv")

smallValues
  .join(largeValues, $"sv.key" === $"lv.key")
  .select($"sv.key".as("key"), $"sv.value".as("small_value"),
$"lv.value".as("large_value"))
  .withColumn("diff", $"small_value" - $"large_value")
  .show()
+---+-----------+-----------+----+|key|small_value|large_value|diff|+---+-----------+-----------+----+|
 1|          1|         10|  -9||  3|          5|         20|
-15|+---+-----------+-----------+----+


The problem with the other cases is that calling smallValues("columnName")
or largeValues("columnName") is eagerly resolving the attribute to the same
column (since the data is actually coming from the same place).  By the
time we realize that you are joining the data with itself (at which point
we rewrite one side of the join to use different expression ids) its too
late.  At the core the problem is that in Scala we have no easy way to
differentiate largeValues("columnName") from smallValues("columnName").
This is because the data is coming from the same DataFrame and we don't
actually know which variable name you are using.  There are things we can
change here, but its pretty hard to change the semantics without breaking
other use cases.

So, this isn't a straight forward "bug", but its definitely a usability
issue.  For now, my advice would be: only use unresolved columns (i.e.
$"[alias.]column" or col("[alias.]column")) when working with self joins.

Michael

Re: How to distinguish columns when joining DataFrames with shared parent?

Reply via email to