Unfortunately, the mechanisms that we use to differentiate columns automatically don't work particularly well in the presence of self joins. However, you can get it work if you use the $"column" syntax consistently:
val df = Seq((1, 1), (1, 10), (2, 3), (3, 20), (3, 5), (4, 10)).toDF("key", "value")val smallValues = df.filter('value < 10).as("sv")val largeValues = df.filter('value >= 10).as("lv") smallValues .join(largeValues, $"sv.key" === $"lv.key") .select($"sv.key".as("key"), $"sv.value".as("small_value"), $"lv.value".as("large_value")) .withColumn("diff", $"small_value" - $"large_value") .show() +---+-----------+-----------+----+|key|small_value|large_value|diff|+---+-----------+-----------+----+| 1| 1| 10| -9|| 3| 5| 20| -15|+---+-----------+-----------+----+ The problem with the other cases is that calling smallValues("columnName") or largeValues("columnName") is eagerly resolving the attribute to the same column (since the data is actually coming from the same place). By the time we realize that you are joining the data with itself (at which point we rewrite one side of the join to use different expression ids) its too late. At the core the problem is that in Scala we have no easy way to differentiate largeValues("columnName") from smallValues("columnName"). This is because the data is coming from the same DataFrame and we don't actually know which variable name you are using. There are things we can change here, but its pretty hard to change the semantics without breaking other use cases. So, this isn't a straight forward "bug", but its definitely a usability issue. For now, my advice would be: only use unresolved columns (i.e. $"[alias.]column" or col("[alias.]column")) when working with self joins. Michael