Currently seems DataFrame doesn't enforce the uniqueness of field name. So
it is possible to have same fields in DataFrame. It usually happens after
join especially self-join. Although user can rename the column names before
join, or rename the column names after join (DataFrame#withColunmRenamed is
not sufficient for now).  In hive, the ambiguous name can be resolved by
using the table name as prefix, but seems DataFrame don't support it ( I
mean DataFrame API rather than SparkSQL). I think we have 2 options here
1. Enforce the uniqueness of field name in DataFrame, so that the following
operations would not cause ambiguous column reference
2. Provide DataFrame#withColunmsRenamed(oldColumns:Seq[String],
newColumns:Seq[String]) to allow change schema names

For now, I would prefer option 2 which is more easier to implement and keep
compatibility.


val df = ...        // schema (name, age)
val df2 = df.join(df, "name")   // schema (name, age, age)
df2.select("age")   // ambiguous column reference.

-- 
Best Regards

Jeff Zhang

Reply via email to