To add to the above discussion, Pandas, allows suffixing and prefixing to solve this issue
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.join.html Rakesh On Fri, May 8, 2015 at 2:42 PM Nicholas Chammas <nicholas.cham...@gmail.com> wrote: > DataFrames, as far as I can tell, don’t have an equivalent to SQL’s table > aliases. > > This is essential when joining dataframes that have identically named > columns. > > >>> # PySpark 1.3.1>>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, > "other": "I know"}']))>>> df2 = sqlContext.jsonRDD(sc.parallelize(['{"a": > 4, "other": "I dunno"}']))>>> df12 = df1.join(df2, df1['a'] == df2['a'])>>> > df12 > DataFrame[a: bigint, other: string, a: bigint, other: string]>>> > df12.printSchema() > root > |-- a: long (nullable = true) > |-- other: string (nullable = true) > |-- a: long (nullable = true) > |-- other: string (nullable = true) > > Now, trying any one of the following: > > df12.select('a') > df12['a'] > df12.a > > yields this: > > org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, > could be: a#360L, a#358L.; > > Same goes for accessing the other field. > > This is good, but what are we supposed to do in this case? > > SQL solves this by fully qualifying the column name with the table name, > and also offering table aliasing <http://dba.stackexchange.com/a/5991/2660 > > > in the case where you are joining a table to itself. > > If we translate this directly into DataFrames lingo, perhaps it would look > something like: > > df12['df1.a'] > df12['df2.other'] > > But I’m not sure how this fits into the larger API. This certainly isn’t > backwards compatible with how joins are done now. > > So what’s the recommended course of action here? > > Having to unique-ify all your column names before joining doesn’t sound > like a nice solution. > > Nick > >