To add to the above discussion, Pandas, allows suffixing and prefixing to
solve this issue

http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.join.html

Rakesh

On Fri, May 8, 2015 at 2:42 PM Nicholas Chammas <nicholas.cham...@gmail.com>
wrote:

> DataFrames, as far as I can tell, don’t have an equivalent to SQL’s table
> aliases.
>
> This is essential when joining dataframes that have identically named
> columns.
>
> >>> # PySpark 1.3.1>>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4,
> "other": "I know"}']))>>> df2 = sqlContext.jsonRDD(sc.parallelize(['{"a":
> 4, "other": "I dunno"}']))>>> df12 = df1.join(df2, df1['a'] == df2['a'])>>>
> df12
> DataFrame[a: bigint, other: string, a: bigint, other: string]>>>
> df12.printSchema()
> root
>  |-- a: long (nullable = true)
>  |-- other: string (nullable = true)
>  |-- a: long (nullable = true)
>  |-- other: string (nullable = true)
>
> Now, trying any one of the following:
>
> df12.select('a')
> df12['a']
> df12.a
>
> yields this:
>
> org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous,
> could be: a#360L, a#358L.;
>
> Same goes for accessing the other field.
>
> This is good, but what are we supposed to do in this case?
>
> SQL solves this by fully qualifying the column name with the table name,
> and also offering table aliasing <http://dba.stackexchange.com/a/5991/2660
> >
> in the case where you are joining a table to itself.
>
> If we translate this directly into DataFrames lingo, perhaps it would look
> something like:
>
> df12['df1.a']
> df12['df2.other']
>
> But I’m not sure how this fits into the larger API. This certainly isn’t
> backwards compatible with how joins are done now.
>
> So what’s the recommended course of action here?
>
> Having to unique-ify all your column names before joining doesn’t sound
> like a nice solution.
>
> Nick
> ​
>

Reply via email to