Join on DataFrames from the same source (Pyspark)

Karlson Tue, 21 Apr 2015 07:45:00 -0700

Hi,

can anyone confirm (and if so elaborate on) the following problem?

When I join two DataFrames that originate from the same sourceDataFrame, the resulting DF will explode to a huge number of rows. Aquick example:


I load a DataFrame with n rows from disk:

    df = sql_context.parquetFile('data.parquet')

Then I create two DataFrames from that source.

    df_one = df.select(['col1', 'col2'])
    df_two = df.select(['col1', 'col3'])

Finally I want to (inner) join them back together:

df_joined = df_one.join(df_two, df_one['col1'] == df_two['col2'],'inner')

The key in col1 is unique. The resulting DataFrame should have n rows,however it does have n*n rows.

That does not happen, when I load df_one and df_two from disk directly.I am on Spark 1.3.0, but this also happens on the current 1.4.0snapshot.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Join on DataFrames from the same source (Pyspark)

Reply via email to