your code should be df_one = df.select('col1', 'col2') df_two = df.select('col1', 'col3')
Your current code is generating a tupple, and of course df_1 and df_2 are different, so join is yielding to cartesian. Best Ayan On Wed, Apr 22, 2015 at 12:42 AM, Karlson <ksonsp...@siberie.de> wrote: > Hi, > > can anyone confirm (and if so elaborate on) the following problem? > > When I join two DataFrames that originate from the same source DataFrame, > the resulting DF will explode to a huge number of rows. A quick example: > > I load a DataFrame with n rows from disk: > > df = sql_context.parquetFile('data.parquet') > > Then I create two DataFrames from that source. > > df_one = df.select(['col1', 'col2']) > df_two = df.select(['col1', 'col3']) > > Finally I want to (inner) join them back together: > > df_joined = df_one.join(df_two, df_one['col1'] == df_two['col2'], > 'inner') > > The key in col1 is unique. The resulting DataFrame should have n rows, > however it does have n*n rows. > > That does not happen, when I load df_one and df_two from disk directly. I > am on Spark 1.3.0, but this also happens on the current 1.4.0 snapshot. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha