fixed in master: https://github.com/apache/spark/commit/2d010f7afe6ac8e67e07da6bea700e9e8c9e6cc2
On Wed, Apr 22, 2015 at 12:19 AM, Karlson <ksonsp...@siberie.de> wrote: > DataFrames do not have the attributes 'alias' or 'as' in the Python API. > > > On 2015-04-21 20:41, Michael Armbrust wrote: > >> This is https://issues.apache.org/jira/browse/SPARK-6231 >> >> Unfortunately this is pretty hard to fix as its hard for us to >> differentiate these without aliases. However you can add an alias as >> follows: >> >> from pyspark.sql.functions import * >> df.alias("a").join(df.alias("b"), col("a.col1") == col("b.col1")) >> >> On Tue, Apr 21, 2015 at 8:10 AM, Karlson <ksonsp...@siberie.de> wrote: >> >> Sorry, my code actually was >>> >>> df_one = df.select('col1', 'col2') >>> df_two = df.select('col1', 'col3') >>> >>> But in Spark 1.4.0 this does not seem to make any difference anyway and >>> the problem is the same with both versions. >>> >>> >>> >>> On 2015-04-21 17:04, ayan guha wrote: >>> >>> your code should be >>>> >>>> df_one = df.select('col1', 'col2') >>>> df_two = df.select('col1', 'col3') >>>> >>>> Your current code is generating a tupple, and of course df_1 and df_2 >>>> are >>>> different, so join is yielding to cartesian. >>>> >>>> Best >>>> Ayan >>>> >>>> On Wed, Apr 22, 2015 at 12:42 AM, Karlson <ksonsp...@siberie.de> wrote: >>>> >>>> Hi, >>>> >>>>> >>>>> can anyone confirm (and if so elaborate on) the following problem? >>>>> >>>>> When I join two DataFrames that originate from the same source >>>>> DataFrame, >>>>> the resulting DF will explode to a huge number of rows. A quick >>>>> example: >>>>> >>>>> I load a DataFrame with n rows from disk: >>>>> >>>>> df = sql_context.parquetFile('data.parquet') >>>>> >>>>> Then I create two DataFrames from that source. >>>>> >>>>> df_one = df.select(['col1', 'col2']) >>>>> df_two = df.select(['col1', 'col3']) >>>>> >>>>> Finally I want to (inner) join them back together: >>>>> >>>>> df_joined = df_one.join(df_two, df_one['col1'] == df_two['col2'], >>>>> 'inner') >>>>> >>>>> The key in col1 is unique. The resulting DataFrame should have n rows, >>>>> however it does have n*n rows. >>>>> >>>>> That does not happen, when I load df_one and df_two from disk >>>>> directly. I >>>>> am on Spark 1.3.0, but this also happens on the current 1.4.0 snapshot. >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>> >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >>> > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >