That generation of row_number() has to be performed through a window call and I don't think there is any way around it without orderBy()
df1 = df1.select(F.row_number().over(Window.partitionBy().orderBy(df1['amount_6m'])).alias("row_num"),"amount_6m") The problem is that without partitionBy() clause data will be skewed towards one executor. WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. Cheers view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Wed, 12 May 2021 at 17:33, Andrew Melo <andrew.m...@gmail.com> wrote: > Hi, > > In the case where the left and right hand side share a common parent like: > > df = spark.read.someDataframe().withColumn('rownum', row_number()) > df1 = df.withColumn('c1', expensive_udf1('foo')).select('c1', 'rownum') > df2 = df.withColumn('c2', expensive_udf2('bar')).select('c2', 'rownum') > df_joined = df1.join(df2, 'rownum', 'inner') > > (or maybe replacing row_number() with monotonically_increasing_id()....) > > Is there some hint/optimization that can be done to let Spark know > that the left and right hand-sides of the join share the same > ordering, and a sort/hash merge doesn't need to be done? > > Thanks > Andrew > > On Wed, May 12, 2021 at 11:07 AM Sean Owen <sro...@gmail.com> wrote: > > > > Yeah I don't think that's going to work - you aren't guaranteed to get > 1, 2, 3, etc. I think row_number() might be what you need to generate a > join ID. > > > > RDD has a .zip method, but (unless I'm forgetting!) DataFrame does not. > You could .zip two RDDs you get from DataFrames and manually convert the > Rows back to a single Row and back to DataFrame. > > > > > > On Wed, May 12, 2021 at 10:47 AM kushagra deep <kushagra94d...@gmail.com> > wrote: > >> > >> Thanks Raghvendra > >> > >> Will the ids for corresponding columns be same always ? Since > monotonic_increasing_id() returns a number based on partitionId and the row > number of the partition ,will it be same for corresponding columns? Also > is it guaranteed that the two dataframes will be divided into logical spark > partitions with the same cardinality for each partition ? > >> > >> Reg, > >> Kushagra Deep > >> > >> On Wed, May 12, 2021, 21:00 Raghavendra Ganesh <raghavendr...@gmail.com> > wrote: > >>> > >>> You can add an extra id column and perform an inner join. > >>> > >>> val df1_with_id = df1.withColumn("id", monotonically_increasing_id()) > >>> > >>> val df2_with_id = df2.withColumn("id", monotonically_increasing_id()) > >>> > >>> df1_with_id.join(df2_with_id, Seq("id"), "inner").drop("id").show() > >>> > >>> +---------+---------+ > >>> > >>> |amount_6m|amount_9m| > >>> > >>> +---------+---------+ > >>> > >>> | 100| 500| > >>> > >>> | 200| 600| > >>> > >>> | 300| 700| > >>> > >>> | 400| 800| > >>> > >>> | 500| 900| > >>> > >>> +---------+---------+ > >>> > >>> > >>> -- > >>> Raghavendra > >>> > >>> > >>> On Wed, May 12, 2021 at 6:20 PM kushagra deep < > kushagra94d...@gmail.com> wrote: > >>>> > >>>> Hi All, > >>>> > >>>> I have two dataframes > >>>> > >>>> df1 > >>>> > >>>> amount_6m > >>>> 100 > >>>> 200 > >>>> 300 > >>>> 400 > >>>> 500 > >>>> > >>>> And a second data df2 below > >>>> > >>>> amount_9m > >>>> 500 > >>>> 600 > >>>> 700 > >>>> 800 > >>>> 900 > >>>> > >>>> The number of rows is same in both dataframes. > >>>> > >>>> Can I merge the two dataframes to achieve below df > >>>> > >>>> df3 > >>>> > >>>> amount_6m | amount_9m > >>>> 100 500 > >>>> 200 600 > >>>> 300 700 > >>>> 400 800 > >>>> 500 900 > >>>> > >>>> Thanks in advance > >>>> > >>>> Reg, > >>>> Kushagra Deep > >>>> > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >