Hi Kushagra I still think this is a bad idea. By definition data in a dataframe or rdd is unordered, you are imposing an order where there is none, and if it works it will be by chance. For example a simple repartition may disrupt the row ordering. It is just too unpredictable.
I would suggest you fix upstream and add correct identifier to each of the streams. It will for sure a much better solution. On Wed, 19 May 2021 at 7:21 pm, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > That generation of row_number() has to be performed through a window call > and I don't think there is any way around it without orderBy() > > df1 = > df1.select(F.row_number().over(Window.partitionBy().orderBy(df1['amount_6m'])).alias("row_num"),"amount_6m") > > The problem is that without partitionBy() clause data will be skewed > towards one executor. > > WARN window.WindowExec: No Partition Defined for Window operation! Moving > all data to a single partition, this can cause serious performance > degradation. > > Cheers > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Wed, 12 May 2021 at 17:33, Andrew Melo <andrew.m...@gmail.com> wrote: > >> Hi, >> >> In the case where the left and right hand side share a common parent like: >> >> df = spark.read.someDataframe().withColumn('rownum', row_number()) >> df1 = df.withColumn('c1', expensive_udf1('foo')).select('c1', 'rownum') >> df2 = df.withColumn('c2', expensive_udf2('bar')).select('c2', 'rownum') >> df_joined = df1.join(df2, 'rownum', 'inner') >> >> (or maybe replacing row_number() with monotonically_increasing_id()....) >> >> Is there some hint/optimization that can be done to let Spark know >> that the left and right hand-sides of the join share the same >> ordering, and a sort/hash merge doesn't need to be done? >> >> Thanks >> Andrew >> >> On Wed, May 12, 2021 at 11:07 AM Sean Owen <sro...@gmail.com> wrote: >> > >> > Yeah I don't think that's going to work - you aren't guaranteed to get >> 1, 2, 3, etc. I think row_number() might be what you need to generate a >> join ID. >> > >> > RDD has a .zip method, but (unless I'm forgetting!) DataFrame does not. >> You could .zip two RDDs you get from DataFrames and manually convert the >> Rows back to a single Row and back to DataFrame. >> > >> > >> > On Wed, May 12, 2021 at 10:47 AM kushagra deep < >> kushagra94d...@gmail.com> wrote: >> >> >> >> Thanks Raghvendra >> >> >> >> Will the ids for corresponding columns be same always ? Since >> monotonic_increasing_id() returns a number based on partitionId and the row >> number of the partition ,will it be same for corresponding columns? Also >> is it guaranteed that the two dataframes will be divided into logical spark >> partitions with the same cardinality for each partition ? >> >> >> >> Reg, >> >> Kushagra Deep >> >> >> >> On Wed, May 12, 2021, 21:00 Raghavendra Ganesh < >> raghavendr...@gmail.com> wrote: >> >>> >> >>> You can add an extra id column and perform an inner join. >> >>> >> >>> val df1_with_id = df1.withColumn("id", monotonically_increasing_id()) >> >>> >> >>> val df2_with_id = df2.withColumn("id", monotonically_increasing_id()) >> >>> >> >>> df1_with_id.join(df2_with_id, Seq("id"), "inner").drop("id").show() >> >>> >> >>> +---------+---------+ >> >>> >> >>> |amount_6m|amount_9m| >> >>> >> >>> +---------+---------+ >> >>> >> >>> | 100| 500| >> >>> >> >>> | 200| 600| >> >>> >> >>> | 300| 700| >> >>> >> >>> | 400| 800| >> >>> >> >>> | 500| 900| >> >>> >> >>> +---------+---------+ >> >>> >> >>> >> >>> -- >> >>> Raghavendra > > >> >>> >> >>> >> >>> On Wed, May 12, 2021 at 6:20 PM kushagra deep < >> kushagra94d...@gmail.com> wrote: >> >>>> >> >>>> Hi All, >> >>>> >> >>>> I have two dataframes >> >>>> >> >>>> df1 >> >>>> >> >>>> amount_6m >> >>>> 100 >> >>>> 200 >> >>>> 300 >> >>>> 400 >> >>>> 500 >> >>>> >> >>>> And a second data df2 below >> >>>> >> >>>> amount_9m >> >>>> 500 >> >>>> 600 >> >>>> 700 >> >>>> 800 >> >>>> 900 >> >>>> >> >>>> The number of rows is same in both dataframes. >> >>>> >> >>>> Can I merge the two dataframes to achieve below df >> >>>> >> >>>> df3 >> >>>> >> >>>> amount_6m | amount_9m >> >>>> 100 500 >> >>>> 200 600 >> >>>> 300 700 >> >>>> 400 800 >> >>>> 500 900 >> >>>> >> >>>> Thanks in advance >> >>>> >> >>>> Reg, >> >>>> Kushagra Deep >> >>>> >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> -- Best Regards, Ayan Guha