Thanks a lot Mich , this works though I have to test for scalability. I have one question though . If we dont specify any column in partitionBy will it shuffle all the records in one executor ? Because this is what seems to be happening.
Thanks once again ! Regards Kushagra Deep On Tue, May 18, 2021 at 10:48 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Ok, this should hopefully work as it uses row_number. > > from pyspark.sql.window import Window > import pyspark.sql.functions as F > from pyspark.sql.functions import row_number > > def spark_session(appName): > return SparkSession.builder \ > .appName(appName) \ > .enableHiveSupport() \ > .getOrCreate() > appName = "test" > spark =spark_session(appName) > ## > ## Get a DF first from csv files > ## > d1location="hdfs://rhes75:9000/tmp/df1.csv" > d2location="hdfs://rhes75:9000/tmp/df2.csv" > > df1 = spark.read.csv(d1location, header="true") > df1.printSchema() > df1.show() > df2 = spark.read.csv(d2location, header="true") > df2.printSchema() > df2.show() > df1 = > df1.select(F.row_number().over(Window.partitionBy().orderBy(df1['amount_6m'])).alias("row_num"),"amount_6m") > df1.show() > df2 = > df2.select(F.row_number().over(Window.partitionBy().orderBy(df2['amount_9m'])).alias("row_num"),"amount_9m") > df2.show() > df1.join(df2,"row_num","inner").select("amount_6m","amount_9m").show() > > > root > |-- amount_6m: string (nullable = true) > > +---------+ > |amount_6m| > +---------+ > | 100| > | 200| > | 300| > | 400| > | 500 | > +---------+ > > root > |-- amount_9m: string (nullable = true) > > +---------+ > |amount_9m| > +---------+ > | 500| > | 600| > | 700| > | 800| > | 900| > +---------+ > > +-------+---------+ > |row_num|amount_6m| > +-------+---------+ > | 1| 100| > | 2| 200| > | 3| 300| > | 4| 400| > | 5| 500 | > +-------+---------+ > > +-------+---------+ > |row_num|amount_9m| > +-------+---------+ > | 1| 500| > | 2| 600| > | 3| 700| > | 4| 800| > | 5| 900| > +-------+---------+ > > +---------+---------+ > |amount_6m|amount_9m| > +---------+---------+ > | 100| 500| > | 200| 600| > | 300| 700| > | 400| 800| > | 500 | 900| > +---------+---------+ > > HTH > > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Tue, 18 May 2021 at 16:39, kushagra deep <kushagra94d...@gmail.com> > wrote: > >> The use case is to calculate PSI/CSI values . And yes the union is one to >> one row as you showed. >> >> On Tue, May 18, 2021, 20:39 Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >>> >>> Hi Kushagra, >>> >>> A bit late on this but what is the business use case for this merge? >>> >>> You have two data frames each with one column and you want to UNION them >>> in a certain way but the correlation is not known. In other words this >>> UNION is as is? >>> >>> amount_6m | amount_9m >>> 100 500 >>> 200 600 >>> >>> HTH >>> >>> >>> On Wed, 12 May 2021 at 13:51, kushagra deep <kushagra94d...@gmail.com> >>> wrote: >>> >>>> Hi All, >>>> >>>> I have two dataframes >>>> >>>> df1 >>>> >>>> amount_6m >>>> 100 >>>> 200 >>>> 300 >>>> 400 >>>> 500 >>>> >>>> And a second data df2 below >>>> >>>> amount_9m >>>> 500 >>>> 600 >>>> 700 >>>> 800 >>>> 900 >>>> >>>> The number of rows is same in both dataframes. >>>> >>>> Can I merge the two dataframes to achieve below df >>>> >>>> df3 >>>> >>>> amount_6m | amount_9m >>>> 100 500 >>>> 200 600 >>>> 300 700 >>>> 400 800 >>>> 500 900 >>>> >>>> Thanks in advance >>>> >>>> Reg, >>>> Kushagra Deep >>>> >>>>