Re: Merge two dataframes

2021-05-19 Thread ayan guha
Hi Kushagra I still think this is a bad idea. By definition data in a dataframe or rdd is unordered, you are imposing an order where there is none, and if it works it will be by chance. For example a simple repartition may disrupt the row ordering. It is just too unpredictable. I would suggest y

Re: Merge two dataframes

2021-05-19 Thread Mich Talebzadeh
That generation of row_number() has to be performed through a window call and I don't think there is any way around it without orderBy() df1 = df1.select(F.row_number().over(Window.partitionBy().orderBy(df1['amount_6m'])).alias("row_num"),"amount_6m") The problem is that without partitionBy() cla

Re: Merge two dataframes

2021-05-19 Thread Mich Talebzadeh
Hi Kushagra, I believe you are referring to this warning below WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation. I don't know an easy way around it. If the operation is only once you may be

Re: Merge two dataframes

2021-05-18 Thread kushagra deep
Thanks a lot Mich , this works though I have to test for scalability. I have one question though . If we dont specify any column in partitionBy will it shuffle all the records in one executor ? Because this is what seems to be happening. Thanks once again ! Regards Kushagra Deep On Tue, May 18,

Re: Merge two dataframes

2021-05-18 Thread Mich Talebzadeh
Ok, this should hopefully work as it uses row_number. from pyspark.sql.window import Window import pyspark.sql.functions as F from pyspark.sql.functions import row_number def spark_session(appName): return SparkSession.builder \ .appName(appName) \ .enableHiveSupport() \

Re: Merge two dataframes

2021-05-18 Thread kushagra deep
The use case is to calculate PSI/CSI values . And yes the union is one to one row as you showed. On Tue, May 18, 2021, 20:39 Mich Talebzadeh wrote: > > Hi Kushagra, > > A bit late on this but what is the business use case for this merge? > > You have two data frames each with one column and you

Re: Merge two dataframes

2021-05-18 Thread Mich Talebzadeh
Hi Kushagra, A bit late on this but what is the business use case for this merge? You have two data frames each with one column and you want to UNION them in a certain way but the correlation is not known. In other words this UNION is as is? amount_6m | amount_9m 100 50

Re: Merge two dataframes

2021-05-17 Thread Andrew Melo
On Mon, May 17, 2021 at 2:31 PM Lalwani, Jayesh wrote: > > If the UDFs are computationally expensive, I wouldn't solve this problem with > UDFs at all. If they are working in an iterative manner, and assuming each > iteration is independent of other iterations (yes, I know that's a big > assum

Re: Merge two dataframes

2021-05-17 Thread Lalwani, Jayesh
If the UDFs are computationally expensive, I wouldn't solve this problem with UDFs at all. If they are working in an iterative manner, and assuming each iteration is independent of other iterations (yes, I know that's a big assumptiuon), I would think about exploding your dataframe to have a ro

Re: Merge two dataframes

2021-05-17 Thread Andrew Melo
In our case, these UDFs are quite expensive and worked on in an iterative manner, so being able to cache the two "sides" of the graphs independently will speed up the development cycle. Otherwise, if you modify foo() here, then you have to recompute bar and baz, even though they're unchanged. df.w

Re: Merge two dataframes

2021-05-17 Thread Sean Owen
Why join here - just add two columns to the DataFrame directly? On Mon, May 17, 2021 at 1:04 PM Andrew Melo wrote: > Anyone have ideas about the below Q? > > It seems to me that given that "diamond" DAG, that spark could see > that the rows haven't been shuffled/filtered, it could do some type o

Re: Merge two dataframes

2021-05-17 Thread Andrew Melo
Anyone have ideas about the below Q? It seems to me that given that "diamond" DAG, that spark could see that the rows haven't been shuffled/filtered, it could do some type of "zip join" to push them together, but I've not been able to get a plan that doesn't do a hash/sort merge join Cheers Andre

Re: Merge two dataframes

2021-05-12 Thread Andrew Melo
Hi, In the case where the left and right hand side share a common parent like: df = spark.read.someDataframe().withColumn('rownum', row_number()) df1 = df.withColumn('c1', expensive_udf1('foo')).select('c1', 'rownum') df2 = df.withColumn('c2', expensive_udf2('bar')).select('c2', 'rownum') df_join

Re: Merge two dataframes

2021-05-12 Thread Sean Owen
Yeah I don't think that's going to work - you aren't guaranteed to get 1, 2, 3, etc. I think row_number() might be what you need to generate a join ID. RDD has a .zip method, but (unless I'm forgetting!) DataFrame does not. You could .zip two RDDs you get from DataFrames and manually convert the R

Re: Merge two dataframes

2021-05-12 Thread kushagra deep
Thanks Raghvendra Will the ids for corresponding columns be same always ? Since monotonic_increasing_id() returns a number based on partitionId and the row number of the partition ,will it be same for corresponding columns? Also is it guaranteed that the two dataframes will be divided into logic

Re: Merge two dataframes

2021-05-12 Thread Raghavendra Ganesh
You can add an extra id column and perform an inner join. val df1_with_id = df1.withColumn("id", monotonically_increasing_id()) val df2_with_id = df2.withColumn("id", monotonically_increasing_id()) df1_with_id.join(df2_with_id, Seq("id"), "inner").drop("id").show() +-+-+ |amoun