Hi Kushagra
I still think this is a bad idea. By definition data in a dataframe or rdd
is unordered, you are imposing an order where there is none, and if it
works it will be by chance. For example a simple repartition may disrupt
the row ordering. It is just too unpredictable.
I would suggest y
That generation of row_number() has to be performed through a window call
and I don't think there is any way around it without orderBy()
df1 =
df1.select(F.row_number().over(Window.partitionBy().orderBy(df1['amount_6m'])).alias("row_num"),"amount_6m")
The problem is that without partitionBy() cla
Hi Kushagra,
I believe you are referring to this warning below
WARN window.WindowExec: No Partition Defined for Window operation! Moving
all data to a single partition, this can cause serious performance
degradation.
I don't know an easy way around it. If the operation is only once you may
be
Thanks a lot Mich , this works though I have to test for scalability.
I have one question though . If we dont specify any column in partitionBy
will it shuffle all the records in one executor ? Because this is what
seems to be happening.
Thanks once again !
Regards
Kushagra Deep
On Tue, May 18,
Ok, this should hopefully work as it uses row_number.
from pyspark.sql.window import Window
import pyspark.sql.functions as F
from pyspark.sql.functions import row_number
def spark_session(appName):
return SparkSession.builder \
.appName(appName) \
.enableHiveSupport() \
The use case is to calculate PSI/CSI values . And yes the union is one to
one row as you showed.
On Tue, May 18, 2021, 20:39 Mich Talebzadeh
wrote:
>
> Hi Kushagra,
>
> A bit late on this but what is the business use case for this merge?
>
> You have two data frames each with one column and you
Hi Kushagra,
A bit late on this but what is the business use case for this merge?
You have two data frames each with one column and you want to UNION them in
a certain way but the correlation is not known. In other words this UNION
is as is?
amount_6m | amount_9m
100 50
On Mon, May 17, 2021 at 2:31 PM Lalwani, Jayesh wrote:
>
> If the UDFs are computationally expensive, I wouldn't solve this problem with
> UDFs at all. If they are working in an iterative manner, and assuming each
> iteration is independent of other iterations (yes, I know that's a big
> assum
If the UDFs are computationally expensive, I wouldn't solve this problem with
UDFs at all. If they are working in an iterative manner, and assuming each
iteration is independent of other iterations (yes, I know that's a big
assumptiuon), I would think about exploding your dataframe to have a ro
In our case, these UDFs are quite expensive and worked on in an
iterative manner, so being able to cache the two "sides" of the graphs
independently will speed up the development cycle. Otherwise, if you
modify foo() here, then you have to recompute bar and baz, even though
they're unchanged.
df.w
Why join here - just add two columns to the DataFrame directly?
On Mon, May 17, 2021 at 1:04 PM Andrew Melo wrote:
> Anyone have ideas about the below Q?
>
> It seems to me that given that "diamond" DAG, that spark could see
> that the rows haven't been shuffled/filtered, it could do some type o
Anyone have ideas about the below Q?
It seems to me that given that "diamond" DAG, that spark could see
that the rows haven't been shuffled/filtered, it could do some type of
"zip join" to push them together, but I've not been able to get a plan
that doesn't do a hash/sort merge join
Cheers
Andre
Hi,
In the case where the left and right hand side share a common parent like:
df = spark.read.someDataframe().withColumn('rownum', row_number())
df1 = df.withColumn('c1', expensive_udf1('foo')).select('c1', 'rownum')
df2 = df.withColumn('c2', expensive_udf2('bar')).select('c2', 'rownum')
df_join
Yeah I don't think that's going to work - you aren't guaranteed to get 1,
2, 3, etc. I think row_number() might be what you need to generate a join
ID.
RDD has a .zip method, but (unless I'm forgetting!) DataFrame does not. You
could .zip two RDDs you get from DataFrames and manually convert the R
Thanks Raghvendra
Will the ids for corresponding columns be same always ? Since
monotonic_increasing_id() returns a number based on partitionId and the row
number of the partition ,will it be same for corresponding columns? Also
is it guaranteed that the two dataframes will be divided into logic
You can add an extra id column and perform an inner join.
val df1_with_id = df1.withColumn("id", monotonically_increasing_id())
val df2_with_id = df2.withColumn("id", monotonically_increasing_id())
df1_with_id.join(df2_with_id, Seq("id"), "inner").drop("id").show()
+-+-+
|amoun
16 matches
Mail list logo