Hi Amit, The only approach I can think of is to create 2 copies of schema_df1, one partitioned on key1 and other on key2 and then use these to Join. ________________________________ From: Amit Joshi <mailtojoshia...@gmail.com> Sent: 04 October 2021 19:13 To: spark-user <user@spark.apache.org> Subject: [EXTERNAL] [Marketing Mail] Re: [Spark] Optimize spark join on different keys for same data frame
Caution! This email originated outside of FedEx. Please do not open attachments or click links from an unknown or suspicious origin. Hi spark users, Can anyone please provide any views on the topic. Regards Amit Joshi On Sunday, October 3, 2021, Amit Joshi <mailtojoshia...@gmail.com<mailto:mailtojoshia...@gmail.com>> wrote: Hi Spark-Users, Hope you are doing good. I have been working on cases where a dataframe is joined with more than one data frame separately, on different cols, that too frequently. I was wondering how to optimize the join to make them faster. We can consider the dataset to be big in size so broadcast joins is not an option. For eg: schema_df1 = new StructType() .add(StructField("key1", StringType, true)) .add(StructField("key2", StringType, true)) .add(StructField("val", DoubleType, true)) schema_df2 = new StructType() .add(StructField("key1", StringType, true)) .add(StructField("val", DoubleType, true)) schema_df3 = new StructType() .add(StructField("key2", StringType, true)) .add(StructField("val", DoubleType, true)) Now if we want to join join1 = df1.join(df2,"key1") join2 = df1.join(df3,"key2") I was thinking of bucketing as a solution to speed up the joins. But if I bucket df1 on the key1,then join2 may not benefit, and vice versa (if bucket on key2 for df1). or Should we bucket df1 twice, one with key1 and another with key2? Is there a strategy to make both the joins faster for both the joins? Regards Amit Joshi