Hi Gurunandan, Thanks for the reply!
I do see the exchange operator in the SQL tab, but I can see it in both the experiments: 1. Using repartitioned dataframes 2. Using initial dataframes Does that mean that the repartitioned dataframes are not actually "co-partitioned"? If that's the case, I have two more questions: 1. Why is the job with repartitioned dataframes faster (at least 3x) as compared to the job using initial dataframes? 2. How do I ensure co-partitioning for pyspark dataframes? Thanks, Shivam On Wed, Dec 14, 2022 at 5:58 PM Gurunandan <gurunandan....@gmail.com> wrote: > Hi, > One of the options for validation is to navigate `SQL TAB` in Spark UI > and click on a Query of interest to view detailed information of each > Query. We need to validate if the Exchange Operator is present for > shuffle, like shared in the attachment. > > Otherwise we can print the executed plan and validate for Exchange > Operator in the Physical Plan. > > On Wed, Dec 14, 2022 at 10:56 AM Shivam Verma <raj.shivam...@gmail.com> > wrote: > > > > Hello folks, > > > > I have a use case where I save two pyspark dataframes as parquet files > and then use them later to join with each other or with other tables and > perform multiple aggregations. > > > > Since I know the column being used in the downstream joins and groupby, > I was hoping I could use co-partitioning for the two dataframes when saving > them and avoid shuffle later. > > > > I repartitioned the two dataframes (providing same number of partitions > and same column for repartitioning). > > > > While I'm seeing an improvement in execution time with the above > approach, how do I confirm that a shuffle is actually NOT happening (maybe > through SparkUI)? > > The spark plan and shuffle read/write are the same in the two scenarios: > > 1. Using repartitioned dataframes to perform join+aggregation > > 2. Using base dataframes itself (without explicit repartitioning) to > perform join+aggregatio > > > > I have a StackOverflow post with more details regarding the same: > > https://stackoverflow.com/q/74771971/14741697 > > > > Thanks in advance, appreciate your help! > > > > Regards, > > Shivam > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org