This may not be good advice but... could you sort by the partition key to ensure the partitions match up? Thinking of olden times :)
On Fri, Dec 23, 2022 at 4:42 AM Shivam Verma <raj.shivam...@gmail.com> wrote: > Hi Gurunandan, > > Thanks for the reply! > > I do see the exchange operator in the SQL tab, but I can see it in both > the experiments: > 1. Using repartitioned dataframes > 2. Using initial dataframes > > Does that mean that the repartitioned dataframes are not actually > "co-partitioned"? > If that's the case, I have two more questions: > > 1. Why is the job with repartitioned dataframes faster (at least 3x) as > compared to the job using initial dataframes? > 2. How do I ensure co-partitioning for pyspark dataframes? > > Thanks, > Shivam > > > > On Wed, Dec 14, 2022 at 5:58 PM Gurunandan <gurunandan....@gmail.com> > wrote: > >> Hi, >> One of the options for validation is to navigate `SQL TAB` in Spark UI >> and click on a Query of interest to view detailed information of each >> Query. We need to validate if the Exchange Operator is present for >> shuffle, like shared in the attachment. >> >> Otherwise we can print the executed plan and validate for Exchange >> Operator in the Physical Plan. >> >> On Wed, Dec 14, 2022 at 10:56 AM Shivam Verma <raj.shivam...@gmail.com> >> wrote: >> > >> > Hello folks, >> > >> > I have a use case where I save two pyspark dataframes as parquet files >> and then use them later to join with each other or with other tables and >> perform multiple aggregations. >> > >> > Since I know the column being used in the downstream joins and groupby, >> I was hoping I could use co-partitioning for the two dataframes when saving >> them and avoid shuffle later. >> > >> > I repartitioned the two dataframes (providing same number of partitions >> and same column for repartitioning). >> > >> > While I'm seeing an improvement in execution time with the above >> approach, how do I confirm that a shuffle is actually NOT happening (maybe >> through SparkUI)? >> > The spark plan and shuffle read/write are the same in the two scenarios: >> > 1. Using repartitioned dataframes to perform join+aggregation >> > 2. Using base dataframes itself (without explicit repartitioning) to >> perform join+aggregatio >> > >> > I have a StackOverflow post with more details regarding the same: >> > https://stackoverflow.com/q/74771971/14741697 >> > >> > Thanks in advance, appreciate your help! >> > >> > Regards, >> > Shivam >> > >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Thanks, Russell Jurney @rjurney <http://twitter.com/rjurney> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB <http://facebook.com/jurney> datasyndrome.com Book a time on Calendly <https://calendly.com/rjurney_personal/30min>