Re: Check if shuffle is caused for repartitioned pyspark dataframes

Shivam Verma Fri, 23 Dec 2022 02:42:39 -0800

Hi Gurunandan,

Thanks for the reply!


I do see the exchange operator in the SQL tab, but I can see it in both the
experiments:
1. Using repartitioned dataframes
2. Using initial dataframes

Does that mean that the repartitioned dataframes are not actually
"co-partitioned"?
If that's the case, I have two more questions:

1. Why is the job with repartitioned dataframes faster (at least 3x) as
compared to the job using initial dataframes?
2. How do I ensure co-partitioning for pyspark dataframes?

Thanks,
Shivam



On Wed, Dec 14, 2022 at 5:58 PM Gurunandan <[email protected]> wrote:

> Hi,
> One of the options for validation is to navigate `SQL TAB` in Spark UI
> and click on a Query of interest to view detailed information of each
> Query. We need to validate if the Exchange Operator is present for
> shuffle, like shared in the attachment.
>
> Otherwise we can print the executed plan and validate for Exchange
> Operator in the Physical Plan.
>
> On Wed, Dec 14, 2022 at 10:56 AM Shivam Verma <[email protected]>
> wrote:
> >
> > Hello folks,
> >
> > I have a use case where I save two pyspark dataframes as parquet files
> and then use them later to join with each other or with other tables and
> perform multiple aggregations.
> >
> > Since I know the column being used in the downstream joins and groupby,
> I was hoping I could use co-partitioning for the two dataframes when saving
> them and avoid shuffle later.
> >
> > I repartitioned the two dataframes (providing same number of partitions
> and same column for repartitioning).
> >
> > While I'm seeing an improvement in execution time with the above
> approach, how do I confirm that a shuffle is actually NOT happening (maybe
> through SparkUI)?
> > The spark plan and shuffle read/write are the same in the two scenarios:
> > 1. Using repartitioned dataframes to perform join+aggregation
> > 2. Using base dataframes itself (without explicit repartitioning) to
> perform join+aggregatio
> >
> > I have a StackOverflow post with more details regarding the same:
> > https://stackoverflow.com/q/74771971/14741697
> >
> > Thanks in advance, appreciate your help!
> >
> > Regards,
> > Shivam
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]

Re: Check if shuffle is caused for repartitioned pyspark dataframes

Reply via email to