Re: Check if shuffle is caused for repartitioned pyspark dataframes

Russell Jurney Fri, 23 Dec 2022 08:38:43 -0800

This may not be good advice but... could you sort by the partition key to
ensure the partitions match up? Thinking of olden times :)


On Fri, Dec 23, 2022 at 4:42 AM Shivam Verma <raj.shivam...@gmail.com>
wrote:

> Hi Gurunandan,
>
> Thanks for the reply!
>
> I do see the exchange operator in the SQL tab, but I can see it in both
> the experiments:
> 1. Using repartitioned dataframes
> 2. Using initial dataframes
>
> Does that mean that the repartitioned dataframes are not actually
> "co-partitioned"?
> If that's the case, I have two more questions:
>
> 1. Why is the job with repartitioned dataframes faster (at least 3x) as
> compared to the job using initial dataframes?
> 2. How do I ensure co-partitioning for pyspark dataframes?
>
> Thanks,
> Shivam
>
>
>
> On Wed, Dec 14, 2022 at 5:58 PM Gurunandan <gurunandan....@gmail.com>
> wrote:
>
>> Hi,
>> One of the options for validation is to navigate `SQL TAB` in Spark UI
>> and click on a Query of interest to view detailed information of each
>> Query. We need to validate if the Exchange Operator is present for
>> shuffle, like shared in the attachment.
>>
>> Otherwise we can print the executed plan and validate for Exchange
>> Operator in the Physical Plan.
>>
>> On Wed, Dec 14, 2022 at 10:56 AM Shivam Verma <raj.shivam...@gmail.com>
>> wrote:
>> >
>> > Hello folks,
>> >
>> > I have a use case where I save two pyspark dataframes as parquet files
>> and then use them later to join with each other or with other tables and
>> perform multiple aggregations.
>> >
>> > Since I know the column being used in the downstream joins and groupby,
>> I was hoping I could use co-partitioning for the two dataframes when saving
>> them and avoid shuffle later.
>> >
>> > I repartitioned the two dataframes (providing same number of partitions
>> and same column for repartitioning).
>> >
>> > While I'm seeing an improvement in execution time with the above
>> approach, how do I confirm that a shuffle is actually NOT happening (maybe
>> through SparkUI)?
>> > The spark plan and shuffle read/write are the same in the two scenarios:
>> > 1. Using repartitioned dataframes to perform join+aggregation
>> > 2. Using base dataframes itself (without explicit repartitioning) to
>> perform join+aggregatio
>> >
>> > I have a StackOverflow post with more details regarding the same:
>> > https://stackoverflow.com/q/74771971/14741697
>> >
>> > Thanks in advance, appreciate your help!
>> >
>> > Regards,
>> > Shivam
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --

Thanks,
Russell Jurney @rjurney <http://twitter.com/rjurney>
russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
<http://facebook.com/jurney> datasyndrome.com Book a time on Calendly
<https://calendly.com/rjurney_personal/30min>

Re: Check if shuffle is caused for repartitioned pyspark dataframes

Reply via email to