Can you share the stages as seen in the spark ui for the count and coalesce
jobs

My suggestion of moving things around was just for troubleshooting rather
than a solution of that wasn't clear before

On Mon, 31 Jan 2022, 08:07 Benjamin Du, <[email protected]> wrote:

> Remvoing coalesce didn't help either.
>
>
>
> Best,
>
> ----
>
> Ben Du
>
> Personal Blog <http://www.legendu.net/> | GitHub
> <https://github.com/dclong/> | Bitbucket <https://bitbucket.org/dclong/>
> | Docker Hub <https://hub.docker.com/r/dclong/>
> ------------------------------
> *From:* Deepak Sharma <[email protected]>
> *Sent:* Sunday, January 30, 2022 12:45 AM
> *To:* Benjamin Du <[email protected]>
> *Cc:* [email protected] <[email protected]>
> *Subject:* Re: A Persisted Spark DataFrame is computed twice
>
> coalesce returns a new dataset.
> That will cause the recomputation.
>
> Thanks
> Deepak
>
> On Sun, 30 Jan 2022 at 14:06, Benjamin Du <[email protected]> wrote:
>
> I have some PySpark code like below. Basically, I persist a DataFrame
> (which is time-consuming to compute) to disk, call the method
> DataFrame.count to trigger the caching/persist immediately, and then I
> coalesce the DataFrame to reduce the number of partitions (the original
> DataFrame has 30,000 partitions) and output it to HDFS. Based on the
> execution time of job stages and the execution plan, it seems to me that
> the DataFrame is recomputed at df.coalesce(300). Does anyone know why
> this happens?
>
> df = spark.read.parquet("/input/hdfs/path") \
>     .filter(...) \
>     .withColumn("new_col", my_pandas_udf("col0", "col1")) \
>     .persist(StorageLevel.DISK_ONLY)
> df.count()
> df.coalesce(300).write.mode("overwrite").parquet(output_mod)
>
>
> BTW, it works well if I manually write the DataFrame to HDFS, read it
> back, coalesce it and write it back to HDFS.
> Originally post at
> https://stackoverflow.com/questions/70781494/a-persisted-spark-dataframe-is-computed-twice.
> <https://stackoverflow.com/questions/70781494/a-persisted-spark-dataframe-is-computed-twice>
>
> Best,
>
> ----
>
> Ben Du
>
> Personal Blog <http://www.legendu.net/> | GitHub
> <https://github.com/dclong/> | Bitbucket <https://bitbucket.org/dclong/>
> | Docker Hub <https://hub.docker.com/r/dclong/>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>

Reply via email to