Can you share the stages as seen in the spark ui for the count and coalesce
jobs

My suggestion of moving things around was just for troubleshooting rather
than a solution of that wasn't clear before

On Mon, 31 Jan 2022, 08:07 Benjamin Du, <legendu....@outlook.com> wrote:

> Remvoing coalesce didn't help either.
>
>
>
> Best,
>
> ----
>
> Ben Du
>
> Personal Blog <http://www.legendu.net/> | GitHub
> <https://github.com/dclong/> | Bitbucket <https://bitbucket.org/dclong/>
> | Docker Hub <https://hub.docker.com/r/dclong/>
> ------------------------------
> *From:* Deepak Sharma <deepakmc...@gmail.com>
> *Sent:* Sunday, January 30, 2022 12:45 AM
> *To:* Benjamin Du <legendu....@outlook.com>
> *Cc:* u...@spark.incubator.apache.org <u...@spark.incubator.apache.org>
> *Subject:* Re: A Persisted Spark DataFrame is computed twice
>
> coalesce returns a new dataset.
> That will cause the recomputation.
>
> Thanks
> Deepak
>
> On Sun, 30 Jan 2022 at 14:06, Benjamin Du <legendu....@outlook.com> wrote:
>
> I have some PySpark code like below. Basically, I persist a DataFrame
> (which is time-consuming to compute) to disk, call the method
> DataFrame.count to trigger the caching/persist immediately, and then I
> coalesce the DataFrame to reduce the number of partitions (the original
> DataFrame has 30,000 partitions) and output it to HDFS. Based on the
> execution time of job stages and the execution plan, it seems to me that
> the DataFrame is recomputed at df.coalesce(300). Does anyone know why
> this happens?
>
> df = spark.read.parquet("/input/hdfs/path") \
>     .filter(...) \
>     .withColumn("new_col", my_pandas_udf("col0", "col1")) \
>     .persist(StorageLevel.DISK_ONLY)
> df.count()
> df.coalesce(300).write.mode("overwrite").parquet(output_mod)
>
>
> BTW, it works well if I manually write the DataFrame to HDFS, read it
> back, coalesce it and write it back to HDFS.
> Originally post at
> https://stackoverflow.com/questions/70781494/a-persisted-spark-dataframe-is-computed-twice.
> <https://stackoverflow.com/questions/70781494/a-persisted-spark-dataframe-is-computed-twice>
>
> Best,
>
> ----
>
> Ben Du
>
> Personal Blog <http://www.legendu.net/> | GitHub
> <https://github.com/dclong/> | Bitbucket <https://bitbucket.org/dclong/>
> | Docker Hub <https://hub.docker.com/r/dclong/>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>

Reply via email to