Can you share the stages as seen in the spark ui for the count and coalesce jobs
My suggestion of moving things around was just for troubleshooting rather than a solution of that wasn't clear before On Mon, 31 Jan 2022, 08:07 Benjamin Du, <legendu....@outlook.com> wrote: > Remvoing coalesce didn't help either. > > > > Best, > > ---- > > Ben Du > > Personal Blog <http://www.legendu.net/> | GitHub > <https://github.com/dclong/> | Bitbucket <https://bitbucket.org/dclong/> > | Docker Hub <https://hub.docker.com/r/dclong/> > ------------------------------ > *From:* Deepak Sharma <deepakmc...@gmail.com> > *Sent:* Sunday, January 30, 2022 12:45 AM > *To:* Benjamin Du <legendu....@outlook.com> > *Cc:* u...@spark.incubator.apache.org <u...@spark.incubator.apache.org> > *Subject:* Re: A Persisted Spark DataFrame is computed twice > > coalesce returns a new dataset. > That will cause the recomputation. > > Thanks > Deepak > > On Sun, 30 Jan 2022 at 14:06, Benjamin Du <legendu....@outlook.com> wrote: > > I have some PySpark code like below. Basically, I persist a DataFrame > (which is time-consuming to compute) to disk, call the method > DataFrame.count to trigger the caching/persist immediately, and then I > coalesce the DataFrame to reduce the number of partitions (the original > DataFrame has 30,000 partitions) and output it to HDFS. Based on the > execution time of job stages and the execution plan, it seems to me that > the DataFrame is recomputed at df.coalesce(300). Does anyone know why > this happens? > > df = spark.read.parquet("/input/hdfs/path") \ > .filter(...) \ > .withColumn("new_col", my_pandas_udf("col0", "col1")) \ > .persist(StorageLevel.DISK_ONLY) > df.count() > df.coalesce(300).write.mode("overwrite").parquet(output_mod) > > > BTW, it works well if I manually write the DataFrame to HDFS, read it > back, coalesce it and write it back to HDFS. > Originally post at > https://stackoverflow.com/questions/70781494/a-persisted-spark-dataframe-is-computed-twice. > <https://stackoverflow.com/questions/70781494/a-persisted-spark-dataframe-is-computed-twice> > > Best, > > ---- > > Ben Du > > Personal Blog <http://www.legendu.net/> | GitHub > <https://github.com/dclong/> | Bitbucket <https://bitbucket.org/dclong/> > | Docker Hub <https://hub.docker.com/r/dclong/> > > > > -- > Thanks > Deepak > www.bigdatabig.com > www.keosha.net >