Hi,
as always, I would like to first identify the problem before solving the
problem.
So to isolate the problem, first without coalesce try to write the data out
to a storage location and check the time.
Then try to do coalesce to one and check the time.
If the time between writing down between coalesce and writing out to the
files is very large, then the issue is coalesce. Otherwise the issue is the
chain of transformations before coalesce.
Anyways, its 2021, and I always get confused when people use RDD's. Any
particular reason why dataframes would not work?


Regards,
Gourav Sengupta

On Wed, Feb 3, 2021 at 7:20 PM James Yu <ja...@ispot.tv> wrote:

> Hi Silvio,
>
> The result file is less than 50 MB in size so I think it is small and
> acceptable enough for one task to write.
>
> Your suggestion sounds interesting. Could you guide us further on how to
> easily "add a stage boundary"?
>
> Thanks
> ------------------------------
> *From:* Silvio Fiorito <silvio.fior...@granturing.com>
> *Sent:* Wednesday, February 3, 2021 11:05 AM
> *To:* James Yu <ja...@ispot.tv>; user <user@spark.apache.org>
> *Subject:* Re: Poor performance caused by coalesce to 1
>
>
> Coalesce is reducing the parallelization of your last stage, in your case
> to 1 task. So, it’s natural it will give poor performance especially with
> large data. If you absolutely need a single file output, you can instead
> add a stage boundary and use repartition(1). This will give your query full
> parallelization during processing while at the end giving you a single task
> that writes data out. Note that if the file is large (e.g. in 1GB or more)
> you’ll probably still notice slowness while writing. You may want to
> reconsider the 1-file requirement for larger datasets.
>
>
>
> *From: *James Yu <ja...@ispot.tv>
> *Date: *Wednesday, February 3, 2021 at 1:54 PM
> *To: *user <user@spark.apache.org>
> *Subject: *Poor performance caused by coalesce to 1
>
>
>
> Hi Team,
>
>
>
> We are running into this poor performance issue and seeking your
> suggestion on how to improve it:
>
>
>
> We have a particular dataset which we aggregate from other datasets and
> like to write out to one single file (because it is small enough).  We
> found that after a series of transformations (GROUP BYs, FLATMAPs), we
> coalesced the final RDD to 1 partition before writing it out, and this
> coalesce degrade the performance, not that this additional coalesce
> operation took additional runtime, but it somehow dictates the partitions
> to use in the upstream transformations.
>
>
>
> We hope there is a simple and useful way to solve this kind of issue which
> we believe is quite common for many people.
>
>
>
>
>
> Thanks
>
>
>
> James
>

Reply via email to