That sounds like a plan as suggested by Sean, I have also seen caching the RS before coalesce provides benefits, especially for a minute 50MB data. Check Spark GUI storage tab for its effect.
HTH Mich LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Wed, 3 Feb 2021 at 19:08, Sean Owen <sro...@gmail.com> wrote: > Probably could also be because that coalesce can cause some upstream > transformations to also have parallelism of 1. I think (?) an OK solution > is to cache the result, then coalesce and write. Or combine the files after > the fact. or do what Silvio said. > > On Wed, Feb 3, 2021 at 12:55 PM James Yu <ja...@ispot.tv> wrote: > >> Hi Team, >> >> We are running into this poor performance issue and seeking your >> suggestion on how to improve it: >> >> We have a particular dataset which we aggregate from other datasets and >> like to write out to one single file (because it is small enough). We >> found that after a series of transformations (GROUP BYs, FLATMAPs), we >> coalesced the final RDD to 1 partition before writing it out, and this >> coalesce degrade the performance, not that this additional coalesce >> operation took additional runtime, but it somehow dictates the partitions >> to use in the upstream transformations. >> >> We hope there is a simple and useful way to solve this kind of issue >> which we believe is quite common for many people. >> >> >> Thanks >> >> James >> >