Interesting point. As I understand, the key point is the ShuffleManager ensures that only one map output file is processed by the reduce task, even when multiple attempts succeed. So it is not a random selection process. At the reduce stage, only one copy of the map output needs to be read by the reduce task. Now which copies, if I am correct, much like other classical examples, Spark prioritizes the copy that completes first (FIFO). The first completed instance output will be used, and the output from the other speculative instances will be ignored. This makes sense as the reduce stage can proceed with the earliest available data, minimizing the impact of speculative execution on job completion time which is another important factor.
HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Thu, 21 Dec 2023 at 17:51, Enrico Minack <i...@enrico.minack.dev> wrote: > Hi Spark devs, > > I have a question around ShuffleManager: With speculative execution, one > map output file is being created multiple times (by multiple task > attempts). If both attempts succeed, which is to be read by the reduce > task in the next stage? Is any map output as good as any other? > > Thanks for clarification, > Enrico > > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >