You are right -- Spark can't do this with its current architecture. My question was: if there was a new implementation supporting pipelined execution, what kind of Spark jobs would benefit (a lot) from it?
Thanks, --- Sungwoo On Thu, Sep 8, 2022 at 1:47 AM Russell Jurney <russell.jur...@gmail.com> wrote: > I don't think Spark can do this with its current architecture. It has to > wait for the step to be done, speculative execution isn't possible. Others > probably know more about why that is. > > Thanks, > Russell Jurney @rjurney <http://twitter.com/rjurney> > russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB > <http://facebook.com/jurney> datasyndrome.com > > > On Wed, Sep 7, 2022 at 7:42 AM Sungwoo Park <glap...@gmail.com> wrote: > >> Hello Spark users, >> >> I have a question on the architecture of Spark (which could lead to a >> research problem). In its current implementation, Spark finishes executing >> all the tasks in a stage before proceeding to child stages. For example, >> given a two-stage map-reduce DAG, Spark finishes executing all the map >> tasks before scheduling reduce tasks. >> >> We can think of another 'pipelined execution' strategy in which tasks in >> child stages can be scheduled and executed concurrently with tasks in >> parent stages. For example, for the two-stage map-reduce DAG, while map >> tasks are being executed, we could schedule and execute reduce tasks in >> advance if the cluster has enough resources. These reduce tasks can also >> pre-fetch the output of map tasks. >> >> Has anyone seen Spark jobs for which this 'pipelined execution' strategy >> would be desirable while the current implementation is not quite adequate? >> Since Spark tasks usually run for a short period of time, I guess the new >> strategy would not have a major performance improvement. However, there >> might be some category of Spark jobs for which this new strategy would be >> clearly a better choice. >> >> Thanks, >> >> --- Sungwoo >> >>