You are right -- Spark can't do this with its current architecture. My
question was: if there was a new implementation supporting pipelined
execution, what kind of Spark jobs would benefit (a lot) from it?

Thanks,

--- Sungwoo

On Thu, Sep 8, 2022 at 1:47 AM Russell Jurney <russell.jur...@gmail.com>
wrote:

> I don't think Spark can do this with its current architecture. It has to
> wait for the step to be done, speculative execution isn't possible. Others
> probably know more about why that is.
>
> Thanks,
> Russell Jurney @rjurney <http://twitter.com/rjurney>
> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
> <http://facebook.com/jurney> datasyndrome.com
>
>
> On Wed, Sep 7, 2022 at 7:42 AM Sungwoo Park <glap...@gmail.com> wrote:
>
>> Hello Spark users,
>>
>> I have a question on the architecture of Spark (which could lead to a
>> research problem). In its current implementation, Spark finishes executing
>> all the tasks in a stage before proceeding to child stages. For example,
>> given a two-stage map-reduce DAG, Spark finishes executing all the map
>> tasks before scheduling reduce tasks.
>>
>> We can think of another 'pipelined execution' strategy in which tasks in
>> child stages can be scheduled and executed concurrently with tasks in
>> parent stages. For example, for the two-stage map-reduce DAG, while map
>> tasks are being executed, we could schedule and execute reduce tasks in
>> advance if the cluster has enough resources. These reduce tasks can also
>> pre-fetch the output of map tasks.
>>
>> Has anyone seen Spark jobs for which this 'pipelined execution' strategy
>> would be desirable while the current implementation is not quite adequate?
>> Since Spark tasks usually run for a short period of time, I guess the new
>> strategy would not have a major performance improvement. However, there
>> might be some category of Spark jobs for which this new strategy would be
>> clearly a better choice.
>>
>> Thanks,
>>
>> --- Sungwoo
>>
>>

Reply via email to