The FetchFailed error of Task B will be forwarded to DAGScheduler too. The FetchFailed already means the output missing of the stage. So DAGScheduler will reschedule the upstream stage, which would reschedule the upstream task of Task B at the end.
On Mon, Sep 14, 2020 at 10:39 AM 陈晓宇 <xychen0...@gmail.com> wrote: > Thanks Yi Wu and Sean. Here I mean shuffle data, and without shuffle > service. > > spark.blacklist.application.fetchFailure.enabled=true; seems to be the > answer and I was not noticed about it, thanks for pointing it out. And I > will give a try. > > However I doubt how it would work: when Task B reported FetchFailed, this > blacklist flag can be used to identify executor A, and tasks will not be > scheduled to executor A any more. But would the upstream task for Task > B(which was previously running on executor A) be re-scheduled by DAG > scheduler? The DAG scheduler only reschedule a task unless it thinks the > output of the task is missing(Please correct me if I am wrong). And unless > executor A failed to report heartbeat for a timeout period, the driver > still believe the output are there on executor A. > > Thanks again. > > Yi Wu <yi...@databricks.com> 于2020年9月11日周五 下午9:24写道: > >> What do you mean by "read from executor A"? I can think of several paths >> for an executor to read something from another remote executor: >> >> 1. shuffle data >> If the executor fails to fetch the shuffle data, I think it will result >> in the FetchFiled for the task. For this case, blacklist can identify the >> problematic executor A >> if spark.blacklist.application.fetchFailure.enabled=true; >> >> 2. RDD block >> If the executor fails to fetch RDD blocks, I think the task would just do >> the computation by itself instead of failing. >> >> 3. Broadcast block >> If the executor fails to fetch the broadcast block, the task seems to >> fail in this case and blacklist doesn't handle it well. >> >> Thanks, >> Yi >> >> On Fri, Sep 11, 2020 at 8:43 PM Sean Owen <sro...@gmail.com> wrote: >> >>> -dev, +user >>> Executors do not communicate directly, so I don't think that's quite >>> what you are seeing. You'd have to clarify. >>> >>> On Fri, Sep 11, 2020 at 12:08 AM 陈晓宇 <xychen0...@gmail.com> wrote: >>> > >>> > Hello all, >>> > >>> > We've been using spark 2.3 with blacklist enabled and often meet the >>> problem that when executor A has some problem(like connection issue). Tasks >>> on executor B, executor C will fail saying cannot read from executor A. >>> Finally the job will fail due to task on executor B failed 4 times. >>> > >>> > I wonder whether there is any existing fix or discussions how to >>> identify Executor A as the problem node. >>> > >>> > Thanks >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>>