Hi Bo, In current Blink implementation, the failover strategy can only confirm restart the upstream task region for some special exceptions reporeted by downstream task failure. As you said, if the partition is consumed once by downstream task, then it would be removed and can not be consumed again even though the data is still available on disk.
The key problem as I mentioned in last email is loss of parrition management on JM side. JM is only aware of execution states currently but not aware of corresponding partition state during deciding which region shoulde be restarted. So there would result in unnecessary restart process after partition is removed from TaskManager shuffle service. As I mentioned, the new proposed shuffle manager achitecture already considers the partition management in ShuffleMaster componenet on JM side. That means the failover strategy could get correct partition state via ShuffleMaster#getFeature on JM side, so it can make the proper decision of restarting regions. And the partition state might be updated via communication from ShuffleService on TM side to ShuffleMaster on JM side. I and @Andrey Zagrebin are working on this feature now, and I think it would solve all your concerns based on this architecuture. Best, Zhijiang ------------------------------------------------------------------ From:Bo WANG <wbeaglewatc...@gmail.com> Send Time:2019年1月28日(星期一) 15:11 To:dev <dev@flink.apache.org> Cc:wangzhijiang999 <wangzhijiang...@aliyun.com>; Kurt Young <k...@apache.org> Subject:Re: [DISCUSS] Shall we make SpillableSubpartition repeatedly readable to support fine grained recovery Thanks all for the replies. Though the current implementation in blink can handle some cases of the problem, the solution is not straight forward and efficient enough and can result in unnecessary restarting of terminated upstream task. The following word count job with parallelism 2 can demonstrate the retry process: T0: Map(1/2)#1, Map(2/2)#1 - both terminate and produce partitions for Reduce(1/2) and Reduce(2/2) T1: Reduce(1/2)#1, Reduce(2/2)#1 - Reduce(1/2)#1 throws a RuntimeException and fails due to error injection, Reduce(2/2)#1 terminates. T3: Reduce(1/2)#2 restarts and throws PartitionNotFoundException of Map(1/2) T4: Map(1/2)#2 restarts, produces partitions, and terminates T5: Reduce(1/2)#3 restarts and successfully consumes partitions produced by Map(1/2)#2, but throws PartitionNotFoundException of Map(2/2) T6: Map(2/2)#2 restarts, produces partitions, and terminates T7: Reduce(1/2)#4 restarts and throws PartitionNotFoundException of Map(1/2) (since partition produced by Map(1/2)#2 has been consumed by Reduce(1/2)#3) T8: Map(1/2)#3 restarts, produces partitions, and terminates T9: Reduce(1/2)#5 restarts, consumes partitions from Map(1/2)#3 and Map(2/2)#2, and terminates. Thus, Map restart 3 times and Reduce restart 4 times totally in this fail over. And we could conclude that with parallelism of n of Map vertex, Map will restart 2^n - 1 times, and Reduce will restart 2^n times. We can found that each time Reduce(1/2) retries, all subpartitions produced by Map(1/2) are consumed and thus cannot be reused for possible later downstream task retry. In fact, Map(1/2) is not necessary to restart every time since its output is OK if partitions are repeatedly readable. Based on this observation, we propose to improve the current implementation in Blink by reusing the output for possible later failure and re-start task only if its output is missing or corrupted. Repeatably readable SpillableSubpartition and restarting upstream producer are complementary rather than exclusive features. Combining with advantages of these two methods, failure recovery would be more efficient and effective. On Sat, Jan 26, 2019 at 11:54 PM Stephan Ewen <ewenstep...@gmail.com> wrote: > > Let's make sure that this is on the list of patches me merge from the blink > branch... > > On Fri, Jan 25, 2019, 07:56 Guowei Ma <guowei....@gmail.com wrote: > > > Thanks to zhijiang for a detailed explanation. I would do some supplements > > Blink has indeed solved this particular problem. This problem can be > > identified in Blink and the upstream will be restarted by Blink > > thanks > > > > zhijiang <wangzhijiang...@aliyun.com.invalid> 于2019年1月25日周五 下午12:04写道: > > > > > Hi Bo, > > > > > > Your mentioned problems can be summaried into two issues: > > > > > > 1. Failover strategy should consider whether the upstream produced > > > partition is still available when the downstream fails. If the produced > > > partition is available, then only downstream region needs to restarted, > > > otherwise the upstream region should also be restarted to re-produce the > > > partition data. > > > 2. The lifecycle of partition: Currently once the partition data is > > > transfered via network completely, the partition and view would be > > released > > > from producer side, no matter whether the data is actually processed by > > > consumer or not. Even the TaskManager would be released earier when the > > > partition data is not transfered yet. > > > > > > Both issues are already considered in my proposed pluggable shuffle > > > manager architecutre which would introduce the ShuffleMaster componenet > > to > > > manage partition globally on JobManager side, then it is natural to solve > > > the above problems based on this architecuture. You can refer to the flip > > > [1] if interested. > > > > > > [1] > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-31%3A+Pluggable+Shuffle+Manager > > > > > > Best, > > > Zhijiang > > > ------------------------------------------------------------------ > > > From:Stephan Ewen <se...@apache.org> > > > Send Time:2019年1月24日(星期四) 22:17 > > > To:dev <dev@flink.apache.org>; Kurt Young <k...@apache.org> > > > Subject:Re: [DISCUSS] Shall we make SpillableSubpartition repeatedly > > > readable to support fine grained recovery > > > > > > The SpillableSubpartition can also be used during the execution of > > bounded > > > DataStreams programs. I think this is largely independent from > > deprecating > > > the DataSet API. > > > > > > I am wondering if this particular issue is one that has been addressed in > > > the Blink code already (we are looking to merge much of that > > functionality) > > > - because the proposed extension is actually necessary for proper batch > > > fault tolerance (independent of the DataSet or Query Processor stack). > > > > > > I am adding Kurt to this thread - maybe he help us find that out. > > > > > > On Thu, Jan 24, 2019 at 2:36 PM Piotr Nowojski <pi...@da-platform.com> > > > wrote: > > > > > > > Hi, > > > > > > > > I’m not sure how much effort we will be willing to invest in the > > existing > > > > batch stack. We are currently focusing on the support of bounded > > > > DataStreams (already done in Blink and will be merged to Flink soon) > > and > > > > unifing batch & stream under DataStream API. > > > > > > > > Piotrek > > > > > > > > > On 23 Jan 2019, at 04:45, Bo WANG <wbeaglewatc...@gmail.com> wrote: > > > > > > > > > > Hi all, > > > > > > > > > > When running the batch WordCount example, I configured the job > > > execution > > > > > mode > > > > > as BATCH_FORCED, and failover-strategy as region, I manually injected > > > > some > > > > > errors to let the execution fail in different phases. In some cases, > > > the > > > > > job could > > > > > recovery from failover and became succeed, but in some cases, the job > > > > > retried > > > > > several times and failed. > > > > > > > > > > Example: > > > > > - If the failure occurred before task read data, e.g., failed before > > > > > invokable.invoke() in Task.java, failover could succeed. > > > > > - If the failure occurred after task having read data, failover did > > not > > > > > work. > > > > > > > > > > Problem diagnose: > > > > > Running the example described before, each ExecutionVertex is defined > > > as > > > > > a restart region, and the ResultPartitionType between executions is > > > > > BLOCKING. > > > > > Thus, SpillableSubpartition and SpillableSubpartitionView are used to > > > > > write/read > > > > > shuffle data, and data blocks are described as BufferConsumers stored > > > in > > > > a > > > > > list > > > > > called buffers, when task requires input data from > > > > > SpillableSubpartitionView, > > > > > BufferConsumers are REMOVED from buffers. Thus, when failures > > occurred > > > > > after having read data, some BufferConsumers have already released. > > > > > Although tasks retried, the input data is incomplete. > > > > > > > > > > Fix Proposal: > > > > > - BufferConsumer should not be removed from buffers until the > > consumed > > > > > ExecutionVertex is terminal. > > > > > - SpillableSubpartition should not be released until the consumed > > > > > ExecutionVertex is terminal. > > > > > - SpillableSubpartition could creates multi > > SpillableSubpartitionViews, > > > > > each of which is corresponding to a ExecutionAttempt. > > > > > > > > > > Best, > > > > > Bo > > > > > > > > > > > > > > > >