Hi Zhijiang Thank you for your analysis. I agree with it. The solution may be to let tm exit like you mentioned when any type of oom occurs, because the flink has no control on a tm when a oom occurs.
I fired a jira before, https://issues.apache.org/jira/browse/FLINK-12889. Don't know it is worth to fix. Thank you all. Yours sincerely Joshua On Fri, Jun 21, 2019 at 5:32 PM zhijiang <wangzhijiang...@aliyun.com> wrote: > Thanks for the reminding @Chesnay Schepler . > > I just looked throught the related logs. Actually all the five > "Source: ServiceLog" tasks are not in terminal state on JM view, the > relevant processes are as follows: > > 1. The checkpoint in task causes OOM issue which would call > `Task#failExternally` as a result, we could see the log "Attempting to > fail task externally" in tm. > 2. The source task would transform state from RUNNING to FAILED and then > starts a canceler thread for canceling task, we could see log "Triggering > cancellation of task" in tm. > 3. When JM starts to cancel the source tasks, the rpc call > `Task#cancelExecution` would find the task was already in FAILED state as > above step 2, we could see log "Attempting to cancel task" in tm. > > At last all the five source tasks are not in terminal states from jm log, > I guess the step 2 might not create canceler thread successfully, because > the root failover was caused by OOM during creating native thread in step1, > so it might exist possibilities that createing canceler thread is not > successful as well in OOM case which is unstable. If so, the source task > would not been interrupted at all, then it would not report to JM as well, > but the state is already changed to FAILED before. > > For the other vertex tasks, it does not trigger `Task#failExternally` in > step 1, and only receives the cancel rpc from JM in step 3. And I guess at > this time later than the source period, the canceler thread could be > created succesfully after some GCs, then these tasks could be canceled as > reported to JM side. > > I think the key problem is under OOM case some behaviors are not within > expectations, so it might bring problems. Maybe we should handle OOM error > in extreme way like making TM exit to solve the potential issue. > > Best, > Zhijiang > > ------------------------------------------------------------------ > From:Chesnay Schepler <ches...@apache.org> > Send Time:2019年6月21日(星期五) 16:34 > To:zhijiang <wangzhijiang...@aliyun.com>; Joshua Fan < > joshuafat...@gmail.com> > Cc:user <user@flink.apache.org>; Till Rohrmann <trohrm...@apache.org> > Subject:Re: Maybe a flink bug. Job keeps in FAILING state > > The logs are attached to the initial mail. > > Echoing my thoughts from earlier: from the logs it looks as if the TM > never even submits the terminal state RPC calls for several tasks to the JM. > > On 21/06/2019 10:30, zhijiang wrote: > Hi Joshua, > > If the tasks(subtask 1/5,subtask 2/5,subtask 3/5,subtask 5/5) were really > in CANCELED state on TM side, but in CANCELING state on JM side, then it > might indicates the terminal state RPC was not received by JM. I am not > sure whether the OOM would cause this issue happen resulting in unexpected > behavior. > > In addition, you mentioned these tasks are still active after OOM and was > called to cancel, so I am not sure what is the specific periods for your > attached TM stack. I think it might provide help if you could provide > corresponding TM log and JM log. > From TM log it is easy to check the task final state. > > Best, > Zhijiang > ------------------------------------------------------------------ > From:Joshua Fan <joshuafat...@gmail.com> <joshuafat...@gmail.com> > Send Time:2019年6月20日(星期四) 11:55 > To:zhijiang <wangzhijiang...@aliyun.com> <wangzhijiang...@aliyun.com> > Cc:user <user@flink.apache.org> <user@flink.apache.org>; Till Rohrmann > <trohrm...@apache.org> <trohrm...@apache.org>; Chesnay Schepler > <ches...@apache.org> <ches...@apache.org> > Subject:Re: Maybe a flink bug. Job keeps in FAILING state > > zhijiang > > I did not capture the job ui, the topology is in FAILING state, but the > persistentbolt subtasks as can be seen in the picture attached in first > mail was all canceled, and the parsebolt subtasks as described before had > one subtask FAILED, other subtasks CANCELED, but the source subtasks had > one subtask(subtask 4/5) CANCELED, and other subtasks(subtask 1/5,subtask > 2/5,subtask 3/5,subtask 5/5) CANCELING, not in a terminal state. > > The subtask status described above is in jm view, but in tm view, all of > the source subtask was in FAILED, do not know why jm was not notify about > this. > > As all of the failed status was triggered by a oom by the subtask can not > create native thread when checkpointing, I also dumped the stack of the > jvm, it shows the four subtasks(subtask 1/5,subtask 2/5,subtask 3/5,subtask > 5/5) are still active after it throwed a oom and was called to cancel . I > attached the jstack file in this email. > > Yours sincerely > Joshua > > On Wed, Jun 19, 2019 at 4:40 PM zhijiang <wangzhijiang...@aliyun.com> > wrote: > As long as one task is in canceling state, then the job status might be > still in canceling state. > > @Joshua Do you confirm all of the tasks in topology were already in > terminal state such as failed or canceled? > > Best, > Zhijiang > ------------------------------------------------------------------ > From:Chesnay Schepler <ches...@apache.org> > Send Time:2019年6月19日(星期三) 16:32 > To:Joshua Fan <joshuafat...@gmail.com>; user <user@flink.apache.org>; > Till Rohrmann <trohrm...@apache.org> > Subject:Re: Maybe a flink bug. Job keeps in FAILING state > > @Till have you see something like this before? Despite all source tasks > reaching a terminal state on a TM (FAILED) it does not send updates to > the JM for all of them, but only a single one. > > On 18/06/2019 12:14, Joshua Fan wrote: > > Hi All, > > There is a topology of 3 operator, such as, source, parser, and > > persist. Occasionally, 5 subtasks of the source encounters exception > > and turns to failed, at the same time, one subtask of the parser runs > > into exception and turns to failed too. The jobmaster gets a message > > of the parser's failed. The jobmaster then try to cancel all the > > subtask, most of the subtasks of the three operator turns to canceled > > except the 5 subtasks of the source, because the state of the 5 ones > > is already FAILED before jobmaster try to cancel it. Then the > > jobmaster can not reach a final state but keeps in Failing state > > meanwhile the subtask of the source kees in canceling state. > > > > The job run on a flink 1.7 cluster on yarn, and there is only one tm > > with 10 slots. > > > > The attached files contains a jm log , tm log and the ui picture. > > > > The exception timestamp is about 2019-06-16 13:42:28. > > > > Yours > > Joshua > > > > >