Hi Zhijiang

Thank you for your analysis. I agree with it. The solution may be to let tm
exit like you mentioned when any type of oom occurs, because the flink has
no control on a tm when a oom occurs.

I fired a jira before, https://issues.apache.org/jira/browse/FLINK-12889.

Don't know it is worth to fix.

Thank you all.

Yours sincerely
Joshua

On Fri, Jun 21, 2019 at 5:32 PM zhijiang <wangzhijiang...@aliyun.com> wrote:

> Thanks for the reminding @Chesnay Schepler .
>
> I just looked throught the related logs. Actually all the five
> "Source: ServiceLog" tasks are not in terminal state on JM view, the
> relevant processes are as follows:
>
> 1. The checkpoint in task causes OOM issue which would call
> `Task#failExternally` as a result, we could see the log "Attempting to
> fail task externally" in tm.
> 2. The source task would transform state from RUNNING to FAILED and then
> starts a canceler thread for canceling task, we could see log "Triggering
> cancellation of task" in tm.
> 3. When JM starts to cancel the source tasks, the rpc call
> `Task#cancelExecution` would find the task was already in FAILED state as
> above step 2, we could see log "Attempting to cancel task" in tm.
>
> At last all the five source tasks are not in terminal states from jm log,
> I guess the step 2 might not create canceler thread successfully, because
> the root failover was caused by OOM during creating native thread in step1,
> so it might exist possibilities that createing canceler thread is not
> successful as well in OOM case which is unstable. If so, the source task
> would not been interrupted at all, then it would not report to JM as well,
> but the state is already changed to FAILED before.
>
> For the other vertex tasks, it does not trigger `Task#failExternally` in
> step 1, and only receives the cancel rpc from JM in step 3. And I guess at
> this time later than the source period, the canceler thread could be
> created succesfully after some GCs, then these tasks could be canceled as
> reported to JM side.
>
> I think the key problem is under OOM case some behaviors are not within
> expectations, so it might bring problems. Maybe we should handle OOM error
> in extreme way like making TM exit to solve the potential issue.
>
> Best,
> Zhijiang
>
> ------------------------------------------------------------------
> From:Chesnay Schepler <ches...@apache.org>
> Send Time:2019年6月21日(星期五) 16:34
> To:zhijiang <wangzhijiang...@aliyun.com>; Joshua Fan <
> joshuafat...@gmail.com>
> Cc:user <user@flink.apache.org>; Till Rohrmann <trohrm...@apache.org>
> Subject:Re: Maybe a flink bug. Job keeps in FAILING state
>
> The logs are attached to the initial mail.
>
> Echoing my thoughts from earlier: from the logs it looks as if the TM
> never even submits the terminal state RPC calls for several tasks to the JM.
>
> On 21/06/2019 10:30, zhijiang wrote:
> Hi Joshua,
>
> If the tasks(subtask 1/5,subtask 2/5,subtask 3/5,subtask 5/5) were really
> in CANCELED state on TM side, but in CANCELING state on JM side, then it
> might indicates the terminal state RPC was not received by JM. I am not
> sure whether the OOM would cause this issue happen resulting in unexpected
> behavior.
>
> In addition, you mentioned these tasks are still active after OOM and was
> called to cancel, so I am not sure what is the specific periods for your
> attached TM stack. I think it might provide help if you could provide
> corresponding TM log and JM log.
> From TM log it is easy to check the task final state.
>
> Best,
> Zhijiang
> ------------------------------------------------------------------
> From:Joshua Fan <joshuafat...@gmail.com> <joshuafat...@gmail.com>
> Send Time:2019年6月20日(星期四) 11:55
> To:zhijiang <wangzhijiang...@aliyun.com> <wangzhijiang...@aliyun.com>
> Cc:user <user@flink.apache.org> <user@flink.apache.org>; Till Rohrmann
> <trohrm...@apache.org> <trohrm...@apache.org>; Chesnay Schepler
> <ches...@apache.org> <ches...@apache.org>
> Subject:Re: Maybe a flink bug. Job keeps in FAILING state
>
> zhijiang
>
> I did not capture the job ui, the topology is in FAILING state, but the
> persistentbolt subtasks as can be seen in the picture attached in first
> mail was all canceled, and the parsebolt subtasks as described before had
> one subtask FAILED, other subtasks CANCELED, but the source subtasks had
> one subtask(subtask 4/5) CANCELED, and other subtasks(subtask 1/5,subtask
> 2/5,subtask 3/5,subtask 5/5) CANCELING,  not in a terminal state.
>
> The subtask status described above is in jm view, but in tm view, all of
> the source subtask was in FAILED, do not know why jm was not notify about
> this.
>
> As all of the failed status was triggered by a oom by the subtask can not
> create native thread when checkpointing, I also dumped the stack of the
> jvm, it shows the four subtasks(subtask 1/5,subtask 2/5,subtask 3/5,subtask
> 5/5) are still active after it throwed a oom and was called to cancel . I
> attached the jstack file in this email.
>
> Yours sincerely
> Joshua
>
> On Wed, Jun 19, 2019 at 4:40 PM zhijiang <wangzhijiang...@aliyun.com>
> wrote:
> As long as one task is in canceling state, then the job status might be
> still in canceling state.
>
> @Joshua Do you confirm all of the tasks in topology were already in
> terminal state such as failed or canceled?
>
> Best,
> Zhijiang
> ------------------------------------------------------------------
> From:Chesnay Schepler <ches...@apache.org>
> Send Time:2019年6月19日(星期三) 16:32
> To:Joshua Fan <joshuafat...@gmail.com>; user <user@flink.apache.org>;
> Till Rohrmann <trohrm...@apache.org>
> Subject:Re: Maybe a flink bug. Job keeps in FAILING state
>
> @Till have you see something like this before? Despite all source tasks
> reaching a terminal state on a TM (FAILED) it does not send updates to
> the JM for all of them, but only a single one.
>
> On 18/06/2019 12:14, Joshua Fan wrote:
> > Hi All,
> > There is a topology of 3 operator, such as, source, parser, and
> > persist. Occasionally, 5 subtasks of the source encounters exception
> > and turns to failed, at the same time, one subtask of the parser runs
> > into exception and turns to failed too. The jobmaster gets a message
> > of the parser's failed. The jobmaster then try to cancel all the
> > subtask, most of the subtasks of the three operator turns to canceled
> > except the 5 subtasks of the source, because the state of the 5 ones
> > is already FAILED before jobmaster try to cancel it. Then the
> > jobmaster can not reach a final state but keeps in  Failing state
> > meanwhile the subtask of the source kees in canceling state.
> >
> > The job run on a flink 1.7 cluster on yarn, and there is only one tm
> > with 10 slots.
> >
> > The attached files contains a jm log , tm log and the ui picture.
> >
> > The exception timestamp is about 2019-06-16 13:42:28.
> >
> > Yours
> > Joshua
>
>
>
>
>

Reply via email to