Re: Maybe a flink bug. Job keeps in FAILING state

Chesnay Schepler Fri, 21 Jun 2019 01:35:44 -0700

The logs are attached to the initial mail.

Echoing my thoughts from earlier: from the logs it looks as if the TMnever even submits the terminal state RPC calls for several tasks to the JM.


On 21/06/2019 10:30, zhijiang wrote:

Hi Joshua,

If the tasks(subtask 1/5,subtask 2/5,subtask 3/5,subtask 5/5) werereally in CANCELED state on TM side, but in CANCELING state on JMside, then it might indicates the terminal state RPC was not receivedby JM. I am not sure whether the OOM would cause this issue happenresulting in unexpected behavior.

In addition, you mentioned these tasks are still active after OOM andwas called to cancel, so I am not sure what is the specific periodsfor your attached TM stack. I think it might provide help if you couldprovide corresponding TM log and JM log.

From TM log it is easy to check the task final state.

Best,
Zhijiang

    ------------------------------------------------------------------
    From:Joshua Fan <joshuafat...@gmail.com>
    Send Time:2019年6月20日(星期四) 11:55
    To:zhijiang <wangzhijiang...@aliyun.com>
    Cc:user <user@flink.apache.org>; Till Rohrmann
    <trohrm...@apache.org>; Chesnay Schepler <ches...@apache.org>
    Subject:Re: Maybe a flink bug. Job keeps in FAILING state

    zhijiang

    I did not capture the job ui, the topology is in FAILING state,
    but the persistentbolt subtasks as can be seen in the picture
    attached in first mail was all canceled, and the parsebolt
    subtasks as described before had one subtask FAILED, other
    subtasks CANCELED, but the source subtasks had one subtask(subtask
    4/5) CANCELED, and other subtasks(subtask 1/5,subtask 2/5,subtask
    3/5,subtask 5/5) CANCELING,  not in a terminal state.

    The subtask status described above is in jm view, but in tm view,
    all of the source subtask was in FAILED, do not know why jm was
    not notify about this.

    As all of the failed status was triggered by a oom by the subtask
    can not create native thread when checkpointing, I also dumped the
    stack of the jvm, it shows the four subtasks(subtask 1/5,subtask
    2/5,subtask 3/5,subtask 5/5) are still active after it throwed a
    oom and was called to cancel . I attached the jstack file in this
    email.

    Yours sincerely
    Joshua

    On Wed, Jun 19, 2019 at 4:40 PM zhijiang
    <wangzhijiang...@aliyun.com <mailto:wangzhijiang...@aliyun.com>>
    wrote:
    As long as one task is in canceling state, then the job status
    might be still in canceling state.

    @Joshua Do you confirm all of the tasks in topology were already
    in terminal state such as failed or canceled?

    Best,
    Zhijiang
    ------------------------------------------------------------------
    From:Chesnay Schepler <ches...@apache.org <mailto:ches...@apache.org>>
    Send Time:2019年6月19日(星期三) 16:32
    To:Joshua Fan <joshuafat...@gmail.com
    <mailto:joshuafat...@gmail.com>>; user <user@flink.apache.org
    <mailto:user@flink.apache.org>>; Till Rohrmann
    <trohrm...@apache.org <mailto:trohrm...@apache.org>>
    Subject:Re: Maybe a flink bug. Job keeps in FAILING state

    @Till have you see something like this before? Despite all source tasks

    reaching a terminal state on a TM (FAILED) it does not send updates to

    the JM for all of them, but only a single one.

    On 18/06/2019 12:14, Joshua Fan wrote:
    > Hi All,
    > There is a topology of 3 operator, such as, source, parser, and
    > persist. Occasionally, 5 subtasks of the source encounters exception

    > and turns to failed, at the same time, one subtask of the parser runs

    > into exception and turns to failed too. The jobmaster gets a message

    > of the parser's failed. The jobmaster then try to cancel all the
    > subtask, most of the subtasks of the three operator turns to canceled

    > except the 5 subtasks of the source, because the state of the 5 ones

    > is already FAILED before jobmaster try to cancel it. Then the
    > jobmaster can not reach a final state but keeps in  Failing state
    > meanwhile the subtask of the source kees in canceling state.
    >
    > The job run on a flink 1.7 cluster on yarn, and there is only one tm

    > with 10 slots.
    >
    > The attached files contains a jm log , tm log and the ui picture.
    >
    > The exception timestamp is about 2019-06-16 13:42:28.
    >
    > Yours
    > Joshua

Re: Maybe a flink bug. Job keeps in FAILING state

Reply via email to