t;>
>> -Bruce
>>
>>
>>
>> --
>>
>>
>>
>>
>>
>> *From: *Zhu Zhu
>> *Date: *Monday, April 13, 2020 at 9:29 PM
>> *To: *Till Rohrmann
>> *Cc: *Aljoscha Krettek , user ,
>> Gary Yao
>> *Subject: *Re: Fli
;
> -Bruce
>
>
>
> --
>
>
>
>
>
> *From: *Zhu Zhu
> *Date: *Monday, April 13, 2020 at 9:29 PM
> *To: *Till Rohrmann
> *Cc: *Aljoscha Krettek , user ,
> Gary Yao
> *Subject: *Re: Flink job didn't restart when a task failed
>
>
>
&
ser , Gary
Yao
Subject: Re: Flink job didn't restart when a task failed
Sorry for not following this ML earlier.
I think the cause might be that the final state ('FAILED') update message to JM
is lost. TaskExecutor will simply fail the task (which does not take effect in
th
Sorry for not following this ML earlier.
I think the cause might be that the final state ('FAILED') update message
to JM is lost. TaskExecutor will simply fail the task (which does not take
effect in this case since the task is already FAILED) and will not update
the task state again in this case.
For future reference, here is the issue to track the reconciliation logic
[1].
[1] https://issues.apache.org/jira/browse/FLINK-17075
Cheers,
Till
On Thu, Apr 9, 2020 at 6:47 PM Till Rohrmann wrote:
> Hi Bruce,
>
> what you are describing sounds indeed quite bad. Quite hard to say whether
> we
Hi Bruce,
what you are describing sounds indeed quite bad. Quite hard to say whether
we fixed such an issue in 1.10. It is definitely worth a try to upgrade,
though.
In order to further debug the problem, it would be really great if you
could provide us with the log files of the JobMaster and the
Hi,
this indeed seems very strange!
@Gary Could you maybe have a look at this since you work/worked quite a
bit on the scheduler?
Best,
Aljoscha
On 09.04.20 05:46, Hanson, Bruce wrote:
Hello Flink folks:
We had a problem with a Flink job the other day that I haven’t seen before. One
task
Hello Flink folks:
We had a problem with a Flink job the other day that I haven’t seen before. One
task encountered a failure and switched to FAILED (see the full exception
below). After the failure, the task said it was notifying the Job Manager:
2020-04-06 08:21:04.329 [flink-akka.actor.defau