Re: Flink 1.7.2 extremely unstable and losing jobs in prod

Andrey Zagrebin Tue, 19 Mar 2019 10:09:36 -0700

Hi Bruno,

could you also share the job master logs?


Thanks,
Andrey

On Tue, Mar 19, 2019 at 12:03 PM Bruno Aranda <bara...@apache.org> wrote:

> Hi,
>
> This is causing serious instability and data loss in our production
> environment. Any help figuring out what's going on here would be really
> appreciated.
>
> We recently updated our two EMR clusters from flink 1.6.1 to flink 1.7.2
> (running on AWS EMR). The road to the upgrade was fairly rocky, but we felt
> like it was working sufficiently well in our pre-production environments
> that we rolled it out to prod.
>
> However we're now seeing the jobmanager crash spontaneously several times
> a day. There doesn't seem to be any pattern to when this happens - it
> doesn't coincide with an increase in the data flowing through the system,
> nor is it at the same time of day.
>
> The big problem is that when it recovers, sometimes a lot of the jobs fail
> to resume with the following exception:
>
> org.apache.flink.util.FlinkException: JobManager responsible for
> 2401cd85e70698b25ae4fb2955f96fd0 lost the leadership.
>     at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1185)
>     at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:138)
>     at
> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1625)
> //...
> Caused by: java.util.concurrent.TimeoutException: The heartbeat of
> JobManager with id abb0e96af8966f93d839e4d9395c7697 timed out.
>     at
> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1626)
>     ... 16 more
>
> Starting them manually afterwards doesn't resume from checkpoint, which
> for most jobs means it starts from the end of the source kafka topic. This
> means whenever this surprise jobmanager restart happens, we have a ticking
> clock during which we're losing data.
>
> We speculate that those jobs die first and while they wait to be restarted
> (they have a 30 second delay strategy), the job manager restarts and does
> not recover them? In any case, we have never seen so many job failures and
> JM restarts with exactly the same EMR config.
>
> We've got some functionality we're building that uses the
> StreamingFileSink over S3 bugfixes in 1.7.2, so rolling back isn't an ideal
> option.
>
> Looking through the mailing list, we found
> https://issues.apache.org/jira/browse/FLINK-11843 - does it seem possible
> this might be related?
>
> Best regards,
>
> Bruno
>

Re: Flink 1.7.2 extremely unstable and losing jobs in prod

Reply via email to