Hi Bruno, could you also share the job master logs?
Thanks, Andrey On Tue, Mar 19, 2019 at 12:03 PM Bruno Aranda <bara...@apache.org> wrote: > Hi, > > This is causing serious instability and data loss in our production > environment. Any help figuring out what's going on here would be really > appreciated. > > We recently updated our two EMR clusters from flink 1.6.1 to flink 1.7.2 > (running on AWS EMR). The road to the upgrade was fairly rocky, but we felt > like it was working sufficiently well in our pre-production environments > that we rolled it out to prod. > > However we're now seeing the jobmanager crash spontaneously several times > a day. There doesn't seem to be any pattern to when this happens - it > doesn't coincide with an increase in the data flowing through the system, > nor is it at the same time of day. > > The big problem is that when it recovers, sometimes a lot of the jobs fail > to resume with the following exception: > > org.apache.flink.util.FlinkException: JobManager responsible for > 2401cd85e70698b25ae4fb2955f96fd0 lost the leadership. > at > org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1185) > at > org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:138) > at > org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1625) > //... > Caused by: java.util.concurrent.TimeoutException: The heartbeat of > JobManager with id abb0e96af8966f93d839e4d9395c7697 timed out. > at > org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1626) > ... 16 more > > Starting them manually afterwards doesn't resume from checkpoint, which > for most jobs means it starts from the end of the source kafka topic. This > means whenever this surprise jobmanager restart happens, we have a ticking > clock during which we're losing data. > > We speculate that those jobs die first and while they wait to be restarted > (they have a 30 second delay strategy), the job manager restarts and does > not recover them? In any case, we have never seen so many job failures and > JM restarts with exactly the same EMR config. > > We've got some functionality we're building that uses the > StreamingFileSink over S3 bugfixes in 1.7.2, so rolling back isn't an ideal > option. > > Looking through the mailing list, we found > https://issues.apache.org/jira/browse/FLINK-11843 - does it seem possible > this might be related? > > Best regards, > > Bruno >