Ok, here it goes: https://transfer.sh/12qMre/jobmanager-debug.log
In an attempt to make it smaller, did remove the noisy "http wire" ones and masked a couple of things. Not sure this covers everything you would like to see. Thanks! Bruno On Thu, 21 Mar 2019 at 15:24, Till Rohrmann <trohrm...@apache.org> wrote: > Hi Bruno, > > could you upload the logs to https://transfer.sh/ or > https://gist.github.com/ and then post a link. For further debugging this > will be crucial. It would be really good if you could set the log level to > DEBUG. > > Concerning the number of registered TMs, the new mode (not the legacy > mode), no longer respects the `-n` setting when you start a yarn session. > Instead it will dynamically start as many containers as you need to run the > submitted jobs. That's why you don't see the spare TM and this is the > expected behaviour. > > The community intends to add support for ranges of how many TMs must be > active at any given time [1]. > > [1] https://issues.apache.org/jira/browse/FLINK-11078 > > Cheers, > Till > > On Thu, Mar 21, 2019 at 1:50 PM Bruno Aranda <bara...@apache.org> wrote: > >> Hi Andrey, >> >> Thanks for your response. I was trying to get the logs somewhere but they >> are biggish (~4Mb). Do you suggest somewhere I could put them? >> >> In any case, I can see exceptions like this: >> >> 2019/03/18 10:11:50,763 DEBUG >> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Releasing >> slot [SlotRequestId{ab89ff271ebf317a13a9e773aca4e9fb}] because: null >> 2019/03/18 10:11:50,807 INFO >> org.apache.flink.runtime.executiongraph.ExecutionGraph - Job >> alert-event-beeTrap-notifier (2ff941926e6ad80ba441d9cfcd7d689d) switched >> from state RUNNING to FAILING. >> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: >> Could not allocate all requires slots within timeout of 300000 ms. Slots >> required: 2, slots allocated: 0 >> at >> org.apache.flink.runtime.executiongraph.ExecutionGraph.lambda$scheduleEager$3(ExecutionGraph.java:991) >> at >> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) >> ... >> >> It looks like a TM may crash, and then the JM. And then the JM is not >> able to find slots for the tasks in a reasonable time frame? Weirdly, we >> are running 13 TMs with 6 slots each (we used legacy mode in 1.6), and we >> always try to keep an extra TM worth of free slots just in case. Looking at >> the dashboard, I see 12 TMs, 2 free slots, but we tell Flink 13 are >> available when we start the session in yarn. >> >> Any ideas? It is way less stable for us these days without having changed >> settings much since we started using Flink around 1.2 some time back. >> >> Thanks, >> >> Bruno >> >> >> >> On Tue, 19 Mar 2019 at 17:09, Andrey Zagrebin <and...@ververica.com> >> wrote: >> >>> Hi Bruno, >>> >>> could you also share the job master logs? >>> >>> Thanks, >>> Andrey >>> >>> On Tue, Mar 19, 2019 at 12:03 PM Bruno Aranda <bara...@apache.org> >>> wrote: >>> >>>> Hi, >>>> >>>> This is causing serious instability and data loss in our production >>>> environment. Any help figuring out what's going on here would be really >>>> appreciated. >>>> >>>> We recently updated our two EMR clusters from flink 1.6.1 to flink >>>> 1.7.2 (running on AWS EMR). The road to the upgrade was fairly rocky, but >>>> we felt like it was working sufficiently well in our pre-production >>>> environments that we rolled it out to prod. >>>> >>>> However we're now seeing the jobmanager crash spontaneously several >>>> times a day. There doesn't seem to be any pattern to when this happens - it >>>> doesn't coincide with an increase in the data flowing through the system, >>>> nor is it at the same time of day. >>>> >>>> The big problem is that when it recovers, sometimes a lot of the jobs >>>> fail to resume with the following exception: >>>> >>>> org.apache.flink.util.FlinkException: JobManager responsible for >>>> 2401cd85e70698b25ae4fb2955f96fd0 lost the leadership. >>>> at >>>> org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1185) >>>> at >>>> org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:138) >>>> at >>>> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1625) >>>> //... >>>> Caused by: java.util.concurrent.TimeoutException: The heartbeat of >>>> JobManager with id abb0e96af8966f93d839e4d9395c7697 timed out. >>>> at >>>> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1626) >>>> ... 16 more >>>> >>>> Starting them manually afterwards doesn't resume from checkpoint, which >>>> for most jobs means it starts from the end of the source kafka topic. This >>>> means whenever this surprise jobmanager restart happens, we have a ticking >>>> clock during which we're losing data. >>>> >>>> We speculate that those jobs die first and while they wait to be >>>> restarted (they have a 30 second delay strategy), the job manager restarts >>>> and does not recover them? In any case, we have never seen so many job >>>> failures and JM restarts with exactly the same EMR config. >>>> >>>> We've got some functionality we're building that uses the >>>> StreamingFileSink over S3 bugfixes in 1.7.2, so rolling back isn't an ideal >>>> option. >>>> >>>> Looking through the mailing list, we found >>>> https://issues.apache.org/jira/browse/FLINK-11843 - does it seem >>>> possible this might be related? >>>> >>>> Best regards, >>>> >>>> Bruno >>>> >>>