Re: Flink 1.7.2 extremely unstable and losing jobs in prod

Bruno Aranda Thu, 21 Mar 2019 08:30:34 -0700

Ok, here it goes:

https://transfer.sh/12qMre/jobmanager-debug.log


In an attempt to make it smaller, did remove the noisy "http wire" ones and
masked a couple of things. Not sure this covers everything you would like
to see.

Thanks!

Bruno

On Thu, 21 Mar 2019 at 15:24, Till Rohrmann <trohrm...@apache.org> wrote:

> Hi Bruno,
>
> could you upload the logs to https://transfer.sh/ or
> https://gist.github.com/ and then post a link. For further debugging this
> will be crucial. It would be really good if you could set the log level to
> DEBUG.
>
> Concerning the number of registered TMs, the new mode (not the legacy
> mode), no longer respects the `-n` setting when you start a yarn session.
> Instead it will dynamically start as many containers as you need to run the
> submitted jobs. That's why you don't see the spare TM and this is the
> expected behaviour.
>
> The community intends to add support for ranges of how many TMs must be
> active at any given time [1].
>
> [1] https://issues.apache.org/jira/browse/FLINK-11078
>
> Cheers,
> Till
>
> On Thu, Mar 21, 2019 at 1:50 PM Bruno Aranda <bara...@apache.org> wrote:
>
>> Hi Andrey,
>>
>> Thanks for your response. I was trying to get the logs somewhere but they
>> are biggish (~4Mb). Do you suggest somewhere I could put them?
>>
>> In any case, I can see exceptions like this:
>>
>> 2019/03/18 10:11:50,763 DEBUG
>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Releasing
>> slot [SlotRequestId{ab89ff271ebf317a13a9e773aca4e9fb}] because: null
>> 2019/03/18 10:11:50,807 INFO
>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>> alert-event-beeTrap-notifier (2ff941926e6ad80ba441d9cfcd7d689d) switched
>> from state RUNNING to FAILING.
>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>> Could not allocate all requires slots within timeout of 300000 ms. Slots
>> required: 2, slots allocated: 0
>> at
>> org.apache.flink.runtime.executiongraph.ExecutionGraph.lambda$scheduleEager$3(ExecutionGraph.java:991)
>> at
>> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
>> ...
>>
>> It looks like a TM may crash, and then the JM. And then the JM is not
>> able to find slots for the tasks in a reasonable time frame? Weirdly, we
>> are running 13 TMs with 6 slots each (we used legacy mode in 1.6), and we
>> always try to keep an extra TM worth of free slots just in case. Looking at
>> the dashboard, I see 12 TMs, 2 free slots, but we tell Flink 13 are
>> available when we start the session in yarn.
>>
>> Any ideas? It is way less stable for us these days without having changed
>> settings much since we started using Flink around 1.2 some time back.
>>
>> Thanks,
>>
>> Bruno
>>
>>
>>
>> On Tue, 19 Mar 2019 at 17:09, Andrey Zagrebin <and...@ververica.com>
>> wrote:
>>
>>> Hi Bruno,
>>>
>>> could you also share the job master logs?
>>>
>>> Thanks,
>>> Andrey
>>>
>>> On Tue, Mar 19, 2019 at 12:03 PM Bruno Aranda <bara...@apache.org>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> This is causing serious instability and data loss in our production
>>>> environment. Any help figuring out what's going on here would be really
>>>> appreciated.
>>>>
>>>> We recently updated our two EMR clusters from flink 1.6.1 to flink
>>>> 1.7.2 (running on AWS EMR). The road to the upgrade was fairly rocky, but
>>>> we felt like it was working sufficiently well in our pre-production
>>>> environments that we rolled it out to prod.
>>>>
>>>> However we're now seeing the jobmanager crash spontaneously several
>>>> times a day. There doesn't seem to be any pattern to when this happens - it
>>>> doesn't coincide with an increase in the data flowing through the system,
>>>> nor is it at the same time of day.
>>>>
>>>> The big problem is that when it recovers, sometimes a lot of the jobs
>>>> fail to resume with the following exception:
>>>>
>>>> org.apache.flink.util.FlinkException: JobManager responsible for
>>>> 2401cd85e70698b25ae4fb2955f96fd0 lost the leadership.
>>>>     at
>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1185)
>>>>     at
>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:138)
>>>>     at
>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1625)
>>>> //...
>>>> Caused by: java.util.concurrent.TimeoutException: The heartbeat of
>>>> JobManager with id abb0e96af8966f93d839e4d9395c7697 timed out.
>>>>     at
>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1626)
>>>>     ... 16 more
>>>>
>>>> Starting them manually afterwards doesn't resume from checkpoint, which
>>>> for most jobs means it starts from the end of the source kafka topic. This
>>>> means whenever this surprise jobmanager restart happens, we have a ticking
>>>> clock during which we're losing data.
>>>>
>>>> We speculate that those jobs die first and while they wait to be
>>>> restarted (they have a 30 second delay strategy), the job manager restarts
>>>> and does not recover them? In any case, we have never seen so many job
>>>> failures and JM restarts with exactly the same EMR config.
>>>>
>>>> We've got some functionality we're building that uses the
>>>> StreamingFileSink over S3 bugfixes in 1.7.2, so rolling back isn't an ideal
>>>> option.
>>>>
>>>> Looking through the mailing list, we found
>>>> https://issues.apache.org/jira/browse/FLINK-11843 - does it seem
>>>> possible this might be related?
>>>>
>>>> Best regards,
>>>>
>>>> Bruno
>>>>
>>>

Re: Flink 1.7.2 extremely unstable and losing jobs in prod

Reply via email to