Re: Flink 1.7.2 extremely unstable and losing jobs in prod

Till Rohrmann Thu, 21 Mar 2019 08:24:33 -0700

Hi Bruno,

could you upload the logs to https://transfer.sh/ or
https://gist.github.com/ and then post a link. For further debugging this
will be crucial. It would be really good if you could set the log level to
DEBUG.


Concerning the number of registered TMs, the new mode (not the legacy
mode), no longer respects the `-n` setting when you start a yarn session.
Instead it will dynamically start as many containers as you need to run the
submitted jobs. That's why you don't see the spare TM and this is the
expected behaviour.

The community intends to add support for ranges of how many TMs must be
active at any given time [1].

[1] https://issues.apache.org/jira/browse/FLINK-11078

Cheers,
Till

On Thu, Mar 21, 2019 at 1:50 PM Bruno Aranda <bara...@apache.org> wrote:

> Hi Andrey,
>
> Thanks for your response. I was trying to get the logs somewhere but they
> are biggish (~4Mb). Do you suggest somewhere I could put them?
>
> In any case, I can see exceptions like this:
>
> 2019/03/18 10:11:50,763 DEBUG
> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Releasing
> slot [SlotRequestId{ab89ff271ebf317a13a9e773aca4e9fb}] because: null
> 2019/03/18 10:11:50,807 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
> alert-event-beeTrap-notifier (2ff941926e6ad80ba441d9cfcd7d689d) switched
> from state RUNNING to FAILING.
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate all requires slots within timeout of 300000 ms. Slots
> required: 2, slots allocated: 0
> at
> org.apache.flink.runtime.executiongraph.ExecutionGraph.lambda$scheduleEager$3(ExecutionGraph.java:991)
> at
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
> ...
>
> It looks like a TM may crash, and then the JM. And then the JM is not able
> to find slots for the tasks in a reasonable time frame? Weirdly, we are
> running 13 TMs with 6 slots each (we used legacy mode in 1.6), and we
> always try to keep an extra TM worth of free slots just in case. Looking at
> the dashboard, I see 12 TMs, 2 free slots, but we tell Flink 13 are
> available when we start the session in yarn.
>
> Any ideas? It is way less stable for us these days without having changed
> settings much since we started using Flink around 1.2 some time back.
>
> Thanks,
>
> Bruno
>
>
>
> On Tue, 19 Mar 2019 at 17:09, Andrey Zagrebin <and...@ververica.com>
> wrote:
>
>> Hi Bruno,
>>
>> could you also share the job master logs?
>>
>> Thanks,
>> Andrey
>>
>> On Tue, Mar 19, 2019 at 12:03 PM Bruno Aranda <bara...@apache.org> wrote:
>>
>>> Hi,
>>>
>>> This is causing serious instability and data loss in our production
>>> environment. Any help figuring out what's going on here would be really
>>> appreciated.
>>>
>>> We recently updated our two EMR clusters from flink 1.6.1 to flink 1.7.2
>>> (running on AWS EMR). The road to the upgrade was fairly rocky, but we felt
>>> like it was working sufficiently well in our pre-production environments
>>> that we rolled it out to prod.
>>>
>>> However we're now seeing the jobmanager crash spontaneously several
>>> times a day. There doesn't seem to be any pattern to when this happens - it
>>> doesn't coincide with an increase in the data flowing through the system,
>>> nor is it at the same time of day.
>>>
>>> The big problem is that when it recovers, sometimes a lot of the jobs
>>> fail to resume with the following exception:
>>>
>>> org.apache.flink.util.FlinkException: JobManager responsible for
>>> 2401cd85e70698b25ae4fb2955f96fd0 lost the leadership.
>>>     at
>>> org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1185)
>>>     at
>>> org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:138)
>>>     at
>>> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1625)
>>> //...
>>> Caused by: java.util.concurrent.TimeoutException: The heartbeat of
>>> JobManager with id abb0e96af8966f93d839e4d9395c7697 timed out.
>>>     at
>>> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1626)
>>>     ... 16 more
>>>
>>> Starting them manually afterwards doesn't resume from checkpoint, which
>>> for most jobs means it starts from the end of the source kafka topic. This
>>> means whenever this surprise jobmanager restart happens, we have a ticking
>>> clock during which we're losing data.
>>>
>>> We speculate that those jobs die first and while they wait to be
>>> restarted (they have a 30 second delay strategy), the job manager restarts
>>> and does not recover them? In any case, we have never seen so many job
>>> failures and JM restarts with exactly the same EMR config.
>>>
>>> We've got some functionality we're building that uses the
>>> StreamingFileSink over S3 bugfixes in 1.7.2, so rolling back isn't an ideal
>>> option.
>>>
>>> Looking through the mailing list, we found
>>> https://issues.apache.org/jira/browse/FLINK-11843 - does it seem
>>> possible this might be related?
>>>
>>> Best regards,
>>>
>>> Bruno
>>>
>>

Re: Flink 1.7.2 extremely unstable and losing jobs in prod

Reply via email to