Re: Flink 1.7.2 extremely unstable and losing jobs in prod

2019-04-09 Thread Bruno Aranda
Thanks Till, I will start separate threads for the two issues we are experiencing. Cheers, Bruno On Mon, 8 Apr 2019 at 15:27, Till Rohrmann wrote: > Hi Bruno, > > first of all good to hear that you could resolve some of the problems. > > Slots get removed if a TaskManager gets unregistered fro

Re: Flink 1.7.2 extremely unstable and losing jobs in prod

2019-04-08 Thread Till Rohrmann
Hi Bruno, first of all good to hear that you could resolve some of the problems. Slots get removed if a TaskManager gets unregistered from the SlotPool. This usually happens if a TaskManager closes its connection or its heartbeat with the ResourceManager times out. So you could look for messages

Re: Flink 1.7.2 extremely unstable and losing jobs in prod

2019-04-08 Thread Bruno Aranda
Hi Till, Many thanks for your reply and don't worry. We understand this is tricky and you are busy. We have been experiencing some issues, and a couple of them have been addressed, so the logs probably were not relevant anymore. About losing jobs on restart -> it seems that YARN was killing the

Re: Flink 1.7.2 extremely unstable and losing jobs in prod

2019-04-08 Thread Till Rohrmann
Hi Bruno, sorry for getting back to you so late. I just tried to access your logs to investigate the problem but transfer.sh tells me that they are no longer there. Could you maybe re-upload them or directly send them to my mail address. Sorry for not taking faster a look at your problem and the i

Re: Flink 1.7.2 extremely unstable and losing jobs in prod

2019-03-21 Thread Bruno Aranda
Ok, here it goes: https://transfer.sh/12qMre/jobmanager-debug.log In an attempt to make it smaller, did remove the noisy "http wire" ones and masked a couple of things. Not sure this covers everything you would like to see. Thanks! Bruno On Thu, 21 Mar 2019 at 15:24, Till Rohrmann wrote: > H

Re: Flink 1.7.2 extremely unstable and losing jobs in prod

2019-03-21 Thread Till Rohrmann
Hi Bruno, could you upload the logs to https://transfer.sh/ or https://gist.github.com/ and then post a link. For further debugging this will be crucial. It would be really good if you could set the log level to DEBUG. Concerning the number of registered TMs, the new mode (not the legacy mode), n

Re: Flink 1.7.2 extremely unstable and losing jobs in prod

2019-03-21 Thread Bruno Aranda
Hi Andrey, Thanks for your response. I was trying to get the logs somewhere but they are biggish (~4Mb). Do you suggest somewhere I could put them? In any case, I can see exceptions like this: 2019/03/18 10:11:50,763 DEBUG org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Releasing

Re: Flink 1.7.2 extremely unstable and losing jobs in prod

2019-03-19 Thread Andrey Zagrebin
Hi Bruno, could you also share the job master logs? Thanks, Andrey On Tue, Mar 19, 2019 at 12:03 PM Bruno Aranda wrote: > Hi, > > This is causing serious instability and data loss in our production > environment. Any help figuring out what's going on here would be really > appreciated. > > We

Flink 1.7.2 extremely unstable and losing jobs in prod

2019-03-19 Thread Bruno Aranda
Hi, This is causing serious instability and data loss in our production environment. Any help figuring out what's going on here would be really appreciated. We recently updated our two EMR clusters from flink 1.6.1 to flink 1.7.2 (running on AWS EMR). The road to the upgrade was fairly rocky, but