Thanks Till, I will start separate threads for the two issues we are
experiencing.
Cheers,
Bruno
On Mon, 8 Apr 2019 at 15:27, Till Rohrmann wrote:
> Hi Bruno,
>
> first of all good to hear that you could resolve some of the problems.
>
> Slots get removed if a TaskManager gets unregistered fro
Hi Bruno,
first of all good to hear that you could resolve some of the problems.
Slots get removed if a TaskManager gets unregistered from the SlotPool.
This usually happens if a TaskManager closes its connection or its
heartbeat with the ResourceManager times out. So you could look for
messages
Hi Till,
Many thanks for your reply and don't worry. We understand this is tricky
and you are busy.
We have been experiencing some issues, and a couple of them have been
addressed, so the logs probably were not relevant anymore.
About losing jobs on restart -> it seems that YARN was killing the
Hi Bruno,
sorry for getting back to you so late. I just tried to access your logs to
investigate the problem but transfer.sh tells me that they are no longer
there. Could you maybe re-upload them or directly send them to my mail
address. Sorry for not taking faster a look at your problem and the
i
Ok, here it goes:
https://transfer.sh/12qMre/jobmanager-debug.log
In an attempt to make it smaller, did remove the noisy "http wire" ones and
masked a couple of things. Not sure this covers everything you would like
to see.
Thanks!
Bruno
On Thu, 21 Mar 2019 at 15:24, Till Rohrmann wrote:
> H
Hi Bruno,
could you upload the logs to https://transfer.sh/ or
https://gist.github.com/ and then post a link. For further debugging this
will be crucial. It would be really good if you could set the log level to
DEBUG.
Concerning the number of registered TMs, the new mode (not the legacy
mode), n
Hi Andrey,
Thanks for your response. I was trying to get the logs somewhere but they
are biggish (~4Mb). Do you suggest somewhere I could put them?
In any case, I can see exceptions like this:
2019/03/18 10:11:50,763 DEBUG
org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Releasing
Hi Bruno,
could you also share the job master logs?
Thanks,
Andrey
On Tue, Mar 19, 2019 at 12:03 PM Bruno Aranda wrote:
> Hi,
>
> This is causing serious instability and data loss in our production
> environment. Any help figuring out what's going on here would be really
> appreciated.
>
> We
Hi,
This is causing serious instability and data loss in our production
environment. Any help figuring out what's going on here would be really
appreciated.
We recently updated our two EMR clusters from flink 1.6.1 to flink 1.7.2
(running on AWS EMR). The road to the upgrade was fairly rocky, but