Re: Taskmanager times out continuously for registration with Jobmanager

2018-10-15 Thread Till Rohrmann
Hi Abdul, in Flink 1.4 we use Akka's death watch to detect no longer reachable hosts. The downside of the death watch mechanism is that hosts which were detected to be dead are being quarantined. Once in this state you need to restart the ActorSystem in order to receive messages again. The idea be

Re: Taskmanager times out continuously for registration with Jobmanager

2018-10-12 Thread Abdul Qadeer
We were able to fix it by passing IP address instead of hostname for actor system listen address when starting taskmanager: def runTaskManager( taskManagerHostname: String, resourceID: ResourceID, actorSystemPort: Int,

Re: Taskmanager times out continuously for registration with Jobmanager

2018-10-12 Thread Till Rohrmann
It is hard to tell without all logs but it could easily be a K8s setup problem. Also problematic is that you are running a Flink version which is no longer actively supported. Try at least to use the latest bug fix release for 1.4. Cheers, Till On Fri, Oct 12, 2018, 09:43 Abdul Qadeer wrote: >

Re: Taskmanager times out continuously for registration with Jobmanager

2018-10-12 Thread Abdul Qadeer
Hi Till, A few more data points: In a rerun of the same versions with fresh deployment, I see *log*.debug(*s"RegisterTaskManager: $*msg*"*) in JobManager, however the *AcknowledgeRegistration/AlreadyRegistered *messages are never sent, I have taken tcpdump for the taskmanager which doesn't recove

Re: Taskmanager times out continuously for registration with Jobmanager

2018-10-11 Thread Abdul Qadeer
Hi Till, I didn't try with newer versions as it is not possible to update the Flink version atm. If you could give any pointers for debugging that would be great. On Thu, Oct 11, 2018 at 2:44 AM Till Rohrmann wrote: > Hi Abdul, > > have you tried whether this problem also occurs with newer Flin

Re: Taskmanager times out continuously for registration with Jobmanager

2018-10-11 Thread Till Rohrmann
Hi Abdul, have you tried whether this problem also occurs with newer Flink versions (1.5.4 or 1.6.1)? Cheers, Till On Thu, Oct 11, 2018 at 9:24 AM Dawid Wysakowicz wrote: > Hi Abdul, > > I've added Till and Gary to cc, who might be able to help you. > > Best, > > Dawid > > On 11/10/18 03:05, A

Re: Taskmanager times out continuously for registration with Jobmanager

2018-10-11 Thread Dawid Wysakowicz
Hi Abdul, I've added Till and Gary to cc, who might be able to help you. Best, Dawid On 11/10/18 03:05, Abdul Qadeer wrote: > > Hi, > > > We are facing an issue in standalone HA mode in Flink 1.4.0 where > Taskmanager restarts and is not able to register with the Jobmanager. > It times out awa

Taskmanager times out continuously for registration with Jobmanager

2018-10-10 Thread Abdul Qadeer
Hi, We are facing an issue in standalone HA mode in Flink 1.4.0 where Taskmanager restarts and is not able to register with the Jobmanager. It times out awaiting *AcknowledgeRegistration/AlreadyRegistered* message from Jobmanager Actor and keeps sending *RegisterTaskManager *message. The logs at