It is hard to tell without all logs but it could easily be a K8s setup problem. Also problematic is that you are running a Flink version which is no longer actively supported. Try at least to use the latest bug fix release for 1.4.
Cheers, Till On Fri, Oct 12, 2018, 09:43 Abdul Qadeer <quadeer....@gmail.com> wrote: > Hi Till, > > A few more data points: > > In a rerun of the same versions with fresh deployment, I see > *log*.debug(*s"RegisterTaskManager: > $*msg*"*) in JobManager, however the > *AcknowledgeRegistration/AlreadyRegistered *messages are never sent, I > have taken tcpdump for the taskmanager which doesn't recover and compared > it with another taskmanager which recovers after restart (i.e. receives > *AcknowledgeRegistration *message). > > Restarting the docker container of bad taskmanager doesn't work. The only > workaround right now is to delete the kubernetes pod holding the bad > taskmanager container. Does it have to do something with the akka address > the jobmanager stores for a taskmanager? The only variable I see between > restarting container vs pod is the change in the akka address. > > Also, the infinite retries for registration start after the taskmanager > container restarts with Jobmanager actor system quarantined: > > {"timeMillis":1539282282329,"thread":"flink-akka.actor.default-dispatcher-3","level":"ERROR","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"The > actor system akka.tcp://flink@taskmgr-6b59f97748-fmgwn:8070 has > quarantined the remote actor system akka.tcp://flink@192.168.83.52:6123. > Shutting the actor system down to be able to reestablish a > connection!","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":49,"threadPriority":5} > > > A manual restart by docker restart or killing the JVM doesn't reproduce > this problem. > > On Thu, Oct 11, 2018 at 11:15 AM Abdul Qadeer <quadeer....@gmail.com> > wrote: > >> Hi Till, >> >> I didn't try with newer versions as it is not possible to update the >> Flink version atm. >> If you could give any pointers for debugging that would be great. >> >> On Thu, Oct 11, 2018 at 2:44 AM Till Rohrmann <trohrm...@apache.org> >> wrote: >> >>> Hi Abdul, >>> >>> have you tried whether this problem also occurs with newer Flink >>> versions (1.5.4 or 1.6.1)? >>> >>> Cheers, >>> Till >>> >>> On Thu, Oct 11, 2018 at 9:24 AM Dawid Wysakowicz <dwysakow...@apache.org> >>> wrote: >>> >>>> Hi Abdul, >>>> >>>> I've added Till and Gary to cc, who might be able to help you. >>>> >>>> Best, >>>> >>>> Dawid >>>> >>>> On 11/10/18 03:05, Abdul Qadeer wrote: >>>> >>>> Hi, >>>> >>>> >>>> We are facing an issue in standalone HA mode in Flink 1.4.0 where >>>> Taskmanager restarts and is not able to register with the Jobmanager. It >>>> times out awaiting *AcknowledgeRegistration/AlreadyRegistered* message >>>> from Jobmanager Actor and keeps sending *RegisterTaskManager *message. >>>> The logs at Jobmanager don’t show anything about registration >>>> failure/request. It doesn’t print *log*.debug(*s"RegisterTaskManager: >>>> $*msg*"*) (from JobManager.scala) either. The network connection >>>> between taskmanager and jobmanager seems fine; tcpdump shows message sent >>>> to jobmanager and TCP ACK received from jobmanager. Note that the >>>> communication is happening between docker containers. >>>> >>>> >>>> Following are the logs from Taskmanager: >>>> >>>> >>>> >>>> {"timeMillis":1539189572438,"thread":"flink-akka.actor.default-dispatcher-2","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Trying >>>> to register at JobManager akka.tcp:// >>>> flink@192.168.83.51:6123/user/jobmanager (attempt 1400, timeout: 30000 >>>> milliseconds)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":48,"threadPriority":5} >>>> >>>> {"timeMillis":1539189580229,"thread":"Curator-Framework-0-SendThread(zookeeper.maglev-system.svc.cluster.local:2181)","level":"DEBUG","loggerName":"org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn","message":"Got >>>> ping response for sessionid: 0x10000260ea5002d after >>>> 0ms","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":101,"threadPriority":5} >>>> >>>> {"timeMillis":1539189600247,"thread":"Curator-Framework-0-SendThread(zookeeper.maglev-system.svc.cluster.local:2181)","level":"DEBUG","loggerName":"org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn","message":"Got >>>> ping response for sessionid: 0x10000260ea5002d after >>>> 0ms","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":101,"threadPriority":5} >>>> >>>> {"timeMillis":1539189602458,"thread":"flink-akka.actor.default-dispatcher-2","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Trying >>>> to register at JobManager akka.tcp:// >>>> flink@192.168.83.51:6123/user/jobmanager (attempt 1401, timeout: 30000 >>>> milliseconds)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":48,"threadPriority":5} >>>> >>>> {"timeMillis":1539189620251,"thread":"Curator-Framework-0-SendThread(zookeeper.maglev-system.svc.cluster.local:2181)","level":"DEBUG","loggerName":"org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn","message":"Got >>>> ping response for sessionid: 0x10000260ea5002d after >>>> 0ms","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":101,"threadPriority":5} >>>> >>>> {"timeMillis":1539189632478,"thread":"flink-akka.actor.default-dispatcher-2","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Trying >>>> to register at JobManager akka.tcp:// >>>> flink@192.168.83.51:6123/user/jobmanager (attempt 1402, timeout: 30000 >>>> milliseconds)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":48,"threadPriority":5} >>>> >>>> >>>>