Abdul Qadeer created FLINK-12437:
------------------------------------
Summary: Taskmanager doesn't initiate registration after
jobmanager marks it terminated
Key: FLINK-12437
URL: https://issues.apache.org/jira/browse/FLINK-12437
Project: Flink
Issue Type: Bug
Reporter: Abdul Qadeer
This issue is observed in Standalone cluster deployment mode with Zookeeper HA
enabled in Flink 1.4.0. A few taskmanagers restarted due to Out of Metaspace.
The offending taskmanager `pipelineruntime-taskmgr-6789dd578b-dcp4r` first
successfully registers with jobmanager, and the remote watcher marks it
terminated soon after as seen in logs. There were other taskmanagers that were
terminated around same time but they had been quarantined by jobmanager with
message similar to:
{noformat}
Association to [akka.tcp://[email protected]:8070] having UID [864976677] is
irrecoverably failed. UID is now quarantined and all messages to this UID will
be delivered to dead letters. Remote actorsystem must be restarted to recover
from this situation.
{noformat}
They came back up and successfully registered with jobmanager. This didn't
happen for the offending taskmanager:
At JobManager:
{noformat}
{"timeMillis":1557073368155,"thread":"flink-akka.actor.default-dispatcher-49","level":"INFO","loggerName":"org.apache.flink.runtime.instance.InstanceManager","message":"Registered
TaskManager at pipelineruntime-taskmgr-6789dd578b-dcp4r
(akka.tcp://[email protected]:8070/user/taskmanager) as
ae61ac607f0ab35ab5066f7dc221e654. Current number of registered hosts is 8.
Current number of alive task slots is
51.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":125,"threadPriority":5}
...
...
{"timeMillis":1557073391386,"thread":"flink-akka.actor.default-dispatcher-82","level":"INFO","loggerName":"org.apache.flink.runtime.instance.InstanceManager","message":"Unregistered
task manager /10.60.5.85. Number of registered task managers 7. Number of
available slots
45.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":159,"threadPriority":5}
...
...
{"timeMillis":1557073391483,"thread":"flink-akka.actor.default-dispatcher-82","level":"INFO","loggerName":"org.apache.flink.runtime.instance.InstanceManager","message":"Unregistered
task manager /10.60.5.85. Number of registered task managers 6. Number of
available slots
39.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":159,"threadPriority":5}
...
...
{"timeMillis":1557073370389,"thread":"flink-akka.actor.default-dispatcher-35","level":"INFO","loggerName":"akka.actor.LocalActorRef","message":"Message
[akka.remote.ReliableDeliverySupervisor$Ungate$] from
Actor[akka://flink/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fflink%4010.60.5.85%3A8070-3#1863607260]
to
Actor[akka://flink/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fflink%4010.60.5.85%3A8070-3#1863607260]
was not delivered. [22] dead letters encountered. This logging can be turned
off or adjusted with configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":98,"threadPriority":5}
{noformat}
At TaskManager:
{noformat}
{"timeMillis":1557073366068,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Starting
TaskManager","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073366073,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Starting
TaskManager actor system at
10.60.5.85:8070.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073366077,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Trying
to start actor system at
10.60.5.85:8070","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073366510,"thread":"flink-akka.actor.default-dispatcher-4","level":"INFO","loggerName":"akka.event.slf4j.Slf4jLogger","message":"Slf4jLogger
started","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":51,"threadPriority":5}
{"timeMillis":1557073366694,"thread":"flink-akka.actor.default-dispatcher-4","level":"INFO","loggerName":"akka.remote.Remoting","message":"Starting
remoting","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":51,"threadPriority":5}
{"timeMillis":1557073367049,"thread":"flink-akka.actor.default-dispatcher-4","level":"INFO","loggerName":"akka.remote.Remoting","message":"Remoting
started; listening on addresses
:[akka.tcp://[email protected]:8070]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":51,"threadPriority":5}
{"timeMillis":1557073367051,"thread":"flink-akka.actor.default-dispatcher-4","level":"INFO","loggerName":"akka.remote.Remoting","message":"Remoting
now listens on addresses:
[akka.tcp://[email protected]:8070]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":51,"threadPriority":5}
{"timeMillis":1557073367089,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Actor
system started at
akka.tcp://[email protected]:8070","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367138,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.metrics.MetricRegistryImpl","message":"Configuring
FlinkMetricsReporter with
{class=com.cisco.ndp.pipeline.processor.flink.metrics.FlinkMetricsReporter}.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367139,"thread":"pool-2-thread-1","level":"INFO","loggerName":"com.cisco.ndp.pipeline.processor.flink.metrics.FlinkMetricsReporter","message":"Metrics
Reporter
Open","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367139,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.metrics.MetricRegistryImpl","message":"Reporting
metrics for reporter ndp of type
com.cisco.ndp.pipeline.processor.flink.metrics.FlinkMetricsReporter.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367142,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Starting
TaskManager
actor","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367176,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.network.netty.NettyConfig","message":"NettyConfig
[server address: /10.60.5.85, server port: 0, ssl enabled: false, memory
segment size (bytes): 32768, transport type: NIO, number of server threads: 3
(manual), number of client threads: 3 (manual), server connect backlog: 0 (use
Netty's default), client connect timeout (sec): 120, send/receive buffer size
(bytes): 0 (use Netty's
default)]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367187,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration","message":"Messages
have a max timeout of 100000
ms","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367198,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskexecutor.TaskManagerServices","message":"Temporary
file directory '/tmp': total 373 GB, usable 295 GB (79.09%
usable)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367608,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.network.buffer.NetworkBufferPool","message":"Allocated
639 MB for network buffer pool (number of memory segments: 20467, bytes per
segment:
32768).","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367710,"thread":"pool-2-thread-1","level":"WARN","loggerName":"org.apache.flink.runtime.query.QueryableStateUtils","message":"Could
not load Queryable State Client Proxy. Probable reason:
flink-queryable-state-runtime is not in the classpath. Please put the
corresponding jar from the opt to the lib
folder.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367711,"thread":"pool-2-thread-1","level":"WARN","loggerName":"org.apache.flink.runtime.query.QueryableStateUtils","message":"Could
not load Queryable State Server. Probable reason:
flink-queryable-state-runtime is not in the classpath. Please put the
corresponding jar from the opt to the lib
folder.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367712,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.network.NetworkEnvironment","message":"Starting
the network environment and its
components.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367753,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.network.netty.NettyClient","message":"Successful
initialization (took 34
ms).","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367805,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.network.netty.NettyServer","message":"Successful
initialization (took 51 ms). Listening on SocketAddress
/10.60.5.85:38873.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367808,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskexecutor.TaskManagerServices","message":"Limiting
managed memory to 0.7 of the currently free heap space (4005 MB), memory will
be allocated
lazily.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367819,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.disk.iomanager.IOManager","message":"I/O
manager uses directory /tmp/flink-io-5f657721-13dd-40aa-9c00-2a15d5666280 for
spill
files.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367826,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.filecache.FileCache","message":"User
file cache uses directory
/tmp/flink-dist-cache-30b1f2fd-9457-435b-a601-ae0b4e37dc6d","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367862,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.filecache.FileCache","message":"User
file cache uses directory
/tmp/flink-dist-cache-3dfb3cd5-b261-4df3-a662-a1cd91047c72","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073367888,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Starting
TaskManager actor at
akka://flink/user/taskmanager#1157564383.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073367889,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"TaskManager
data connection information:
pipelineruntime-taskmgr-6789dd578b-dcp4r-57b5f60d8144eb16425ec5bd9666768f @
pipelineruntime-taskmgr-6789dd578b-dcp4r
(dataPort=38873)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073367890,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"TaskManager
has 6 task
slot(s).","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073367892,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Memory
usage stats: [HEAP: 842/6554/6554 MB, NON HEAP: 62/64/1776 MB
(used/committed/max)]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073367892,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService","message":"Starting
ZooKeeperLeaderRetrievalService.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073367965,"thread":"pool-2-thread-1-EventThread","level":"INFO","loggerName":"org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService","message":"Leader
node has changed with Leader=akka.tcp://[email protected]:6123/user/jobmanager,
session
ID=270a3383-8f1e-4f2d-b1d6-f7af727e9ea0.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":46,"threadPriority":5}
{"timeMillis":1557073367966,"thread":"pool-2-thread-1-EventThread","level":"INFO","loggerName":"org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService","message":"New
leader information: Leader=akka.tcp://[email protected]:6123/user/jobmanager,
session
ID=270a3383-8f1e-4f2d-b1d6-f7af727e9ea0.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":46,"threadPriority":5}
{"timeMillis":1557073367975,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Trying
to register at JobManager akka.tcp://[email protected]:6123/user/jobmanager
(attempt 1, timeout: 500
milliseconds)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073368168,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Successful
registration at JobManager (akka.tcp://[email protected]:6123/user/jobmanager),
starting network stack and library
cache.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073368177,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Determined
BLOB server address to be /10.60.5.53:43987. Starting BLOB
cache.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073368184,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.blob.PermanentBlobCache","message":"Created
BLOB cache storage directory
/tmp/blobStore-ffdc49ba-e86f-4240-93ad-7566c43e9b0d","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073368189,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.blob.TransientBlobCache","message":"Created
BLOB cache storage directory
/tmp/blobStore-764277b6-6e46-4c8f-b7ee-80f746edefab","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073391398,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
[org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage]
from Actor[akka.tcp://[email protected]:6123/temp/$R4] to
Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [1] dead
letters encountered. This logging can be turned off or adjusted with
configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073391399,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
[org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage]
from Actor[akka.tcp://[email protected]:6123/temp/$S4] to
Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [2] dead
letters encountered. This logging can be turned off or adjusted with
configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073391399,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
[org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage]
from Actor[akka.tcp://[email protected]:6123/temp/$T4] to
Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [3] dead
letters encountered. This logging can be turned off or adjusted with
configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073391400,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
[org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage]
from Actor[akka.tcp://[email protected]:6123/temp/$U4] to
Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [4] dead
letters encountered. This logging can be turned off or adjusted with
configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073391400,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
[org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage]
from Actor[akka.tcp://[email protected]:6123/temp/$V4] to
Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [5] dead
letters encountered. This logging can be turned off or adjusted with
configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073391401,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
[org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage]
from Actor[akka.tcp://[email protected]:6123/temp/$W4] to
Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [6] dead
letters encountered. This logging can be turned off or adjusted with
configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073391401,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
[org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage]
from Actor[akka.tcp://[email protected]:6123/temp/$X4] to
Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [7] dead
letters encountered. This logging can be turned off or adjusted with
configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073391474,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
[org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage]
from Actor[akka.tcp://[email protected]:6123/temp/$Y4] to
Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [8] dead
letters encountered. This logging can be turned off or adjusted with
configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073391475,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
[org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage]
from Actor[akka.tcp://[email protected]:6123/temp/$Z4] to
Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [9] dead
letters encountered. This logging can be turned off or adjusted with
configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073391477,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
[org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage]
from Actor[akka.tcp://[email protected]:6123/temp/$04] to
Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [10] dead
letters encountered. This logging can be turned off or adjusted with
configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
...
...
...
{"timeMillis":1557073691534,"thread":"flink-akka.actor.default-dispatcher-2","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
[org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage]
from Actor[akka.tcp://[email protected]:6123/temp/$sab] to
Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [316] dead
letters encountered. This logging can be turned off or adjusted with
configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":49,"threadPriority":5}
{noformat}
TCP dump at taskmanager:
{noformat}
19:55:58.214944 IP 10.60.5.85.45008 > 10.60.5.53.6123: tcp 715
0x0000: 4500 02ff 2809 4000 4006 f0ee 0a3c 0555 E...(.@.@....<.U
0x0010: 0a3c 0535 afd0 17eb a107 10ac 0270 79da .<.5.........py.
0x0020: 8018 ce96 21f3 0000 0101 080a f2c0 c93f ....!..........?
0x0030: b74c ec05 0000 02c7 0ac4 0512 c105 0a3d .L.............=
0x0040: 0a3b 616b 6b61 2e74 6370 3a2f 2f66 6c69 .;akka.tcp://fli
0x0050: 6e6b 4031 302e 3630 2e35 2e35 333a 3631 [email protected]:61
0x0060: 3233 2f75 7365 722f 6a6f 626d 616e 6167 23/user/jobmanag
0x0070: 6572 2331 3231 3433 3237 3831 3312 bf04 er#1214327813...
0x0080: 0aba 04ac ed00 0573 7200 3f6f 7267 2e61 .......sr.?org.a
0x0090: 7061 6368 652e 666c 696e 6b2e 7275 6e74 pache.flink.runt
0x00a0: 696d 652e 6d65 7373 6167 6573 2e54 6173 ime.messages.Tas
0x00b0: 6b4d 616e 6167 6572 4d65 7373 6167 6573 kManagerMessages
0x00c0: 2448 6561 7274 6265 6174 1fb7 fffd 259b $Heartbeat....%.
0x00d0: c539 0200 024c 000c 6163 6375 6d75 6c61 .9...L..accumula
0x00e0: 746f 7273 7400 164c 7363 616c 612f 636f torst..Lscala/co
0x00f0: 6c6c 6563 7469 6f6e 2f53 6571 3b4c 000a llection/Seq;L..
0x0100: 696e 7374 616e 6365 4944 7400 2e4c 6f72 instanceIDt..Lor
0x0110: 672f 6170 6163 6865 2f66 6c69 6e6b 2f72 g/apache/flink/r
0x0120: 756e 7469 6d65 2f69 6e73 7461 6e63 652f untime/instance/
0x0130: 496e 7374 616e 6365 4944 3b78 7073 7200 InstanceID;xpsr.
0x0140: 2473 6361 6c61 2e63 6f6c 6c65 6374 696f $scala.collectio
0x0150: 6e2e 6d75 7461 626c 652e 4172 7261 7942 n.mutable.ArrayB
0x0160: 7566 6665 7215 38b0 5383 828e 7302 0003 uffer.8.S...s...
0x0170: 4900 0b69 6e69 7469 616c 5369 7a65 4900 I..initialSizeI.
0x0180: 0573 697a 6530 5b00 0561 7272 6179 7400 .size0[..arrayt.
0x0190: 135b 4c6a 6176 612f 6c61 6e67 2f4f 626a .[Ljava/lang/Obj
0x01a0: 6563 743b 7870 0000 0010 0000 0000 7572 ect;xp........ur
0x01b0: 0013 5b4c 6a61 7661 2e6c 616e 672e 4f62 ..[Ljava.lang.Ob
0x01c0: 6a65 6374 3b90 ce58 9f10 7329 6c02 0000 ject;..X..s)l...
0x01d0: 7870 0000 0010 7070 7070 7070 7070 7070 xp....pppppppppp
0x01e0: 7070 7070 7070 7372 002c 6f72 672e 6170 ppppppsr.,org.ap
0x01f0: 6163 6865 2e66 6c69 6e6b 2e72 756e 7469 ache.flink.runti
0x0200: 6d65 2e69 6e73 7461 6e63 652e 496e 7374 me.instance.Inst
0x0210: 616e 6365 4944 0000 0000 0000 0001 0200 anceID..........
0x0220: 0078 7200 206f 7267 2e61 7061 6368 652e .xr..org.apache.
0x0230: 666c 696e 6b2e 7574 696c 2e41 6273 7472 flink.util.Abstr
0x0240: 6163 7449 4400 0000 0000 0000 0102 0003 actID...........
0x0250: 4a00 096c 6f77 6572 5061 7274 4a00 0975 J..lowerPartJ..u
0x0260: 7070 6572 5061 7274 4c00 0874 6f53 7472 pperPartL..toStr
0x0270: 696e 6774 0012 4c6a 6176 612f 6c61 6e67 ingt..Ljava/lang
0x0280: 2f53 7472 696e 673b 7870 ae61 ac60 7f0a /String;xp.a.`..
0x0290: b35a b506 6f7d c221 e654 7400 2061 6536 .Z..o}.!.Tt..ae6
0x02a0: 3161 6336 3037 6630 6162 3335 6162 3530 1ac607f0ab35ab50
0x02b0: 3636 6637 6463 3232 3165 3635 3410 0122 66f7dc221e654.."
0x02c0: 3e0a 3c61 6b6b 612e 7463 703a 2f2f 666c >.<akka.tcp://fl
0x02d0: 696e 6b40 3130 2e36 302e 352e 3835 3a38 [email protected]:8
0x02e0: 3037 302f 7573 6572 2f74 6173 6b6d 616e 070/user/taskman
0x02f0: 6167 6572 2331 3135 3735 3634 3338 33 ager#1157564383
19:55:58.214996 IP 10.60.5.53.6123 > 10.60.5.85.45008: tcp 0
0x0000: 4500 0034 c1fe 4000 3f06 5ac4 0a3c 0535 E..4..@.?.Z..<.5
0x0010: 0a3c 0555 17eb afd0 0270 79da a107 1377 .<.U.....py....w
0x0020: 8010 ce93 1f28 0000 0101 080a b74c ff8d .....(.......L..
0x0030: f2c0 c93f ...?
{noformat}
After this, the taskmanager never registers again at the jobmanager.
This run had the following akka configuration:
akka.watch.heartbeat.pause: 60 s
akka.ask.timeout: 100 s
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)