Hi all,

We use Flink 1.5 batch and start thousands of jobs per day. Occasionally we 
observed some stuck jobs, due to some TM hang in “DEPLOYING” state. 

On checking TM log, it shows that it stuck in downloading jars in BlobClient:

————
...
INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor       - Received task 
DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) 
(184/2000).
INFO  org.apache.flink.runtime.taskmanager.Task                     - 
DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) 
(184/2000) switched from CREATED to DEPLOYING.
INFO  org.apache.flink.runtime.taskmanager.Task                     - Creating 
FileSystem stream leak safety net for task DataSource (at 
createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) [DEPLOYING]
INFO  org.apache.flink.runtime.taskmanager.Task                     - Loading 
JAR files for task DataSource (at createInput(ExecutionEnvironment.java:548) 
(our.code)) (184/2000) [DEPLOYING].
INFO  org.apache.flink.runtime.blob.BlobClient                          - 
Downloading 
19e65c0caa41f264f9ffe4ca2a48a434/p-3ecd6341bf97d5512b14c93f6c9f51f682b6db26-37d5e69d156ee00a924c1ebff0c0d280
 from some-host-ip-port

no more logs...
————

It seems that the TM is calling BlobClient to download jars from JM/BlobServer. 
Under hood it’s calling Socket.connect() and then Socket.read() to retrieve 
results. 

Should we add timeout in socket operations in BlobClient to resolve this issue?

Thanks,
Qi

Reply via email to