Hello, This issue occurred again and we dumped the TM thread. It indeed hung on socket read to download jar from Blob server:
"DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) (1999/2000)" #72 prio=5 os_prio=0 tid=0x00007fb9a1521000 nid=0xa0994 runnable [0x00007fb97cfbf000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:152) at org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:140) at org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:170) at org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) at org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:206) at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) - locked <0x000000078ab60ba8> (a java.lang.Object) at org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:893) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:585) at java.lang.Thread.run(Thread.java:748) I checked the latest master code. There’s still no socket timeout in Blob client. Should I create an issue to add this timeout? Regards, Qi > On Apr 19, 2019, at 7:49 PM, qi luo <luoqi...@gmail.com> wrote: > > Hi all, > > We use Flink 1.5 batch and start thousands of jobs per day. Occasionally we > observed some stuck jobs, due to some TM hang in “DEPLOYING” state. > > On checking TM log, it shows that it stuck in downloading jars in BlobClient: > > ———— > ... > INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Received > task DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) > (184/2000). > INFO org.apache.flink.runtime.taskmanager.Task - > DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) > (184/2000) switched from CREATED to DEPLOYING. > INFO org.apache.flink.runtime.taskmanager.Task - > Creating FileSystem stream leak safety net for task DataSource (at > createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) [DEPLOYING] > INFO org.apache.flink.runtime.taskmanager.Task - Loading > JAR files for task DataSource (at createInput(ExecutionEnvironment.java:548) > (our.code)) (184/2000) [DEPLOYING]. > INFO org.apache.flink.runtime.blob.BlobClient - > Downloading > 19e65c0caa41f264f9ffe4ca2a48a434/p-3ecd6341bf97d5512b14c93f6c9f51f682b6db26-37d5e69d156ee00a924c1ebff0c0d280 > from some-host-ip-port > > no more logs... > ———— > > It seems that the TM is calling BlobClient to download jars from > JM/BlobServer. Under hood it’s calling Socket.connect() and then > Socket.read() to retrieve results. > > Should we add timeout in socket operations in BlobClient to resolve this > issue? > > Thanks, > Qi