[ https://issues.apache.org/jira/browse/FLINK-12426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16847441#comment-16847441 ]
Qi commented on FLINK-12426: ---------------------------- [~till.rohrmann] thanks for looking into this. This issue rarely happens with large number of TMs (250 in our case). I'm wondering whether it's appropriate to add some timeout on _BlobClient.downloadFromBlobServer?_ > TM occasionally hang in deploying state > --------------------------------------- > > Key: FLINK-12426 > URL: https://issues.apache.org/jira/browse/FLINK-12426 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Reporter: Qi > Priority: Major > > Hi all, > > We use Flink batch and start thousands of jobs per day. Occasionally we > observed some stuck jobs, due to some TM hang in “DEPLOYING” state. > > It seems that the TM is calling BlobClient to download jars from > JM/BlobServer. Under hood it’s calling Socket.connect() and then > Socket.read() to retrieve results. > > These jobs usually have many TM slots (1~2k). We checked the TM log and > dumped the TM thread. It indeed hung on socket read to download jar from Blob > server. > > We're using Flink 1.5 but this may also affect later versions since related > code are not changed much. We've tried to add socket timeout in BlobClient, > but still no luck. > > ———————— > TM log > ———————— > ... > INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Received task > DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) > (184/2000). > INFO org.apache.flink.runtime.taskmanager.Task - DataSource (at > createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) switched > from CREATED to DEPLOYING. > INFO org.apache.flink.runtime.taskmanager.Task - Creating FileSystem stream > leak safety net for task DataSource (at > createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) [DEPLOYING] > INFO org.apache.flink.runtime.taskmanager.Task - Loading JAR files for task > DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) > (184/2000) [DEPLOYING]. > INFO org.apache.flink.runtime.blob.BlobClient - Downloading > 19e65c0caa41f264f9ffe4ca2a48a434/p-3ecd6341bf97d5512b14c93f6c9f51f682b6db26-37d5e69d156ee00a924c1ebff0c0d280 > from some-host-ip-port > {color:#222222}no more logs...{color} > > ———————— > TM thread dump: > ———————— > _"DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) > (1999/2000)" #72 prio=5 os_prio=0 tid=0x00007fb9a1521000 nid=0xa0994 runnable > [0x00007fb97cfbf000]_ > _java.lang.Thread.State: RUNNABLE_ > _at java.net.SocketInputStream.socketRead0(Native Method)_ > _at > java.net.SocketInputStream.socketRead(SocketInputStream.java:116)_ > _at java.net.SocketInputStream.read(SocketInputStream.java:171)_ > _at java.net.SocketInputStream.read(SocketInputStream.java:141)_ > _at > org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:152)_ > _at > org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:140)_ > _at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:170)_ > _at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)_ > _at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:206)_ > _at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)_ > _- locked <0x000000078ab60ba8> (a java.lang.Object)_ > _at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:893)_ > _at org.apache.flink.runtime.taskmanager.Task.run(Task.java:585)_ > _at java.lang.Thread.run(Thread.java:748)_ > _————————_ > -- This message was sent by Atlassian JIRA (v7.6.3#76005)