[ https://issues.apache.org/jira/browse/FLINK-12547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848633#comment-16848633 ]
Haibo Sun commented on FLINK-12547: ----------------------------------- [~QiLuo], because the blob client has a retry mechanism, I understand that "The TM hangs for over an hour (longer than the SO_TIMEOUT)" is possible, but it does not mean that SO_TIMEOUT does not work. In addition, `30 minutes` is too longer, and I think you should set SO_TIMEOUT to a smaller value. > Deadlock when the task thread downloads jars using BlobClient > ------------------------------------------------------------- > > Key: FLINK-12547 > URL: https://issues.apache.org/jira/browse/FLINK-12547 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.8.0 > Reporter: Haibo Sun > Assignee: Haibo Sun > Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The jstack is as follows (this jstack is from an old Flink version, but the > master branch has the same problem). > {code:java} > "Source: Custom Source (76/400)" #68 prio=5 os_prio=0 tid=0x00007f8139cd3000 > nid=0xe2 runnable [0x00007f80da5fd000] > java.lang.Thread.State: RUNNABLE > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:170) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at > org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:152) > at > org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:140) > at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:164) > at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) > at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:206) > at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) > - locked <0x000000062cf2a188> (a java.lang.Object) > at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:968) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:604) > at java.lang.Thread.run(Thread.java:834) > Locked ownable synchronizers: > - None > {code} > > The reason is that SO_TIMEOUT is not set in the socket connection of the blob > client. When the network packet loss seriously due to the high CPU load of > the machine, the blob client connection fails to perceive that the server has > been disconnected, which results in blocking in the native method. -- This message was sent by Atlassian JIRA (v7.6.3#76005)