[ 
https://issues.apache.org/jira/browse/FLINK-12547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848313#comment-16848313
 ] 

Qi commented on FLINK-12547:
----------------------------

Hi Haibo,

 

We had added SO_TIMEOUT to the blob socket (SO_TIMEOUT = 30min) in our internal 
Flink release, but this deadlock issue still occurs as described in 
https://issues.apache.org/jira/browse/FLINK-12426. The TM hangs for over an 
hour (longer than the SO_TIMEOUT).

 

We've observed that the JM is under heavy CPU load during the deadlock issue, 
but recovers after a few minutes. This looks pretty strange to me.

 

Thanks,

Qi

> Deadlock when the task thread downloads jars using BlobClient
> -------------------------------------------------------------
>
>                 Key: FLINK-12547
>                 URL: https://issues.apache.org/jira/browse/FLINK-12547
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.8.0
>            Reporter: Haibo Sun
>            Assignee: Haibo Sun
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The jstack is as follows (this jstack is from an old Flink version, but the 
> master branch has the same problem).
> {code:java}
> "Source: Custom Source (76/400)" #68 prio=5 os_prio=0 tid=0x00007f8139cd3000 
> nid=0xe2 runnable [0x00007f80da5fd000]
> java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:170)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at 
> org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:152)
> at 
> org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:140)
> at 
> org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:164)
> at 
> org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)
> at 
> org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:206)
> at 
> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
> - locked <0x000000062cf2a188> (a java.lang.Object)
> at 
> org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:968)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:604)
> at java.lang.Thread.run(Thread.java:834)
> Locked ownable synchronizers:
> - None
> {code}
>  
> The reason is that SO_TIMEOUT is not set in the socket connection of the blob 
> client. When the network packet loss seriously due to the high CPU load of 
> the machine, the blob client connection fails to perceive that the server has 
> been disconnected, which results in blocking in the native method. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to