Hello,

This issue occurred again and we dumped the TM thread. It indeed hung on socket 
read to download jar from Blob server:

"DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) 
(1999/2000)" #72 prio=5 os_prio=0 tid=0x00007fb9a1521000 nid=0xa0994 runnable 
[0x00007fb97cfbf000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at 
org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:152)
        at 
org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:140)
        at 
org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:170)
        at 
org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)
        at 
org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:206)
        at 
org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
        - locked <0x000000078ab60ba8> (a java.lang.Object)
        at 
org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:893)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:585)
        at java.lang.Thread.run(Thread.java:748)

I checked the latest master code. There’s still no socket timeout in Blob 
client. Should I create an issue to add this timeout?

Regards,
Qi 

> On Apr 19, 2019, at 7:49 PM, qi luo <luoqi...@gmail.com> wrote:
> 
> Hi all,
> 
> We use Flink 1.5 batch and start thousands of jobs per day. Occasionally we 
> observed some stuck jobs, due to some TM hang in “DEPLOYING” state. 
> 
> On checking TM log, it shows that it stuck in downloading jars in BlobClient:
> 
> ————
> ...
> INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor       - Received 
> task DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) 
> (184/2000).
> INFO  org.apache.flink.runtime.taskmanager.Task                     - 
> DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) 
> (184/2000) switched from CREATED to DEPLOYING.
> INFO  org.apache.flink.runtime.taskmanager.Task                     - 
> Creating FileSystem stream leak safety net for task DataSource (at 
> createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) [DEPLOYING]
> INFO  org.apache.flink.runtime.taskmanager.Task                     - Loading 
> JAR files for task DataSource (at createInput(ExecutionEnvironment.java:548) 
> (our.code)) (184/2000) [DEPLOYING].
> INFO  org.apache.flink.runtime.blob.BlobClient                          - 
> Downloading 
> 19e65c0caa41f264f9ffe4ca2a48a434/p-3ecd6341bf97d5512b14c93f6c9f51f682b6db26-37d5e69d156ee00a924c1ebff0c0d280
>  from some-host-ip-port
> 
> no more logs...
> ————
> 
> It seems that the TM is calling BlobClient to download jars from 
> JM/BlobServer. Under hood it’s calling Socket.connect() and then 
> Socket.read() to retrieve results. 
> 
> Should we add timeout in socket operations in BlobClient to resolve this 
> issue?
> 
> Thanks,
> Qi

Reply via email to