Thanks Dawid, I’ve created an issue for this https://jira.apache.org/jira/browse/FLINK-12426 <https://jira.apache.org/jira/browse/FLINK-12426>. Though we’re using 1.5 but this may affect later versions.
I’m still investigating the root case but no result yet. This happens occasionally and isn't easy to reproduce. > On Apr 25, 2019, at 6:40 PM, Dawid Wysakowicz <dwysakow...@apache.org> wrote: > > Hi, > > Feel free to open a JIRA for this issue. By the way have you investigated > what is the root cause for it hanging? > > Best, > > Dawid > > On 25/04/2019 08:55, qi luo wrote: >> Hello, >> >> This issue occurred again and we dumped the TM thread. It indeed hung on >> socket read to download jar from Blob server: >> >> "DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) >> (1999/2000)" #72 prio=5 os_prio=0 tid=0x00007fb9a1521000 nid=0xa0994 >> runnable [0x00007fb97cfbf000] >> java.lang.Thread.State: RUNNABLE >> at java.net.SocketInputStream.socketRead0(Native Method) >> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) >> at java.net.SocketInputStream.read(SocketInputStream.java:171) >> at java.net.SocketInputStream.read(SocketInputStream.java:141) >> at >> org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:152) >> at >> org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:140) >> at >> org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:170) >> at >> org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181) >> at >> org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:206) >> at >> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) >> - locked <0x000000078ab60ba8> (a java.lang.Object) >> at >> org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:893) >> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:585) >> at java.lang.Thread.run(Thread.java:748) >> >> I checked the latest master code. There’s still no socket timeout in Blob >> client. Should I create an issue to add this timeout? >> >> Regards, >> Qi >> >>> On Apr 19, 2019, at 7:49 PM, qi luo <luoqi...@gmail.com >>> <mailto:luoqi...@gmail.com>> wrote: >>> >>> Hi all, >>> >>> We use Flink 1.5 batch and start thousands of jobs per day. Occasionally we >>> observed some stuck jobs, due to some TM hang in “DEPLOYING” state. >>> >>> On checking TM log, it shows that it stuck in downloading jars in >>> BlobClient: >>> >>> ———— >>> ... >>> INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Received >>> task DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) >>> (184/2000). >>> INFO org.apache.flink.runtime.taskmanager.Task - >>> DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) >>> (184/2000) switched from CREATED to DEPLOYING. >>> INFO org.apache.flink.runtime.taskmanager.Task - >>> Creating FileSystem stream leak safety net for task DataSource (at >>> createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) >>> [DEPLOYING] >>> INFO org.apache.flink.runtime.taskmanager.Task - >>> Loading JAR files for task DataSource (at >>> createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) >>> [DEPLOYING]. >>> INFO org.apache.flink.runtime.blob.BlobClient - >>> Downloading >>> 19e65c0caa41f264f9ffe4ca2a48a434/p-3ecd6341bf97d5512b14c93f6c9f51f682b6db26-37d5e69d156ee00a924c1ebff0c0d280 >>> from some-host-ip-port >>> >>> no more logs... >>> ———— >>> >>> It seems that the TM is calling BlobClient to download jars from >>> JM/BlobServer. Under hood it’s calling Socket.connect() and then >>> Socket.read() to retrieve results. >>> >>> Should we add timeout in socket operations in BlobClient to resolve this >>> issue? >>> >>> Thanks, >>> Qi >>