Hi, Feel free to open a JIRA for this issue. By the way have you investigated what is the root cause for it hanging?
Best, Dawid On 25/04/2019 08:55, qi luo wrote: > Hello, > > This issue occurred again and we dumped the TM thread. It indeed hung > on socket read to download jar from Blob server: > / > / > /"DataSource (at createInput(ExecutionEnvironment.java:548) > (our.code)) (1999/2000)" #72 prio=5 os_prio=0 tid=0x00007fb9a1521000 > nid=0xa0994 runnable [0x00007fb97cfbf000]/ > / java.lang.Thread.State: RUNNABLE/ > / at java.net.SocketInputStream.socketRead0(Native Method)/ > / at > java.net.SocketInputStream.socketRead(SocketInputStream.java:116)/ > / at java.net.SocketInputStream.read(SocketInputStream.java:171)/ > / at java.net.SocketInputStream.read(SocketInputStream.java:141)/ > / at > org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:152)/ > / at > org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:140)/ > / at > org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:170)/ > / at > org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)/ > / at > org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:206)/ > / at > org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)/ > / - locked <0x000000078ab60ba8> (a java.lang.Object)/ > / at > org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:893)/ > / at org.apache.flink.runtime.taskmanager.Task.run(Task.java:585)/ > / at java.lang.Thread.run(Thread.java:748)/ > > I checked the latest master code. There’s still no socket timeout in > Blob client. Should I create an issue to add this timeout? > > Regards, > Qi > >> On Apr 19, 2019, at 7:49 PM, qi luo <luoqi...@gmail.com >> <mailto:luoqi...@gmail.com>> wrote: >> >> Hi all, >> >> We use Flink 1.5 batch and start thousands of jobs per day. >> Occasionally we observed some stuck jobs, due to some TM hang in >> “DEPLOYING” state. >> >> On checking TM log, it shows that it stuck in downloading jars in >> BlobClient: >> >> ———— >> ... >> INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Received >> task DataSource (at createInput(ExecutionEnvironment.java:548) >> (our.code)) (184/2000). INFO >> org.apache.flink.runtime.taskmanager.Task - DataSource (at >> createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) >> switched from CREATED to DEPLOYING. INFO >> org.apache.flink.runtime.taskmanager.Task - Creating FileSystem >> stream leak safety net for task DataSource (at >> createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) >> [DEPLOYING] INFO org.apache.flink.runtime.taskmanager.Task - Loading >> JAR files for task DataSource (at >> createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) >> [DEPLOYING]. INFO org.apache.flink.runtime.blob.BlobClient - >> Downloading >> 19e65c0caa41f264f9ffe4ca2a48a434/p-3ecd6341bf97d5512b14c93f6c9f51f682b6db26-37d5e69d156ee00a924c1ebff0c0d280 >> from some-host-ip-port >> no more logs... >> ———— >> >> It seems that the TM is calling BlobClient to download jars from >> JM/BlobServer. Under hood it’s calling Socket.connect() and then >> Socket.read() to retrieve results. >> >> Should we add timeout in socket operations in BlobClient to resolve >> this issue? >> >> Thanks, >> Qi >
signature.asc
Description: OpenPGP digital signature