Re: TM occasionally hang in deploying state in Flink 1.5

qi luo Tue, 07 May 2019 02:17:23 -0700

Thanks Dawid, I’ve created an issue for this 
https://jira.apache.org/jira/browse/FLINK-12426 
<https://jira.apache.org/jira/browse/FLINK-12426>. Though we’re using 1.5 but 
this may affect later versions.


I’m still investigating the root case but no result yet. This happens 
occasionally and isn't easy to reproduce.

> On Apr 25, 2019, at 6:40 PM, Dawid Wysakowicz <dwysakow...@apache.org> wrote:
> 
> Hi,
> 
> Feel free to open a JIRA for this issue. By the way have you investigated 
> what is the root cause for it hanging?
> 
> Best,
> 
> Dawid
> 
> On 25/04/2019 08:55, qi luo wrote:
>> Hello,
>> 
>> This issue occurred again and we dumped the TM thread. It indeed hung on 
>> socket read to download jar from Blob server:
>> 
>> "DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) 
>> (1999/2000)" #72 prio=5 os_prio=0 tid=0x00007fb9a1521000 nid=0xa0994 
>> runnable [0x00007fb97cfbf000]
>>    java.lang.Thread.State: RUNNABLE
>>         at java.net.SocketInputStream.socketRead0(Native Method)
>>         at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>>         at java.net.SocketInputStream.read(SocketInputStream.java:171)
>>         at java.net.SocketInputStream.read(SocketInputStream.java:141)
>>         at 
>> org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:152)
>>         at 
>> org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:140)
>>         at 
>> org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:170)
>>         at 
>> org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)
>>         at 
>> org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:206)
>>         at 
>> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
>>         - locked <0x000000078ab60ba8> (a java.lang.Object)
>>         at 
>> org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:893)
>>         at org.apache.flink.runtime.taskmanager.Task.run(Task.java:585)
>>         at java.lang.Thread.run(Thread.java:748)
>> 
>> I checked the latest master code. There’s still no socket timeout in Blob 
>> client. Should I create an issue to add this timeout?
>> 
>> Regards,
>> Qi 
>> 
>>> On Apr 19, 2019, at 7:49 PM, qi luo <luoqi...@gmail.com 
>>> <mailto:luoqi...@gmail.com>> wrote:
>>> 
>>> Hi all,
>>> 
>>> We use Flink 1.5 batch and start thousands of jobs per day. Occasionally we 
>>> observed some stuck jobs, due to some TM hang in “DEPLOYING” state. 
>>> 
>>> On checking TM log, it shows that it stuck in downloading jars in 
>>> BlobClient:
>>> 
>>> ————
>>> ...
>>> INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor       - Received 
>>> task DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) 
>>> (184/2000).
>>> INFO  org.apache.flink.runtime.taskmanager.Task                     - 
>>> DataSource (at createInput(ExecutionEnvironment.java:548) (our.code)) 
>>> (184/2000) switched from CREATED to DEPLOYING.
>>> INFO  org.apache.flink.runtime.taskmanager.Task                     - 
>>> Creating FileSystem stream leak safety net for task DataSource (at 
>>> createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) 
>>> [DEPLOYING]
>>> INFO  org.apache.flink.runtime.taskmanager.Task                     - 
>>> Loading JAR files for task DataSource (at 
>>> createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) 
>>> [DEPLOYING].
>>> INFO  org.apache.flink.runtime.blob.BlobClient                          - 
>>> Downloading 
>>> 19e65c0caa41f264f9ffe4ca2a48a434/p-3ecd6341bf97d5512b14c93f6c9f51f682b6db26-37d5e69d156ee00a924c1ebff0c0d280
>>>  from some-host-ip-port
>>> 
>>> no more logs...
>>> ————
>>> 
>>> It seems that the TM is calling BlobClient to download jars from 
>>> JM/BlobServer. Under hood it’s calling Socket.connect() and then 
>>> Socket.read() to retrieve results. 
>>> 
>>> Should we add timeout in socket operations in BlobClient to resolve this 
>>> issue?
>>> 
>>> Thanks,
>>> Qi
>>

Re: TM occasionally hang in deploying state in Flink 1.5

Reply via email to