[
https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061487#comment-17061487
]
Gary Yao edited comment on FLINK-16468 at 3/18/20, 8:16 AM:
------------------------------------------------------------
[~longtimer] We could say a few words about the implications of a low restart
delay in the user docs. Since the implications are very extensive, I would keep
the description general and not about the BlobClient specifically. If you want
to improve the user docs, feel free to open a new JIRA issue and cc me.
Introducing a backoff time makes sense since we currently just exhaust all
retry attempts without giving the network/services time to recover. However, I
would still keep the retry delays low (i.e., a few seconds) because otherwise
the user is left without feedback about the state of the deployment. If you
want to work on this issue, let me know and I will assign it to you.
The current 1 second restart delay probably already mitigates the issue. There
will be at most 300 (60*5) BlobClient retries per minute, and the TIME-WAIT
state is destroyed after [1
minute|https://github.com/torvalds/linux/blob/bd2463ac7d7ec51d432f23bf0e893fb371a908cd/include/net/tcp.h#L121].
Therefore, the current retry mechanism hogs at most 300 sockets per TM.
was (Author: gjy):
We could say a few words about the implications of a low restart delay in the
user docs. Since the implications are very extensive, I would keep the
description general and not about the BlobClient specifically. If you want to
improve the user docs, feel free to open a new JIRA issue and cc me.
Introducing a backoff time makes sense since we currently just exhaust all
retry attempts without giving the network/services time to recover. However, I
would still keep the retry delays low (i.e., a few seconds) because otherwise
the user is left without feedback about the state of the deployment. If you
want to work on this issue, let me know and I will assign it to you.
The current 1 second restart delay probably already mitigates the issue. There
will be at most 300 (60*5) BlobClient retries per minute, and the TIME-WAIT
state is destroyed after [1
minute|https://github.com/torvalds/linux/blob/bd2463ac7d7ec51d432f23bf0e893fb371a908cd/include/net/tcp.h#L121].
Therefore, the current retry mechanism hogs at most 300 sockets per TM.
> BlobClient rapid retrieval retries on failure opens too many sockets
> --------------------------------------------------------------------
>
> Key: FLINK-16468
> URL: https://issues.apache.org/jira/browse/FLINK-16468
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.8.3, 1.9.2, 1.10.0
> Environment: Linux ubuntu servers running, patch current latest
> Ubuntu patch current release java 8 JRE
> Reporter: Jason Kania
> Priority: Major
> Fix For: 1.11.0
>
>
> In situations where the BlobClient retrieval fails as in the following log,
> rapid retries will exhaust the open sockets. All the retries happen within a
> few milliseconds.
> {noformat}
> 2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient -
> Failed to fetch BLOB
> cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7
> from aaa-1/10.0.1.1:45145 and store it under
> /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-00000004
> Retrying...
> {noformat}
> The above is output repeatedly until the following error occurs:
> {noformat}
> java.io.IOException: Could not connect to BlobServer at address
> aaa-1/10.0.1.1:45145
> at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:100)
> at
> org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143)
> at
> org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)
> at
> org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202)
> at
> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
> at
> org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915)
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.SocketException: Too many open files
> at java.net.Socket.createImpl(Socket.java:478)
> at java.net.Socket.connect(Socket.java:605)
> at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:95)
> ... 8 more
> {noformat}
> The retries should have some form of backoff in this situation to avoid
> flooding the logs and exhausting other resources on the server.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)