[
https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064757#comment-17064757
]
Gary Yao edited comment on FLINK-16468 at 3/23/20, 12:04 PM:
-------------------------------------------------------------
{quote}
I will happily update the user docs, but would appreciate some input on what
the implications might be since my lack of experience on the implications was
the part of the reason why this issue and
https://issues.apache.org/jira/browse/FLINK-16470 were raised in the first
place.
{quote}
We changed the default restart delay from 0s to 1s to mitigate restart storms
(see
[FLIP-62|https://cwiki.apache.org/confluence/display/FLINK/FLIP-62%3A+Set+default+restart+delay+for+FixedDelay-+and+FailureRateRestartStrategy+to+1s]).
For example, if your job failed due to the data source being overloaded,
frequent restarts will only worsen the situation as this will further increase
load on the data source.
{quote}
Given the option, I would go with a backoff algorithm going something like
1,2,4,8,16... seconds which provides both user feedback and some chance for
network recovery.
{quote}
I think waiting for 16s or more would be quite drastic. If a job restarts, the
exception will be visible on the Web UI. However, if we sleep for (1 + 2 + 4 +
8 + 16) seconds on the TaskManager, the only feedback we provide to the user is
through log files. I would be ok to introduce a configurable, low delay between
BlobClient retries (e.g., by default 1s). Note, however, that this change would
[require a
FLIP|https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals#FlinkImprovementProposals-Whatisconsidereda%22majorchange%22thatneedsaFLIP?]
since we would introduce a new public interface. All in all, I think this
issue has low priority at the moment.
was (Author: gjy):
{quote}
I will happily update the user docs, but would appreciate some input on what
the implications might be since my lack of experience on the implications was
the part of the reason why this issue and
https://issues.apache.org/jira/browse/FLINK-16470 were raised in the first
place.
{quote}
We changed the default restart delay from 0s to 1s to mitigate restart storms
(see
[FLIP-62|https://cwiki.apache.org/confluence/display/FLINK/FLIP-62%3A+Set+default+restart+delay+for+FixedDelay-+and+FailureRateRestartStrategy+to+1s]).
For example, if your job failed due to the data source being overloaded,
frequent restarts will only worsen the situation as this will further increase
load on the data source.
{quote}
Given the option, I would go with a backoff algorithm going something like
1,2,4,8,16... seconds which provides both user feedback and some chance for
network recovery.
{quote}
I think waiting for 16s or more would be quite drastic. If a job restarts, the
exception will be visible on the Web UI. However, if we sleep for (1 + 2 + 4 +
8 + 16) seconds on the TaskManager, the only feedback we provide to the user is
through log files. I would be ok to introduce a configurable, low delay between
BlobClient retries (e.g., by default 1s). Note, however, that this change would
require a
[FLIP|https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals#FlinkImprovementProposals-Whatisconsidereda%22majorchange%22thatneedsaFLIP?]
since we would introduce a new public interface. All in all, I think this
issue has low priority at the moment.
> BlobClient rapid retrieval retries on failure opens too many sockets
> --------------------------------------------------------------------
>
> Key: FLINK-16468
> URL: https://issues.apache.org/jira/browse/FLINK-16468
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.8.3, 1.9.2, 1.10.0
> Environment: Linux ubuntu servers running, patch current latest
> Ubuntu patch current release java 8 JRE
> Reporter: Jason Kania
> Priority: Major
> Fix For: 1.11.0
>
>
> In situations where the BlobClient retrieval fails as in the following log,
> rapid retries will exhaust the open sockets. All the retries happen within a
> few milliseconds.
> {noformat}
> 2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient -
> Failed to fetch BLOB
> cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7
> from aaa-1/10.0.1.1:45145 and store it under
> /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-00000004
> Retrying...
> {noformat}
> The above is output repeatedly until the following error occurs:
> {noformat}
> java.io.IOException: Could not connect to BlobServer at address
> aaa-1/10.0.1.1:45145
> at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:100)
> at
> org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143)
> at
> org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)
> at
> org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202)
> at
> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
> at
> org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915)
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.SocketException: Too many open files
> at java.net.Socket.createImpl(Socket.java:478)
> at java.net.Socket.connect(Socket.java:605)
> at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:95)
> ... 8 more
> {noformat}
> The retries should have some form of backoff in this situation to avoid
> flooding the logs and exhausting other resources on the server.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)