Kevin Wikant created YARN-11249:
-----------------------------------
Summary: YARN Client Retry Policy inadvertently changed several
years ago
Key: YARN-11249
URL: https://issues.apache.org/jira/browse/YARN-11249
Project: Hadoop YARN
Issue Type: Bug
Components: client
Reporter: Kevin Wikant
A non-HA & non-Federation YARN client should be using
[OtherThanRemoteAndSaslExceptionDependentRetry|[https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java#L303]]
(previously named OtherThanRemoteExceptionDependentRetry).
An HA or Federation YARN client should be using
[FailoverOnNetworkExceptionRetry|[https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java#L267]].
However, the logic surrounding which Retry Policy gets used (by YARN client)
has been inconsistent/unexpected due to past changes in the code.
[This
PR|[https://github.com/apache/hadoop/commit/bdfad4523f1a5a776e20773c371ce99d0c538ac1]]
introduced what is arguably a bug where FailoverOnNetworkExceptionRetry would
get used for non-HA & non-Federation YARN clusters because of the default
yarn-site configuration
["yarn.federation.failover.enabled=true"|[https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java#L102]].
This was arguably a bug because "yarn.federation.failover.enabled=true" even
if YARN federation is not enabled at all. The consequence of this change is
that all over the world non-HA & non-Federation YARN clusters using a Hadoop
release from after 2017-09 were unexpectedly using
FailoverOnNetworkExceptionRetry when in reality they should be using
OtherThanRemoteAndSaslExceptionDependentRetry. The only YARN clusters (using a
Hadoop release from after 2017-09) that would not be impacted by the retry
policy change would be those with a custom yarn-site configuration of
"yarn.federation.failover.enabled=false"; I assume this is probably a fairly
small fraction of YARN clusters around the world and so I suspect this retry
policy change affected the majority of clusters around the world using an
affected Hadoop release.
The code was recently corrected to change the Retry Policy used for non-HA &
non-Federation YARN clusters; in 2022-07 this
[change|[https://github.com/apache/hadoop/commit/213ea037589d8a69c34e1f98413ddbc06dfd42bf]]
was committed which causes the correct Retry Policy of
OtherThanRemoteAndSaslExceptionDependentRetry to be used for non-HA &
non-Federation YARN clusters. The change was to return that [federation
failover is not enabled unless federation is
enabled|[https://github.com/apache/hadoop/blob/213ea037589d8a69c34e1f98413ddbc06dfd42bf/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/HAUtil.java#L70]].
Hadoop clusters based on Hadoop releases created between the following
approximate time ranges will have a different "default" YARN client retry
policy (i.e. "default" retry policy in the sense that you will get it with a
default yarn-site file containing "yarn.federation.failover.enabled=true"):
- before 2017-09 ; YARN uses OtherThanRemoteAndSaslExceptionDependentRetry by
default
- 2017-09 to 2022-07 ; YARN uses FailoverOnNetworkExceptionRetry by default
- 2022-07 into future ; YARN uses
OtherThanRemoteAndSaslExceptionDependentRetry by default
Part of the problem here is that
[FailoverOnNetworkExceptionRetry|[https://github.com/apache/hadoop/blob/213ea037589d8a69c34e1f98413ddbc06dfd42bf/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryPolicies.java#L698]]
is arguably a superior retry policy than
[OtherThanRemoteAndSaslExceptionDependentRetry|[https://github.com/apache/hadoop/blob/213ea037589d8a69c34e1f98413ddbc06dfd42bf/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryPolicies.java#L610]]
because:
- FailoverOnNetworkExceptionRetry will not retry on certain non-retryable
exceptions such as
[InvalidToken|[https://github.com/apache/hadoop/blob/213ea037589d8a69c34e1f98413ddbc06dfd42bf/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryPolicies.java#L729]]
&
[AccessControlException|[https://github.com/apache/hadoop/blob/213ea037589d8a69c34e1f98413ddbc06dfd42bf/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryPolicies.java#L732]]
- OtherThanRemoteAndSaslExceptionDependentRetry will retry for 15 minutes on
certain non-retryable exceptions such as [InvalidToken &
AccessControlException|[https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java#L291]]
My concern is that YARN users who were previously seeing their YARN client not
retry on InvalidToken or AccessControlException will now start to see the YARN
client retry for 15 minutes on these exceptions because the Retry Policy for
non-HA & non-Federation YARN client has been reverted back to
OtherThanRemoteAndSaslExceptionDependentRetry.
I see a few options to correct this:
- remove OtherThanRemoteAndSaslExceptionDependentRetry entirely & always use
FailoverOnNetworkExceptionRetry
-- the benefit of this approach is that it reduces lines of code & helps avoid
cases of exception handling code divergence by maintaining multiple different
retry policies
-- I don't see any major risk to doing this because all Hadoop releases since
after 2017-09 have been using FailoverOnNetworkExceptionRetry without the
sysadmins even being aware of it & without any issues reported
- bring OtherThanRemoteAndSaslExceptionDependentRetry to parity with
FailoverOnNetworkExceptionRetry in terms of not retrying on InvalidToken or
AccessControlException
-- downside is that the exception handling in these retry policies may diverge
again in the future
- refactor so that both retry policies share the same exception handling code
but may differ in other areas such as the RetryAction being returned
-- may be as simple as using FailoverOnNetworkExceptionRetry's exception
handling logic in a parent class but to return different RetryAction depending
on subclass (i.e. FAILOVER_AND_RETRY for HA/federation vs RETRY for
non-HA/non-federation)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]