Figured out the answer, eventually. The magic property name, in this case, is yarn.client.failover-max-attempts (prefixed with spark.hadoop. in the case of Spark, of course). For a full explanation, see the StackOverflow answer <https://stackoverflow.com/a/60011708/375670> I just added.
On Wed, Jan 22, 2020 at 5:02 PM Jeff Evans <jeffrey.wayne.ev...@gmail.com> wrote: > Greetings, > > Is it possible to limit the number of times the IPC client retries upon a > spark-submit invocation? For context, see this StackOverflow post > <https://stackoverflow.com/questions/59863850/how-to-control-the-number-of-hadoop-ipc-retry-attempts-for-a-spark-job-submissio>. > In essence, I am trying to call spark-submit on a Kerberized cluster, > without having valid Kerberos tickets available. This is deliberate, and > I'm not truly facing a Kerberos issue. Rather, this is the > easiest reproducible case of "long IPC retry" I have been able to trigger. > > In this particular case, the following errors are printed (presumably by > the launcher): > > 20/01/22 15:49:32 INFO retry.RetryInvocationHandler: java.io.IOException: > Failed on local exception: java.io.IOException: > org.apache.hadoop.security.AccessControlException: Client cannot authenticate > via:[TOKEN, KERBEROS]; Host Details : local host is: > "node-1.cluster/172.18.0.2"; destination host is: "node-1.cluster":8032; , > while invoking ApplicationClientProtocolPBClientImpl.getClusterMetrics over > null after 1 failover attempts. Trying to failover after sleeping for 35160ms. > > This continues for 30 times before the launcher finally gives up. > > As indicated in the answer on that StackOverflow post, the relevant Hadoop > properties should be ipc.client.connect.max.retries and/or > ipc.client.connect.max.retries.on.sasl. However, in testing on Spark > 2.4.0 (on CDH 6.1), I am not able to get either of these to take effect (it > still retries 30 times regardless). I am trying the SparkPi example, and > specifying them with --conf spark.hadoop.ipc.client.connect.max.retries > and/or --conf spark.hadoop.ipc.client.connect.max.retries.on.sasl. > > Any ideas on what I could be doing wrong, or why I can't get these > properties to take effect? >