[ https://issues.apache.org/jira/browse/FLINK-10278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ben La Monica updated FLINK-10278: ---------------------------------- Fix Version/s: 1.5.3 > Flink in YARN cluster uses wrong path when looking for Kerberos Keytab > ---------------------------------------------------------------------- > > Key: FLINK-10278 > URL: https://issues.apache.org/jira/browse/FLINK-10278 > Project: Flink > Issue Type: Bug > Affects Versions: 1.5.2 > Reporter: Ben La Monica > Priority: Major > Fix For: 1.5.3 > > > While trying to run Flink in a yarn cluster with more than 1 physical > computer in the cluster, the first task manager will start fine, but the > second task manager fails to start because it is looking for the kerberos > keytab in the location that is on the *FIRST* taskmanager. See below log > lines (unrelated lines removed for clarity): > {code:java} > 2018-09-01 23:00:34,322 INFO class=o.a.f.yarn.YarnTaskExecutorRunner > thread=main Current working/local Directory: > /mnt/yarn/usercache/hadoop/appcache/application_1535833786616_0005 > 2018-09-01 23:00:34,339 INFO class=o.a.f.r.c.BootstrapTools thread=main > Setting directories for temporary files to: > /mnt/yarn/usercache/hadoop/appcache/application_1535833786616_0005 > 2018-09-01 23:00:34,339 INFO class=o.a.f.yarn.YarnTaskExecutorRunner > thread=main keytab path: > /mnt/yarn/usercache/hadoop/appcache/application_1535833786616_0005/container_1535833786616_0005_01_000319/krb5.keytab > 2018-09-01 23:00:34,339 INFO class=o.a.f.yarn.YarnTaskExecutorRunner > thread=main YARN daemon is running as: hadoop Yarn client user obtainer: > hadoop > 2018-09-01 23:00:34,343 ERROR class=o.a.f.yarn.YarnTaskExecutorRunner > thread=main YARN TaskManager initialization failed. > org.apache.flink.configuration.IllegalConfigurationException: Kerberos login > configuration is invalid; keytab > '/mnt/yarn/usercache/hadoop/appcache/application_1535833786616_0005/container_1535833786616_0005_01_000001/krb5.keytab' > does not exist > at > org.apache.flink.runtime.security.SecurityConfiguration.validate(SecurityConfiguration.java:139) > at > org.apache.flink.runtime.security.SecurityConfiguration.<init>(SecurityConfiguration.java:90) > at > org.apache.flink.runtime.security.SecurityConfiguration.<init>(SecurityConfiguration.java:71) > at > org.apache.flink.yarn.YarnTaskExecutorRunner.run(YarnTaskExecutorRunner.java:120) > at > org.apache.flink.yarn.YarnTaskExecutorRunner.main(YarnTaskExecutorRunner.java:73){code} > > You'll notice that the log statement says that the keytab should be located > in container 000319: > /mnt/yarn/usercache/hadoop/appcache/application_1535833786616_0005/container_1535833786616_0005_01_{color:#14892c}*000319*{color}/krb5.keytab > But after I changed the code so that it would show the file that it's > actually checking when doing the SecurityConfiguration init it is actually > checking container 000001, which is not on the host: > /mnt/yarn/usercache/hadoop/appcache/application_1535833786616_0005/container_1535833786616_0005_01_{color:#d04437}*000001*{color}/krb5.keytab > This causes the YARN task managers to restart over and over again (which is > why we're up to container 319!) > I'll submit a PR for this fix, though basically it's just moving the > initialization of the SecurityConfiguration down 2 lines. -- This message was sent by Atlassian JIRA (v7.6.3#76005)