Hi all, So I've checked the log and it seems that the expired delegation error was triggered during resource localization. Maybe there's something wrong with my Hadoop setup, NMs are supposed to get a good token from RM in order to localize resources automatically.
Regards, Kiên 2020-11-17 10:28:55,972 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > { > hdfs://xxxxx:8020/user/xxx/.flink/application_1604481558884_0006/lib/flink-table > -blink_2.12-1.11.2.jar, 1604482517793, FILE, null } failed: Got expired > delegation token id > org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): > Got expired delegation token id > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1498) > at org.apache.hadoop.ipc.Client.call(Client.java:1444) > at org.apache.hadoop.ipc.Client.call(Client.java:1354) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:900) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1660) > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1577) > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1574) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1589) > at > org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:269) > at > org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67) > at > org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414) > at > org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:242) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:235) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:223) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > 2020-11-17 10:28:55,973 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_e99_1604481558884_0006_04_000001 transitioned from > LOCALIZING to LOCALIZATION_FAILED > On Tue, Nov 17, 2020 at 5:33 PM Kien Truong <duckientru...@gmail.com> wrote: > Hi Yangze, > > Thanks for checking. > > I'm not using the new application mode, but the old single job > yarn-cluster mode. > > I'll try to get some more logs tomorrow. > > Regards, > Kien > > On 17 Nov 2020 at 16:37, Yangze Guo <karma...@gmail.com> wrote: > > Hi, > > There is a login operation in > YarnEntrypointUtils.logYarnEnvironmentInformation without the keytab. > One suspect is that Flink may access the HDFS when it tries to build > the PackagedProgram. > > Does this issue only happen in the application mode? If so, I would cc > @kkloudas. > > Best, > Yangze Guo > > On Tue, Nov 17, 2020 at 4:52 PM Yangze Guo <karma...@gmail.com> wrote: > > > > Hi, > > > > AFAIK, Flink does exclude the HDFS_DELEGATION_TOKEN in the > > HadoopModule when user provides the keytab and principal. I'll try to > > do a deeper investigation to figure out is there any HDFS access > > before the HadoopModule installed. > > > > Best, > > Yangze Guo > > > > > > On Tue, Nov 17, 2020 at 4:36 PM Kien Truong <duckientru...@gmail.com> wrote: > > > > > > Hi, > > > > > > Yes, I did. There're also logs about logging in using keytab successfully > > > in both Job Manager and Task Manager. > > > > > > I found some YARN docs about token renewal on AM restart > > > > > > > > > > Therefore, to survive AM restart after token expiry, your AM has to get > > > > the NMs to localize the keytab or make no HDFS accesses until (somehow) > > > > a new token has been passed to them from a client. > > > > > > Maybe Flink did access HDFS with an expired token, before switching to > > > use the localized keytab ? > > > > > > Regards, > > > Kien > > > > > > > > > > > > On 17 Nov 2020 at 15:14, Yangze Guo <karma...@gmail.com> wrote: > > > > > > Hi, Kien, > > > > > > > > > > > > Do you config the "security.kerberos.login.principal" and the > > > > > > "security.kerberos.login.keytab" together? If you only set the keytab, > > > > > > it will not take effect. > > > > > > > > > > > > Best, > > > > > > Yangze Guo > > > > > > > > > > > > On Tue, Nov 17, 2020 at 3:03 PM Kien Truong <duckientru...@gmail.com> > > > wrote: > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > We are having an issue where Flink Application Master is unable to > > > > automatically restart Flink job after its delegation token has expired. > > > > > > > > > > > > > > We are using Flink 1.11 with YARN 3.1.1 in single job per yarn-cluster > > > > mode. We have also add valid keytab configuration and taskmanagers are > > > > able to login with keytabs correctly. However, it seems YARN > > > > Application Master still use delegation tokens instead of the keytab. > > > > > > > > > > > > > > Any idea how to resolve this would be much appreciated. > > > > > > > > > > > > > > Thanks > > > > > > > Kien > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >