Hi all,

So I've checked the log and it seems that the expired delegation error was
triggered during resource localization.
Maybe there's something wrong with my Hadoop setup, NMs are supposed to get
a good token from RM in order to localize resources automatically.

Regards,
Kiên

2020-11-17 10:28:55,972 WARN
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
> {
> hdfs://xxxxx:8020/user/xxx/.flink/application_1604481558884_0006/lib/flink-table
> -blink_2.12-1.11.2.jar, 1604482517793, FILE, null } failed: Got expired
> delegation token id
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException):
> Got expired delegation token id
>         at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1498)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1444)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1354)
>         at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
>         at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
>         at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
>         at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:900)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>         at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>         at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>         at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>         at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>         at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)
>         at
> org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1660)
>         at
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1577)
>         at
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1574)
>         at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1589)
>         at
> org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:269)
>         at
> org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)
>         at
> org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)
>         at
> org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>         at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)
>         at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:242)
>         at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:235)
>         at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:223)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
>
> 2020-11-17 10:28:55,973 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
> Container container_e99_1604481558884_0006_04_000001 transitioned from
> LOCALIZING to LOCALIZATION_FAILED
>


On Tue, Nov 17, 2020 at 5:33 PM Kien Truong <duckientru...@gmail.com> wrote:

> Hi Yangze,
>
> Thanks for checking.
>
> I'm not using the new application mode, but the old single job
> yarn-cluster mode.
>
> I'll try to get some more logs tomorrow.
>
> Regards,
> Kien
>
> On 17 Nov 2020 at 16:37, Yangze Guo <karma...@gmail.com> wrote:
>
> Hi,
>
> There is a login operation in
> YarnEntrypointUtils.logYarnEnvironmentInformation without the keytab.
> One suspect is that Flink may access the HDFS when it tries to build
> the PackagedProgram.
>
> Does this issue only happen in the application mode? If so, I would cc
> @kkloudas.
>
> Best,
> Yangze Guo
>
> On Tue, Nov 17, 2020 at 4:52 PM Yangze Guo <karma...@gmail.com> wrote:
> >
> > Hi,
> >
> > AFAIK, Flink does exclude the HDFS_DELEGATION_TOKEN in the
> > HadoopModule when user provides the keytab and principal. I'll try to
> > do a deeper investigation to figure out is there any HDFS access
> > before the HadoopModule installed.
> >
> > Best,
> > Yangze Guo
> >
> >
> > On Tue, Nov 17, 2020 at 4:36 PM Kien Truong <duckientru...@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > Yes, I did. There're also logs about logging in using keytab successfully 
> > > in both Job Manager and Task Manager.
> > >
> > > I found some YARN docs about token renewal on AM restart
> > >
> > >
> > > > Therefore, to survive AM restart after token expiry, your AM has to get 
> > > > the NMs to localize the keytab or make no HDFS accesses until (somehow) 
> > > > a new token has been passed to them from a client.
> > >
> > > Maybe Flink did access HDFS with an expired token, before switching to 
> > > use the localized keytab ?
> > >
> > > Regards,
> > > Kien
> > >
> > >
> > >
> > > On 17 Nov 2020 at 15:14, Yangze Guo <karma...@gmail.com> wrote:
> > >
> > > Hi, Kien,
> > >
> > >
> > >
> > > Do you config the "security.kerberos.login.principal" and the
> > >
> > > "security.kerberos.login.keytab" together? If you only set the keytab,
> > >
> > > it will not take effect.
> > >
> > >
> > >
> > > Best,
> > >
> > > Yangze Guo
> > >
> > >
> > >
> > > On Tue, Nov 17, 2020 at 3:03 PM Kien Truong <duckientru...@gmail.com> 
> > > wrote:
> > >
> > > >
> > >
> > > > Hi all,
> > >
> > > >
> > >
> > > > We are having an issue where Flink Application Master is unable to 
> > > > automatically restart Flink job after its delegation token has expired.
> > >
> > > >
> > >
> > > > We are using Flink 1.11 with YARN 3.1.1 in single job per yarn-cluster 
> > > > mode. We have also add valid keytab configuration and taskmanagers are 
> > > > able to login with keytabs correctly. However, it seems YARN 
> > > > Application Master still use delegation tokens instead of the keytab.
> > >
> > > >
> > >
> > > > Any idea how to resolve this would be much appreciated.
> > >
> > > >
> > >
> > > > Thanks
> > >
> > > > Kien
> > >
> > > >
> > >
> > > >
> > >
> > > >
> > >
> > > >
> > >
>
>

Reply via email to