Hi everyone, We're encountering a persistent HDFS_DELEGATION_TOKEN ... can't be found in cache error when the ApplicationMaster restarts in YARN application mode with Kerberos enabled in Flink 1.18.1, and we'd appreciate your help understanding the root cause and finding the proper fix.
When the AM crashes and YARN restarts it (a new container attempt), the new attempt fails immediately with exitCode: -1000, and error message: ``` Application application_1732503259026_5493 failed 2 times in previous 10000 milliseconds due to AM Container for appattempt_..._000002 exited with exitCode: -1000 Diagnostics: token (token for streaming: HDFS_DELEGATION_TOKEN owner=streaming/[email protected], renewer=, realuser=, issueDate=1782377318282, maxDate=1845449318282, sequenceNumber=426567, masterKeyId=1321) can't be found in cache ``` Through code tracing, we believe the root cause is that the submission-context token (T0) expires and is evicted from the NameNode's token cache, while Flink's internal token (T1) lives only in the AM process's UGI memory and is lost when the process restarts. We set this config security.kerberos.token.provider.hadoopfs.renewer: yarnin flink-conf.yaml expecting the YARN RM's DelegationTokenRenewer to keep T0 alive at the Application level. However, the config does not take effect. We also tried the approach from the PR https://github.com/apache/flink/pull/26101 however, this didn't help. My question is what is the recommended way to keep the token alive for AM restarts in YARN application mode? Could anyone help me with this issue? Thank you very much.
