Hi everyone,

We're encountering a persistent HDFS_DELEGATION_TOKEN ... can't be found in 
cache error when the ApplicationMaster restarts in YARN application mode with 
Kerberos enabled in Flink 1.18.1, and we'd appreciate your help understanding 
the root cause and finding the proper fix.

When the AM crashes and YARN restarts it (a new container attempt), the new 
attempt fails immediately with exitCode: -1000, and error message:

```
Application application_1732503259026_5493 failed 2 times in previous 10000 
milliseconds
due to AM Container for appattempt_..._000002 exited with exitCode: -1000
Diagnostics: token (token for streaming: HDFS_DELEGATION_TOKEN
owner=streaming/[email protected], renewer=, realuser=,
issueDate=1782377318282, maxDate=1845449318282,
sequenceNumber=426567, masterKeyId=1321) can't be found in cache
```

Through code tracing, we believe the root cause is that the submission-context 
token (T0) expires and is evicted from the NameNode's token cache, while 
Flink's internal token (T1) lives only in the AM process's UGI memory and is 
lost when the process restarts.

We set this config security.kerberos.token.provider.hadoopfs.renewer: yarnin 
flink-conf.yaml expecting the YARN RM's DelegationTokenRenewer to keep T0 alive 
at the Application level. However, the config does not take effect.

We also tried the approach from the PR 
https://github.com/apache/flink/pull/26101 however, this didn't help.
My question is what is the recommended way to keep the token alive for AM 
restarts in YARN application mode?
Could anyone help me with this issue? Thank you very much.

Reply via email to