Hi, Now I see the situation. In short from Fink point of view there is no possibility to refresh AM Container tokens since it's YARN responsibility so this is a known issue. I've reported and discussed this issue with the YARN guys back in the days when we've added token framework to Spark but since then nothing happened.
All in all the scenario related the YARN log aggregation what I've shared is something which is not what Flink can solve. I can imagine the following options depending on your possibilities: * Not advised because it increases the attack surface but cheap: Increase the token max lifetime * Implement a custom HadoopFSDelegationTokenReceiver which is somehow update AM tokens in the outside YARN world (this is a hacky but viable) * Try to turn off all YARN features which are depending on tokens, like log aggregation etc... * Switch to k8s BR, G On Wed, Aug 14, 2024 at 4:55 PM dpp <pengpeng.d...@foxmail.com> wrote: > Hi, > I have reviewed your response and retested using the officially released > version 1.17.2 in the Kerberos scenario, but the same issue still occurred. > > In Yarn mode, I simulated the stop of a NodeManager node where the > JobManager is located, triggering the high availability feature of Flink > jobs. I set the Flink job to restart 3 times. After 3 retries, the Flink > JobManager's Yarn container failed to start, with the following exception > reported: > > *Diagnostics Info: * > *AM Container for appattempt_1723616701630_0003_000003 exited with > exitCode: -1000* > *Failing this attempt.Diagnostics: [2024-08-14 16:12:49.183]token (token > for mr: HDFS_DELEGATION_TOKEN owner=xxx/x...@xxx.com, renewer=, realUser=, > issueDate=1723621278534, maxDate=1723621878534, sequenceNumber=130, > masterKeyId=24) is expired, current time: 2024-08-14 16:12:49,152+0800 > expected renewal time: 2024-08-14 15:46:18,534+0800* > *org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (token for mr: HDFS_DELEGATION_TOKEN owner=xxx/x...@xxx.com, renewer=, > realUser=, issueDate=1723621278534, maxDate=1723621878534, > sequenceNumber=130, masterKeyId=24) is expired, current time: 2024-08-14 > 16:12:49,152+0800 expected renewal time: 2024-08-14 15:46:18,534+0800* > > Although I observed the HDFS delegation token being periodically refreshed > in the logs of Flink's JobManager and TaskManager, the AM Container did not > use the latest delegation token to start after the job failed and a retry > was attempted. > > So, it seems to me that the new delegation framework of Flink has not > solved the problem of using an expired delegation token when the container > starts. > > > I think my situation is somewhat similar to the description in the link > below, and I am not sure if our new framework has addressed this scenario: > > *One tricky detail is important. Even if the YARN client sets delegation > tokens at the initial stage AM must re-obtain tokens at startup because > otherwise AM restart may fail (we’ve struggled with this in Spark). Please > consider the following situation:* > > - *YARN client obtains delegation tokens and sets them on the > container* > - *AM starts and uses HDFS token for log aggregation* > - *AM dies for whatever reason after HDFS token max lifetime > (typically 7 days)* > - *AM restarts with the old token* > - *Log aggregation fails immediately because of expired token* > > *(*FLIP-211: Kerberos delegation token framework - Apache Flink - Apache > Software Foundation > <https://cwiki.apache.org/confluence/display/FLINK/FLIP-211%3A+Kerberos+delegation+token+framework> > *)* > > > You can verify this Kerberos scenario, thank you. > > > ------------------ 原始邮件 ------------------ > *发件人:* "Gabor Somogyi" <gabor.g.somo...@gmail.com>; > *发送时间:* 2024年8月13日(星期二) 晚上11:25 > *收件人:* "user"<user@flink.apache.org>; > *主题:* Re: Flink jobs failed in Kerberos delegation token > > Hi, > > As a general suggestion please use the official releases since we're not > able to analyze any kind of custom code with cherry-picks with potential > hand made conflict resolution. > > When you state "needs to be restarted due to an exception" then what kind > of exception you're are referring to? > I mean what kind of file operation is happening? Full stack trace can be > useful too... > > The reason why I'm asking is because there are features which are planned > not working, like YARN log aggregation > but Flink data processing must work after the TM registered itself at the > JM. When the mentioned registration happens then > the TM receives a set of fresh tokens which must be used for data > processing. > > BR, > G > > >> From: dpp <pengpeng.d...@foxmail.com> >> Date: Sat, Aug 10, 2024 at 6:42 AM >> Subject: Flink jobs failed in Kerberos delegation token >> To: user <user@flink.apache.org> >> >> >> Hello, I am currently using Flink version 1.15.2 and have encountered an >> issue with the HDFS delegation token expiring after 7 days in a Kerberos >> scenario. >> I have seen a new delegation token framework ( >> https://issues.apache.org/jira/browse/FLINK-21232) and I have merged >> the code commits from 1 to 12 (Sub-Tasks 1-12) in the link into my Flink >> version 1.15.2. >> Now, it is possible to refresh the delegation token periodically. >> However, after 7 days, if the JobManager or TaskManager needs to be >> restarted due to an exception, I found that the Yarn container used to >> start JM/TM still uses the HDFS_DELEGATION_KIND that was generated the >> first time the job was submitted.And it also reports an error similar to >> 'token (HDFS_DELEGATION_TOKEN token 31615466 for xx) can't be found in >> cache'. >> So,the new delegation token framework did not take effect. I'm using the >> default method of Flink and delegation tokens are not managed elsewhere. >> Could anyone help me with this issue? Thank you very much. >> >