[ https://issues.apache.org/jira/browse/FLINK-37378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
slankka reopened FLINK-37378: ----------------------------- Thanks to [dinchamion (Greg 'Dinchamion' Fazekas) · GitHub|https://github.com/dinchamion] He's PR in Cloudera repo points out the important renewer configuration. but it's unfair to close my Jira ticket. I even did not create pr for Flink. I just provide information for any one who are facing the same problem. > Yarn log aggregation fails with Kerberos DT issues > -------------------------------------------------- > > Key: FLINK-37378 > URL: https://issues.apache.org/jira/browse/FLINK-37378 > Project: Flink > Issue Type: Improvement > Components: Documentation > Affects Versions: 2.0.0, 2.1.0 > Reporter: slankka > Priority: Major > Labels: docuentation > > Thanks to [~gaborgsomogyi] , he created FLINK-28608, we found it is helpful > to solve log aggregation failure of long running flink on yarn applications. > So I suggest that the configuation of token provider renewer should be > documented. > It's difficult to prove, but still have a way to verify this. > {code:java} > dfs.namenode.delegation.key.update-interval 86400000 (1 day) # change to > 180000 3min > dfs.namenode.delegation.token.max-lifetime 604800000 (7 days) # change to > 360000 5min > dfs.namenode.delegation.token.renew-interval 86400000 (1 day) # change to > 180000 3min {code} > > normally after 7 days( by default), you will find Yarn log aggregation status > is TIMEDOUT. > It's no matter what release of hadoop we are using. (Apache Hadoop 3.3.6 in > fact.) > > *How we found the problem?* > The log aggregation success log example (Flink-1.13.0): > {code:java} > token for xxxx: HDFS_DELEGATION_TOKEN owner=xxxx/10.12.13...@aaa.bbb.com, > renewer=yarn, realUser=, issueDate=1739273095368, maxDate=1739877895368{code} > The failed example (Flink-1.17.0): > {code:java} > token for xxxx: HDFS_DELEGATION_TOKEN owner=xxxx/10.2.100....@aaa.bbb.com, > renewer=, realUser=, issueDate=1739953940508, maxDate=1739954300508 {code} > > *Solution we found* > If flink deploys on Yarn, this configuration is important to keep Yarn log > aggregation works after Flink job terminated(FAILED, FINSHED,KiLLED) since > started for 7 days. > it's not configured by default. If flink runs for 7 days, without this conf, > yarn log aggregation fails. > {code:java} > # since Flink-1.16 > security.kerberos.token.provider.%s.renewer > # if deploys on Yarn > security.kerberos.token.provider.hadoopfs.renewer: yarn {code} > > BTW, we also found that people [dinchamion (Greg 'Dinchamion' Fazekas) · > GitHub|https://github.com/dinchamion] (not me) in cloudera points out the > importance of this at Links, but he did not create a pull request yet. > Proof link: > [https://github.com/cloudera/flink-tutorials/pull/44] > -- This message was sent by Atlassian Jira (v8.20.10#820010)