[jira] [Comment Edited] (FLINK-37378) Yarn log aggregation fails with Kerberos DT issues

slankka (Jira) Tue, 25 Feb 2025 05:00:11 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-37378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17930325#comment-17930325
 ]


slankka edited comment on FLINK-37378 at 2/25/25 12:57 PM:
-----------------------------------------------------------

Thanks to [dinchamion (Greg 'Dinchamion' Fazekas) · 
GitHub|https://github.com/dinchamion]

He's PR in Cloudera repo points out the important renewer configuration. 

but it's unfair [~gaborgsomogyi]  to close my Jira ticket. I even did not 
create pr for Flink.

I just provide information for any one who are facing the same problem.


was (Author: adrian z):
Thanks to [dinchamion (Greg 'Dinchamion' Fazekas) · 
GitHub|https://github.com/dinchamion]

He's PR in Cloudera repo points out the important renewer configuration. 

but it's unfair to close my Jira ticket. I even did not create pr for Flink.

I just provide information for any one who are facing the same problem.

> Yarn log aggregation fails with Kerberos DT issues
> --------------------------------------------------
>
>                 Key: FLINK-37378
>                 URL: https://issues.apache.org/jira/browse/FLINK-37378
>             Project: Flink
>          Issue Type: Improvement
>          Components: Documentation
>    Affects Versions: 2.0.0, 2.1.0
>            Reporter: slankka
>            Priority: Major
>              Labels: docuentation
>
> Thanks to [~gaborgsomogyi] , he created FLINK-28608, we found it is helpful 
> to solve log aggregation failure of long running flink on yarn applications. 
> So I suggest that the configuation of token provider renewer should be 
> documented.
> It's difficult to prove, but still have a way to verify this.
> {code:java}
> dfs.namenode.delegation.key.update-interval 86400000 (1 day)  # change to 
> 180000 3min
> dfs.namenode.delegation.token.max-lifetime 604800000 (7 days) # change to 
> 360000 5min
> dfs.namenode.delegation.token.renew-interval 86400000 (1 day) # change to 
> 180000 3min {code}
>  
> normally after 7 days( by default), you will find Yarn log aggregation status 
> is TIMEDOUT.
> It's no matter what release of hadoop we are using. (Apache Hadoop 3.3.6 in 
> fact.)
>  
> *How we found the problem?*
> The log aggregation success log example (Flink-1.13.0):
> {code:java}
> token for xxxx: HDFS_DELEGATION_TOKEN owner=xxxx/10.12.13...@aaa.bbb.com, 
> renewer=yarn, realUser=, issueDate=1739273095368, maxDate=1739877895368{code}
> The failed example (Flink-1.17.0):
> {code:java}
> token for xxxx: HDFS_DELEGATION_TOKEN owner=xxxx/10.2.100....@aaa.bbb.com, 
> renewer=, realUser=, issueDate=1739953940508, maxDate=1739954300508 {code}
>  
> *Solution we found*
> If flink deploys on Yarn, this configuration is important to keep Yarn log 
> aggregation works  after Flink job terminated（FAILED, FINSHED,KiLLED) since 
> started for 7 days.
> it's not configured by default. If flink runs for 7 days, without this conf, 
> yarn log aggregation fails.
> {code:java}
> # since Flink-1.16
> security.kerberos.token.provider.%s.renewer
> # if deploys on Yarn
> security.kerberos.token.provider.hadoopfs.renewer: yarn {code}
>  
> BTW, we also found that people [dinchamion (Greg 'Dinchamion' Fazekas) · 
> GitHub|https://github.com/dinchamion] (not me) in cloudera points out the 
> importance of this at Links, but he did not create a pull request yet.
> Proof link:
> [https://github.com/cloudera/flink-tutorials/pull/44]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-37378) Yarn log aggregation fails with Kerberos DT issues

Reply via email to