[jira] [Commented] (FLINK-37378) Yarn log aggregation fails with Kerberos DT issues

slankka (Jira) Tue, 25 Feb 2025 05:31:29 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-37378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17930338#comment-17930338
 ]


slankka commented on FLINK-37378:
---------------------------------

Verfiying Spark soon.

I also found that the similar problem Spark have. And it's related to you too.

The problem may be the fact that Driver and Executors both need renewer=yarn as 
I point out at this ticket.

 

In Spark 3.3.0+ someone changes to Ugi.currentUser (PRINCIPAL) if spark.master 
is not yarn (happends on Yarn or K8s ) for K8s.

[[SPARK-40612][CORE] Fixing the principal used for delegation token renewal on 
non-YARN resource managers by attilapiros · Pull Request #38048 · 
apache/spark|https://github.com/apache/spark/pull/38048]

> Yarn log aggregation fails with Kerberos DT issues
> --------------------------------------------------
>
>                 Key: FLINK-37378
>                 URL: https://issues.apache.org/jira/browse/FLINK-37378
>             Project: Flink
>          Issue Type: Improvement
>          Components: Documentation
>    Affects Versions: 2.0.0, 2.1.0
>            Reporter: slankka
>            Priority: Major
>              Labels: docuentation
>
> Thanks to [~gaborgsomogyi] , he created FLINK-28608, we found it is helpful 
> to solve log aggregation failure of long running flink on yarn applications. 
> So I suggest that the configuation of token provider renewer should be 
> documented.
> It's difficult to prove, but still have a way to verify this.
> {code:java}
> dfs.namenode.delegation.key.update-interval 86400000 (1 day)  # change to 
> 180000 3min
> dfs.namenode.delegation.token.max-lifetime 604800000 (7 days) # change to 
> 360000 5min
> dfs.namenode.delegation.token.renew-interval 86400000 (1 day) # change to 
> 180000 3min {code}
>  
> normally after 7 days( by default), you will find Yarn log aggregation status 
> is TIMEDOUT.
> It's no matter what release of hadoop we are using. (Apache Hadoop 3.3.6 in 
> fact.)
>  
> *How we found the problem?*
> The log aggregation success log example (Flink-1.13.0):
> {code:java}
> token for xxxx: HDFS_DELEGATION_TOKEN owner=xxxx/10.12.13...@aaa.bbb.com, 
> renewer=yarn, realUser=, issueDate=1739273095368, maxDate=1739877895368{code}
> The failed example (Flink-1.17.0):
> {code:java}
> token for xxxx: HDFS_DELEGATION_TOKEN owner=xxxx/10.2.100....@aaa.bbb.com, 
> renewer=, realUser=, issueDate=1739953940508, maxDate=1739954300508 {code}
>  
> *Solution we found*
> If flink deploys on Yarn, this configuration is important to keep Yarn log 
> aggregation works  after Flink job terminated（FAILED, FINSHED,KiLLED) since 
> started for 7 days.
> it's not configured by default. If flink runs for 7 days, without this conf, 
> yarn log aggregation fails.
> {code:java}
> # since Flink-1.16
> security.kerberos.token.provider.%s.renewer
> # if deploys on Yarn
> security.kerberos.token.provider.hadoopfs.renewer: yarn {code}
>  
> BTW, we also found that people [dinchamion (Greg 'Dinchamion' Fazekas) · 
> GitHub|https://github.com/dinchamion] (not me) in cloudera points out the 
> importance of this at Links, but he did not create a pull request yet.
> Proof link:
> [https://github.com/cloudera/flink-tutorials/pull/44]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-37378) Yarn log aggregation fails with Kerberos DT issues

Reply via email to