Re: Flink jobs failed in Kerberos delegation token

Gabor Somogyi Wed, 14 Aug 2024 09:15:11 -0700

Proxy user has normally kerberos credentials and with that it's possible to
fetch new HDFS tokens.
When you can keep the proxy user kerberos credentials up-to-date (which is
not an easy task) then the YARN side can potentially work.


G


On Wed, Aug 14, 2024 at 5:42 PM dpp <pengpeng.d...@foxmail.com> wrote:

> Thanks for the quick reply, good buddy.
> Based on the current situation, it seems that the Flink on Yarn mode has
> some flaws in the Kerberos scenario, and the Yarn team doesn't seem to be
> paying much attention to this issue, which is quite troublesome.
>
> I noticed there is a Yarn parameter called
> `yarn.resourcemanager.proxy-user-privileges.enabled` that seems to be
> suitable for long-running applications. Are you familiar with this
> parameter? Can it be used to solve Flink's issue?
>
>
> ------------------------------
>
>
>
>
> ------------------ 原始邮件 ------------------
> *发件人:* "Gabor Somogyi" <gabor.g.somo...@gmail.com>;
> *发送时间:* 2024年8月14日(星期三) 晚上11:17
> *收件人:* "user"<user@flink.apache.org>;
> *主题:* Re: Flink jobs failed in Kerberos delegation token
>
> Hi,
>
> Now I see the situation. In short from Fink point of view there is no
> possibility to refresh AM Container tokens since it's YARN responsibility
> so this is a known issue.
> I've reported and discussed this issue with the YARN guys back in the days
> when we've added token framework to Spark but since then nothing happened.
>
> All in all the scenario related the YARN log aggregation what I've shared
> is something which is not what Flink can solve.
> I can imagine the following options depending on your possibilities:
> * Not advised because it increases the attack surface but cheap: Increase
> the token max lifetime
> * Implement a custom HadoopFSDelegationTokenReceiver which is somehow
> update AM tokens in the outside YARN world (this is a hacky but viable)
> * Try to turn off all YARN features which are depending on tokens, like
> log aggregation etc...
> * Switch to k8s
>
> BR,
> G
>
>
> On Wed, Aug 14, 2024 at 4:55 PM dpp <pengpeng.d...@foxmail.com> wrote:
>
>> Hi,
>> I have reviewed your response and retested using the officially released
>> version 1.17.2 in the Kerberos scenario, but the same issue still occurred.
>>
>> In Yarn mode, I simulated the stop of a NodeManager node where the
>> JobManager is located, triggering the high availability feature of Flink
>> jobs. I set the Flink job to restart 3 times. After 3 retries, the Flink
>> JobManager's Yarn container failed to start, with the following exception
>> reported:
>>
>> *Diagnostics Info: *
>> *AM Container for appattempt_1723616701630_0003_000003 exited with
>> exitCode: -1000*
>> *Failing this attempt.Diagnostics: [2024-08-14 16:12:49.183]token (token
>> for mr: HDFS_DELEGATION_TOKEN owner=xxx/x...@xxx.com, renewer=, realUser=,
>> issueDate=1723621278534, maxDate=1723621878534, sequenceNumber=130,
>> masterKeyId=24) is expired, current time: 2024-08-14 16:12:49,152+0800
>> expected renewal time: 2024-08-14 15:46:18,534+0800*
>> *org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> token (token for mr: HDFS_DELEGATION_TOKEN owner=xxx/x...@xxx.com, renewer=,
>> realUser=, issueDate=1723621278534, maxDate=1723621878534,
>> sequenceNumber=130, masterKeyId=24) is expired, current time: 2024-08-14
>> 16:12:49,152+0800 expected renewal time: 2024-08-14 15:46:18,534+0800*
>>
>> Although I observed the HDFS delegation token being periodically
>> refreshed in the logs of Flink's JobManager and TaskManager, the AM
>> Container did not use the latest delegation token to start after the job
>> failed and a retry was attempted.
>>
>> So, it seems to me that the new delegation framework of Flink has not
>> solved the problem of using an expired delegation token when the container
>> starts.
>>
>>
>> I think my situation is somewhat similar to the description in the link
>> below, and I am not sure if our new framework has addressed this scenario:
>>
>> *One tricky detail is important. Even if the YARN client sets delegation
>> tokens at the initial stage AM must re-obtain tokens at startup because
>> otherwise AM restart may fail (we’ve struggled with this in Spark). Please
>> consider the following situation:*
>>
>>    - *YARN client obtains delegation tokens and sets them on the
>>    container*
>>    - *AM starts and uses HDFS token for log aggregation*
>>    - *AM dies for whatever reason after HDFS token max lifetime
>>    (typically 7 days)*
>>    - *AM restarts with the old token*
>>    - *Log aggregation fails immediately because of expired token*
>>
>> *（*FLIP-211: Kerberos delegation token framework - Apache Flink - Apache
>> Software Foundation
>> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-211%3A+Kerberos+delegation+token+framework>
>> *）*
>>
>>
>>  You can verify this Kerberos scenario, thank you.
>>
>>
>> ------------------ 原始邮件 ------------------
>> *发件人:* "Gabor Somogyi" <gabor.g.somo...@gmail.com>;
>> *发送时间:* 2024年8月13日(星期二) 晚上11:25
>> *收件人:* "user"<user@flink.apache.org>;
>> *主题:* Re: Flink jobs failed in Kerberos delegation token
>>
>> Hi,
>>
>> As a general suggestion please use the official releases since we're not
>> able to analyze any kind of custom code with cherry-picks with potential
>> hand made conflict resolution.
>>
>> When you state "needs to be restarted due to an exception" then what kind
>> of exception you're are referring to?
>> I mean what kind of file operation is happening? Full stack trace can be
>> useful too...
>>
>> The reason why I'm asking is because there are features which are planned
>> not working, like YARN log aggregation
>> but Flink data processing must work after the TM registered itself at the
>> JM. When the mentioned registration happens then
>> the TM receives a set of fresh tokens which must be used for data
>> processing.
>>
>> BR,
>> G
>>
>>
>>> From: dpp <pengpeng.d...@foxmail.com>
>>> Date: Sat, Aug 10, 2024 at 6:42 AM
>>> Subject: Flink jobs failed in Kerberos delegation token
>>> To: user <user@flink.apache.org>
>>>
>>>
>>> Hello, I am currently using Flink version 1.15.2 and have encountered an
>>> issue with the HDFS delegation token expiring after 7 days in a Kerberos
>>> scenario.
>>> I have seen a new delegation token framework (
>>> https://issues.apache.org/jira/browse/FLINK-21232)  and I have merged
>>> the code commits from 1 to 12 (Sub-Tasks 1-12) in the link into my Flink
>>> version 1.15.2.
>>> Now, it is possible to refresh the delegation token periodically.
>>> However, after 7 days, if the JobManager or TaskManager needs to be
>>> restarted due to an exception, I found that the Yarn container used to
>>> start JM/TM still uses the HDFS_DELEGATION_KIND that was generated the
>>> first time the job was submitted.And it also reports an error similar
>>> to 'token (HDFS_DELEGATION_TOKEN token 31615466 for xx) can't be found in
>>> cache'.
>>> So,the new delegation token framework did not take effect. I'm using the
>>> default method of Flink and delegation tokens are not managed elsewhere.
>>> Could anyone help me with this issue? Thank you very much.
>>>
>>

Re: Flink jobs failed in Kerberos delegation token

Reply via email to