Hi 杨光,

Thanks a lot for reporting and looking into this with such detail!
Your observations are correct: the changes from 1.3.2 to 1.4.0 in the 
YarnTaskManagerRunner caused the local Keytab path in TMs to not be correctly 
set.

Unfortunately, AFAIK I don’t think there is a possible workaround to this for 
1.4.0.
Shipped Keytabs to TMs live in the working directory of the corresponding Yarn 
container, so the correct local path for the keytab cannot be known upfront.
The only scenario that this would work is if all TM containers happen to be on 
the same NodeManager as the AM container.

@Eron,
This is a reoccurrence of FLINK-5580 [1], and as you speculated, the TM is 
using the wrong keytab path again because it was not properly set.
I agree that the integration test scenario is best to not be in the main code. 
It actually seems to also be the cause of this issue this time.
As you can see in [2], the change was only aiming to refactor the integration 
test scenario code block, but accidentally affected the keytab path setting.
At the same time, we’ll need better unit test coverage for this, as apparently 
this can very easily break.

I’ve filed a JIRA for this, with the comments so far included: FLINK-8270 [3]
Will suggest this to be a blocker for 1.4.1 / 1.5.0.

[1] https://issues.apache.org/jira/browse/FLINK-5580
[2] 
https://github.com/apache/flink/commit/7f1c23317453859ce3b136b6e13f698d3fee34a1#diff-a81afdf5ce0872836ac6fadb603d483e
[3] https://issues.apache.org/jira/browse/FLINK-8270


On 15 December 2017 at 4:12:24 PM, Tzu-Li (Gordon) Tai (tzuli...@apache.org) 
wrote:

Hi 杨光,

Thanks a lot for reporting and looking into this with such detail!
Your observations are correct: the changes from 1.3.2 to 1.4.0 in the 
YarnTaskManagerRunner caused the local Keytab path in TMs to not be correctly 
set.

Unfortunately, AFAIK I don’t think there is a possible workaround to this for 
1.4.0.
Shipped Keytabs to TMs live in the working directory of the corresponding Yarn 
container, so the correct local path for the keytab cannot be known upfront.
The only scenario that this would work is if all TM containers happen to be on 
the same NodeManager as the AM container.

@Eron,
This is a reoccurrence of FLINK-5580 [1], and as you speculated, the TM is 
using the wrong keytab path again because it was not properly set.
I agree that the integration test scenario is best to not be in the main code. 
It actually seems to also be the cause of this issue this time.
As you can see in [2], the change was only aiming to refactor the integration 
test scenario code block, but accidentally affected the keytab path setting.
At the same time, we’ll need better unit test coverage for this, as apparently 
this can very easily break.

I’ve filed a JIRA for this, with the comments so far included: FLINK-8270 [3]
Will suggest this to be a blocker for 1.4.1 / 1.5.0.

[1] https://issues.apache.org/jira/browse/FLINK-5580
[2] 
https://github.com/apache/flink/commit/7f1c23317453859ce3b136b6e13f698d3fee34a1#diff-a81afdf5ce0872836ac6fadb603d483e
[3] https://issues.apache.org/jira/browse/FLINK-8270

Reply via email to