[GitHub] [flink] mbalassi commented on pull request #18144: [FLINK-24674][kubernetes] Create corresponding resouces for task manager Pods

GitBox Wed, 22 Dec 2021 12:32:34 -0800


mbalassi commented on pull request #18144:
URL: https://github.com/apache/flink/pull/18144#issuecomment-999855841



   Hi @viirya,
   
   @gyfora and I have investigated the issue and we could reproduce the 
following 
[failure](https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=28340&view=logs&j=81be5d54-0dc6-5130-d390-233dd2956037&t=81e697a1-afb6-56e2-7d6c-47095b046a9f&l=3329)
 locally:
   
   ```
   Dec 17 21:52:24 2021-12-17 21:52:07,419 ERROR 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Could not 
create application program.
   Dec 17 21:52:24 java.lang.IllegalArgumentException: Only "local" is 
supported as schema for application mode. This assumes that the jar is located 
in the image, not the Flink client. An example of such path is: 
local:///opt/flink/examples/streaming/WindowJoin.jar
   Dec 17 21:52:24      at 
org.apache.flink.kubernetes.utils.KubernetesUtils.lambda$checkJarFileForApplicationMode$2(KubernetesUtils.java:386)
 ~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
   ```
   
   This happens in the k8s native ha e2e 
[test](https://github.com/apache/flink/blob/master/flink-end-to-end-tests/test-scripts/test_kubernetes_application_ha.sh#L52-L74)
 after the forced restart of the jobmanager pod. Note that initially the job 
starts fine (and the non-ha test passes successfully), but on restarting from a 
jobmanager pod failure is bugged after your change.
   
   On jobmanager restore the new jobmanager reads the `flink-conf.yaml` in the 
relevant configmap, in the case of the test from 
`flink-config-flink-native-k8s-application-ha-1`. This initally contained the 
following line to note the location of the job jar submitted:
   
   ```
   pipeline.jars: local:///opt/flink/examples/streaming/StateMachineExample.jar
   ```
   
   However after your change the taskmanager pod 
[overwrites](https://github.com/apache/flink/pull/18144/commits/acb7da45bd3c00778c82e3a9ad5a6cb8a3e0020a#diff-8c94eb0faebcdc0528869b32be042da7b30866e71c6279888f80f5b44cac661cR160)
 the configmap that was until now only created by the jobmanager, with the 
following value which leads to the failure:
   
   ```
   pipeline.jars: file:///opt/flink/examples/streaming/StateMachineExample.jar
   ```
   
   The `file` or `local` scheme is a trivial change to code around, but 
investigating this broken test revealed a conceptual issue with your proposal. 
After your change all the accompanying resources always get overwritten 
taskmanager pod, which leads to unintended consequences. To my knowledge we 
always expected in Flink that the jobmanager and taskmanagers share the same 
Hadoop configuration and Kerberos configuration, having this different can lead 
to gnarly bugs.
   
   If we understand correctly from the original [issue 
description](https://issues.apache.org/jira/browse/FLINK-24674) you had a case 
where the Hadoop config was present in the taskmanager docker image, but not in 
the jobmanager one. We feel that this is a cornercase that does not merit 
special coverage given the above consequences, we suggest that you use one of 
the two existing mechanism to ship your Hadoop config instead:
   1. Specify on the client that is doing the job submission via the 
`HADOOP_CONF_DIR` environment variable
   2. Deploy a long lived configmap to the k8s environment and reference that 
via `kubernetes.hadoop.conf.config-map.name`
   
   Taking a step back your Hadoop config is probably environment specific and 
often changes between your test and production environment, so we would 
discourage you from baking this into the image itself - rather keep it as part 
of the environment itself via either of the aforementioned methods. Based on 
this assessment we are advising against merging this change.
   
   Thank you for your contribution, let me know if you have any further 
questions, concerns or suggestions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] mbalassi commented on pull request #18144: [FLINK-24674][kubernetes] Create corresponding resouces for task manager Pods

Reply via email to