Hi Boris, without looking at the entrypoint in much detail, generally there should not be a race condition there:
* if the taskmanagers can not connect to the resourcemanager they will retry (per default the timeout is 5 mins) * if the JobManager does not get enough resources from the ResourceManager it will also wait for the resources/slots to provided. The timeout there is also 5 minutes, I think. So, this should actually be pretty robust as long as the Taskmanager containers can reach the Jobmanager eventually. Could you provide the Taskmanager/JobManager logs for such a failure case? Cheers, Konstantin On Mon, Feb 18, 2019 at 1:07 AM Boris Lublinsky < boris.lublin...@lightbend.com> wrote: > Following > https://github.com/apache/flink/tree/release-1.7/flink-container/docker > I have created an entry point, which looks like follows: > > #!/bin/sh > > ################################################################################ > # from > https://github.com/apache/flink/blob/release-1.7/flink-container/docker/docker-entrypoint.sh > # and > https://github.com/docker-flink/docker-flink/blob/63b19a904fa8bfd1322f1d59fdb226c82b9186c7/1.7/scala_2.11-alpine/docker-entrypoint.sh > ################################################################################ > > # If unspecified, the hostname of the container is taken as the JobManager > address > JOB_MANAGER_RPC_ADDRESS=${JOB_MANAGER_RPC_ADDRESS:-$(hostname -f)} > > drop_privs_cmd() { > if [ $(id -u) != 0 ]; then > # Don't need to drop privs if EUID != 0 > return > elif [ -x /sbin/su-exec ]; then > # Alpine > echo su-exec flink > else > # Others > echo gosu flink > fi > } > > JOB_MANAGER="jobmanager" > TASK_MANAGER="taskmanager" > > CMD="$1" > shift > > if [ "${CMD}" = "help" ]; then > echo "Usage: $(basename $0) (${JOB_MANAGER}|${TASK_MANAGER}|help)" > exit 0 > elif [ "${CMD}" = "${JOB_MANAGER}" -o "${CMD}" = "${TASK_MANAGER}" ]; then > if [ "${CMD}" = "${TASK_MANAGER}" ]; then > > TASK_MANAGER_NUMBER_OF_TASK_SLOTS=${TASK_MANAGER_NUMBER_OF_TASK_SLOTS:-$(grep > -c ^processor /proc/cpuinfo)} > > sed -i -e "s/jobmanager.rpc.address: > localhost/jobmanager.rpc.address: ${JOB_MANAGER_RPC_ADDRESS}/g" > "$FLINK_HOME/conf/flink-conf.yaml" > sed -i -e "s/taskmanager.numberOfTaskSlots: > 1/taskmanager.numberOfTaskSlots: $TASK_MANAGER_NUMBER_OF_TASK_SLOTS/g" > "$FLINK_HOME/conf/flink-conf.yaml" > echo "blob.server.port: 6124" >> "$FLINK_HOME/conf/flink-conf.yaml" > echo "query.server.port: 6125" >> "$FLINK_HOME/conf/flink-conf.yaml" > > echo "Starting Task Manager" > echo "config file: " && grep '^[^\n#]' > "$FLINK_HOME/conf/flink-conf.yaml" > exec $(drop_privs_cmd) "$FLINK_HOME/bin/taskmanager.sh" > start-foreground > else > sed -i -e "s/jobmanager.rpc.address: > localhost/jobmanager.rpc.address: ${JOB_MANAGER_RPC_ADDRESS}/g" > "$FLINK_HOME/conf/flink-conf.yaml" > echo "blob.server.port: 6124" >> "$FLINK_HOME/conf/flink-conf.yaml" > echo "query.server.port: 6125" >> "$FLINK_HOME/conf/flink-conf.yaml" > echo "config file: " && grep '^[^\n#]' > "$FLINK_HOME/conf/flink-conf.yaml" > > if [ -z "$1" ]; then > exec $(drop_privs_cmd) "$FLINK_HOME/bin/jobmanager.sh" > start-foreground "$@" > else > exec $FLINK_HOME/bin/standalone-job.sh start-foreground "$@" > fi > fi > fi > > exec "$@" > > It does work for all the cases, except running standalone job. > The problem, the way I understand it, is a racing condition. > In kubernetes it takes several attempts for establish connection between > Job and Task manager, while standalone-job.sh > tries to start a job immediately once the cluster is created (before > connection is established). > Is there a better option to implement it starting a job on container > startup? > > -- Konstantin Knauf | Solutions Architect +49 160 91394525 <https://www.ververica.com/> Follow us @VervericaData -- Join Flink Forward <https://flink-forward.org/> - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Data Artisans GmbH Registered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen