Flink on K8s job submission best practices

Maximilian Bode Fri, 22 Dec 2017 05:56:37 -0800

Hi everyone,

We are beginning to run Flink on K8s and found the basic templates [1]
as well as the example Helm chart [2] very helpful. Also the discussion
about JobManager HA [3] and Patrick's talk [4] was very interesting. All
in all it is delightful how easy everything can be set up and works out
of the box.

Now we are looking for some best practices as far as job submission is
concerned. Having played with a few alternative options, we would like
to get some input on what other people are using. What we have looked
into so far:

1. Packaging the job jar into e.g. the JM image and submitting manually
(either from the UI or via `kubectl exec`). Ideally, we would like
to establish a more automated setup, preferably using native
Kubernetes objects.
2. Building a separate image whose responsibility it is to submit the
job and keep it running. This could either use the API [5] or share
the Flink config so that CLI calls connect to the existing cluster.
When scheduling this as a Kubernetes deployment [6] and e.g. the
node running this client pod fails, one ends up with duplicate jobs.
One could build custom logic (poll if job exists, only submit if it
does not), but this seems fragile and it is conceivable that this
could lead to weird timing issues like different containers trying
to submit at the same time. One solution would be to implement an
atomic submit-if-not-exists, but I suppose this would need to
involve some level of locking on the JM.
3. Schedule the client container from the step above as a Kubernetes
job [7]. This seems somewhat unidiomatic for streaming jobs that are
not expected to terminate, but one would not have to deal with
duplicate Flink jobs. In the failure scenario described above, the
(Flink) job would still be running on the Flink cluster, there just
would not be a client attached to it (as the Kubernetes job would
not be restarted). On the other hand, should the (Flink) job fail
for some reason, there is no fashion of restarting it automatically.

Are we missing something obvious? Has the Flink community come up with a
default way of submitting Flink jobs on Kubernetes yet or are there
people willing to share their experiences?

Best regards and happy holidays,
Max

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/deployment/kubernetes.html
[2] https://github.com/docker-flink/examples/tree/master/helm/flink
[3]
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-HA-with-Kubernetes-without-Zookeeper-td15033.html
[4] https://www.youtube.com/watch?v=w721NI-mtAA Slides:
https://www.slideshare.net/FlinkForward/flink-forward-berlin-2017-patrick-lucas-flink-in-containerland
[5]
https://ci.apache.org/projects/flink/flink-docs-master/monitoring/rest_api.html#submitting-programs
[6] https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
[7]
https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
--
Maximilian Bode * maximilian.b...@tngtech.com
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller
Sitz: Unterföhring * Amtsgericht München * HRB 135082

Flink on K8s job submission best practices

Reply via email to