Hello,
We run flink using the spotify flink Kubernetes operator (job cluster
mode). Everything works fine, including upgrades and crash recovery. We
do not run the job manager in HA mode.
One of the problems we have is that upon upgrades (or during testing),
the startup time of the flink cluster takes a very long time:
* First the operator needs to create the cluster (JM+TM), and wait for
it to respond for api requests. This already takes a couple of minutes.
* Then the operator creates a job-submitter pod that submits the job
to the cluster. The job is packaged as a fat jar, but it is already
baked in the docker images we use (so technically there would be no
need to "submit" it from a separate pod). The submission goes rather
fast tho (the time between the job submitter seeing the cluster is
online and the "hello" log from the main program is <1min)
* Then the application needs to start up and load its state from the
latest savepoint, which again takes a couple of minutes
All steps take quite some time, and we are looking to reduce the startup
time to allow for easier testing but also less downtime during upgrades.
So i have some questions:
* I wonder if the situation is the same for all kubernetes operators.
I really need some kind of operator because i otherwise i have to
set which savepoint to load from myself every startup.
* What cluster startup time is considered to be acceptable / best
practise ?
* If there are other tricks to reduce startup time, i would be very
interested in knowing them :-)
There is also a discussion ongoing on running flink on spot nodes. I
guess the startup time is relevant there too.
Thanks already
Frank