Hello,

We run flink using the spotify flink Kubernetes operator (job cluster mode). Everything works fine, including upgrades and crash recovery. We do not run the job manager in HA mode.

One of the problems we have is that upon upgrades (or during testing), the startup time of the flink cluster takes a very long time:

 * First the operator needs to create the cluster (JM+TM), and wait for
   it to respond for api requests. This already takes a couple of minutes.
 * Then the operator creates a job-submitter pod that submits the job
   to the cluster. The job is packaged as a fat jar, but it is already
   baked in the docker images we use (so technically there would be no
   need to "submit" it from a separate pod). The submission goes rather
   fast tho (the time between the job submitter seeing the cluster is
   online and the "hello" log from the main program is <1min)
 * Then the application needs to start up and load its state from the
   latest savepoint, which again takes a couple of minutes

All steps take quite some time, and we are looking to reduce the startup time to allow for easier testing but also less downtime during upgrades. So i have some questions:

 * I wonder if the situation is the same for all kubernetes operators. 
   I really need some kind of operator because i otherwise i have to
   set which savepoint to load from myself every startup.
 * What cluster startup time is considered to be acceptable / best
   practise ?
 * If there are other tricks to reduce startup time, i would be very
   interested in knowing them :-)

There is also a discussion ongoing on running flink on spot nodes. I guess the startup time is relevant there too.

Thanks already
Frank




Reply via email to