Hi Tzanko, in order to make the container entrypoint properly work with HA, we need to fix the JobID (see https://issues.apache.org/jira/browse/FLINK-10291). At the moment, we generate a new JobID for every restart of the cluster entrypoint container. Due to that the system cannot find the existing checkpoints.
Fixing the JobID is not a big deal and it should be fixed with the next bug fix release. Cheers, Till On Thu, Sep 20, 2018 at 10:12 AM vino yang <yanghua1...@gmail.com> wrote: > Hi Tzanko, > > Maybe Till is more appropriate to answer this question. > > Thanks, vino. > > Tzanko Matev <tsa...@gmail.com> 于2018年9月19日周三 下午5:47写道: > >> Dear all, >> >> I am currently experimenting with a Flink 1.6.0 job cluster. The goal is >> to run a streaming job on K8s. Right now I am using docker-compose to >> experiment with the job cluster. >> >> I am trying to set-up HA with Zookeeper, but I seem to fail. I have a >> docker-compose file which contains the following services: >> - Zookeeper >> - Flink job manager >> - Flink task manager >> >> The containers are set up as per the documentation for docker-compose, >> but I have also set up the necessary HA settings in the conf file. However, >> when I kill the job manager container and start it again, the job being >> processed does not recover but always starts from scratch. Instead I get >> the following error: >> >> > ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler - >> Could not retrieve the redirect address. >> > >> > java.util.concurrent.CompletionException: >> org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing >> token not set: Ignoring message >> LocalFencedMessage(8c4887f5c13f6d907d82a55d97ac428f, >> LocalRpcInvocation(requestRestAddress(Time))) sent to >> akka.tcp://flink@blockprocessor-job-cluster:50000/user/dispatcher >> because the fencing token is null. >> >> Am I missing something? Is HA implemented for job clusters at all? >> >> Best wishes, >> Tzanko Matev >> >>