And it works now. My mistake Boris Lublinsky FDP Architect boris.lublin...@lightbend.com https://www.lightbend.com/
> On Jun 3, 2019, at 10:18 PM, Xintong Song <tonysong...@gmail.com> wrote: > > If that is the case, then I would suggest you to check the following two > things: > 1. Is the HA mode configured properly in Flink configuration? There should be > a config option "high-availability" in your flink-conf.yarml. If not > configured, the default value would be "NONE". > 2. It "ClassPathJobGraphRetriever#retrieveJobGraph" actually invoked, and is > there any exceptions thrown from it. This is to verify whether the correct > code path for job cluster is invoked. > > Thank you~ > Xintong Song > > > On Tue, Jun 4, 2019 at 10:48 AM Boris Lublinsky > <boris.lublin...@lightbend.com <mailto:boris.lublin...@lightbend.com>> wrote: > I am running on k8 > Job master runs as a deployment of 1, so just killing a pod restarts it > > Boris Lublinsky > FDP Architect > boris.lublin...@lightbend.com <mailto:boris.lublin...@lightbend.com> > https://www.lightbend.com/ <https://www.lightbend.com/> >> On Jun 3, 2019, at 9:46 PM, Xintong Song <tonysong...@gmail.com >> <mailto:tonysong...@gmail.com>> wrote: >> >> So here are my questions: >> 1. What environment do you run Flink in? Is it locally, on Yarn or Mesos? >> 2. How do you trigger "restart a Job Master"? >> >> Thank you~ >> Xintong Song >> >> >> On Tue, Jun 4, 2019 at 10:35 AM Boris Lublinsky >> <boris.lublin...@lightbend.com <mailto:boris.lublin...@lightbend.com>> wrote: >> Thanks, >> Thats what I thought initially. >> The issue is that because of this, during restart, it does not know which >> job was running before (it is obtained from submitted job graph store). >> Because this is empty, there is no restarted jobs and the cluster does not >> even try to restore checkpoints. >> I can see that checkpoints are stored correctly, but they are never accessed. >> >> Boris Lublinsky >> FDP Architect >> boris.lublin...@lightbend.com <mailto:boris.lublin...@lightbend.com> >> https://www.lightbend.com/ <https://www.lightbend.com/> >>> On Jun 3, 2019, at 9:23 PM, Xintong Song <tonysong...@gmail.com >>> <mailto:tonysong...@gmail.com>> wrote: >>> >>> Hi Boris, >>> >>> I think what you described that putJobGraph is not invoked in Flink job >>> cluster is by design and should not cause a failure of job recovering. For >>> a Flink job cluster, there is only one job graph to execute. Instead of >>> uploading job graph to an already running cluster (like in a session >>> cluster), the job graph in a Flink job cluster is uploaded before the >>> cluster is started, together with the Flink framework jars. Please refer to >>> MiniDispatcher and SingleJobSubmittedJobGraphStore for the details. >>> >>> I think we need more information to find the root cause of your problem. >>> For example, can you explain what are the detailed operation steps do you >>> perform when you say "trying to restart a Job Master". >>> >>> Thank you~ >>> Xintong Song >>> >>> >>> On Mon, Jun 3, 2019 at 10:05 PM Boris Lublinsky >>> <boris.lublin...@lightbend.com <mailto:boris.lublin...@lightbend.com>> >>> wrote: >>> I am trying to experiment with Flink Job server with HA and I am noticing, >>> that in this case >>> method putJobGraph in the class SubmittedJobGraphStore Is never invoked. (I >>> can see that it is invoked in the case of session cluster when a job is >>> added) >>> As a result, when I am trying to restart a Job Master, it finds no running >>> jobs and is not trying to restore it. >>> Am I missing something? >>> >>> >>> >>> Boris Lublinsky >>> FDP Architect >>> boris.lublin...@lightbend.com <mailto:boris.lublin...@lightbend.com> >>> https://www.lightbend.com/ <https://www.lightbend.com/> >> >