Re: Flink job server with HA

Boris Lublinsky Mon, 03 Jun 2019 19:49:46 -0700

I am running on k8
Job master runs as a deployment of 1, so just killing a pod restarts it


Boris Lublinsky
FDP Architect
boris.lublin...@lightbend.com
https://www.lightbend.com/

> On Jun 3, 2019, at 9:46 PM, Xintong Song <tonysong...@gmail.com> wrote:
> 
> So here are my questions:
> 1. What environment do you run Flink in? Is it locally, on Yarn or Mesos?
> 2. How do you trigger "restart a Job Master"?
> 
> Thank you~
> Xintong Song
> 
> 
> On Tue, Jun 4, 2019 at 10:35 AM Boris Lublinsky 
> <boris.lublin...@lightbend.com <mailto:boris.lublin...@lightbend.com>> wrote:
> Thanks,
> Thats what I thought initially.
> The issue is that because of this, during restart, it does not know which job 
> was running before (it is obtained from submitted job graph store).
> Because this is empty, there is no restarted jobs and the cluster does not 
> even try to restore checkpoints.
> I can see that checkpoints are stored correctly, but they are never accessed.
> 
> Boris Lublinsky
> FDP Architect
> boris.lublin...@lightbend.com <mailto:boris.lublin...@lightbend.com>
> https://www.lightbend.com/ <https://www.lightbend.com/>
>> On Jun 3, 2019, at 9:23 PM, Xintong Song <tonysong...@gmail.com 
>> <mailto:tonysong...@gmail.com>> wrote:
>> 
>> Hi Boris,
>> 
>> I think what you described that putJobGraph is not invoked in Flink job 
>> cluster is by design and should not cause a failure of job recovering. For a 
>> Flink job cluster, there is only one job graph to execute. Instead of 
>> uploading job graph to an already running cluster (like in a session 
>> cluster), the job graph in a Flink job cluster is uploaded before the 
>> cluster is started, together with the Flink framework jars. Please refer to 
>> MiniDispatcher and SingleJobSubmittedJobGraphStore for the details.
>> 
>> I think we need more information to find the root cause of your problem. For 
>> example, can you explain what are the detailed operation steps do you 
>> perform when you say "trying to restart a Job Master".
>> 
>> Thank you~
>> Xintong Song
>> 
>> 
>> On Mon, Jun 3, 2019 at 10:05 PM Boris Lublinsky 
>> <boris.lublin...@lightbend.com <mailto:boris.lublin...@lightbend.com>> wrote:
>> I am trying to experiment with Flink Job server with HA and I am noticing, 
>> that in this case
>> method putJobGraph in the class SubmittedJobGraphStore Is never invoked. (I 
>> can see that it is invoked in the case of session cluster when a job is 
>> added)
>> As a result, when I am trying to restart a Job Master, it finds no running 
>> jobs and is not trying to restore it.
>> Am I missing something?
>> 
>>  
>> 
>> Boris Lublinsky
>> FDP Architect
>> boris.lublin...@lightbend.com <mailto:boris.lublin...@lightbend.com>
>> https://www.lightbend.com/ <https://www.lightbend.com/>
>

Re: Flink job server with HA

Reply via email to