Thanks for your reply!
What I have seen is that the job terminates when there's intermittent loss
of connectivity with zookeeper. This is in-fact the most common reason why
our jobs are terminating at this point. Worse, it's unable to restore from
checkpoint during some (not all) of these terminat
Flink should try to pick the latest checkpoint and will only use the
savepoint if no newer checkpoint could be found.
Cheers,
Till
On Wed, Dec 16, 2020 at 10:13 PM vishalovercome wrote:
> I'm not sure if this addresses the original concern. For instance consider
> this sequence:
>
> 1. Job star
I'm not sure if this addresses the original concern. For instance consider
this sequence:
1. Job starts from savepoint
2. Job creates a few checkpoints
3. Job manager (just one in kubernetes) crashes and restarts with the
commands specified in the kubernetes manifest which has the savepoint path
I'm not the original poster, but I'm running into this same issue. What you
just described is exactly what I want. I presume you guys are using some
variant of this helm
https://github.com/docker-flink/examples/tree/master/helm/flink to configure
your k8s cluster? I'm also assuming that this cl
]
>>> Caused by: [No route to host]
>>> 2019-09-24 17:40:39,006 WARN akka.remote.transport.netty.NettyTransport
>>>- Remote connection to [null] failed with
>>> java.net.NoRouteToHostException: No route to host
>>>
>>> On Fri, O
11, 2019 at 9:39 AM Yun Tang wrote:
>>
>>> Hi Hao
>>>
>>> It seems that I misunderstood the background of usage for your cases.
>>> High availability configuration targets for fault tolerance not for general
>>> development evolution. If you wa
ology, just follow
>> the general rule to restore from savepoint/checkpoint, do not rely on HA to
>> do job migration things.
>>
>> Best
>> Yun Tang
>> ----------
>> *From:* Hao Sun
>> *Sent:* Friday, October 11, 2019 8:33
>&
Tang
> *Cc:* Vijay Bhaskar ; Yang Wang <
> danrtsey...@gmail.com>; Sean Hester ;
> Aleksandar Mastilovic ; Yuval Itzchakov <
> yuva...@gmail.com>; user
> *Subject:* Re: Challenges Deploying Flink With Savepoints On Kubernetes
>
> Yep I know that option.
loud.com>>; Aleksandar
Mastilovic mailto:amastilo...@sightmachine.com>>;
Yun Tang mailto:myas...@live.com>>; Hao Sun
mailto:ha...@zendesk.com>>; Yuval Itzchakov
mailto:yuva...@gmail.com>>; user
mailto:user@flink.apache.org>>
Subject: Re: Challenges Deploying Flink Wi
ean Hester ; Aleksandar Mastilovic <
> amastilo...@sightmachine.com>; Yun Tang ; Hao Sun <
> ha...@zendesk.com>; Yuval Itzchakov ; user <
> user@flink.apache.org>
> *Subject:* Re: Challenges Deploying Flink With Savepoints On Kubernetes
>
> Thanks Yang. We will
st
Yun Tang
From: Vijay Bhaskar
Sent: Thursday, October 10, 2019 19:24
To: Yang Wang
Cc: Sean Hester ; Aleksandar Mastilovic
; Yun Tang ; Hao Sun
; Yuval Itzchakov ; user
Subject: Re: Challenges Deploying Flink With Savepoints On Kubernetes
Thanks Yang. We will
ed, i think the job
>>>>>>>> could recover both at exceptionally crash and restart by
>>>>>>>> expectation.
>>>>>>>>
>>>>>>>> @Aleksandar Mastilovic , we are also
>>>>>>>> ha
gt;> feature to the community.
>>>>>>>
>>>>>>> [1].
>>>>>>> https://docs.google.com/document/d/1Z-VdJlPPEQoWT1WLm5woM4y0bFep4FrgdJ9ipQuRv8g/edit
>>>>>>>
>>>>>>> Best,
>>>>>>> Yang
&
y0bFep4FrgdJ9ipQuRv8g/edit
>>>>>>
>>>>>> Best,
>>>>>> Yang
>>>>>>
>>>>>> Aleksandar Mastilovic 于2019年9月26日周四
>>>>>> 上午4:11写道:
>>>>>>
>>>>>>> Would you gu
okeeper-less HA? I could ask the managers how they feel about
>>>>>> open-sourcing the improvement.
>>>>>>
>>>>>> On Sep 25, 2019, at 11:49 AM, Yun Tang wrote:
>>>>>>
>>>>>> As Aleksandar said, k8s with HA conf
gt;> On Sep 25, 2019, at 11:49 AM, Yun Tang wrote:
>>>>>
>>>>> As Aleksandar said, k8s with HA configuration could solve your
>>>>> problem. There already have some discussion about how to implement such HA
>>>>> in k8s if we don't hav
1105 [1] and FLINK-12884 [2].
>>>> Currently, you might only have to choose zookeeper as high-availability
>>>> service.
>>>>
>>>> [1] https://issues.apache.org/jira/browse/FLINK-11105
>>>> [2] https://issues.apache.org/jira/browse/FLINK-12
ps://issues.apache.org/jira/browse/FLINK-11105
>>> [2] https://issues.apache.org/jira/browse/FLINK-12884
>>>
>>> Best
>>> Yun Tang
>>> --
>>> *From:* Aleksandar Mastilovic
>>> *Sent:* Thursday, September 26, 2019
;> Yun Tang
>> ----------
>> *From:* Aleksandar Mastilovic
>> *Sent:* Thursday, September 26, 2019 1:57
>> *To:* Sean Hester
>> *Cc:* Hao Sun ; Yuval Itzchakov ;
>> user
>> *Subject:* Re: Challenges Deploying Flink With Savepoints On Kuberne
day, September 26, 2019 1:57
> *To:* Sean Hester
> *Cc:* Hao Sun ; Yuval Itzchakov ;
> user
> *Subject:* Re: Challenges Deploying Flink With Savepoints On Kubernetes
>
> Can’t you simply use JobManager in HA mode? It would pick up where it left
> off if you don’t provide a Savepo
ber 26, 2019 1:57
> To: Sean Hester
> Cc: Hao Sun ; Yuval Itzchakov ; user
>
> Subject: Re: Challenges Deploying Flink With Savepoints On Kubernetes
>
> Can’t you simply use JobManager in HA mode? It would pick up where it left
> off if you don’t provide a Savepoint.
>
t: Re: Challenges Deploying Flink With Savepoints On Kubernetes
Can’t you simply use JobManager in HA mode? It would pick up where it left off
if you don’t provide a Savepoint.
On Sep 25, 2019, at 6:07 AM, Sean Hester
mailto:sean.hes...@bettercloud.com>> wrote:
thanks for all replies! i
Can’t you simply use JobManager in HA mode? It would pick up where it left off
if you don’t provide a Savepoint.
> On Sep 25, 2019, at 6:07 AM, Sean Hester wrote:
>
> thanks for all replies! i'll definitely take a look at the Flink k8s Operator
> project.
>
> i'll try to restate the issue to
One of the way you should do is, have a separate cluster job manager
program in kubernetes, which is actually managing jobs. So that you can
decouple the job control. While restarting the job, make sure to follow the
below steps:
a) First job manager takes save point by killing the job and notes d
thanks for all replies! i'll definitely take a look at the Flink k8s
Operator project.
i'll try to restate the issue to clarify. this issue is specific to
starting a job from a savepoint in job-cluster mode. in these cases the Job
Manager container is configured to run a single Flink job at start-
I think I overlooked it. Good point. I am using Redis to save the path to
my savepoint, I might be able to set a TTL to avoid such issue.
Hao Sun
On Tue, Sep 24, 2019 at 9:54 AM Yuval Itzchakov wrote:
> Hi Hao,
>
> I think he's exactly talking about the usecase where the JM/TM restart and
> th
Hi Hao,
I think he's exactly talking about the usecase where the JM/TM restart and
they come back up from the latest savepoint which might be stale by that
time.
On Tue, 24 Sep 2019, 19:24 Hao Sun, wrote:
> We always make a savepoint before we shutdown the job-cluster. So the
> savepoint is alw
We always make a savepoint before we shutdown the job-cluster. So the
savepoint is always the latest. When we fix a bug or change the job graph,
it can resume well.
We only use checkpoints for unplanned downtime, e.g. K8S killed JM/TM,
uncaught exception, etc.
Maybe I do not understand your use ca
AFAIK there's currently nothing implemented to solve this problem, but
working on a possible fix can be implemented on top of
https://github.com/lyft/flinkk8soperator which already has a pretty fancy
state machine for rolling upgrades. I'd love to be involved as this is an
issue I've been thinking
hi all--we've run into a gap (knowledge? design? tbd?) for our use cases
when deploying Flink jobs to start from savepoints using the job-cluster
mode in Kubernetes.
we're running a ~15 different jobs, all in job-cluster mode, using a mix of
Flink 1.8.1 and 1.9.0, under GKE (Google Kubernetes Engi
30 matches
Mail list logo