Re: long lived standalone job session cluster in kubernetes

Heath Albritton Wed, 27 Feb 2019 10:48:41 -0800

Great, my team is eager to get started.  I’m curious what progress had been 
made so far?


-H

> On Feb 26, 2019, at 14:43, Chunhui Shi <c...@apache.org> wrote:
> 
> Hi Heath and Till, thanks for offering help on reviewing this feature. I just 
> reassigned the JIRAs to myself after offline discussion with Jin. Let us work 
> together to get kubernetes integrated natively with flink. Thanks.
> 
>> On Fri, Feb 15, 2019 at 12:19 AM Till Rohrmann <trohrm...@apache.org> wrote:
>> Alright, I'll get back to you once the PRs are open. Thanks a lot for your 
>> help :-)
>> 
>> Cheers,
>> Till
>> 
>>> On Thu, Feb 14, 2019 at 5:45 PM Heath Albritton <halbr...@harm.org> wrote:
>>> My team and I are keen to help out with testing and review as soon as there 
>>> is a pill request.
>>> 
>>> -H
>>> 
>>>> On Feb 11, 2019, at 00:26, Till Rohrmann <trohrm...@apache.org> wrote:
>>>> 
>>>> Hi Heath,
>>>> 
>>>> I just learned that people from Alibaba already made some good progress 
>>>> with FLINK-9953. I'm currently talking to them in order to see how we can 
>>>> merge this contribution into Flink as fast as possible. Since I'm quite 
>>>> busy due to the upcoming release I hope that other community members will 
>>>> help out with the reviewing once the PRs are opened.
>>>> 
>>>> Cheers,
>>>> Till
>>>> 
>>>>> On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton <halbr...@harm.org> wrote:
>>>>> Has any progress been made on this?  There are a number of folks in
>>>>> the community looking to help out.
>>>>> 
>>>>> 
>>>>> -H
>>>>> 
>>>>> On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann <trohrm...@apache.org> 
>>>>> wrote:
>>>>> >
>>>>> > Hi Derek,
>>>>> >
>>>>> > there is this issue [1] which tracks the active Kubernetes integration. 
>>>>> > Jin Sun already started implementing some parts of it. There should 
>>>>> > also be some PRs open for it. Please check them out.
>>>>> >
>>>>> > [1] https://issues.apache.org/jira/browse/FLINK-9953
>>>>> >
>>>>> > Cheers,
>>>>> > Till
>>>>> >
>>>>> > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee <derekver...@gmail.com> 
>>>>> > wrote:
>>>>> >>
>>>>> >> Sounds good.
>>>>> >>
>>>>> >> Is someone working on this automation today?
>>>>> >>
>>>>> >> If not, although my time is tight, I may be able to work on a PR for 
>>>>> >> getting us started down the path Kubernetes native cluster mode.
>>>>> >>
>>>>> >>
>>>>> >> On 12/4/18 5:35 AM, Till Rohrmann wrote:
>>>>> >>
>>>>> >> Hi Derek,
>>>>> >>
>>>>> >> what I would recommend to use is to trigger the cancel with savepoint 
>>>>> >> command [1]. This will create a savepoint and terminate the job 
>>>>> >> execution. Next you simply need to respawn the job cluster which you 
>>>>> >> provide with the savepoint to resume from.
>>>>> >>
>>>>> >> [1] 
>>>>> >> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
>>>>> >>
>>>>> >> Cheers,
>>>>> >> Till
>>>>> >>
>>>>> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin 
>>>>> >> <and...@data-artisans.com> wrote:
>>>>> >>>
>>>>> >>> Hi Derek,
>>>>> >>>
>>>>> >>> I think your automation steps look good.
>>>>> >>> Recreating deployments should not take long
>>>>> >>> and as you mention, this way you can avoid unpredictable old/new 
>>>>> >>> version collisions.
>>>>> >>>
>>>>> >>> Best,
>>>>> >>> Andrey
>>>>> >>>
>>>>> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz <dwysakow...@apache.org> 
>>>>> >>> > wrote:
>>>>> >>> >
>>>>> >>> > Hi Derek,
>>>>> >>> >
>>>>> >>> > I am not an expert in kubernetes, so I will cc Till, who should be 
>>>>> >>> > able
>>>>> >>> > to help you more.
>>>>> >>> >
>>>>> >>> > As for the automation for similar process I would recommend having a
>>>>> >>> > look at dA platform[1] which is built on top of kubernetes.
>>>>> >>> >
>>>>> >>> > Best,
>>>>> >>> >
>>>>> >>> > Dawid
>>>>> >>> >
>>>>> >>> > [1] https://data-artisans.com/platform-overview
>>>>> >>> >
>>>>> >>> > On 30/11/2018 02:10, Derek VerLee wrote:
>>>>> >>> >>
>>>>> >>> >> I'm looking at the job cluster mode, it looks great and I and
>>>>> >>> >> considering migrating our jobs off our "legacy" session cluster and
>>>>> >>> >> into Kubernetes.
>>>>> >>> >>
>>>>> >>> >> I do need to ask some questions because I haven't found a lot of
>>>>> >>> >> details in the documentation about how it works yet, and I gave up
>>>>> >>> >> following the the DI around in the code after a while.
>>>>> >>> >>
>>>>> >>> >> Let's say I have a deployment for the job "leader" in HA with ZK, 
>>>>> >>> >> and
>>>>> >>> >> another deployment for the taskmanagers.
>>>>> >>> >>
>>>>> >>> >> I want to upgrade the code or configuration and start from a
>>>>> >>> >> savepoint, in an automated way.
>>>>> >>> >>
>>>>> >>> >> Best I can figure, I can not just update the deployment resources 
>>>>> >>> >> in
>>>>> >>> >> kubernetes and allow the containers to restart in an arbitrary 
>>>>> >>> >> order.
>>>>> >>> >>
>>>>> >>> >> Instead, I expect sequencing is important, something along the 
>>>>> >>> >> lines
>>>>> >>> >> of this:
>>>>> >>> >>
>>>>> >>> >> 1. issue savepoint command on leader
>>>>> >>> >> 2. wait for savepoint
>>>>> >>> >> 3. destroy all leader and taskmanager containers
>>>>> >>> >> 4. deploy new leader, with savepoint url
>>>>> >>> >> 5. deploy new taskmanagers
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >> For example, I imagine old taskmanagers (with an old version of my
>>>>> >>> >> job) attaching to the new leader and causing a problem.
>>>>> >>> >>
>>>>> >>> >> Does that sound right, or am I overthinking it?
>>>>> >>> >>
>>>>> >>> >> If not, has anyone tried implementing any automation for this yet?
>>>>> >>> >>
>>>>> >>> >
>>>>> >>>

Re: long lived standalone job session cluster in kubernetes

Reply via email to