Re: long lived standalone job session cluster in kubernetes

Chunhui Shi Tue, 26 Feb 2019 14:43:46 -0800

Hi Heath and Till, thanks for offering help on reviewing this feature. I
just reassigned the JIRAs to myself after offline discussion with Jin. Let
us work together to get kubernetes integrated natively with flink. Thanks.


On Fri, Feb 15, 2019 at 12:19 AM Till Rohrmann <trohrm...@apache.org> wrote:

> Alright, I'll get back to you once the PRs are open. Thanks a lot for your
> help :-)
>
> Cheers,
> Till
>
> On Thu, Feb 14, 2019 at 5:45 PM Heath Albritton <halbr...@harm.org> wrote:
>
>> My team and I are keen to help out with testing and review as soon as
>> there is a pill request.
>>
>> -H
>>
>> On Feb 11, 2019, at 00:26, Till Rohrmann <trohrm...@apache.org> wrote:
>>
>> Hi Heath,
>>
>> I just learned that people from Alibaba already made some good progress
>> with FLINK-9953. I'm currently talking to them in order to see how we can
>> merge this contribution into Flink as fast as possible. Since I'm quite
>> busy due to the upcoming release I hope that other community members will
>> help out with the reviewing once the PRs are opened.
>>
>> Cheers,
>> Till
>>
>> On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton <halbr...@harm.org> wrote:
>>
>>> Has any progress been made on this?  There are a number of folks in
>>> the community looking to help out.
>>>
>>>
>>> -H
>>>
>>> On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann <trohrm...@apache.org>
>>> wrote:
>>> >
>>> > Hi Derek,
>>> >
>>> > there is this issue [1] which tracks the active Kubernetes
>>> integration. Jin Sun already started implementing some parts of it. There
>>> should also be some PRs open for it. Please check them out.
>>> >
>>> > [1] https://issues.apache.org/jira/browse/FLINK-9953
>>> >
>>> > Cheers,
>>> > Till
>>> >
>>> > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee <derekver...@gmail.com>
>>> wrote:
>>> >>
>>> >> Sounds good.
>>> >>
>>> >> Is someone working on this automation today?
>>> >>
>>> >> If not, although my time is tight, I may be able to work on a PR for
>>> getting us started down the path Kubernetes native cluster mode.
>>> >>
>>> >>
>>> >> On 12/4/18 5:35 AM, Till Rohrmann wrote:
>>> >>
>>> >> Hi Derek,
>>> >>
>>> >> what I would recommend to use is to trigger the cancel with savepoint
>>> command [1]. This will create a savepoint and terminate the job execution.
>>> Next you simply need to respawn the job cluster which you provide with the
>>> savepoint to resume from.
>>> >>
>>> >> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
>>> >>
>>> >> Cheers,
>>> >> Till
>>> >>
>>> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <
>>> and...@data-artisans.com> wrote:
>>> >>>
>>> >>> Hi Derek,
>>> >>>
>>> >>> I think your automation steps look good.
>>> >>> Recreating deployments should not take long
>>> >>> and as you mention, this way you can avoid unpredictable old/new
>>> version collisions.
>>> >>>
>>> >>> Best,
>>> >>> Andrey
>>> >>>
>>> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz <dwysakow...@apache.org>
>>> wrote:
>>> >>> >
>>> >>> > Hi Derek,
>>> >>> >
>>> >>> > I am not an expert in kubernetes, so I will cc Till, who should be
>>> able
>>> >>> > to help you more.
>>> >>> >
>>> >>> > As for the automation for similar process I would recommend having
>>> a
>>> >>> > look at dA platform[1] which is built on top of kubernetes.
>>> >>> >
>>> >>> > Best,
>>> >>> >
>>> >>> > Dawid
>>> >>> >
>>> >>> > [1] https://data-artisans.com/platform-overview
>>> >>> >
>>> >>> > On 30/11/2018 02:10, Derek VerLee wrote:
>>> >>> >>
>>> >>> >> I'm looking at the job cluster mode, it looks great and I and
>>> >>> >> considering migrating our jobs off our "legacy" session cluster
>>> and
>>> >>> >> into Kubernetes.
>>> >>> >>
>>> >>> >> I do need to ask some questions because I haven't found a lot of
>>> >>> >> details in the documentation about how it works yet, and I gave up
>>> >>> >> following the the DI around in the code after a while.
>>> >>> >>
>>> >>> >> Let's say I have a deployment for the job "leader" in HA with ZK,
>>> and
>>> >>> >> another deployment for the taskmanagers.
>>> >>> >>
>>> >>> >> I want to upgrade the code or configuration and start from a
>>> >>> >> savepoint, in an automated way.
>>> >>> >>
>>> >>> >> Best I can figure, I can not just update the deployment resources
>>> in
>>> >>> >> kubernetes and allow the containers to restart in an arbitrary
>>> order.
>>> >>> >>
>>> >>> >> Instead, I expect sequencing is important, something along the
>>> lines
>>> >>> >> of this:
>>> >>> >>
>>> >>> >> 1. issue savepoint command on leader
>>> >>> >> 2. wait for savepoint
>>> >>> >> 3. destroy all leader and taskmanager containers
>>> >>> >> 4. deploy new leader, with savepoint url
>>> >>> >> 5. deploy new taskmanagers
>>> >>> >>
>>> >>> >>
>>> >>> >> For example, I imagine old taskmanagers (with an old version of my
>>> >>> >> job) attaching to the new leader and causing a problem.
>>> >>> >>
>>> >>> >> Does that sound right, or am I overthinking it?
>>> >>> >>
>>> >>> >> If not, has anyone tried implementing any automation for this yet?
>>> >>> >>
>>> >>> >
>>> >>>
>>>
>>

Re: long lived standalone job session cluster in kubernetes

Reply via email to