Re: long lived standalone job session cluster in kubernetes

Till Rohrmann Tue, 02 Apr 2019 11:42:38 -0700

Hi Heath,

I think some of the PRs are already open and ready for review [1, 2].


[1] https://issues.apache.org/jira/browse/FLINK-10932
[2] https://issues.apache.org/jira/browse/FLINK-10935

Cheers,
Till

On Wed, Feb 27, 2019 at 10:48 AM Heath Albritton <halbr...@harm.org> wrote:

> Great, my team is eager to get started.  I’m curious what progress had
> been made so far?
>
> -H
>
> On Feb 26, 2019, at 14:43, Chunhui Shi <c...@apache.org> wrote:
>
> Hi Heath and Till, thanks for offering help on reviewing this feature. I
> just reassigned the JIRAs to myself after offline discussion with Jin. Let
> us work together to get kubernetes integrated natively with flink. Thanks.
>
> On Fri, Feb 15, 2019 at 12:19 AM Till Rohrmann <trohrm...@apache.org>
> wrote:
>
>> Alright, I'll get back to you once the PRs are open. Thanks a lot for
>> your help :-)
>>
>> Cheers,
>> Till
>>
>> On Thu, Feb 14, 2019 at 5:45 PM Heath Albritton <halbr...@harm.org>
>> wrote:
>>
>>> My team and I are keen to help out with testing and review as soon as
>>> there is a pill request.
>>>
>>> -H
>>>
>>> On Feb 11, 2019, at 00:26, Till Rohrmann <trohrm...@apache.org> wrote:
>>>
>>> Hi Heath,
>>>
>>> I just learned that people from Alibaba already made some good progress
>>> with FLINK-9953. I'm currently talking to them in order to see how we can
>>> merge this contribution into Flink as fast as possible. Since I'm quite
>>> busy due to the upcoming release I hope that other community members will
>>> help out with the reviewing once the PRs are opened.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton <halbr...@harm.org>
>>> wrote:
>>>
>>>> Has any progress been made on this?  There are a number of folks in
>>>> the community looking to help out.
>>>>
>>>>
>>>> -H
>>>>
>>>> On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann <trohrm...@apache.org>
>>>> wrote:
>>>> >
>>>> > Hi Derek,
>>>> >
>>>> > there is this issue [1] which tracks the active Kubernetes
>>>> integration. Jin Sun already started implementing some parts of it. There
>>>> should also be some PRs open for it. Please check them out.
>>>> >
>>>> > [1] https://issues.apache.org/jira/browse/FLINK-9953
>>>> >
>>>> > Cheers,
>>>> > Till
>>>> >
>>>> > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee <derekver...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> Sounds good.
>>>> >>
>>>> >> Is someone working on this automation today?
>>>> >>
>>>> >> If not, although my time is tight, I may be able to work on a PR for
>>>> getting us started down the path Kubernetes native cluster mode.
>>>> >>
>>>> >>
>>>> >> On 12/4/18 5:35 AM, Till Rohrmann wrote:
>>>> >>
>>>> >> Hi Derek,
>>>> >>
>>>> >> what I would recommend to use is to trigger the cancel with
>>>> savepoint command [1]. This will create a savepoint and terminate the job
>>>> execution. Next you simply need to respawn the job cluster which you
>>>> provide with the savepoint to resume from.
>>>> >>
>>>> >> [1]
>>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
>>>> >>
>>>> >> Cheers,
>>>> >> Till
>>>> >>
>>>> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <
>>>> and...@data-artisans.com> wrote:
>>>> >>>
>>>> >>> Hi Derek,
>>>> >>>
>>>> >>> I think your automation steps look good.
>>>> >>> Recreating deployments should not take long
>>>> >>> and as you mention, this way you can avoid unpredictable old/new
>>>> version collisions.
>>>> >>>
>>>> >>> Best,
>>>> >>> Andrey
>>>> >>>
>>>> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz <dwysakow...@apache.org>
>>>> wrote:
>>>> >>> >
>>>> >>> > Hi Derek,
>>>> >>> >
>>>> >>> > I am not an expert in kubernetes, so I will cc Till, who should
>>>> be able
>>>> >>> > to help you more.
>>>> >>> >
>>>> >>> > As for the automation for similar process I would recommend
>>>> having a
>>>> >>> > look at dA platform[1] which is built on top of kubernetes.
>>>> >>> >
>>>> >>> > Best,
>>>> >>> >
>>>> >>> > Dawid
>>>> >>> >
>>>> >>> > [1] https://data-artisans.com/platform-overview
>>>> >>> >
>>>> >>> > On 30/11/2018 02:10, Derek VerLee wrote:
>>>> >>> >>
>>>> >>> >> I'm looking at the job cluster mode, it looks great and I and
>>>> >>> >> considering migrating our jobs off our "legacy" session cluster
>>>> and
>>>> >>> >> into Kubernetes.
>>>> >>> >>
>>>> >>> >> I do need to ask some questions because I haven't found a lot of
>>>> >>> >> details in the documentation about how it works yet, and I gave
>>>> up
>>>> >>> >> following the the DI around in the code after a while.
>>>> >>> >>
>>>> >>> >> Let's say I have a deployment for the job "leader" in HA with
>>>> ZK, and
>>>> >>> >> another deployment for the taskmanagers.
>>>> >>> >>
>>>> >>> >> I want to upgrade the code or configuration and start from a
>>>> >>> >> savepoint, in an automated way.
>>>> >>> >>
>>>> >>> >> Best I can figure, I can not just update the deployment
>>>> resources in
>>>> >>> >> kubernetes and allow the containers to restart in an arbitrary
>>>> order.
>>>> >>> >>
>>>> >>> >> Instead, I expect sequencing is important, something along the
>>>> lines
>>>> >>> >> of this:
>>>> >>> >>
>>>> >>> >> 1. issue savepoint command on leader
>>>> >>> >> 2. wait for savepoint
>>>> >>> >> 3. destroy all leader and taskmanager containers
>>>> >>> >> 4. deploy new leader, with savepoint url
>>>> >>> >> 5. deploy new taskmanagers
>>>> >>> >>
>>>> >>> >>
>>>> >>> >> For example, I imagine old taskmanagers (with an old version of
>>>> my
>>>> >>> >> job) attaching to the new leader and causing a problem.
>>>> >>> >>
>>>> >>> >> Does that sound right, or am I overthinking it?
>>>> >>> >>
>>>> >>> >> If not, has anyone tried implementing any automation for this
>>>> yet?
>>>> >>> >>
>>>> >>> >
>>>> >>>
>>>>
>>>

Re: long lived standalone job session cluster in kubernetes

Reply via email to