Hi Heath and Till, thanks for offering help on reviewing this feature. I just reassigned the JIRAs to myself after offline discussion with Jin. Let us work together to get kubernetes integrated natively with flink. Thanks.
On Fri, Feb 15, 2019 at 12:19 AM Till Rohrmann <trohrm...@apache.org> wrote: > Alright, I'll get back to you once the PRs are open. Thanks a lot for your > help :-) > > Cheers, > Till > > On Thu, Feb 14, 2019 at 5:45 PM Heath Albritton <halbr...@harm.org> wrote: > >> My team and I are keen to help out with testing and review as soon as >> there is a pill request. >> >> -H >> >> On Feb 11, 2019, at 00:26, Till Rohrmann <trohrm...@apache.org> wrote: >> >> Hi Heath, >> >> I just learned that people from Alibaba already made some good progress >> with FLINK-9953. I'm currently talking to them in order to see how we can >> merge this contribution into Flink as fast as possible. Since I'm quite >> busy due to the upcoming release I hope that other community members will >> help out with the reviewing once the PRs are opened. >> >> Cheers, >> Till >> >> On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton <halbr...@harm.org> wrote: >> >>> Has any progress been made on this? There are a number of folks in >>> the community looking to help out. >>> >>> >>> -H >>> >>> On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann <trohrm...@apache.org> >>> wrote: >>> > >>> > Hi Derek, >>> > >>> > there is this issue [1] which tracks the active Kubernetes >>> integration. Jin Sun already started implementing some parts of it. There >>> should also be some PRs open for it. Please check them out. >>> > >>> > [1] https://issues.apache.org/jira/browse/FLINK-9953 >>> > >>> > Cheers, >>> > Till >>> > >>> > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee <derekver...@gmail.com> >>> wrote: >>> >> >>> >> Sounds good. >>> >> >>> >> Is someone working on this automation today? >>> >> >>> >> If not, although my time is tight, I may be able to work on a PR for >>> getting us started down the path Kubernetes native cluster mode. >>> >> >>> >> >>> >> On 12/4/18 5:35 AM, Till Rohrmann wrote: >>> >> >>> >> Hi Derek, >>> >> >>> >> what I would recommend to use is to trigger the cancel with savepoint >>> command [1]. This will create a savepoint and terminate the job execution. >>> Next you simply need to respawn the job cluster which you provide with the >>> savepoint to resume from. >>> >> >>> >> [1] >>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint >>> >> >>> >> Cheers, >>> >> Till >>> >> >>> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin < >>> and...@data-artisans.com> wrote: >>> >>> >>> >>> Hi Derek, >>> >>> >>> >>> I think your automation steps look good. >>> >>> Recreating deployments should not take long >>> >>> and as you mention, this way you can avoid unpredictable old/new >>> version collisions. >>> >>> >>> >>> Best, >>> >>> Andrey >>> >>> >>> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz <dwysakow...@apache.org> >>> wrote: >>> >>> > >>> >>> > Hi Derek, >>> >>> > >>> >>> > I am not an expert in kubernetes, so I will cc Till, who should be >>> able >>> >>> > to help you more. >>> >>> > >>> >>> > As for the automation for similar process I would recommend having >>> a >>> >>> > look at dA platform[1] which is built on top of kubernetes. >>> >>> > >>> >>> > Best, >>> >>> > >>> >>> > Dawid >>> >>> > >>> >>> > [1] https://data-artisans.com/platform-overview >>> >>> > >>> >>> > On 30/11/2018 02:10, Derek VerLee wrote: >>> >>> >> >>> >>> >> I'm looking at the job cluster mode, it looks great and I and >>> >>> >> considering migrating our jobs off our "legacy" session cluster >>> and >>> >>> >> into Kubernetes. >>> >>> >> >>> >>> >> I do need to ask some questions because I haven't found a lot of >>> >>> >> details in the documentation about how it works yet, and I gave up >>> >>> >> following the the DI around in the code after a while. >>> >>> >> >>> >>> >> Let's say I have a deployment for the job "leader" in HA with ZK, >>> and >>> >>> >> another deployment for the taskmanagers. >>> >>> >> >>> >>> >> I want to upgrade the code or configuration and start from a >>> >>> >> savepoint, in an automated way. >>> >>> >> >>> >>> >> Best I can figure, I can not just update the deployment resources >>> in >>> >>> >> kubernetes and allow the containers to restart in an arbitrary >>> order. >>> >>> >> >>> >>> >> Instead, I expect sequencing is important, something along the >>> lines >>> >>> >> of this: >>> >>> >> >>> >>> >> 1. issue savepoint command on leader >>> >>> >> 2. wait for savepoint >>> >>> >> 3. destroy all leader and taskmanager containers >>> >>> >> 4. deploy new leader, with savepoint url >>> >>> >> 5. deploy new taskmanagers >>> >>> >> >>> >>> >> >>> >>> >> For example, I imagine old taskmanagers (with an old version of my >>> >>> >> job) attaching to the new leader and causing a problem. >>> >>> >> >>> >>> >> Does that sound right, or am I overthinking it? >>> >>> >> >>> >>> >> If not, has anyone tried implementing any automation for this yet? >>> >>> >> >>> >>> > >>> >>> >>> >>