Re: Challenges Deploying Flink With Savepoints On Kubernetes

Yuval Itzchakov Tue, 24 Sep 2019 09:54:51 -0700

Hi Hao,

I think he's exactly talking about the usecase where the JM/TM restart and
they come back up from the latest savepoint which might be stale by that
time.


On Tue, 24 Sep 2019, 19:24 Hao Sun, <[email protected]> wrote:

> We always make a savepoint before we shutdown the job-cluster. So the
> savepoint is always the latest. When we fix a bug or change the job graph,
> it can resume well.
> We only use checkpoints for unplanned downtime, e.g. K8S killed JM/TM,
> uncaught exception, etc.
>
> Maybe I do not understand your use case well, I do not see a need to start
> from checkpoint after a bug fix.
> From what I know, currently you can use checkpoint as a savepoint as well
>
> Hao Sun
>
>
> On Tue, Sep 24, 2019 at 7:48 AM Yuval Itzchakov <[email protected]> wrote:
>
>> AFAIK there's currently nothing implemented to solve this problem, but
>> working on a possible fix can be implemented on top of
>> https://github.com/lyft/flinkk8soperator which already has a pretty
>> fancy state machine for rolling upgrades. I'd love to be involved as this
>> is an issue I've been thinking about as well.
>>
>> Yuval
>>
>> On Tue, Sep 24, 2019 at 5:02 PM Sean Hester <[email protected]>
>> wrote:
>>
>>> hi all--we've run into a gap (knowledge? design? tbd?) for our use cases
>>> when deploying Flink jobs to start from savepoints using the job-cluster
>>> mode in Kubernetes.
>>>
>>> we're running a ~15 different jobs, all in job-cluster mode, using a mix
>>> of Flink 1.8.1 and 1.9.0, under GKE (Google Kubernetes Engine). these are
>>> all long-running streaming jobs, all essentially acting as microservices.
>>> we're using Helm charts to configure all of our deployments.
>>>
>>> we have a number of use cases where we want to restart jobs from a
>>> savepoint to replay recent events, i.e. when we've enhanced the job logic
>>> or fixed a bug. but after the deployment we want to have the job resume
>>> it's "long-running" behavior, where any unplanned restarts resume from the
>>> latest checkpoint.
>>>
>>> the issue we run into is that any obvious/standard/idiomatic Kubernetes
>>> deployment includes the savepoint argument in the configuration. if the Job
>>> Manager container(s) have an unplanned restart, when they come back up they
>>> will start from the savepoint instead of resuming from the latest
>>> checkpoint. everything is working as configured, but that's not exactly
>>> what we want. we want the savepoint argument to be transient somehow (only
>>> used during the initial deployment), but Kubernetes doesn't really support
>>> the concept of transient configuration.
>>>
>>> i can see a couple of potential solutions that either involve custom
>>> code in the jobs or custom logic in the container (i.e. a custom entrypoint
>>> script that records that the configured savepoint has already been used in
>>> a file on a persistent volume or GCS, and potentially when/why/by which
>>> deployment). but these seem like unexpected and hacky solutions. before we
>>> head down that road i wanted to ask:
>>>
>>>    - is this is already a solved problem that i've missed?
>>>    - is this issue already on the community's radar?
>>>
>>> thanks in advance!
>>>
>>> --
>>> *Sean Hester* | Senior Staff Software Engineer | m. 404-828-0865
>>> 3525 Piedmont Rd. NE, Building 6, Suite 500, Atlanta, GA 30305
>>> <http://www.bettercloud.com> <http://www.bettercloud.com>
>>> *Altitude 2019 in San Francisco | Sept. 23 - 25*
>>> It’s not just an IT conference, it’s “a complete learning and networking
>>> experience”
>>> <https://altitude.bettercloud.com/?utm_source=gmail&utm_medium=signature&utm_campaign=2019-altitude>
>>>
>>>
>>
>> --
>> Best Regards,
>> Yuval Itzchakov.
>>
>

Re: Challenges Deploying Flink With Savepoints On Kubernetes

Reply via email to