Re: Failed job restart - flink on yarn

Jamie Grier Fri, 01 Jul 2016 16:36:07 -0700

Hi Prabhu,

Have you taken a look at Flink's savepoints feature?  This allows you to
make snapshots of your job's state on demand and then at any time restart
your job from that point:
https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/streaming/savepoints.html


Also know that you can use Flink disk-backed state backend as well if
you're job state is larger than fits in memory.  See
https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/streaming/state_backends.html#the-rocksdbstatebackend


-Jamie


On Fri, Jul 1, 2016 at 1:34 PM, vpra...@gmail.com <vpra...@gmail.com> wrote:

> Hi,
>
> I have a flink streaming job that reads from kafka, performs a aggregation
> in a window, it ran fine for a while however when the number of events in a
> window crossed a certain limit , the yarn containers failed with Out Of
> Memory. The job was running with 10G containers.
>
> We have about 64G memory on the machine and now I want to restart the job
> with a 20G container (we ran some tests and 20G should be good enough to
> accomodate all the elements from the window).
>
> Is there a way to restart the job from the last checkpoint ?
>
> When I resubmit the job, it starts from the last committed offsets however
> the events that were held in the window at the time of checkpointing seem
> to
> get lost. Is there a way to recover the events buffered within the window
> and were checkpointed before the failure ?
>
> Thanks,
> Prabhu
>
>
>
> --
> View this message in context:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Failed-job-restart-flink-on-yarn-tp7764.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>



-- 

Jamie Grier
data Artisans, Director of Applications Engineering
@jamiegrier <https://twitter.com/jamiegrier>
ja...@data-artisans.com

Re: Failed job restart - flink on yarn

Reply via email to