Hi Prabhu, Have you taken a look at Flink's savepoints feature? This allows you to make snapshots of your job's state on demand and then at any time restart your job from that point: https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/streaming/savepoints.html
Also know that you can use Flink disk-backed state backend as well if you're job state is larger than fits in memory. See https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/streaming/state_backends.html#the-rocksdbstatebackend -Jamie On Fri, Jul 1, 2016 at 1:34 PM, vpra...@gmail.com <vpra...@gmail.com> wrote: > Hi, > > I have a flink streaming job that reads from kafka, performs a aggregation > in a window, it ran fine for a while however when the number of events in a > window crossed a certain limit , the yarn containers failed with Out Of > Memory. The job was running with 10G containers. > > We have about 64G memory on the machine and now I want to restart the job > with a 20G container (we ran some tests and 20G should be good enough to > accomodate all the elements from the window). > > Is there a way to restart the job from the last checkpoint ? > > When I resubmit the job, it starts from the last committed offsets however > the events that were held in the window at the time of checkpointing seem > to > get lost. Is there a way to recover the events buffered within the window > and were checkpointed before the failure ? > > Thanks, > Prabhu > > > > -- > View this message in context: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Failed-job-restart-flink-on-yarn-tp7764.html > Sent from the Apache Flink User Mailing List archive. mailing list archive > at Nabble.com. > -- Jamie Grier data Artisans, Director of Applications Engineering @jamiegrier <https://twitter.com/jamiegrier> ja...@data-artisans.com