Hi Jamie, Thanks for the reply.
Yeah i looked at save points, i want to start my job only from the last checkpoint, this means I have to keep track of when the checkpoint was taken and the trigger a save point. I am not sure this is the way to go. My state backend is HDFS and I can see that the checkpoint path has the data that has been buffered in the window. I want to start the job in a way such that it will read the checkpointed data before the failure and continue processing. I realise that the checkpoints are used whenever there is a container failure, and a new container is obtained. In my case the job failed because a container failed for the maximum AllowedN umber of failures Thanks, Prabhu On Fri, Jul 1, 2016 at 3:54 PM, Jamie Grier [via Apache Flink User Mailing List archive.] <ml-node+s2336050n7767...@n4.nabble.com> wrote: > Hi Prabhu, > > Have you taken a look at Flink's savepoints feature? This allows you to > make snapshots of your job's state on demand and then at any time restart > your job from that point: > https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/streaming/savepoints.html > > Also know that you can use Flink disk-backed state backend as well if > you're job state is larger than fits in memory. See > https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/streaming/state_backends.html#the-rocksdbstatebackend > > > -Jamie > > > On Fri, Jul 1, 2016 at 1:34 PM, [hidden email] > <http:///user/SendEmail.jtp?type=node&node=7767&i=0> <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=7767&i=1>> wrote: > >> Hi, >> >> I have a flink streaming job that reads from kafka, performs a aggregation >> in a window, it ran fine for a while however when the number of events in >> a >> window crossed a certain limit , the yarn containers failed with Out Of >> Memory. The job was running with 10G containers. >> >> We have about 64G memory on the machine and now I want to restart the job >> with a 20G container (we ran some tests and 20G should be good enough to >> accomodate all the elements from the window). >> >> Is there a way to restart the job from the last checkpoint ? >> >> When I resubmit the job, it starts from the last committed offsets however >> the events that were held in the window at the time of checkpointing seem >> to >> get lost. Is there a way to recover the events buffered within the window >> and were checkpointed before the failure ? >> >> Thanks, >> Prabhu >> >> >> >> -- >> View this message in context: >> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Failed-job-restart-flink-on-yarn-tp7764.html >> Sent from the Apache Flink User Mailing List archive. mailing list >> archive at Nabble.com. >> > > > > -- > > Jamie Grier > data Artisans, Director of Applications Engineering > @jamiegrier <https://twitter.com/jamiegrier> > [hidden email] <http:///user/SendEmail.jtp?type=node&node=7767&i=2> > > > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Failed-job-restart-flink-on-yarn-tp7764p7767.html > To unsubscribe from Failed job restart - flink on yarn, click here > <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=7764&code=dnByYWJodUBnbWFpbC5jb218Nzc2NHw2MzI5NTI5MDE=> > . > NAML > <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Failed-job-restart-flink-on-yarn-tp7764p7771.html Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.