Hi, Have you looked into fine-grained recovery? https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures <https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+:+Fine+Grained+Recovery+from+Task+Failures>
Stefan cc'ed might be able to give you some pointers about configuration. Best, Aljoscha > On 6. Mar 2018, at 22:35, Ashish Pokharel <ashish...@yahoo.com> wrote: > > Hi Gordon, > > The issue really is we are trying to avoid checkpointing as datasets are > really heavy and all of the states are really transient in a few of our apps > (flushed within few seconds). So high volume/velocity and transient nature of > state make those app good candidates to just not have checkpoints. > > We do have offsets committed to Kafka AND we have “some” tolerance for gap / > duplicate. However, we do want to handle “graceful” restarts / shutdown. For > shutdown, we have been taking savepoints (which works great) but for restart, > we just can’t find a way. > > Bottom line - we are trading off resiliency for resource utilization and > performance but would like to harden apps for production deployments as much > as we can. > > Hope that makes sense. > > Thanks, Ashish > >> On Mar 6, 2018, at 10:19 PM, Tzu-Li Tai <tzuli...@gmail.com> wrote: >> >> Hi Ashish, >> >> Could you elaborate a bit more on why you think the restart of all operators >> lead to data loss? >> >> When restart occurs, Flink will restart the job from the latest complete >> checkpoint. >> All operator states will be reloaded with state written in that checkpoint, >> and the position of the input stream will also be re-winded. >> >> I don't think there is a way to force a checkpoint before restarting occurs, >> but as I mentioned, that should not be required, because the last complete >> checkpoint will be used. >> Am I missing something in your particular setup? >> >> Cheers, >> Gordon >> >> >> >> -- >> Sent from: >> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ >