Fabian, Stephan, All,
I started a discussion a while back around having a form of event-based 
checkpointing policy that will help us in some of our high volume data 
pipelines. Here is an effort to put this in front of community and understand 
what capabilities can support these type of use cases, how much others feel the 
same need and potentially a feature that can make it to a user story.
Use Case Summary:- Extremely high volume of data (events from consumer devices 
with customer base of over 100M)- Multiple events need to be combined using a 
windowing streaming app grouped by keys (something like 5 min floor of 
timestamp and unique identifiers for customer devices)- "Most" events by a 
group/key arrive in few seconds if not milliseconds however events can 
sometimes delay or get lost in transport (so delayed event handling and 
timeouts will be needed)- Extremely low (pretty vague but hopefully details 
below clarify it more) data loss is acceptable- Because of the volume and 
transient nature of source, checkpointing is turned off (saves on writes to 
persistence as states/sessions are active for only few seconds during 
processing)
Problem Summary:Of course, none of the above is out of the norm for Flink and 
as a matter of factor we already have a Flink app doing this. The issue arises 
when it comes to graceful shutdowns and on operator failures (eg: Kafka 
timeouts etc.) On operator failures, entire job graph restarts which 
essentially flushes out in-memory states/sessions. I think there is a feature 
in works (not sure if it made it to 1.5) to perform selective restarts which 
will control the damage but still will result in data loss. Also, it doesn't 
help when application restarts are needed. We did try going savepoint route for 
explicit restart needs but I think MemoryBackedState ran into issues for larger 
states or something along those line(not certain). We obviously cannot recover 
an operator that actually fails because it's own state could be unrecoverable. 
However, it feels like Flink already has a lot of plumbing to help with overall 
problem of allowing some sort of recoverable state to handle graceful shutdowns 
and restarts with minimal data loss.
Solutions:Some in community commented on my last email with decent ideas like 
having an event-based checkpointing trigger (on shutdown, on restart etc) or 
life-cycle hooks (onCancel, onRestart etc) in Functions that can be implemented 
if this type of behavior is needed etc. 
Appreciate feedback from community on how useful this might be for others and 
from core contributors on their thoughts as well.
Thanks in advance, Ashish

Reply via email to