Hi all,

Hope everyone is doing well!

I am running into what seems like a deadlock (application stalled) situation 
with a Flink streaming job upon restore from savepoint. Job has a slowly moving 
stream (S1) that needs to be “stateful” and a continuous stream (S2) which is 
“joined” with slow moving stream (S1). Some level of loss/repetition is 
acceptable in continuous stream (S2) and hence can rely on something like Kafka 
consumer states upon restarts etc. Continuous stream (S2) however needs to be 
iterated through states from slowly moving streams (S1) a few times (mostly 2). 
States are fair sized (ends up being 15GB on HDFS). When job is restarted with 
no continuous data (S2) on topic job starts up, restores states and does it’s 
initial checkpoint within 3 minutes. However, when app is started from 
savepoint and continuous stream (S2) is actually present in Kafka it seems like 
application comes to a halt. Looking at progress of checkpoints, it seems like 
every attempt is stuck after until some timeouts happen at around 10 mins. If 
iteration on stream is removed app can successfully start and checkpoint even 
when continuous stream (S2) is flowing in as well. Unfortunately we are working 
on a hosted environment for both data and platform, hence debugging with thread 
dumps etc will be challenging. 

I couldn’t find a known issue on this but was wondering if anyone has seen such 
behavior or know of any issues in past. It does look like checkpointing has to 
be set to forced to get an iterative job to checkpoint in the first place (an 
option that is marked deprecated already - working on 1.8.2 version as of now). 
I do understand challenges around consistent checkpointing of iterative stream. 
As I mentioned earlier, what I really want to maintain for the most part are 
states of slowly moving dimensions. Iterations does solve the problem at hand 
(multiple loops of logic) pretty gracefully but not being able to restore from 
savepoint will be a show stopper. 

Will appreciate any pointer / suggestions.

Thanks in advance, 

Ashish

Reply via email to