Testing recoverable job state

Marchant, Hayden Wed, 13 Sep 2017 06:31:30 -0700

I'm a newbie to Flink and am trying to understand how the recovery works using 
state backends. I've read the documentation and am now trying to run a simple 
test to demonstrate the abilities - I'd like to test the recovery of a flink 
job and how the state is recovered from where it left off when 'disaster hit'.


Please note that this whole test is being done on a Windows workstation through 
my IDE. I am running a LocalFlinkMiniCluster and have enabled checkpointing 
using FsStateBackend. I am using Kafka as a source. When running this Flink 
job, I see that a new directory is created within the FsStateBackend base 
directory with a randomly generated JobID. I assume that if a Task fails within 
the job, the state stored in the backend will be used to restart the relevant 
Operator instances from the recent checkpoint. I have tried simulating this by 
throwing an exception in one of the operators,  though I'm not sure what the 
expected functionality is now - will the Task be killed, or just that 'bad' 
tuple will be ignored?

Also, and more importantly, I would like to simulate a more 'drastic' failure - 
that of my whole Flink cluster going down. In my test I would do this simply by 
killing my single LocalFlinkMiniCluster process.  In that case, I would like my 
job to resume when I restart the Flink cluster. However, when I do that, my 
could launches a new job, with same code, but running with a new Job ID. How do 
I get it to run with the same Job ID so that it can use the stored state to 
recover?

Am I approaching this test in the right way? If not, please give me some 
pointers to better simulate a real system. (Note that in a real system, we 
would like to run on a single node cluster.)

Thanks,
Hayden Marchant

Testing recoverable job state

Reply via email to