I'm a newbie to Flink and am trying to understand how the recovery works using state backends. I've read the documentation and am now trying to run a simple test to demonstrate the abilities - I'd like to test the recovery of a flink job and how the state is recovered from where it left off when 'disaster hit'.
Please note that this whole test is being done on a Windows workstation through my IDE. I am running a LocalFlinkMiniCluster and have enabled checkpointing using FsStateBackend. I am using Kafka as a source. When running this Flink job, I see that a new directory is created within the FsStateBackend base directory with a randomly generated JobID. I assume that if a Task fails within the job, the state stored in the backend will be used to restart the relevant Operator instances from the recent checkpoint. I have tried simulating this by throwing an exception in one of the operators, though I'm not sure what the expected functionality is now - will the Task be killed, or just that 'bad' tuple will be ignored? Also, and more importantly, I would like to simulate a more 'drastic' failure - that of my whole Flink cluster going down. In my test I would do this simply by killing my single LocalFlinkMiniCluster process. In that case, I would like my job to resume when I restart the Flink cluster. However, when I do that, my could launches a new job, with same code, but running with a new Job ID. How do I get it to run with the same Job ID so that it can use the stored state to recover? Am I approaching this test in the right way? If not, please give me some pointers to better simulate a real system. (Note that in a real system, we would like to run on a single node cluster.) Thanks, Hayden Marchant