Hey guys, With recent discussions around being able to shutdown and restart streaming jobs from specific checkpoints, there is another issue that I think needs tackling.
As far as I understand when a streaming job finishes the tasks are not notified for the last checkpoints and also jobs don't take a final checkpoint before shutting down. In my opinion this might lead to situations when the user cannot tell whether the job finished properly (with consistent states/ outputs) etc. To give you a concrete example, let's say I am using the RollingSink to produce exactly once output files. If the job finishes I think there will be some files that remain in the pending state and are never completed. The user then sees some complete files, and some pending files for the completed job. The question is then, how do I tell whether the pending files were actually completed properly no that the job is finished. Another example would be that I want to manually shut down my job at 12:00 and make sure that I produce every output up to that point. Is there any way to achieve this currently? I think we need to do 2 things to make this work: 1. Job shutdowns (finish/manual) should trigger a final checkpoint 2. These final checkpoints should actually be 2 phase checkpoints: checkpoint -> ack -> notify -> ack , then when the checkpointcoordinator gets all the notification acks it can tell the user that the system shut down cleanely. Unfortunately it can happen that for some reason the coordinator does not receive all the acks for a complete job, in that case it can warn the user that the checkpoint might be inconsistent. Let me know what you think! Cheers, Gyula