Hey guys,

With recent discussions around being able to shutdown and restart streaming
jobs from specific checkpoints, there is another issue that I think needs
tackling.

As far as I understand when a streaming job finishes the tasks are not
notified for the last checkpoints and also jobs don't take a final
checkpoint before shutting down.

In my opinion this might lead to situations when the user cannot tell
whether the job finished properly (with consistent states/ outputs) etc. To
give you a concrete example, let's say I am using the RollingSink to
produce exactly once output files. If the job finishes I think there will
be some files that remain in the pending state and are never completed. The
user then sees some complete files, and some pending files for the
completed job. The question is then, how do I tell whether the pending
files were actually completed properly no that the job is finished.

Another example would be that I want to manually shut down my job at 12:00
and make sure that I produce every output up to that point. Is there any
way to achieve this currently?

I think we need to do 2 things to make this work:
1. Job shutdowns (finish/manual) should trigger a final checkpoint
2. These final checkpoints should actually be 2 phase checkpoints:
checkpoint -> ack -> notify -> ack , then when the checkpointcoordinator
gets all the notification acks it can tell the user that the system shut
down cleanely.

Unfortunately it can happen that for some reason the coordinator does not
receive all the acks for a complete job, in that case it can warn the user
that the checkpoint might be inconsistent.

Let me know what you think!

Cheers,
Gyula

Reply via email to