Hi there, So far, checkpoint trigger is hardcoded in CheckpointCorrdinator which triggered periodically and push control messages to task managers. It was implemented orthogonal to business logics implemented in jobs.
Our scenario requires master pipeline flow control messages along with events in distributed queue. When follower pipeline source detect a control message(checkpoint_barrier_id / watermark / restore_to_checkpoint_id) it will block itself and send message to checkpoint coordinator and request a specific checkpoint. Once ack from checkpoint coordinator, and acked back checkpoint coordinator, source unblock itself and keep going. It would contains certain message de-dupe logic which always honor first seeing control message and discard latter duplicates given the assumption that events consumed in distributed queue is strongly ordered. Pros: - Allow pipeline author define customized checkpointing logic - Allow break down large pipeline into smaller ones and connect via streaming queue - leverage data locality and run sub pipeline near it's dependency. - pipe late arriving events to a side pipeline and consolidate there Cons: - Overhead of blocking follower pipeline source - Overhead of distributed queue latency - Overhead of writing control messages to distributed queue Does that makes sense? Thanks, Chen