Re: CheckpointedFunction initialization during checkpoint

Aljoscha Krettek Tue, 29 Sep 2020 05:52:40 -0700

Hi Teng,

I think if the system is slowed down enough it can happen that someparts of the graph are still restoring while others are already taking acheckpoint. By virtue of how checkpointing works (by sending barriersalong the network connections between tasks) this should not be aproblem, though.

It would be good to check in the logs if for all individual tasks itholds that "restoring" comes before "checkpointing".


Best,
Aljoscha

On 29.09.20 04:00, Teng Fei Liao wrote:

Hey all,

I've been trying to debug a job recovery performance issue and I'm noticing
some interesting events in the timeline that seem unexpected to me. Here's
a brief outline of the first checkpoint following a job restart:

1. All tasks are deployed and transition into the RUNNING state.
2. I see logs for a subset of initializeState calls ("{} - restoring state"
from TwoPhaseCommitSinkFunction)
3. A checkpoint gets triggered "Triggering checkpoint {} @ {} for job {}."
4. I see more "{} - restoring state" logs.
5. Checkpoint completes "Completed checkpoint {} for job {} ({} bytes in {}
ms)."

The 2 questions I have are:
Are the initializations in 4) in the middle of a checkpoint expected? Since
all the tasks transition in 1) I would think that they are initialized
there as well.

Are the initializations in 4) causing the checkpoint to take longer to
complete? During the checkpoint, I do see "{} - checkpoint {} complete,
committing transaction {} from checkpoint {}" logs
(TwoPhaseCommitSinkFunction's notifyCheckpointComplete method) which
suggests that the kafka producers in 2) and 4) are contributing to the
checkpoint.

Thanks!

-Teng

Re: CheckpointedFunction initialization during checkpoint

Reply via email to