Need help in understanding real-time ingestion task pause behavior during checkpointing

Harini Rajendran Wed, 01 Dec 2021 12:59:28 -0800

Hi all,

I have been investigating this in the background for a few days now and
need some help from the community.


We noticed that every hour, when the tasks roll, we see a spike in the
ingestion lag for about 2-4 minutes. We have 180 tasks running on this
datasource.
[image: Screen Shot 2021-12-01 at 9.14.23 AM.png]

On further debugging of task logs, we found out that around the duration
when the ingestion lag spikes up, *the gap between pause and resume
commands in the task logs during checkpointing are wide ranging from few
seconds to couple minutes*. For example, in the following task logs you can
see that it was about 1.5 minutes.
{"@timestamp":"2021-11-18T*20:06:58.513Z*", "log.level":"DEBUG",
"message":"Received pause command, *pausing* ingestion until resumed.", "
service.name
":"druid/middleManager","event.dataset":"druid/middleManager.log","
process.thread.name
":"task-runner-0-priority-0","log.logger":"org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner"}
{"@timestamp":"2021-11-18T*20:08:26.326Z*", "log.level":"DEBUG",
"message":"Received pause command, *pausing* ingestion until resumed.", "
service.name
":"druid/middleManager","event.dataset":"druid/middleManager.log","
process.thread.name
":"task-runner-0-priority-0","log.logger":"org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner"}
{"@timestamp":"2021-11-18T*20:08:26.329Z*", "log.level":"DEBUG",
"message":"Received resume command, *resuming* ingestion.", "service.name
":"druid/middleManager","event.dataset":"druid/middleManager.log","
process.thread.name
":"task-runner-0-priority-0","log.logger":"org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner"}
So this explains why ingestion is lagging as the *tasks are paused for a
long time*.

*Why are there 2 pauses during checkpointing and why such a huge gap?*
As a next step, I wanted to see why there is such a wide gap. Then we
realized that the first pause is when the task pauses itself here
<https://github.com/confluentinc/druid/blob/185ab56e42577dad6b077b415959512b0cd96345/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskRunner.java#L728>
while
requesting the supervisor for a checkpoint. And the second pause is when
the supervisor actually handles the checkpoint notice here
<https://github.com/confluentinc/druid/blob/185ab56e42577dad6b077b415959512b0cd96345/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java#L2548>
.
And since the supervisor thread for this data source takes such a long time
to process all the notices in the queue before this checkpoint notice, the
ingestion task ends up being in the paused state for a long time.

*Why does the supervisor thread take such a long time to get to this
checkpoint notice?*
That was my next step in debugging.
Proving our theory, we noticed that the *noticesQueue in the supervisor
does get backed up with 100s of notices every hour when tasks roll*.
[image: Screen Shot 2021-12-01 at 9.32.59 AM.png]
And we saw that *run_notice takes between 5s and 7s during task rolls*. And
this causes backing up of noticesQueue causing checkpoint notice to be in
the queue for long leading to ingestion lag spike whenever tasks roll.
[image: Screen Shot 2021-12-01 at 9.34.29 AM.png]

*Why does run_notice take 5-7s to finish?*
When the task starts, it takes about 5s for the HTTP server to come up. So,
till then the supervisor thread is in a loop trying to get the task status
and this causes run_notice to take about 5-7s to finish.

*Questions to the community*
Do we need 2 pauses during checkpointing? Should the task pause itself
before requesting a checkpoint notice given that the supervisor is anyways
going to pause the task while handling the notice? Or is it okay to remove
the pause in the TaskRunner before it sends a checkpoint notice request to
the supervisor? This would immediately solve the ingestion lag issue
completely as there won't be two pauses with such a huge gap in between.

Harini
Software Engineer, Confluent

Need help in understanding real-time ingestion task pause behavior during checkpointing

Reply via email to