Hi Harini,

We have seen issues like this related to task roll time, related to task
queue notifications on overlord instances; I have a patch running
internally that resolves this.

These are my internal triage notes:
======
- Whenever task scheduling is happening (startup, ingest segment task
rollover, redeployment of datasource) Overlord takes a long time to assign
workers. This compounds because tasks sit so long before deployment that it
starts failing tasks and having to relaunch them.

   - TaskQueue: notifyStatus() which receives updates from the
   middlemanagers, and the manage() loop which controls services, runs through
   a single lock. For example, the shutdown request involves submitting
   downstream HTTP requests synchronously (while holding the lock).
   - This means for a cluster with ~700 tasks that tasks are held for
   nearly 1second, and only after each 1 second around the manage loop can 1-2
   notifications be processed. For a new startup, with 700 tasks, and a 1sec
   delay, that is 300-600-or-more seconds for the overlord to realise all the
   tasks are started by the middle manager.
   - Similar delays happen for any other operations.
   - Sub-optimal logging code path (double-concatening very long log
   entries),
   - ZkWorker: Worker fully deserializing all ZK payload data every time
   looking up task IDs rather than only looking at the ID fields. Similarly,
   repeat fetching data on task assignment.

=====

The patch I have is here:
https://github.com/jasonk000/druid/pull/7/files

It fixes a couple of things, most importantly the task queue notification
system. The system is much more stable with high task counts and will
easily restart many tasks concurrently.

I have other perf issues I want to look at first before I can document it
fully, build a test case, rebase it on apache/master, etc. If you test it
out, and it works, we could submit a PR that would resolve it.

PS - I have a queue of similar fixes I'd like to submit, but need some time
to do the documentation, build test cases, upstreaming, etc, if anyone
wants to collaborate, I could open some Issues and share my partial notes.

Thanks
Jason

On Wed, Dec 1, 2021 at 12:59 PM Harini Rajendran
<hrajend...@confluent.io.invalid> wrote:

> Hi all,
>
> I have been investigating this in the background for a few days now and
> need some help from the community.
>
> We noticed that every hour, when the tasks roll, we see a spike in the
> ingestion lag for about 2-4 minutes. We have 180 tasks running on this
> datasource.
> [image: Screen Shot 2021-12-01 at 9.14.23 AM.png]
>
> On further debugging of task logs, we found out that around the duration
> when the ingestion lag spikes up, *the gap between pause and resume
> commands in the task logs during checkpointing are wide ranging from few
> seconds to couple minutes*. For example, in the following task logs you
> can see that it was about 1.5 minutes.
> {"@timestamp":"2021-11-18T*20:06:58.513Z*", "log.level":"DEBUG",
> "message":"Received pause command, *pausing* ingestion until resumed.", "
> service.name
> ":"druid/middleManager","event.dataset":"druid/middleManager.log","
> process.thread.name
> ":"task-runner-0-priority-0","log.logger":"org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner"}
> {"@timestamp":"2021-11-18T*20:08:26.326Z*", "log.level":"DEBUG",
> "message":"Received pause command, *pausing* ingestion until resumed.", "
> service.name
> ":"druid/middleManager","event.dataset":"druid/middleManager.log","
> process.thread.name
> ":"task-runner-0-priority-0","log.logger":"org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner"}
> {"@timestamp":"2021-11-18T*20:08:26.329Z*", "log.level":"DEBUG",
> "message":"Received resume command, *resuming* ingestion.", "service.name
> ":"druid/middleManager","event.dataset":"druid/middleManager.log","
> process.thread.name
> ":"task-runner-0-priority-0","log.logger":"org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner"}
> So this explains why ingestion is lagging as the *tasks are paused for a
> long time*.
>
> *Why are there 2 pauses during checkpointing and why such a huge gap?*
> As a next step, I wanted to see why there is such a wide gap. Then we
> realized that the first pause is when the task pauses itself here
> <https://github.com/confluentinc/druid/blob/185ab56e42577dad6b077b415959512b0cd96345/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskRunner.java#L728>
>  while
> requesting the supervisor for a checkpoint. And the second pause is when
> the supervisor actually handles the checkpoint notice here
> <https://github.com/confluentinc/druid/blob/185ab56e42577dad6b077b415959512b0cd96345/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java#L2548>
> .
> And since the supervisor thread for this data source takes such a long
> time to process all the notices in the queue before this checkpoint notice,
> the ingestion task ends up being in the paused state for a long time.
>
> *Why does the supervisor thread take such a long time to get to this
> checkpoint notice?*
> That was my next step in debugging.
> Proving our theory, we noticed that the *noticesQueue in the supervisor
> does get backed up with 100s of notices every hour when tasks roll*.
> [image: Screen Shot 2021-12-01 at 9.32.59 AM.png]
> And we saw that *run_notice takes between 5s and 7s during task rolls*.
> And this causes backing up of noticesQueue causing checkpoint notice to be
> in the queue for long leading to ingestion lag spike whenever tasks roll.
> [image: Screen Shot 2021-12-01 at 9.34.29 AM.png]
>
> *Why does run_notice take 5-7s to finish?*
> When the task starts, it takes about 5s for the HTTP server to come up.
> So, till then the supervisor thread is in a loop trying to get the task
> status and this causes run_notice to take about 5-7s to finish.
>
> *Questions to the community*
> Do we need 2 pauses during checkpointing? Should the task pause itself
> before requesting a checkpoint notice given that the supervisor is anyways
> going to pause the task while handling the notice? Or is it okay to remove
> the pause in the TaskRunner before it sends a checkpoint notice request to
> the supervisor? This would immediately solve the ingestion lag issue
> completely as there won't be two pauses with such a huge gap in between.
>
> Harini
> Software Engineer, Confluent
>

Reply via email to