Hi Harini, We have seen issues like this related to task roll time, related to task queue notifications on overlord instances; I have a patch running internally that resolves this.
These are my internal triage notes: ====== - Whenever task scheduling is happening (startup, ingest segment task rollover, redeployment of datasource) Overlord takes a long time to assign workers. This compounds because tasks sit so long before deployment that it starts failing tasks and having to relaunch them. - TaskQueue: notifyStatus() which receives updates from the middlemanagers, and the manage() loop which controls services, runs through a single lock. For example, the shutdown request involves submitting downstream HTTP requests synchronously (while holding the lock). - This means for a cluster with ~700 tasks that tasks are held for nearly 1second, and only after each 1 second around the manage loop can 1-2 notifications be processed. For a new startup, with 700 tasks, and a 1sec delay, that is 300-600-or-more seconds for the overlord to realise all the tasks are started by the middle manager. - Similar delays happen for any other operations. - Sub-optimal logging code path (double-concatening very long log entries), - ZkWorker: Worker fully deserializing all ZK payload data every time looking up task IDs rather than only looking at the ID fields. Similarly, repeat fetching data on task assignment. ===== The patch I have is here: https://github.com/jasonk000/druid/pull/7/files It fixes a couple of things, most importantly the task queue notification system. The system is much more stable with high task counts and will easily restart many tasks concurrently. I have other perf issues I want to look at first before I can document it fully, build a test case, rebase it on apache/master, etc. If you test it out, and it works, we could submit a PR that would resolve it. PS - I have a queue of similar fixes I'd like to submit, but need some time to do the documentation, build test cases, upstreaming, etc, if anyone wants to collaborate, I could open some Issues and share my partial notes. Thanks Jason On Wed, Dec 1, 2021 at 12:59 PM Harini Rajendran <hrajend...@confluent.io.invalid> wrote: > Hi all, > > I have been investigating this in the background for a few days now and > need some help from the community. > > We noticed that every hour, when the tasks roll, we see a spike in the > ingestion lag for about 2-4 minutes. We have 180 tasks running on this > datasource. > [image: Screen Shot 2021-12-01 at 9.14.23 AM.png] > > On further debugging of task logs, we found out that around the duration > when the ingestion lag spikes up, *the gap between pause and resume > commands in the task logs during checkpointing are wide ranging from few > seconds to couple minutes*. For example, in the following task logs you > can see that it was about 1.5 minutes. > {"@timestamp":"2021-11-18T*20:06:58.513Z*", "log.level":"DEBUG", > "message":"Received pause command, *pausing* ingestion until resumed.", " > service.name > ":"druid/middleManager","event.dataset":"druid/middleManager.log"," > process.thread.name > ":"task-runner-0-priority-0","log.logger":"org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner"} > {"@timestamp":"2021-11-18T*20:08:26.326Z*", "log.level":"DEBUG", > "message":"Received pause command, *pausing* ingestion until resumed.", " > service.name > ":"druid/middleManager","event.dataset":"druid/middleManager.log"," > process.thread.name > ":"task-runner-0-priority-0","log.logger":"org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner"} > {"@timestamp":"2021-11-18T*20:08:26.329Z*", "log.level":"DEBUG", > "message":"Received resume command, *resuming* ingestion.", "service.name > ":"druid/middleManager","event.dataset":"druid/middleManager.log"," > process.thread.name > ":"task-runner-0-priority-0","log.logger":"org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner"} > So this explains why ingestion is lagging as the *tasks are paused for a > long time*. > > *Why are there 2 pauses during checkpointing and why such a huge gap?* > As a next step, I wanted to see why there is such a wide gap. Then we > realized that the first pause is when the task pauses itself here > <https://github.com/confluentinc/druid/blob/185ab56e42577dad6b077b415959512b0cd96345/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskRunner.java#L728> > while > requesting the supervisor for a checkpoint. And the second pause is when > the supervisor actually handles the checkpoint notice here > <https://github.com/confluentinc/druid/blob/185ab56e42577dad6b077b415959512b0cd96345/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java#L2548> > . > And since the supervisor thread for this data source takes such a long > time to process all the notices in the queue before this checkpoint notice, > the ingestion task ends up being in the paused state for a long time. > > *Why does the supervisor thread take such a long time to get to this > checkpoint notice?* > That was my next step in debugging. > Proving our theory, we noticed that the *noticesQueue in the supervisor > does get backed up with 100s of notices every hour when tasks roll*. > [image: Screen Shot 2021-12-01 at 9.32.59 AM.png] > And we saw that *run_notice takes between 5s and 7s during task rolls*. > And this causes backing up of noticesQueue causing checkpoint notice to be > in the queue for long leading to ingestion lag spike whenever tasks roll. > [image: Screen Shot 2021-12-01 at 9.34.29 AM.png] > > *Why does run_notice take 5-7s to finish?* > When the task starts, it takes about 5s for the HTTP server to come up. > So, till then the supervisor thread is in a loop trying to get the task > status and this causes run_notice to take about 5-7s to finish. > > *Questions to the community* > Do we need 2 pauses during checkpointing? Should the task pause itself > before requesting a checkpoint notice given that the supervisor is anyways > going to pause the task while handling the notice? Or is it okay to remove > the pause in the TaskRunner before it sends a checkpoint notice request to > the supervisor? This would immediately solve the ingestion lag issue > completely as there won't be two pauses with such a huge gap in between. > > Harini > Software Engineer, Confluent >