Re: Need help in understanding real-time ingestion task pause behavior during checkpointing

Harini Rajendran Thu, 06 Jan 2022 09:34:41 -0800

Thanks for the update, Jason. We shall wait for the builds to pass.
Also, are you planning to get #12099 PR reviewed by the community
anytime soon?


Harini
Software Engineer, Observability
+1 412 708 3872



On Wed, Jan 5, 2022 at 8:25 PM Jason Koch <jk...@netflix.com> wrote:

> Harini - these are as far as I can progress them; 12096 is good, 12097 is
> good except for what seems to me to be a CI issue, and 12099 is "done" but
> has a CI issue as well, I'll ask for help below.
>
> Gian / Frank - looking for some help on these:
>
> - On #12099, it is rejecting because of an Intellij Inspect issue, I've
> added the `noinspection` pragma but it is being ignored by the linter. I'm
> not sure how to progress fixing this, apart from rewriting the solution
> which would make it less "obvious" to reasona bout.
>
> - On #12097, the failures are at a libc-related package installation, I
> assume these are ignorable!
>
> Thanks!
> Jason
>
>
> On Wed, Jan 5, 2022 at 9:03 AM Harini Rajendran <hrajend...@confluent.io>
> wrote:
>
>> That's great! Thanks. I'll keep an eye out on those PRs.
>>
>> Harini
>> Software Engineer, Observability
>> +1 412 708 3872
>>
>>
>>
>> On Wed, Jan 5, 2022 at 11:00 AM Jason Koch <jk...@netflix.com> wrote:
>>
>>> Thanks for the prompt - yes I'll get these fixed. They are code coverage
>>> / linter fixes, I had mistakenly assumed they were flake-y tests. I'll aim
>>> to fix these today.
>>>
>>> On Wed, Jan 5, 2022 at 7:25 AM Harini Rajendran <hrajend...@confluent.io>
>>> wrote:
>>>
>>>> Hi Jason,
>>>>
>>>> I was taking a look at your PRs and see that CI build is failing for 2
>>>> of them. Do you know why those are failing and are you planning to fix 
>>>> them?
>>>>
>>>> Harini
>>>> Software Engineer, Observability
>>>> +1 412 708 3872
>>>>
>>>>
>>>>
>>>> On Mon, Jan 3, 2022 at 7:59 PM Harini Rajendran <
>>>> hrajend...@confluent.io> wrote:
>>>>
>>>>> Hi Jason,
>>>>>
>>>>> I shall take a look at these 3 PRs and see if we can try these out in
>>>>> our test environment.
>>>>>
>>>>> Also, we use AWS RDS as the metadata engine.
>>>>>
>>>>> Harini
>>>>> Software Engineer, Observability
>>>>> +1 412 708 3872
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Dec 31, 2021 at 3:22 PM Jason Koch <jk...@netflix.com> wrote:
>>>>>
>>>>>> Hi Harini,
>>>>>>
>>>>>> I had a chance to look at the checkpoint behaviour you mentioned in
>>>>>> more detail, and found two codepaths where the RunNotice code ends up in
>>>>>> the TaskQueue, and hits the same locks. I'd be interested if you want to
>>>>>> try the three related PRs I have submitted. (I added more detail to the
>>>>>> issue https://github.com/apache/druid/issues/11414).
>>>>>>
>>>>>> Failing that, I think the best next step would be some stack traces /
>>>>>> and/or profiler output from the supervisor at the time of rollover. It
>>>>>> would also be useful to know which metadata storage solution you are 
>>>>>> using
>>>>>> as the RunNotice interacts with metadata engine too.
>>>>>>
>>>>>> Thanks
>>>>>> Jason
>>>>>>
>>>>>> On Fri, Dec 3, 2021 at 2:43 PM Jason Koch <jk...@netflix.com> wrote:
>>>>>>
>>>>>>> Gian,
>>>>>>>
>>>>>>> I've submitted a PR to gianm/tq-scale-test that provides a
>>>>>>> concurrent test, (and fixes a concurrency bug I found along the way). 
>>>>>>> The
>>>>>>> change uses an 8millis response time for shutdown acknowledgment, and a 
>>>>>>> 2
>>>>>>> second time for shutdown completion/notification.
>>>>>>>
>>>>>>> Based on this test,
>>>>>>> - serial TaskQueue timeout occurs after 60sec for each test, and,
>>>>>>> - concurrent TaskQueue passes in ~10sec per test,
>>>>>>>
>>>>>>> https://github.com/gianm/druid/pull/3/files
>>>>>>>
>>>>>>> Let me know your thoughts.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Jason
>>>>>>>
>>>>>>> On Fri, Dec 3, 2021 at 11:41 AM Jason Koch <jk...@netflix.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Gian
>>>>>>>>
>>>>>>>> > Jason, also interesting findings! I took a crack at rebasing your
>>>>>>>> patch on
>>>>>>>> > master and adding a scale test for the TaskQueue with 1000
>>>>>>>> simulated tasks:
>>>>>>>> >
>>>>>>>> https://github.com/apache/druid/compare/master...gianm:tq-scale-test
>>>>>>>> <https://github.com/apache/druid/compare/master...gianm:tq-scale-test>.
>>>>>>>> When
>>>>>>>> > I run the scale test, "doMassLaunchAndExit" passes quickly but
>>>>>>>> > "doMassLaunchAndShutdown" times out. I suppose shutting down lots
>>>>>>>> of tasks
>>>>>>>> > is still a bottleneck.
>>>>>>>>
>>>>>>>> Looks good, I'll come back to the test below.
>>>>>>>>
>>>>>>>> > Looking at RemoteTaskRunner and HttpRemoteTaskRunner, it should
>>>>>>>> be pretty
>>>>>>>> > straightforward to make the shutdown API asynchronous, which
>>>>>>>> would help
>>>>>>>> > speed up anything that is shutting down lots of tasks all at
>>>>>>>> once. Would
>>>>>>>> > that be helpful in your environments? Or are the changes to move
>>>>>>>> shutdown
>>>>>>>> > out of critical sections going to be enough?
>>>>>>>>
>>>>>>>> This would be the logical next step for some environments, but I
>>>>>>>> didn't need to go
>>>>>>>> that far. For this particular cluster we are reading from Kafka,
>>>>>>>> which is a
>>>>>>>> SeekableStreamIndexRunner, the /stop call does not stop directly,
>>>>>>>> it only sets a
>>>>>>>> flag, so, the response back to overlord comes back in
>>>>>>>> single-digit-milliseconds.
>>>>>>>>
>>>>>>>> Some extra detail on the specific interaction as related to Kafka
>>>>>>>> might make
>>>>>>>> the problem/fix more clear ...
>>>>>>>>
>>>>>>>> ==
>>>>>>>>
>>>>>>>> Suppose for simplicity we have 500 tasks to roll, and each takes
>>>>>>>> 2ms to
>>>>>>>> acknowledge a stop request. TaskQueue#L322-L332 is going to issue
>>>>>>>> 500x2ms
>>>>>>>> requests to stop all of them, which will take approx 1 second to
>>>>>>>> complete.
>>>>>>>>
>>>>>>>> Notice it is doing this whilst holding the lock.
>>>>>>>>
>>>>>>>> After cleanup those tasks will issue a status update via ZK ->
>>>>>>>> RemoteTaskRunner
>>>>>>>> ::taskComplete that they have completed. That taskComplete fires
>>>>>>>> the future
>>>>>>>> completion which lands back in TaskQueue::notifyStatus where the
>>>>>>>> TaskQueue
>>>>>>>> can now update state that the task has finished.
>>>>>>>>
>>>>>>>> But - notifyStatus can *only* proceed once the lock has been
>>>>>>>> released, and then
>>>>>>>> it claims the lock, and calls removeTaskInternal. At this point the
>>>>>>>> lock is released,
>>>>>>>> and, maybe a few more concurrent ZK->notifyStatus() calls proceed.
>>>>>>>> Let's
>>>>>>>> suppose we got lucky, and we processed 10 requests, which have now
>>>>>>>> been
>>>>>>>> removed from the TaskQueue.
>>>>>>>>
>>>>>>>> At some point though, TaskQueue manage loop is going to get that
>>>>>>>> lock, and we
>>>>>>>> are now at 490 tasks that the queue believes are running, which we
>>>>>>>> expected
>>>>>>>> to be stopped, so we issue another 490*2ms=980ms of HTTP stop
>>>>>>>> requests.
>>>>>>>> And then, maybe, we get another 10 notifyStatus complete .... and
>>>>>>>> we issue 480,
>>>>>>>> and so on.
>>>>>>>>
>>>>>>>> Evidently, this is going to take a long time for the TaskQueue to
>>>>>>>> quiesce to the
>>>>>>>> correct state, and things are a little confused across the cluster
>>>>>>>> until that
>>>>>>>> happens. And, to top it off, tasks and supervisor get confused as
>>>>>>>> to progress,
>>>>>>>> so the task is marked failed, and put back in queue to restart,
>>>>>>>> which means
>>>>>>>> it takes longer.
>>>>>>>>
>>>>>>>> The fix is basically to make sure that the TaskQueue::notifyStatus
>>>>>>>> can proceed
>>>>>>>> to update the task state without blocking. Following that we get a
>>>>>>>> flood of ZK
>>>>>>>> updates in short order, so making the logging & ZK processing more
>>>>>>>> efficient
>>>>>>>> significantly reduces the time for quiesce to complete.
>>>>>>>>
>>>>>>>> ==
>>>>>>>>
>>>>>>>> So back to the test, it looks good, and I think some tweaks need to
>>>>>>>> happen to
>>>>>>>> replicate the above:
>>>>>>>> (1) the taskQueue.add() and shutdown() calls should be run
>>>>>>>> concurrently from lots
>>>>>>>> of threads
>>>>>>>> (2) the TestTaskRunner::shutdown(taskId) call should respond in
>>>>>>>> ~5ms, and
>>>>>>>> concurrently sleep some time, followed by a
>>>>>>>> knownTasks.get(taskId).setResult(...), which I think will trigger
>>>>>>>> the notification loop.
>>>>>>>>
>>>>>>>> I'll take a shot at this after lunch today.
>>>>>>>>
>>>>>>>> Sample logfile of a quick clean shutdown of a kafka task, from
>>>>>>>> overlord view:
>>>>>>>>
>>>>>>>> https://gist.githubusercontent.com/jasonk000/40c11dce3faed44d3a89c27e0227b982/raw/a456a4cddd31f508e618321e03ef8b5241931904/druid.log
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Jason
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Dec 2, 2021 at 4:58 AM Gian Merlino <g...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Harini, those are interesting findings. I'm not sure if the two
>>>>>>>>> pauses are
>>>>>>>>> necessary, but my thought is that it ideally shouldn't matter
>>>>>>>>> because the
>>>>>>>>> supervisor shouldn't be taking that long to handle its notices. A
>>>>>>>>> couple
>>>>>>>>> things come to mind about that:
>>>>>>>>>
>>>>>>>>> 1) Did you see what specifically the supervisor is doing when it's
>>>>>>>>> handling
>>>>>>>>> the notices? Maybe from a stack trace? We should look into
>>>>>>>>> optimizing it,
>>>>>>>>> or making it asynchronous or something, depending on what it is.
>>>>>>>>> 2) Although, there isn't really a need to trigger a run for every
>>>>>>>>> single
>>>>>>>>> task status change anyway; I think it's ok to coalesce them. This
>>>>>>>>> patch
>>>>>>>>> would do it: https://github.com/apache/druid/pull/12018
>>>>>>>>>
>>>>>>>>> Jason, also interesting findings! I took a crack at rebasing your
>>>>>>>>> patch on
>>>>>>>>> master and adding a scale test for the TaskQueue with 1000
>>>>>>>>> simulated tasks:
>>>>>>>>>
>>>>>>>>> https://github.com/apache/druid/compare/master...gianm:tq-scale-test.
>>>>>>>>> When
>>>>>>>>> I run the scale test, "doMassLaunchAndExit" passes quickly but
>>>>>>>>> "doMassLaunchAndShutdown" times out. I suppose shutting down lots
>>>>>>>>> of tasks
>>>>>>>>> is still a bottleneck.
>>>>>>>>>
>>>>>>>>> Looking at RemoteTaskRunner and HttpRemoteTaskRunner, it should be
>>>>>>>>> pretty
>>>>>>>>> straightforward to make the shutdown API asynchronous, which would
>>>>>>>>> help
>>>>>>>>> speed up anything that is shutting down lots of tasks all at once.
>>>>>>>>> Would
>>>>>>>>> that be helpful in your environments? Or are the changes to move
>>>>>>>>> shutdown
>>>>>>>>> out of critical sections going to be enough?
>>>>>>>>>
>>>>>>>>> On Wed, Dec 1, 2021 at 1:27 PM Jason Koch
>>>>>>>>> <jk...@netflix.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>> > Hi Harini,
>>>>>>>>> >
>>>>>>>>> > We have seen issues like this related to task roll time, related
>>>>>>>>> to task
>>>>>>>>> > queue notifications on overlord instances; I have a patch running
>>>>>>>>> > internally that resolves this.
>>>>>>>>> >
>>>>>>>>> > These are my internal triage notes:
>>>>>>>>> > ======
>>>>>>>>> > - Whenever task scheduling is happening (startup, ingest segment
>>>>>>>>> task
>>>>>>>>> > rollover, redeployment of datasource) Overlord takes a long time
>>>>>>>>> to assign
>>>>>>>>> > workers. This compounds because tasks sit so long before
>>>>>>>>> deployment that it
>>>>>>>>> > starts failing tasks and having to relaunch them.
>>>>>>>>> >
>>>>>>>>> >    - TaskQueue: notifyStatus() which receives updates from the
>>>>>>>>> >    middlemanagers, and the manage() loop which controls
>>>>>>>>> services, runs
>>>>>>>>> > through
>>>>>>>>> >    a single lock. For example, the shutdown request involves
>>>>>>>>> submitting
>>>>>>>>> >    downstream HTTP requests synchronously (while holding the
>>>>>>>>> lock).
>>>>>>>>> >    - This means for a cluster with ~700 tasks that tasks are
>>>>>>>>> held for
>>>>>>>>> >    nearly 1second, and only after each 1 second around the
>>>>>>>>> manage loop can
>>>>>>>>> > 1-2
>>>>>>>>> >    notifications be processed. For a new startup, with 700
>>>>>>>>> tasks, and a
>>>>>>>>> > 1sec
>>>>>>>>> >    delay, that is 300-600-or-more seconds for the overlord to
>>>>>>>>> realise all
>>>>>>>>> > the
>>>>>>>>> >    tasks are started by the middle manager.
>>>>>>>>> >    - Similar delays happen for any other operations.
>>>>>>>>> >    - Sub-optimal logging code path (double-concatening very long
>>>>>>>>> log
>>>>>>>>> >    entries),
>>>>>>>>> >    - ZkWorker: Worker fully deserializing all ZK payload data
>>>>>>>>> every time
>>>>>>>>> >    looking up task IDs rather than only looking at the ID fields.
>>>>>>>>> > Similarly,
>>>>>>>>> >    repeat fetching data on task assignment.
>>>>>>>>> >
>>>>>>>>> > =====
>>>>>>>>> >
>>>>>>>>> > The patch I have is here:
>>>>>>>>> > https://github.com/jasonk000/druid/pull/7/files
>>>>>>>>> >
>>>>>>>>> > It fixes a couple of things, most importantly the task queue
>>>>>>>>> notification
>>>>>>>>> > system. The system is much more stable with high task counts and
>>>>>>>>> will
>>>>>>>>> > easily restart many tasks concurrently.
>>>>>>>>> >
>>>>>>>>> > I have other perf issues I want to look at first before I can
>>>>>>>>> document it
>>>>>>>>> > fully, build a test case, rebase it on apache/master, etc. If
>>>>>>>>> you test it
>>>>>>>>> > out, and it works, we could submit a PR that would resolve it.
>>>>>>>>> >
>>>>>>>>> > PS - I have a queue of similar fixes I'd like to submit, but
>>>>>>>>> need some time
>>>>>>>>> > to do the documentation, build test cases, upstreaming, etc, if
>>>>>>>>> anyone
>>>>>>>>> > wants to collaborate, I could open some Issues and share my
>>>>>>>>> partial notes.
>>>>>>>>> >
>>>>>>>>> > Thanks
>>>>>>>>> > Jason
>>>>>>>>> >
>>>>>>>>> > On Wed, Dec 1, 2021 at 12:59 PM Harini Rajendran
>>>>>>>>> > <hrajend...@confluent.io.invalid> wrote:
>>>>>>>>> >
>>>>>>>>> > > Hi all,
>>>>>>>>> > >
>>>>>>>>> > > I have been investigating this in the background for a few
>>>>>>>>> days now and
>>>>>>>>> > > need some help from the community.
>>>>>>>>> > >
>>>>>>>>> > > We noticed that every hour, when the tasks roll, we see a
>>>>>>>>> spike in the
>>>>>>>>> > > ingestion lag for about 2-4 minutes. We have 180 tasks running
>>>>>>>>> on this
>>>>>>>>> > > datasource.
>>>>>>>>> > > [image: Screen Shot 2021-12-01 at 9.14.23 AM.png]
>>>>>>>>> > >
>>>>>>>>> > > On further debugging of task logs, we found out that around
>>>>>>>>> the duration
>>>>>>>>> > > when the ingestion lag spikes up, *the gap between pause and
>>>>>>>>> resume
>>>>>>>>> > > commands in the task logs during checkpointing are wide
>>>>>>>>> ranging from few
>>>>>>>>> > > seconds to couple minutes*. For example, in the following task
>>>>>>>>> logs you
>>>>>>>>> > > can see that it was about 1.5 minutes.
>>>>>>>>> > > {"@timestamp":"2021-11-18T*20:06:58.513Z*",
>>>>>>>>> "log.level":"DEBUG",
>>>>>>>>> > > "message":"Received pause command, *pausing* ingestion until
>>>>>>>>> resumed.", "
>>>>>>>>> > > service.name
>>>>>>>>> > >
>>>>>>>>> ":"druid/middleManager","event.dataset":"druid/middleManager.log","
>>>>>>>>> > > process.thread.name
>>>>>>>>> > >
>>>>>>>>> >
>>>>>>>>> ":"task-runner-0-priority-0","log.logger":"org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner"}
>>>>>>>>> > > {"@timestamp":"2021-11-18T*20:08:26.326Z*",
>>>>>>>>> "log.level":"DEBUG",
>>>>>>>>> > > "message":"Received pause command, *pausing* ingestion until
>>>>>>>>> resumed.", "
>>>>>>>>> > > service.name
>>>>>>>>> > >
>>>>>>>>> ":"druid/middleManager","event.dataset":"druid/middleManager.log","
>>>>>>>>> > > process.thread.name
>>>>>>>>> > >
>>>>>>>>> >
>>>>>>>>> ":"task-runner-0-priority-0","log.logger":"org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner"}
>>>>>>>>> > > {"@timestamp":"2021-11-18T*20:08:26.329Z*",
>>>>>>>>> "log.level":"DEBUG",
>>>>>>>>> > > "message":"Received resume command, *resuming* ingestion.", "
>>>>>>>>> > service.name
>>>>>>>>> > >
>>>>>>>>> ":"druid/middleManager","event.dataset":"druid/middleManager.log","
>>>>>>>>> > > process.thread.name
>>>>>>>>> > >
>>>>>>>>> >
>>>>>>>>> ":"task-runner-0-priority-0","log.logger":"org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner"}
>>>>>>>>> > > So this explains why ingestion is lagging as the *tasks are
>>>>>>>>> paused for a
>>>>>>>>> > > long time*.
>>>>>>>>> > >
>>>>>>>>> > > *Why are there 2 pauses during checkpointing and why such a
>>>>>>>>> huge gap?*
>>>>>>>>> > > As a next step, I wanted to see why there is such a wide gap.
>>>>>>>>> Then we
>>>>>>>>> > > realized that the first pause is when the task pauses itself
>>>>>>>>> here
>>>>>>>>> > > <
>>>>>>>>> >
>>>>>>>>> https://github.com/confluentinc/druid/blob/185ab56e42577dad6b077b415959512b0cd96345/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskRunner.java#L728
>>>>>>>>> >
>>>>>>>>> > while
>>>>>>>>> > > requesting the supervisor for a checkpoint. And the second
>>>>>>>>> pause is when
>>>>>>>>> > > the supervisor actually handles the checkpoint notice here
>>>>>>>>> > > <
>>>>>>>>> >
>>>>>>>>> https://github.com/confluentinc/druid/blob/185ab56e42577dad6b077b415959512b0cd96345/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java#L2548
>>>>>>>>> > >
>>>>>>>>> > > .
>>>>>>>>> > > And since the supervisor thread for this data source takes
>>>>>>>>> such a long
>>>>>>>>> > > time to process all the notices in the queue before this
>>>>>>>>> checkpoint
>>>>>>>>> > notice,
>>>>>>>>> > > the ingestion task ends up being in the paused state for a
>>>>>>>>> long time.
>>>>>>>>> > >
>>>>>>>>> > > *Why does the supervisor thread take such a long time to get
>>>>>>>>> to this
>>>>>>>>> > > checkpoint notice?*
>>>>>>>>> > > That was my next step in debugging.
>>>>>>>>> > > Proving our theory, we noticed that the *noticesQueue in the
>>>>>>>>> supervisor
>>>>>>>>> > > does get backed up with 100s of notices every hour when tasks
>>>>>>>>> roll*.
>>>>>>>>> > > [image: Screen Shot 2021-12-01 at 9.32.59 AM.png]
>>>>>>>>> > > And we saw that *run_notice takes between 5s and 7s during
>>>>>>>>> task rolls*.
>>>>>>>>> > > And this causes backing up of noticesQueue causing checkpoint
>>>>>>>>> notice to
>>>>>>>>> > be
>>>>>>>>> > > in the queue for long leading to ingestion lag spike whenever
>>>>>>>>> tasks roll.
>>>>>>>>> > > [image: Screen Shot 2021-12-01 at 9.34.29 AM.png]
>>>>>>>>> > >
>>>>>>>>> > > *Why does run_notice take 5-7s to finish?*
>>>>>>>>> > > When the task starts, it takes about 5s for the HTTP server to
>>>>>>>>> come up.
>>>>>>>>> > > So, till then the supervisor thread is in a loop trying to get
>>>>>>>>> the task
>>>>>>>>> > > status and this causes run_notice to take about 5-7s to finish.
>>>>>>>>> > >
>>>>>>>>> > > *Questions to the community*
>>>>>>>>> > > Do we need 2 pauses during checkpointing? Should the task
>>>>>>>>> pause itself
>>>>>>>>> > > before requesting a checkpoint notice given that the
>>>>>>>>> supervisor is
>>>>>>>>> > anyways
>>>>>>>>> > > going to pause the task while handling the notice? Or is it
>>>>>>>>> okay to
>>>>>>>>> > remove
>>>>>>>>> > > the pause in the TaskRunner before it sends a checkpoint
>>>>>>>>> notice request
>>>>>>>>> > to
>>>>>>>>> > > the supervisor? This would immediately solve the ingestion lag
>>>>>>>>> issue
>>>>>>>>> > > completely as there won't be two pauses with such a huge gap
>>>>>>>>> in between.
>>>>>>>>> > >
>>>>>>>>> > > Harini
>>>>>>>>> > > Software Engineer, Confluent
>>>>>>>>> > >
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>

Re: Need help in understanding real-time ingestion task pause behavior during checkpointing

Reply via email to