Re: Need help in understanding real-time ingestion task pause behavior during checkpointing

Jason Koch Tue, 11 Jan 2022 09:01:07 -0800

> Thanks for the update, Jason. We shall wait for the builds to pass.

Seems to me the builds are passing now (IMO). The failures present are
related to docker image build, seems like they are issues with the test or
build environment itself, but an expert opinion can perhaps correct me.


> Also, are you planning to get #12099 PR reviewed by the community
anytime soon?

I think it will help if you can provide a real world perspective on whether
this patch set solves your issues if that's at all possible?

I'm not intending to actively seek any reviews until I have a few more PRs
together. Gian, Frank, Xavier have all provided quality input and reviews
on a few of the PRs so far, and it is a quiet / holiday period for many.

Thanks
Jason

On Thu, Jan 6, 2022 at 8:19 AM Harini Rajendran <hrajend...@confluent.io>
wrote:

> Thanks for the update, Jason. We shall wait for the builds to pass.
> Also, are you planning to get #12099 PR reviewed by the community
> anytime soon?
>
> Harini
> Software Engineer, Observability
> +1 412 708 3872
>
>
>
> On Wed, Jan 5, 2022 at 8:25 PM Jason Koch <jk...@netflix.com> wrote:
>
>> Harini - these are as far as I can progress them; 12096 is good, 12097 is
>> good except for what seems to me to be a CI issue, and 12099 is "done" but
>> has a CI issue as well, I'll ask for help below.
>>
>> Gian / Frank - looking for some help on these:
>>
>> - On #12099, it is rejecting because of an Intellij Inspect issue, I've
>> added the `noinspection` pragma but it is being ignored by the linter. I'm
>> not sure how to progress fixing this, apart from rewriting the solution
>> which would make it less "obvious" to reasona bout.
>>
>> - On #12097, the failures are at a libc-related package installation, I
>> assume these are ignorable!
>>
>> Thanks!
>> Jason
>>
>>
>> On Wed, Jan 5, 2022 at 9:03 AM Harini Rajendran <hrajend...@confluent.io>
>> wrote:
>>
>>> That's great! Thanks. I'll keep an eye out on those PRs.
>>>
>>> Harini
>>> Software Engineer, Observability
>>> +1 412 708 3872
>>>
>>>
>>>
>>> On Wed, Jan 5, 2022 at 11:00 AM Jason Koch <jk...@netflix.com> wrote:
>>>
>>>> Thanks for the prompt - yes I'll get these fixed. They are code
>>>> coverage / linter fixes, I had mistakenly assumed they were flake-y tests.
>>>> I'll aim to fix these today.
>>>>
>>>> On Wed, Jan 5, 2022 at 7:25 AM Harini Rajendran <
>>>> hrajend...@confluent.io> wrote:
>>>>
>>>>> Hi Jason,
>>>>>
>>>>> I was taking a look at your PRs and see that CI build is failing for 2
>>>>> of them. Do you know why those are failing and are you planning to fix 
>>>>> them?
>>>>>
>>>>> Harini
>>>>> Software Engineer, Observability
>>>>> +1 412 708 3872
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jan 3, 2022 at 7:59 PM Harini Rajendran <
>>>>> hrajend...@confluent.io> wrote:
>>>>>
>>>>>> Hi Jason,
>>>>>>
>>>>>> I shall take a look at these 3 PRs and see if we can try these out in
>>>>>> our test environment.
>>>>>>
>>>>>> Also, we use AWS RDS as the metadata engine.
>>>>>>
>>>>>> Harini
>>>>>> Software Engineer, Observability
>>>>>> +1 412 708 3872
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 31, 2021 at 3:22 PM Jason Koch <jk...@netflix.com> wrote:
>>>>>>
>>>>>>> Hi Harini,
>>>>>>>
>>>>>>> I had a chance to look at the checkpoint behaviour you mentioned in
>>>>>>> more detail, and found two codepaths where the RunNotice code ends up in
>>>>>>> the TaskQueue, and hits the same locks. I'd be interested if you want to
>>>>>>> try the three related PRs I have submitted. (I added more detail to the
>>>>>>> issue https://github.com/apache/druid/issues/11414).
>>>>>>>
>>>>>>> Failing that, I think the best next step would be some stack traces
>>>>>>> / and/or profiler output from the supervisor at the time of rollover. It
>>>>>>> would also be useful to know which metadata storage solution you are 
>>>>>>> using
>>>>>>> as the RunNotice interacts with metadata engine too.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Jason
>>>>>>>
>>>>>>> On Fri, Dec 3, 2021 at 2:43 PM Jason Koch <jk...@netflix.com> wrote:
>>>>>>>
>>>>>>>> Gian,
>>>>>>>>
>>>>>>>> I've submitted a PR to gianm/tq-scale-test that provides a
>>>>>>>> concurrent test, (and fixes a concurrency bug I found along the way). 
>>>>>>>> The
>>>>>>>> change uses an 8millis response time for shutdown acknowledgment, and 
>>>>>>>> a 2
>>>>>>>> second time for shutdown completion/notification.
>>>>>>>>
>>>>>>>> Based on this test,
>>>>>>>> - serial TaskQueue timeout occurs after 60sec for each test, and,
>>>>>>>> - concurrent TaskQueue passes in ~10sec per test,
>>>>>>>>
>>>>>>>> https://github.com/gianm/druid/pull/3/files
>>>>>>>>
>>>>>>>> Let me know your thoughts.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Jason
>>>>>>>>
>>>>>>>> On Fri, Dec 3, 2021 at 11:41 AM Jason Koch <jk...@netflix.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Gian
>>>>>>>>>
>>>>>>>>> > Jason, also interesting findings! I took a crack at rebasing
>>>>>>>>> your patch on
>>>>>>>>> > master and adding a scale test for the TaskQueue with 1000
>>>>>>>>> simulated tasks:
>>>>>>>>> >
>>>>>>>>> https://github.com/apache/druid/compare/master...gianm:tq-scale-test
>>>>>>>>> <https://github.com/apache/druid/compare/master...gianm:tq-scale-test>.
>>>>>>>>> When
>>>>>>>>> > I run the scale test, "doMassLaunchAndExit" passes quickly but
>>>>>>>>> > "doMassLaunchAndShutdown" times out. I suppose shutting down
>>>>>>>>> lots of tasks
>>>>>>>>> > is still a bottleneck.
>>>>>>>>>
>>>>>>>>> Looks good, I'll come back to the test below.
>>>>>>>>>
>>>>>>>>> > Looking at RemoteTaskRunner and HttpRemoteTaskRunner, it should
>>>>>>>>> be pretty
>>>>>>>>> > straightforward to make the shutdown API asynchronous, which
>>>>>>>>> would help
>>>>>>>>> > speed up anything that is shutting down lots of tasks all at
>>>>>>>>> once. Would
>>>>>>>>> > that be helpful in your environments? Or are the changes to move
>>>>>>>>> shutdown
>>>>>>>>> > out of critical sections going to be enough?
>>>>>>>>>
>>>>>>>>> This would be the logical next step for some environments, but I
>>>>>>>>> didn't need to go
>>>>>>>>> that far. For this particular cluster we are reading from Kafka,
>>>>>>>>> which is a
>>>>>>>>> SeekableStreamIndexRunner, the /stop call does not stop directly,
>>>>>>>>> it only sets a
>>>>>>>>> flag, so, the response back to overlord comes back in
>>>>>>>>> single-digit-milliseconds.
>>>>>>>>>
>>>>>>>>> Some extra detail on the specific interaction as related to Kafka
>>>>>>>>> might make
>>>>>>>>> the problem/fix more clear ...
>>>>>>>>>
>>>>>>>>> ==
>>>>>>>>>
>>>>>>>>> Suppose for simplicity we have 500 tasks to roll, and each takes
>>>>>>>>> 2ms to
>>>>>>>>> acknowledge a stop request. TaskQueue#L322-L332 is going to issue
>>>>>>>>> 500x2ms
>>>>>>>>> requests to stop all of them, which will take approx 1 second to
>>>>>>>>> complete.
>>>>>>>>>
>>>>>>>>> Notice it is doing this whilst holding the lock.
>>>>>>>>>
>>>>>>>>> After cleanup those tasks will issue a status update via ZK ->
>>>>>>>>> RemoteTaskRunner
>>>>>>>>> ::taskComplete that they have completed. That taskComplete fires
>>>>>>>>> the future
>>>>>>>>> completion which lands back in TaskQueue::notifyStatus where the
>>>>>>>>> TaskQueue
>>>>>>>>> can now update state that the task has finished.
>>>>>>>>>
>>>>>>>>> But - notifyStatus can *only* proceed once the lock has been
>>>>>>>>> released, and then
>>>>>>>>> it claims the lock, and calls removeTaskInternal. At this point
>>>>>>>>> the lock is released,
>>>>>>>>> and, maybe a few more concurrent ZK->notifyStatus() calls proceed.
>>>>>>>>> Let's
>>>>>>>>> suppose we got lucky, and we processed 10 requests, which have now
>>>>>>>>> been
>>>>>>>>> removed from the TaskQueue.
>>>>>>>>>
>>>>>>>>> At some point though, TaskQueue manage loop is going to get that
>>>>>>>>> lock, and we
>>>>>>>>> are now at 490 tasks that the queue believes are running, which we
>>>>>>>>> expected
>>>>>>>>> to be stopped, so we issue another 490*2ms=980ms of HTTP stop
>>>>>>>>> requests.
>>>>>>>>> And then, maybe, we get another 10 notifyStatus complete .... and
>>>>>>>>> we issue 480,
>>>>>>>>> and so on.
>>>>>>>>>
>>>>>>>>> Evidently, this is going to take a long time for the TaskQueue to
>>>>>>>>> quiesce to the
>>>>>>>>> correct state, and things are a little confused across the cluster
>>>>>>>>> until that
>>>>>>>>> happens. And, to top it off, tasks and supervisor get confused as
>>>>>>>>> to progress,
>>>>>>>>> so the task is marked failed, and put back in queue to restart,
>>>>>>>>> which means
>>>>>>>>> it takes longer.
>>>>>>>>>
>>>>>>>>> The fix is basically to make sure that the TaskQueue::notifyStatus
>>>>>>>>> can proceed
>>>>>>>>> to update the task state without blocking. Following that we get a
>>>>>>>>> flood of ZK
>>>>>>>>> updates in short order, so making the logging & ZK processing more
>>>>>>>>> efficient
>>>>>>>>> significantly reduces the time for quiesce to complete.
>>>>>>>>>
>>>>>>>>> ==
>>>>>>>>>
>>>>>>>>> So back to the test, it looks good, and I think some tweaks need
>>>>>>>>> to happen to
>>>>>>>>> replicate the above:
>>>>>>>>> (1) the taskQueue.add() and shutdown() calls should be run
>>>>>>>>> concurrently from lots
>>>>>>>>> of threads
>>>>>>>>> (2) the TestTaskRunner::shutdown(taskId) call should respond in
>>>>>>>>> ~5ms, and
>>>>>>>>> concurrently sleep some time, followed by a
>>>>>>>>> knownTasks.get(taskId).setResult(...), which I think will trigger
>>>>>>>>> the notification loop.
>>>>>>>>>
>>>>>>>>> I'll take a shot at this after lunch today.
>>>>>>>>>
>>>>>>>>> Sample logfile of a quick clean shutdown of a kafka task, from
>>>>>>>>> overlord view:
>>>>>>>>>
>>>>>>>>> https://gist.githubusercontent.com/jasonk000/40c11dce3faed44d3a89c27e0227b982/raw/a456a4cddd31f508e618321e03ef8b5241931904/druid.log
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Jason
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Dec 2, 2021 at 4:58 AM Gian Merlino <g...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Harini, those are interesting findings. I'm not sure if the two
>>>>>>>>>> pauses are
>>>>>>>>>> necessary, but my thought is that it ideally shouldn't matter
>>>>>>>>>> because the
>>>>>>>>>> supervisor shouldn't be taking that long to handle its notices. A
>>>>>>>>>> couple
>>>>>>>>>> things come to mind about that:
>>>>>>>>>>
>>>>>>>>>> 1) Did you see what specifically the supervisor is doing when
>>>>>>>>>> it's handling
>>>>>>>>>> the notices? Maybe from a stack trace? We should look into
>>>>>>>>>> optimizing it,
>>>>>>>>>> or making it asynchronous or something, depending on what it is.
>>>>>>>>>> 2) Although, there isn't really a need to trigger a run for every
>>>>>>>>>> single
>>>>>>>>>> task status change anyway; I think it's ok to coalesce them. This
>>>>>>>>>> patch
>>>>>>>>>> would do it: https://github.com/apache/druid/pull/12018
>>>>>>>>>>
>>>>>>>>>> Jason, also interesting findings! I took a crack at rebasing your
>>>>>>>>>> patch on
>>>>>>>>>> master and adding a scale test for the TaskQueue with 1000
>>>>>>>>>> simulated tasks:
>>>>>>>>>>
>>>>>>>>>> https://github.com/apache/druid/compare/master...gianm:tq-scale-test.
>>>>>>>>>> When
>>>>>>>>>> I run the scale test, "doMassLaunchAndExit" passes quickly but
>>>>>>>>>> "doMassLaunchAndShutdown" times out. I suppose shutting down lots
>>>>>>>>>> of tasks
>>>>>>>>>> is still a bottleneck.
>>>>>>>>>>
>>>>>>>>>> Looking at RemoteTaskRunner and HttpRemoteTaskRunner, it should
>>>>>>>>>> be pretty
>>>>>>>>>> straightforward to make the shutdown API asynchronous, which
>>>>>>>>>> would help
>>>>>>>>>> speed up anything that is shutting down lots of tasks all at
>>>>>>>>>> once. Would
>>>>>>>>>> that be helpful in your environments? Or are the changes to move
>>>>>>>>>> shutdown
>>>>>>>>>> out of critical sections going to be enough?
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 1, 2021 at 1:27 PM Jason Koch
>>>>>>>>>> <jk...@netflix.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>> > Hi Harini,
>>>>>>>>>> >
>>>>>>>>>> > We have seen issues like this related to task roll time,
>>>>>>>>>> related to task
>>>>>>>>>> > queue notifications on overlord instances; I have a patch
>>>>>>>>>> running
>>>>>>>>>> > internally that resolves this.
>>>>>>>>>> >
>>>>>>>>>> > These are my internal triage notes:
>>>>>>>>>> > ======
>>>>>>>>>> > - Whenever task scheduling is happening (startup, ingest
>>>>>>>>>> segment task
>>>>>>>>>> > rollover, redeployment of datasource) Overlord takes a long
>>>>>>>>>> time to assign
>>>>>>>>>> > workers. This compounds because tasks sit so long before
>>>>>>>>>> deployment that it
>>>>>>>>>> > starts failing tasks and having to relaunch them.
>>>>>>>>>> >
>>>>>>>>>> >    - TaskQueue: notifyStatus() which receives updates from the
>>>>>>>>>> >    middlemanagers, and the manage() loop which controls
>>>>>>>>>> services, runs
>>>>>>>>>> > through
>>>>>>>>>> >    a single lock. For example, the shutdown request involves
>>>>>>>>>> submitting
>>>>>>>>>> >    downstream HTTP requests synchronously (while holding the
>>>>>>>>>> lock).
>>>>>>>>>> >    - This means for a cluster with ~700 tasks that tasks are
>>>>>>>>>> held for
>>>>>>>>>> >    nearly 1second, and only after each 1 second around the
>>>>>>>>>> manage loop can
>>>>>>>>>> > 1-2
>>>>>>>>>> >    notifications be processed. For a new startup, with 700
>>>>>>>>>> tasks, and a
>>>>>>>>>> > 1sec
>>>>>>>>>> >    delay, that is 300-600-or-more seconds for the overlord to
>>>>>>>>>> realise all
>>>>>>>>>> > the
>>>>>>>>>> >    tasks are started by the middle manager.
>>>>>>>>>> >    - Similar delays happen for any other operations.
>>>>>>>>>> >    - Sub-optimal logging code path (double-concatening very
>>>>>>>>>> long log
>>>>>>>>>> >    entries),
>>>>>>>>>> >    - ZkWorker: Worker fully deserializing all ZK payload data
>>>>>>>>>> every time
>>>>>>>>>> >    looking up task IDs rather than only looking at the ID
>>>>>>>>>> fields.
>>>>>>>>>> > Similarly,
>>>>>>>>>> >    repeat fetching data on task assignment.
>>>>>>>>>> >
>>>>>>>>>> > =====
>>>>>>>>>> >
>>>>>>>>>> > The patch I have is here:
>>>>>>>>>> > https://github.com/jasonk000/druid/pull/7/files
>>>>>>>>>> >
>>>>>>>>>> > It fixes a couple of things, most importantly the task queue
>>>>>>>>>> notification
>>>>>>>>>> > system. The system is much more stable with high task counts
>>>>>>>>>> and will
>>>>>>>>>> > easily restart many tasks concurrently.
>>>>>>>>>> >
>>>>>>>>>> > I have other perf issues I want to look at first before I can
>>>>>>>>>> document it
>>>>>>>>>> > fully, build a test case, rebase it on apache/master, etc. If
>>>>>>>>>> you test it
>>>>>>>>>> > out, and it works, we could submit a PR that would resolve it.
>>>>>>>>>> >
>>>>>>>>>> > PS - I have a queue of similar fixes I'd like to submit, but
>>>>>>>>>> need some time
>>>>>>>>>> > to do the documentation, build test cases, upstreaming, etc, if
>>>>>>>>>> anyone
>>>>>>>>>> > wants to collaborate, I could open some Issues and share my
>>>>>>>>>> partial notes.
>>>>>>>>>> >
>>>>>>>>>> > Thanks
>>>>>>>>>> > Jason
>>>>>>>>>> >
>>>>>>>>>> > On Wed, Dec 1, 2021 at 12:59 PM Harini Rajendran
>>>>>>>>>> > <hrajend...@confluent.io.invalid> wrote:
>>>>>>>>>> >
>>>>>>>>>> > > Hi all,
>>>>>>>>>> > >
>>>>>>>>>> > > I have been investigating this in the background for a few
>>>>>>>>>> days now and
>>>>>>>>>> > > need some help from the community.
>>>>>>>>>> > >
>>>>>>>>>> > > We noticed that every hour, when the tasks roll, we see a
>>>>>>>>>> spike in the
>>>>>>>>>> > > ingestion lag for about 2-4 minutes. We have 180 tasks
>>>>>>>>>> running on this
>>>>>>>>>> > > datasource.
>>>>>>>>>> > > [image: Screen Shot 2021-12-01 at 9.14.23 AM.png]
>>>>>>>>>> > >
>>>>>>>>>> > > On further debugging of task logs, we found out that around
>>>>>>>>>> the duration
>>>>>>>>>> > > when the ingestion lag spikes up, *the gap between pause and
>>>>>>>>>> resume
>>>>>>>>>> > > commands in the task logs during checkpointing are wide
>>>>>>>>>> ranging from few
>>>>>>>>>> > > seconds to couple minutes*. For example, in the following
>>>>>>>>>> task logs you
>>>>>>>>>> > > can see that it was about 1.5 minutes.
>>>>>>>>>> > > {"@timestamp":"2021-11-18T*20:06:58.513Z*",
>>>>>>>>>> "log.level":"DEBUG",
>>>>>>>>>> > > "message":"Received pause command, *pausing* ingestion until
>>>>>>>>>> resumed.", "
>>>>>>>>>> > > service.name
>>>>>>>>>> > >
>>>>>>>>>> ":"druid/middleManager","event.dataset":"druid/middleManager.log","
>>>>>>>>>> > > process.thread.name
>>>>>>>>>> > >
>>>>>>>>>> >
>>>>>>>>>> ":"task-runner-0-priority-0","log.logger":"org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner"}
>>>>>>>>>> > > {"@timestamp":"2021-11-18T*20:08:26.326Z*",
>>>>>>>>>> "log.level":"DEBUG",
>>>>>>>>>> > > "message":"Received pause command, *pausing* ingestion until
>>>>>>>>>> resumed.", "
>>>>>>>>>> > > service.name
>>>>>>>>>> > >
>>>>>>>>>> ":"druid/middleManager","event.dataset":"druid/middleManager.log","
>>>>>>>>>> > > process.thread.name
>>>>>>>>>> > >
>>>>>>>>>> >
>>>>>>>>>> ":"task-runner-0-priority-0","log.logger":"org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner"}
>>>>>>>>>> > > {"@timestamp":"2021-11-18T*20:08:26.329Z*",
>>>>>>>>>> "log.level":"DEBUG",
>>>>>>>>>> > > "message":"Received resume command, *resuming* ingestion.", "
>>>>>>>>>> > service.name
>>>>>>>>>> > >
>>>>>>>>>> ":"druid/middleManager","event.dataset":"druid/middleManager.log","
>>>>>>>>>> > > process.thread.name
>>>>>>>>>> > >
>>>>>>>>>> >
>>>>>>>>>> ":"task-runner-0-priority-0","log.logger":"org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner"}
>>>>>>>>>> > > So this explains why ingestion is lagging as the *tasks are
>>>>>>>>>> paused for a
>>>>>>>>>> > > long time*.
>>>>>>>>>> > >
>>>>>>>>>> > > *Why are there 2 pauses during checkpointing and why such a
>>>>>>>>>> huge gap?*
>>>>>>>>>> > > As a next step, I wanted to see why there is such a wide gap.
>>>>>>>>>> Then we
>>>>>>>>>> > > realized that the first pause is when the task pauses itself
>>>>>>>>>> here
>>>>>>>>>> > > <
>>>>>>>>>> >
>>>>>>>>>> https://github.com/confluentinc/druid/blob/185ab56e42577dad6b077b415959512b0cd96345/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskRunner.java#L728
>>>>>>>>>> >
>>>>>>>>>> > while
>>>>>>>>>> > > requesting the supervisor for a checkpoint. And the second
>>>>>>>>>> pause is when
>>>>>>>>>> > > the supervisor actually handles the checkpoint notice here
>>>>>>>>>> > > <
>>>>>>>>>> >
>>>>>>>>>> https://github.com/confluentinc/druid/blob/185ab56e42577dad6b077b415959512b0cd96345/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java#L2548
>>>>>>>>>> > >
>>>>>>>>>> > > .
>>>>>>>>>> > > And since the supervisor thread for this data source takes
>>>>>>>>>> such a long
>>>>>>>>>> > > time to process all the notices in the queue before this
>>>>>>>>>> checkpoint
>>>>>>>>>> > notice,
>>>>>>>>>> > > the ingestion task ends up being in the paused state for a
>>>>>>>>>> long time.
>>>>>>>>>> > >
>>>>>>>>>> > > *Why does the supervisor thread take such a long time to get
>>>>>>>>>> to this
>>>>>>>>>> > > checkpoint notice?*
>>>>>>>>>> > > That was my next step in debugging.
>>>>>>>>>> > > Proving our theory, we noticed that the *noticesQueue in the
>>>>>>>>>> supervisor
>>>>>>>>>> > > does get backed up with 100s of notices every hour when tasks
>>>>>>>>>> roll*.
>>>>>>>>>> > > [image: Screen Shot 2021-12-01 at 9.32.59 AM.png]
>>>>>>>>>> > > And we saw that *run_notice takes between 5s and 7s during
>>>>>>>>>> task rolls*.
>>>>>>>>>> > > And this causes backing up of noticesQueue causing checkpoint
>>>>>>>>>> notice to
>>>>>>>>>> > be
>>>>>>>>>> > > in the queue for long leading to ingestion lag spike whenever
>>>>>>>>>> tasks roll.
>>>>>>>>>> > > [image: Screen Shot 2021-12-01 at 9.34.29 AM.png]
>>>>>>>>>> > >
>>>>>>>>>> > > *Why does run_notice take 5-7s to finish?*
>>>>>>>>>> > > When the task starts, it takes about 5s for the HTTP server
>>>>>>>>>> to come up.
>>>>>>>>>> > > So, till then the supervisor thread is in a loop trying to
>>>>>>>>>> get the task
>>>>>>>>>> > > status and this causes run_notice to take about 5-7s to
>>>>>>>>>> finish.
>>>>>>>>>> > >
>>>>>>>>>> > > *Questions to the community*
>>>>>>>>>> > > Do we need 2 pauses during checkpointing? Should the task
>>>>>>>>>> pause itself
>>>>>>>>>> > > before requesting a checkpoint notice given that the
>>>>>>>>>> supervisor is
>>>>>>>>>> > anyways
>>>>>>>>>> > > going to pause the task while handling the notice? Or is it
>>>>>>>>>> okay to
>>>>>>>>>> > remove
>>>>>>>>>> > > the pause in the TaskRunner before it sends a checkpoint
>>>>>>>>>> notice request
>>>>>>>>>> > to
>>>>>>>>>> > > the supervisor? This would immediately solve the ingestion
>>>>>>>>>> lag issue
>>>>>>>>>> > > completely as there won't be two pauses with such a huge gap
>>>>>>>>>> in between.
>>>>>>>>>> > >
>>>>>>>>>> > > Harini
>>>>>>>>>> > > Software Engineer, Confluent
>>>>>>>>>> > >
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>

Re: Need help in understanding real-time ingestion task pause behavior during checkpointing

Reply via email to