Re: [DISCUSS] KIP-1335: Bounded concurrency for partition reassignment via kafka-reassign-partitions.sh

Manan Gupta Mon, 01 Jun 2026 23:08:08 -0700

Hey Luke

Thanks, that is a fair concern when the reassignment tool is embedded in
something that assumes kafka-reassign-partitions.sh returns quickly (for
example a short-lived script or a controller reconcile loop that blocks on
one subprocess).
A few clarifications on what is going on today:


Where the “long run” lives
The pacing loops run inside the tool process (or an in-process Admin if
someone calls the command entry point from Java). They do not change the
broker contract: each alterPartitionReassignments call is still bounded and
already returns per-partition futures for acceptance of the reassignment.
What is long-running is the optional wait between steps (non-incremental)
or the pipeline driver (incremental), which repeatedly uses the normal read
APIs (listPartitionReassignments, metadata/describe-style reads) until
replicas match the target. That “wait for replication” work cannot be
turned into a single future today; the cluster does not expose
“reassignment fully complete” as one shot per partition on the alter result
itself, so any implementation—tool or operator—must poll or re-check state
unless it exits and delegates that to something else (as with separate
--verify).

Relationship to the classic execute vs verify split
The non-blocking pattern you describe is already the legacy model:
--execute submits and returns; --verify (or another process) observes
progress. This KIP adds optional blocking in the tool on purpose so
operators who want pacing do not have to hand-chunk JSON and orchestrate
waves themselves. If a deployment must not hold a process open, they can
still use --reassignment-batch-size 0 (legacy one-shot execute + verify),
or external automation that submits smaller JSON files and sleeps between
runs—same traffic shape, more moving parts for the operator.

> Return futures so the admin client can check
For the submit step, the client already gets futures per partition from
alterPartitionReassignments. For completion, there is no single future to
return that replaces polling; you would either keep polling inside the
client library (same duration, different API shape) or push that
responsibility to the caller. Refactoring the shell tool into a stateful
“resume” CLI or a library API that streams progress events could be useful,
but it is a larger follow-up (new UX, persistence, idempotency) rather than
a small tweak to this KIP.

Practical guidance for K8s / operators
For controllers that cannot block, the intended pattern is to not wrap the
blocking paced mode in the reconcile path: run reassignment as a Job, a
sidecar, or use Admin directly with your own bounded reconcile loop and
timeouts. Paced mode targets interactive or batch maintenance workflows
where holding one client open is acceptable.
Paced --execute only blocks inside the tool process; broker and Admin RPC
semantics are unchanged. --verify already polls for completion, so a
long-lived client for observation is not new—this KIP adds optional waits
between submits so operators are not forced to hand-chunk JSON.

If you want a non-blocking paced mode (e.g. “submit only this step and
exit” with a marker file), that would be worth a separate discussion or KIP
so we do not overload this one.

Regards
Manan Gupta

On Tue, Jun 2, 2026 at 8:10 AM Luke Chen <[email protected]> wrote:

> Hi Manan,
>
> LC3: Thanks for updating the KIP to make it clear.
>
> LC4: Thanks for the explanation.
> But that makes me realize that the batch mode (incremental or
> non-incremental) is a long-running admin client process.
> If I remember correctly, in admin client, we try not to make each
> operation a long-running process, so we can see there are operations that
> return futures to the admin client, or like the "--execute" and "--verify"
> example in reassignment operations.
> Making it a long-running operation will block other operations if it's run
> within a script or K8S operator.
> Could we change that?
> For example, we return a list of futures for each partition, and the admin
> client can check the future status to know if the specific partition has
> submitted or not?
>
> Thanks,
> Luke
>
> On Mon, Jun 1, 2026 at 6:18 PM Manan Gupta <[email protected]> wrote:
>
> > Hey Luke
> >
> > LC1: Sure, I have updated the KIP now with the example.
> >
> > > LC3: How does the batch mode know that all N partitions are completed?
> >
> > Batch mode does poll. After each alterPartitionReassignments call for a
> > step, the tool does not infer completion from that RPC alone—the alter
> > returns when the controller has accepted the reassignment, not when
> > replication has fully caught up.
> > Between steps, the tool enters a wait loop: it uses the Admin client to
> > read the cluster’s current reassignment and replica state for the
> > partitions in that step, applies the same completion idea the
> reassignment
> > tool already uses for verification (partition no longer in an active
> > reassignment and the live replica set matches the target in the JSON),
> > sleeps for --reassignment-poll-interval-ms, and repeats until every
> > partition in that step satisfies that condition. Only then does it submit
> > the next step.
> > So “wait until complete” is implemented as repeated observation + sleep,
> > not a single blocking call that magically completes when replication
> > finishes. The KIP text has been updated to spell this out so it is not
> > mistaken for a passive wait with no polling.
> >
> >
> > > LC4: What will it show when some partitions are still waiting to be
> > progressed?
> >
> > We can separate two things: stdout from --execute, and --verify (separate
> > command).
> > Non-incremental batch (--reassignment-batch-size without --incremental)
> > The tool prints how many batches there will be, then for each step lines
> > such as "starting batch i of n" and "waiting for batch i to complete
> before
> > the next." That matches what we saw in testing, for example:
> >
> > ```Submitting partition reassignments in 6 batches of up to 2 partitions
> > each.
> > Starting reassignment batch 1 of 6 (2 partitions)...
> > Waiting for reassignment batch 1 of 6 to complete before starting the
> next
> > batch.
> > then the same pattern for batch 2, and so on.```
> >
> > During the “Waiting …” phase there is no per-partition line item for
> “still
> > copying” or for partitions not yet submitted in later batches; those
> > partitions are simply not in flight until their batch starts. If someone
> > needs partition-level status during that time, they can run --verify in
> > another terminal or use cluster metrics; --verify still only
> distinguishes
> > completed vs in progress for partitions that are part of the plan and
> > reflectable in metadata / reassignment state, not “waiting in a future
> > batch” as a distinct label.
> >
> > Incremental (--incremental)
> > After the one-line mode banner, the tool emits a line each time a
> partition
> > finishes and the next is submitted, for example:
> >
> > ```Incremental mode: keeping up to 2 partition reassignments in flight
> > until all have been submitted.
> > Partition test-1-0 finished reassignment; submitting next from queue if
> > any.
> > (and similarly for test-1-1, test-10-1, test-10-0, …)```
> > So incremental mode already gives clearer liveness than batch-only waits:
> > you see completions as they happen, which helps distinguish “working”
> from
> > “stuck” better than the batch wait lines alone.
> >
> >
> > > LC5: indefinite polling
> > Today there is no maximum wait time on the batch-completion loops: the
> tool
> > keeps periodically re-reading cluster state until every partition in the
> > current step satisfies the completion condition, or the operator stops
> the
> > process. If reassignments are slow rather than stuck—which is common when
> > strict inter-broker or replica throttles are applied—the wait can
> > legitimately take a long time; that is expected and not by itself a sign
> of
> > a hang.
> > Because there is no built-in deadline yet, operators who need to stop
> > should interrupt the tool and use the supported cancel path (--cancel
> with
> > an appropriate JSON) if they want to back out active reassignments, then
> > reassess throttles, plan size, or pacing. Adding a dedicated reassignment
> > wait timeout would be a follow-up: it needs clear semantics (what happens
> > on expiry, how that interacts with partial plans and the existing
> --timeout
> > flag used for log directory moves), which is why this KIP does not
> > introduce that knob yet.
> >
> >
> > > LC6: Default poll interval
> >
> > Agreed that a 500 ms default is aggressive from a controller-load
> > perspective for clusters that already list reassignments often. The
> > implementation default has been raised to 1000 ms (1 second) for both the
> > inter-step wait path and the incremental loop, and the KIP documents that
> > default accordingly. Operators who want less Admin traffic can set
> > --reassignment-poll-interval-ms higher (for example 3–5 seconds); the
> flag
> > exists so that trade-off is explicit and tunable per environment.
> >
> > Regards,
> > Manan Gupta
> >
> > On Mon, Jun 1, 2026 at 1:16 PM Luke Chen <[email protected]> wrote:
> >
> > > Hi Manan,
> > >
> > > LC1: Thanks for the explanation. It's clear to me now.
> > > I think we should also put this example and the "How to choose" part in
> > the
> > > KIP.
> > >
> > > Some more questions:
> > > LC3. How does the batch mode know that all N partitions are completed
> and
> > > then start the next batch?
> > > It looks like we don't poll the status when in batch mode. How do we
> know
> > > that?
> > >
> > > LC4. What will it show when some partitions are still waiting to be
> > > progressed?
> > > Currently, the --verify only shows "is completed" or "is still in
> > > progress".
> > > Should we have an output for the partitions that are sitting in the
> batch
> > > queue?
> > >
> > > LC5. As you've pointed out, there could be a possibility that it will
> > poll
> > > indefinitely.
> > > Why can't we set a timer for it?
> > > Any concerns about it?
> > >
> > > LC6. "reassignment-poll-interval-ms" default to 500ms is too
> aggressive.
> > > I think from users' perspective, any interval < 3 seconds or 5 seconds
> is
> > > considered acceptable.
> > > So could we increase it to at least 1 second?
> > >
> > > Thank you,
> > > Luke
> > >
> > > On Mon, Jun 1, 2026 at 3:50 PM Manan Gupta <[email protected]>
> wrote:
> > >
> > > > Hey Luke
> > > > Thank you for reviewing the proposal.
> > > >
> > > > LC1:
> > > > Please excuse me if my explanation of the two different modes was
> > > unclear.
> > > >
> > > > In non-incremental mode the tool walks the plan in steps. Each step
> > > submits
> > > > up to N partition reassignments, then waits until every partition in
> > that
> > > > step has finished before it opens the next step. The slowest
> partition
> > in
> > > > the current step holds up the entire next step.
> > > >
> > > > In incremental mode N is not “how big each step is.” It is how many
> > > > partition reassignments from this plan may be active at the same
> time.
> > > The
> > > > tool keeps refilling up to N: whenever any single partition
> completes,
> > it
> > > > can start the next one from the queue. There is no rule that the
> whole
> > > > group of N must finish together before new work starts.
> > > >
> > > > Example: 10 partitions in sorted order P1 through P10, N equals 3.
> > > >
> > > > Non-incremental: Step one submits P1 P2 P3 and waits until all three
> > are
> > > > done. Step two submits P4 P5 P6 and waits until all three are done.
> > Step
> > > > three submits P7 P8 P9 and waits until all three are done. Step four
> > > > submits P10 only. If P3 is slow, P4 cannot start until P3 finishes,
> > even
> > > if
> > > > P1 and P2 are already done.
> > > >
> > > > Incremental: The tool first submits P1 P2 P3 so three reasginemnts
> are
> > > > active. If P2 finishes first, it can submit P4 while P1 and P3 are
> > still
> > > > running, still keeping three active when possible. It continues that
> > way
> > > > until every partition in the plan has been submitted and the
> in-flight
> > > work
> > > > drains according to the tool semantics. If P3 is slow, P4 can still
> > start
> > > > as soon as some other slot frees up.
> > > >
> > > > How to choose: use non-incremental if you want clear steps and a
> strict
> > > > “this whole batch finished before the next batch starts” story. Use
> > > > incremental if you want steadier utilization when finish times differ
> > and
> > > > you do not want one slow partition to block starting unrelated
> > partitions
> > > > beyond the cap of N at once.
> > > >
> > > > LC2:
> > > > Both these values are the same, I have updated the KIP to reflect
> that
> > > now.
> > > >
> > > > Regards
> > > > Manan Gupta
> > > >
> > > >
> > > > On Mon, Jun 1, 2026 at 9:52 AM Luke Chen <[email protected]> wrote:
> > > >
> > > > > Hi Manan,
> > > > >
> > > > > Thanks for the KIP.
> > > > > This is a good improvement.
> > > > >
> > > > > Questions:
> > > > > 1. After reading the KIP, I still don't understand the difference
> > > between
> > > > > "incremental mode" and "non-incremental mode".
> > > > > From what I can see is that they both run with
> > reassignment-batch-size
> > > > once
> > > > > time.
> > > > > What's the difference between them?
> > > > > Could you explain more?
> > > > > Maybe some examples would be helpful to help users know the
> > difference
> > > > and
> > > > > how they choose them.
> > > > >
> > > > >
> > > > > 2. I see there are "INCREMENTAL_REASSIGNMENT_POLL_INTERVAL_MS" and
> > > > > "reassignment-poll-interval-ms".
> > > > > What's the difference between them?
> > > > >
> > > > >
> > > > > Thank you,
> > > > > Luke
> > > > >
> > > > >
> > > > > On Mon, May 25, 2026 at 11:06 PM Manan Gupta <[email protected]
> >
> > > > wrote:
> > > > >
> > > > > > Hey TaiJuWu
> > > > > >
> > > > > > Thank you for reviewhing the KIP, my response is inline.
> > > > > >
> > > > > > > TJ00: If we have multiple batch requests, how do you handle
> > single
> > > > > batch
> > > > > > failure?
> > > > > > - If a submit step fails, the tool returns immediately with
> errors
> > > and
> > > > > does
> > > > > > not enqueue the rest; partitions already submitted stay under the
> > > > > > controller’s reassignment as they do today.
> > > > > > - The process exits with a TerseException listing the failed
> > > partitions
> > > > > and
> > > > > > the error message from the broker/controller (the same pattern
> as a
> > > > > > single-shot execute when some alters fail).
> > > > > >
> > > > > > > TJ01: If there is a long time operation, how can the users know
> > it
> > > > > still
> > > > > > running instead of hang?
> > > > > > - Controller / cluster side: ongoing reassignments and
> replication
> > > > > > (metrics, kafka-reassign-partitions --list, Admin / JMX).
> > > > > > - verify in another terminal shows progress toward the target.
> > > > > > Batch wait is mostly quiet; incremental is a bit chattier; true
> > > > progress
> > > > > is
> > > > > > best observed from cluster state or --verify, not only from
> stdout
> > > > during
> > > > > > the wait loop.
> > > > > >
> > > > > > Thanks,
> > > > > > Manan Gupta
> > > > > >
> > > > > > On Mon, May 25, 2026 at 6:06 PM TaiJu Wu <[email protected]>
> > wrote:
> > > > > >
> > > > > > > Hi Manan,
> > > > > > >
> > > > > > > Thanks for the KIP, just for some question.
> > > > > > >
> > > > > > > TJ00: If we have multiple batch requests, how do you handle
> > single
> > > > > batch
> > > > > > > failure?
> > > > > > >
> > > > > > > TJ01: If there is a long time operation, how can the users know
> > it
> > > > > still
> > > > > > > running instead of hang?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > TaiJuWu
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Manan Gupta <[email protected]> 於 2026年5月18日週一 下午6:09寫道：
> > > > > > >
> > > > > > > > Hey Kamal
> > > > > > > >
> > > > > > > > Thank you for your comments.
> > > > > > > >
> > > > > > > > > Should we have a configurable list poll interval?
> > > > > > > > The current fixed interval of 500ms should not degrade the
> > > > controller
> > > > > > > but I
> > > > > > > > agree that operators should have an option to change this
> > value,
> > > > > > updated
> > > > > > > > the KIP to also take another parameter
> > > > reassignment-poll-interval-ms
> > > > > to
> > > > > > > > update the default value from 500 ms.
> > > > > > > >
> > > > > > > > > Shall we extend the batching logic to also
> > > kafka-leader-election
> > > > > > > script?
> > > > > > > > Good point, I will pick this up as a separate KIP as a
> followup
> > > to
> > > > > this
> > > > > > > > KIP.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Manan
> > > > > > > >
> > > > > > > > On Mon, May 18, 2026 at 2:52 PM Kamal Chandraprakash <
> > > > > > > > [email protected]> wrote:
> > > > > > > >
> > > > > > > > > Hi Manan,
> > > > > > > > >
> > > > > > > > > Thanks for improving the user-facing tools! Overall LGTM.
> Few
> > > > > > > questions:
> > > > > > > > >
> > > > > > > > > 1. Should we have a configurable list poll interval? With
> > > 500ms,
> > > > > does
> > > > > > > it
> > > > > > > > > poll the controller often to list the currently running
> > > > > reassignments
> > > > > > > for
> > > > > > > > > large partitions?
> > > > > > > > > 2. Shall we extend the batching logic to also
> > > > kafka-leader-election
> > > > > > > > script?
> > > > > > > > > It will be useful when running with --all-topic-partitions.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Kamal
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Mon, May 11, 2026 at 8:55 AM Manan Gupta <
> > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hello
> > > > > > > > > >
> > > > > > > > > > Gentle reminder to review the KIP.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Manan
> > > > > > > > > >
> > > > > > > > > > On Wed, May 6, 2026 at 7:52 PM Manan Gupta <
> > > > [email protected]
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi all,
> > > > > > > > > > >
> > > > > > > > > > > This email starts the discussion thread for *KIP-1335:
> > > > Bounded
> > > > > > > > > > > concurrency for partition reassignment via
> > > > > > > > > kafka-reassign-partitions.sh*.
> > > > > > > > > > > The proposal adds optional reassignment-batch-size and
> > > > > > incremental
> > > > > > > > > > > parameters to kafka-reassign-partitions.sh so operators
> > can
> > > > cap
> > > > > > how
> > > > > > > > > many
> > > > > > > > > > > partition reassignments are submitted or kept in flight
> > at
> > > > once
> > > > > > > using
> > > > > > > > > > > existing Admin API,
> > > > > > > > > > >
> > > > > > > > > > > I will appreciate your initial thoughts and feedback on
> > the
> > > > > > > proposal.
> > > > > > > > > > >
> > > > > > > > > > > https://cwiki.apache.org/confluence/x/8ZAmGQ
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Manan
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-1335: Bounded concurrency for partition reassignment via kafka-reassign-partitions.sh

Reply via email to