Re: [DISCUSS] KIP-1335: Bounded concurrency for partition reassignment via kafka-reassign-partitions.sh

Luke Chen Mon, 01 Jun 2026 19:39:51 -0700

Hi Manan,

LC3: Thanks for updating the KIP to make it clear.


LC4: Thanks for the explanation.
But that makes me realize that the batch mode (incremental or
non-incremental) is a long-running admin client process.
If I remember correctly, in admin client, we try not to make each
operation a long-running process, so we can see there are operations that
return futures to the admin client, or like the "--execute" and "--verify"
example in reassignment operations.
Making it a long-running operation will block other operations if it's run
within a script or K8S operator.
Could we change that?
For example, we return a list of futures for each partition, and the admin
client can check the future status to know if the specific partition has
submitted or not?

Thanks,
Luke

On Mon, Jun 1, 2026 at 6:18 PM Manan Gupta <[email protected]> wrote:

> Hey Luke
>
> LC1: Sure, I have updated the KIP now with the example.
>
> > LC3: How does the batch mode know that all N partitions are completed?
>
> Batch mode does poll. After each alterPartitionReassignments call for a
> step, the tool does not infer completion from that RPC alone—the alter
> returns when the controller has accepted the reassignment, not when
> replication has fully caught up.
> Between steps, the tool enters a wait loop: it uses the Admin client to
> read the cluster’s current reassignment and replica state for the
> partitions in that step, applies the same completion idea the reassignment
> tool already uses for verification (partition no longer in an active
> reassignment and the live replica set matches the target in the JSON),
> sleeps for --reassignment-poll-interval-ms, and repeats until every
> partition in that step satisfies that condition. Only then does it submit
> the next step.
> So “wait until complete” is implemented as repeated observation + sleep,
> not a single blocking call that magically completes when replication
> finishes. The KIP text has been updated to spell this out so it is not
> mistaken for a passive wait with no polling.
>
>
> > LC4: What will it show when some partitions are still waiting to be
> progressed?
>
> We can separate two things: stdout from --execute, and --verify (separate
> command).
> Non-incremental batch (--reassignment-batch-size without --incremental)
> The tool prints how many batches there will be, then for each step lines
> such as "starting batch i of n" and "waiting for batch i to complete before
> the next." That matches what we saw in testing, for example:
>
> ```Submitting partition reassignments in 6 batches of up to 2 partitions
> each.
> Starting reassignment batch 1 of 6 (2 partitions)...
> Waiting for reassignment batch 1 of 6 to complete before starting the next
> batch.
> then the same pattern for batch 2, and so on.```
>
> During the “Waiting …” phase there is no per-partition line item for “still
> copying” or for partitions not yet submitted in later batches; those
> partitions are simply not in flight until their batch starts. If someone
> needs partition-level status during that time, they can run --verify in
> another terminal or use cluster metrics; --verify still only distinguishes
> completed vs in progress for partitions that are part of the plan and
> reflectable in metadata / reassignment state, not “waiting in a future
> batch” as a distinct label.
>
> Incremental (--incremental)
> After the one-line mode banner, the tool emits a line each time a partition
> finishes and the next is submitted, for example:
>
> ```Incremental mode: keeping up to 2 partition reassignments in flight
> until all have been submitted.
> Partition test-1-0 finished reassignment; submitting next from queue if
> any.
> (and similarly for test-1-1, test-10-1, test-10-0, …)```
> So incremental mode already gives clearer liveness than batch-only waits:
> you see completions as they happen, which helps distinguish “working” from
> “stuck” better than the batch wait lines alone.
>
>
> > LC5: indefinite polling
> Today there is no maximum wait time on the batch-completion loops: the tool
> keeps periodically re-reading cluster state until every partition in the
> current step satisfies the completion condition, or the operator stops the
> process. If reassignments are slow rather than stuck—which is common when
> strict inter-broker or replica throttles are applied—the wait can
> legitimately take a long time; that is expected and not by itself a sign of
> a hang.
> Because there is no built-in deadline yet, operators who need to stop
> should interrupt the tool and use the supported cancel path (--cancel with
> an appropriate JSON) if they want to back out active reassignments, then
> reassess throttles, plan size, or pacing. Adding a dedicated reassignment
> wait timeout would be a follow-up: it needs clear semantics (what happens
> on expiry, how that interacts with partial plans and the existing --timeout
> flag used for log directory moves), which is why this KIP does not
> introduce that knob yet.
>
>
> > LC6: Default poll interval
>
> Agreed that a 500 ms default is aggressive from a controller-load
> perspective for clusters that already list reassignments often. The
> implementation default has been raised to 1000 ms (1 second) for both the
> inter-step wait path and the incremental loop, and the KIP documents that
> default accordingly. Operators who want less Admin traffic can set
> --reassignment-poll-interval-ms higher (for example 3–5 seconds); the flag
> exists so that trade-off is explicit and tunable per environment.
>
> Regards,
> Manan Gupta
>
> On Mon, Jun 1, 2026 at 1:16 PM Luke Chen <[email protected]> wrote:
>
> > Hi Manan,
> >
> > LC1: Thanks for the explanation. It's clear to me now.
> > I think we should also put this example and the "How to choose" part in
> the
> > KIP.
> >
> > Some more questions:
> > LC3. How does the batch mode know that all N partitions are completed and
> > then start the next batch?
> > It looks like we don't poll the status when in batch mode. How do we know
> > that?
> >
> > LC4. What will it show when some partitions are still waiting to be
> > progressed?
> > Currently, the --verify only shows "is completed" or "is still in
> > progress".
> > Should we have an output for the partitions that are sitting in the batch
> > queue?
> >
> > LC5. As you've pointed out, there could be a possibility that it will
> poll
> > indefinitely.
> > Why can't we set a timer for it?
> > Any concerns about it?
> >
> > LC6. "reassignment-poll-interval-ms" default to 500ms is too aggressive.
> > I think from users' perspective, any interval < 3 seconds or 5 seconds is
> > considered acceptable.
> > So could we increase it to at least 1 second?
> >
> > Thank you,
> > Luke
> >
> > On Mon, Jun 1, 2026 at 3:50 PM Manan Gupta <[email protected]> wrote:
> >
> > > Hey Luke
> > > Thank you for reviewing the proposal.
> > >
> > > LC1:
> > > Please excuse me if my explanation of the two different modes was
> > unclear.
> > >
> > > In non-incremental mode the tool walks the plan in steps. Each step
> > submits
> > > up to N partition reassignments, then waits until every partition in
> that
> > > step has finished before it opens the next step. The slowest partition
> in
> > > the current step holds up the entire next step.
> > >
> > > In incremental mode N is not “how big each step is.” It is how many
> > > partition reassignments from this plan may be active at the same time.
> > The
> > > tool keeps refilling up to N: whenever any single partition completes,
> it
> > > can start the next one from the queue. There is no rule that the whole
> > > group of N must finish together before new work starts.
> > >
> > > Example: 10 partitions in sorted order P1 through P10, N equals 3.
> > >
> > > Non-incremental: Step one submits P1 P2 P3 and waits until all three
> are
> > > done. Step two submits P4 P5 P6 and waits until all three are done.
> Step
> > > three submits P7 P8 P9 and waits until all three are done. Step four
> > > submits P10 only. If P3 is slow, P4 cannot start until P3 finishes,
> even
> > if
> > > P1 and P2 are already done.
> > >
> > > Incremental: The tool first submits P1 P2 P3 so three reasginemnts are
> > > active. If P2 finishes first, it can submit P4 while P1 and P3 are
> still
> > > running, still keeping three active when possible. It continues that
> way
> > > until every partition in the plan has been submitted and the in-flight
> > work
> > > drains according to the tool semantics. If P3 is slow, P4 can still
> start
> > > as soon as some other slot frees up.
> > >
> > > How to choose: use non-incremental if you want clear steps and a strict
> > > “this whole batch finished before the next batch starts” story. Use
> > > incremental if you want steadier utilization when finish times differ
> and
> > > you do not want one slow partition to block starting unrelated
> partitions
> > > beyond the cap of N at once.
> > >
> > > LC2:
> > > Both these values are the same, I have updated the KIP to reflect that
> > now.
> > >
> > > Regards
> > > Manan Gupta
> > >
> > >
> > > On Mon, Jun 1, 2026 at 9:52 AM Luke Chen <[email protected]> wrote:
> > >
> > > > Hi Manan,
> > > >
> > > > Thanks for the KIP.
> > > > This is a good improvement.
> > > >
> > > > Questions:
> > > > 1. After reading the KIP, I still don't understand the difference
> > between
> > > > "incremental mode" and "non-incremental mode".
> > > > From what I can see is that they both run with
> reassignment-batch-size
> > > once
> > > > time.
> > > > What's the difference between them?
> > > > Could you explain more?
> > > > Maybe some examples would be helpful to help users know the
> difference
> > > and
> > > > how they choose them.
> > > >
> > > >
> > > > 2. I see there are "INCREMENTAL_REASSIGNMENT_POLL_INTERVAL_MS" and
> > > > "reassignment-poll-interval-ms".
> > > > What's the difference between them?
> > > >
> > > >
> > > > Thank you,
> > > > Luke
> > > >
> > > >
> > > > On Mon, May 25, 2026 at 11:06 PM Manan Gupta <[email protected]>
> > > wrote:
> > > >
> > > > > Hey TaiJuWu
> > > > >
> > > > > Thank you for reviewhing the KIP, my response is inline.
> > > > >
> > > > > > TJ00: If we have multiple batch requests, how do you handle
> single
> > > > batch
> > > > > failure?
> > > > > - If a submit step fails, the tool returns immediately with errors
> > and
> > > > does
> > > > > not enqueue the rest; partitions already submitted stay under the
> > > > > controller’s reassignment as they do today.
> > > > > - The process exits with a TerseException listing the failed
> > partitions
> > > > and
> > > > > the error message from the broker/controller (the same pattern as a
> > > > > single-shot execute when some alters fail).
> > > > >
> > > > > > TJ01: If there is a long time operation, how can the users know
> it
> > > > still
> > > > > running instead of hang?
> > > > > - Controller / cluster side: ongoing reassignments and replication
> > > > > (metrics, kafka-reassign-partitions --list, Admin / JMX).
> > > > > - verify in another terminal shows progress toward the target.
> > > > > Batch wait is mostly quiet; incremental is a bit chattier; true
> > > progress
> > > > is
> > > > > best observed from cluster state or --verify, not only from stdout
> > > during
> > > > > the wait loop.
> > > > >
> > > > > Thanks,
> > > > > Manan Gupta
> > > > >
> > > > > On Mon, May 25, 2026 at 6:06 PM TaiJu Wu <[email protected]>
> wrote:
> > > > >
> > > > > > Hi Manan,
> > > > > >
> > > > > > Thanks for the KIP, just for some question.
> > > > > >
> > > > > > TJ00: If we have multiple batch requests, how do you handle
> single
> > > > batch
> > > > > > failure?
> > > > > >
> > > > > > TJ01: If there is a long time operation, how can the users know
> it
> > > > still
> > > > > > running instead of hang?
> > > > > >
> > > > > > Thanks,
> > > > > > TaiJuWu
> > > > > >
> > > > > >
> > > > > >
> > > > > > Manan Gupta <[email protected]> 於 2026年5月18日週一 下午6:09寫道：
> > > > > >
> > > > > > > Hey Kamal
> > > > > > >
> > > > > > > Thank you for your comments.
> > > > > > >
> > > > > > > > Should we have a configurable list poll interval?
> > > > > > > The current fixed interval of 500ms should not degrade the
> > > controller
> > > > > > but I
> > > > > > > agree that operators should have an option to change this
> value,
> > > > > updated
> > > > > > > the KIP to also take another parameter
> > > reassignment-poll-interval-ms
> > > > to
> > > > > > > update the default value from 500 ms.
> > > > > > >
> > > > > > > > Shall we extend the batching logic to also
> > kafka-leader-election
> > > > > > script?
> > > > > > > Good point, I will pick this up as a separate KIP as a followup
> > to
> > > > this
> > > > > > > KIP.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Manan
> > > > > > >
> > > > > > > On Mon, May 18, 2026 at 2:52 PM Kamal Chandraprakash <
> > > > > > > [email protected]> wrote:
> > > > > > >
> > > > > > > > Hi Manan,
> > > > > > > >
> > > > > > > > Thanks for improving the user-facing tools! Overall LGTM. Few
> > > > > > questions:
> > > > > > > >
> > > > > > > > 1. Should we have a configurable list poll interval? With
> > 500ms,
> > > > does
> > > > > > it
> > > > > > > > poll the controller often to list the currently running
> > > > reassignments
> > > > > > for
> > > > > > > > large partitions?
> > > > > > > > 2. Shall we extend the batching logic to also
> > > kafka-leader-election
> > > > > > > script?
> > > > > > > > It will be useful when running with --all-topic-partitions.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Kamal
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, May 11, 2026 at 8:55 AM Manan Gupta <
> > > [email protected]>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hello
> > > > > > > > >
> > > > > > > > > Gentle reminder to review the KIP.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Manan
> > > > > > > > >
> > > > > > > > > On Wed, May 6, 2026 at 7:52 PM Manan Gupta <
> > > [email protected]
> > > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > This email starts the discussion thread for *KIP-1335:
> > > Bounded
> > > > > > > > > > concurrency for partition reassignment via
> > > > > > > > kafka-reassign-partitions.sh*.
> > > > > > > > > > The proposal adds optional reassignment-batch-size and
> > > > > incremental
> > > > > > > > > > parameters to kafka-reassign-partitions.sh so operators
> can
> > > cap
> > > > > how
> > > > > > > > many
> > > > > > > > > > partition reassignments are submitted or kept in flight
> at
> > > once
> > > > > > using
> > > > > > > > > > existing Admin API,
> > > > > > > > > >
> > > > > > > > > > I will appreciate your initial thoughts and feedback on
> the
> > > > > > proposal.
> > > > > > > > > >
> > > > > > > > > > https://cwiki.apache.org/confluence/x/8ZAmGQ
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Manan
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-1335: Bounded concurrency for partition reassignment via kafka-reassign-partitions.sh

Reply via email to