Re: [DISCUSS] KIP-1335: Bounded concurrency for partition reassignment via kafka-reassign-partitions.sh

Manan Gupta Mon, 01 Jun 2026 02:18:11 -0700

Hey Luke

LC1: Sure, I have updated the KIP now with the example.


> LC3: How does the batch mode know that all N partitions are completed?

Batch mode does poll. After each alterPartitionReassignments call for a
step, the tool does not infer completion from that RPC alone—the alter
returns when the controller has accepted the reassignment, not when
replication has fully caught up.
Between steps, the tool enters a wait loop: it uses the Admin client to
read the cluster’s current reassignment and replica state for the
partitions in that step, applies the same completion idea the reassignment
tool already uses for verification (partition no longer in an active
reassignment and the live replica set matches the target in the JSON),
sleeps for --reassignment-poll-interval-ms, and repeats until every
partition in that step satisfies that condition. Only then does it submit
the next step.
So “wait until complete” is implemented as repeated observation + sleep,
not a single blocking call that magically completes when replication
finishes. The KIP text has been updated to spell this out so it is not
mistaken for a passive wait with no polling.


> LC4: What will it show when some partitions are still waiting to be
progressed?

We can separate two things: stdout from --execute, and --verify (separate
command).
Non-incremental batch (--reassignment-batch-size without --incremental)
The tool prints how many batches there will be, then for each step lines
such as "starting batch i of n" and "waiting for batch i to complete before
the next." That matches what we saw in testing, for example:

```Submitting partition reassignments in 6 batches of up to 2 partitions
each.
Starting reassignment batch 1 of 6 (2 partitions)...
Waiting for reassignment batch 1 of 6 to complete before starting the next
batch.
then the same pattern for batch 2, and so on.```

During the “Waiting …” phase there is no per-partition line item for “still
copying” or for partitions not yet submitted in later batches; those
partitions are simply not in flight until their batch starts. If someone
needs partition-level status during that time, they can run --verify in
another terminal or use cluster metrics; --verify still only distinguishes
completed vs in progress for partitions that are part of the plan and
reflectable in metadata / reassignment state, not “waiting in a future
batch” as a distinct label.

Incremental (--incremental)
After the one-line mode banner, the tool emits a line each time a partition
finishes and the next is submitted, for example:

```Incremental mode: keeping up to 2 partition reassignments in flight
until all have been submitted.
Partition test-1-0 finished reassignment; submitting next from queue if any.
(and similarly for test-1-1, test-10-1, test-10-0, …)```
So incremental mode already gives clearer liveness than batch-only waits:
you see completions as they happen, which helps distinguish “working” from
“stuck” better than the batch wait lines alone.


> LC5: indefinite polling
Today there is no maximum wait time on the batch-completion loops: the tool
keeps periodically re-reading cluster state until every partition in the
current step satisfies the completion condition, or the operator stops the
process. If reassignments are slow rather than stuck—which is common when
strict inter-broker or replica throttles are applied—the wait can
legitimately take a long time; that is expected and not by itself a sign of
a hang.
Because there is no built-in deadline yet, operators who need to stop
should interrupt the tool and use the supported cancel path (--cancel with
an appropriate JSON) if they want to back out active reassignments, then
reassess throttles, plan size, or pacing. Adding a dedicated reassignment
wait timeout would be a follow-up: it needs clear semantics (what happens
on expiry, how that interacts with partial plans and the existing --timeout
flag used for log directory moves), which is why this KIP does not
introduce that knob yet.


> LC6: Default poll interval

Agreed that a 500 ms default is aggressive from a controller-load
perspective for clusters that already list reassignments often. The
implementation default has been raised to 1000 ms (1 second) for both the
inter-step wait path and the incremental loop, and the KIP documents that
default accordingly. Operators who want less Admin traffic can set
--reassignment-poll-interval-ms higher (for example 3–5 seconds); the flag
exists so that trade-off is explicit and tunable per environment.

Regards,
Manan Gupta

On Mon, Jun 1, 2026 at 1:16 PM Luke Chen <[email protected]> wrote:

> Hi Manan,
>
> LC1: Thanks for the explanation. It's clear to me now.
> I think we should also put this example and the "How to choose" part in the
> KIP.
>
> Some more questions:
> LC3. How does the batch mode know that all N partitions are completed and
> then start the next batch?
> It looks like we don't poll the status when in batch mode. How do we know
> that?
>
> LC4. What will it show when some partitions are still waiting to be
> progressed?
> Currently, the --verify only shows "is completed" or "is still in
> progress".
> Should we have an output for the partitions that are sitting in the batch
> queue?
>
> LC5. As you've pointed out, there could be a possibility that it will poll
> indefinitely.
> Why can't we set a timer for it?
> Any concerns about it?
>
> LC6. "reassignment-poll-interval-ms" default to 500ms is too aggressive.
> I think from users' perspective, any interval < 3 seconds or 5 seconds is
> considered acceptable.
> So could we increase it to at least 1 second?
>
> Thank you,
> Luke
>
> On Mon, Jun 1, 2026 at 3:50 PM Manan Gupta <[email protected]> wrote:
>
> > Hey Luke
> > Thank you for reviewing the proposal.
> >
> > LC1:
> > Please excuse me if my explanation of the two different modes was
> unclear.
> >
> > In non-incremental mode the tool walks the plan in steps. Each step
> submits
> > up to N partition reassignments, then waits until every partition in that
> > step has finished before it opens the next step. The slowest partition in
> > the current step holds up the entire next step.
> >
> > In incremental mode N is not “how big each step is.” It is how many
> > partition reassignments from this plan may be active at the same time.
> The
> > tool keeps refilling up to N: whenever any single partition completes, it
> > can start the next one from the queue. There is no rule that the whole
> > group of N must finish together before new work starts.
> >
> > Example: 10 partitions in sorted order P1 through P10, N equals 3.
> >
> > Non-incremental: Step one submits P1 P2 P3 and waits until all three are
> > done. Step two submits P4 P5 P6 and waits until all three are done. Step
> > three submits P7 P8 P9 and waits until all three are done. Step four
> > submits P10 only. If P3 is slow, P4 cannot start until P3 finishes, even
> if
> > P1 and P2 are already done.
> >
> > Incremental: The tool first submits P1 P2 P3 so three reasginemnts are
> > active. If P2 finishes first, it can submit P4 while P1 and P3 are still
> > running, still keeping three active when possible. It continues that way
> > until every partition in the plan has been submitted and the in-flight
> work
> > drains according to the tool semantics. If P3 is slow, P4 can still start
> > as soon as some other slot frees up.
> >
> > How to choose: use non-incremental if you want clear steps and a strict
> > “this whole batch finished before the next batch starts” story. Use
> > incremental if you want steadier utilization when finish times differ and
> > you do not want one slow partition to block starting unrelated partitions
> > beyond the cap of N at once.
> >
> > LC2:
> > Both these values are the same, I have updated the KIP to reflect that
> now.
> >
> > Regards
> > Manan Gupta
> >
> >
> > On Mon, Jun 1, 2026 at 9:52 AM Luke Chen <[email protected]> wrote:
> >
> > > Hi Manan,
> > >
> > > Thanks for the KIP.
> > > This is a good improvement.
> > >
> > > Questions:
> > > 1. After reading the KIP, I still don't understand the difference
> between
> > > "incremental mode" and "non-incremental mode".
> > > From what I can see is that they both run with reassignment-batch-size
> > once
> > > time.
> > > What's the difference between them?
> > > Could you explain more?
> > > Maybe some examples would be helpful to help users know the difference
> > and
> > > how they choose them.
> > >
> > >
> > > 2. I see there are "INCREMENTAL_REASSIGNMENT_POLL_INTERVAL_MS" and
> > > "reassignment-poll-interval-ms".
> > > What's the difference between them?
> > >
> > >
> > > Thank you,
> > > Luke
> > >
> > >
> > > On Mon, May 25, 2026 at 11:06 PM Manan Gupta <[email protected]>
> > wrote:
> > >
> > > > Hey TaiJuWu
> > > >
> > > > Thank you for reviewhing the KIP, my response is inline.
> > > >
> > > > > TJ00: If we have multiple batch requests, how do you handle single
> > > batch
> > > > failure?
> > > > - If a submit step fails, the tool returns immediately with errors
> and
> > > does
> > > > not enqueue the rest; partitions already submitted stay under the
> > > > controller’s reassignment as they do today.
> > > > - The process exits with a TerseException listing the failed
> partitions
> > > and
> > > > the error message from the broker/controller (the same pattern as a
> > > > single-shot execute when some alters fail).
> > > >
> > > > > TJ01: If there is a long time operation, how can the users know it
> > > still
> > > > running instead of hang?
> > > > - Controller / cluster side: ongoing reassignments and replication
> > > > (metrics, kafka-reassign-partitions --list, Admin / JMX).
> > > > - verify in another terminal shows progress toward the target.
> > > > Batch wait is mostly quiet; incremental is a bit chattier; true
> > progress
> > > is
> > > > best observed from cluster state or --verify, not only from stdout
> > during
> > > > the wait loop.
> > > >
> > > > Thanks,
> > > > Manan Gupta
> > > >
> > > > On Mon, May 25, 2026 at 6:06 PM TaiJu Wu <[email protected]> wrote:
> > > >
> > > > > Hi Manan,
> > > > >
> > > > > Thanks for the KIP, just for some question.
> > > > >
> > > > > TJ00: If we have multiple batch requests, how do you handle single
> > > batch
> > > > > failure?
> > > > >
> > > > > TJ01: If there is a long time operation, how can the users know it
> > > still
> > > > > running instead of hang?
> > > > >
> > > > > Thanks,
> > > > > TaiJuWu
> > > > >
> > > > >
> > > > >
> > > > > Manan Gupta <[email protected]> 於 2026年5月18日週一 下午6:09寫道：
> > > > >
> > > > > > Hey Kamal
> > > > > >
> > > > > > Thank you for your comments.
> > > > > >
> > > > > > > Should we have a configurable list poll interval?
> > > > > > The current fixed interval of 500ms should not degrade the
> > controller
> > > > > but I
> > > > > > agree that operators should have an option to change this value,
> > > > updated
> > > > > > the KIP to also take another parameter
> > reassignment-poll-interval-ms
> > > to
> > > > > > update the default value from 500 ms.
> > > > > >
> > > > > > > Shall we extend the batching logic to also
> kafka-leader-election
> > > > > script?
> > > > > > Good point, I will pick this up as a separate KIP as a followup
> to
> > > this
> > > > > > KIP.
> > > > > >
> > > > > > Thanks,
> > > > > > Manan
> > > > > >
> > > > > > On Mon, May 18, 2026 at 2:52 PM Kamal Chandraprakash <
> > > > > > [email protected]> wrote:
> > > > > >
> > > > > > > Hi Manan,
> > > > > > >
> > > > > > > Thanks for improving the user-facing tools! Overall LGTM. Few
> > > > > questions:
> > > > > > >
> > > > > > > 1. Should we have a configurable list poll interval? With
> 500ms,
> > > does
> > > > > it
> > > > > > > poll the controller often to list the currently running
> > > reassignments
> > > > > for
> > > > > > > large partitions?
> > > > > > > 2. Shall we extend the batching logic to also
> > kafka-leader-election
> > > > > > script?
> > > > > > > It will be useful when running with --all-topic-partitions.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Kamal
> > > > > > >
> > > > > > >
> > > > > > > On Mon, May 11, 2026 at 8:55 AM Manan Gupta <
> > [email protected]>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hello
> > > > > > > >
> > > > > > > > Gentle reminder to review the KIP.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Manan
> > > > > > > >
> > > > > > > > On Wed, May 6, 2026 at 7:52 PM Manan Gupta <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > This email starts the discussion thread for *KIP-1335:
> > Bounded
> > > > > > > > > concurrency for partition reassignment via
> > > > > > > kafka-reassign-partitions.sh*.
> > > > > > > > > The proposal adds optional reassignment-batch-size and
> > > > incremental
> > > > > > > > > parameters to kafka-reassign-partitions.sh so operators can
> > cap
> > > > how
> > > > > > > many
> > > > > > > > > partition reassignments are submitted or kept in flight at
> > once
> > > > > using
> > > > > > > > > existing Admin API,
> > > > > > > > >
> > > > > > > > > I will appreciate your initial thoughts and feedback on the
> > > > > proposal.
> > > > > > > > >
> > > > > > > > > https://cwiki.apache.org/confluence/x/8ZAmGQ
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Manan
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-1335: Bounded concurrency for partition reassignment via kafka-reassign-partitions.sh

Reply via email to