Having read up further here, I'd just like to point again to the last
message on that previous thread, as it holds a key point. This bit:

> But, if we consider recovering the metadata log from any _observer_, including
brokers, making sure to pick the surviving process with the highest log
offset, can this situation still happen? In order for a broker to
experience the decrease, wouldn't it need to have a copy of the increasing
log record on disk locally? And potentially then, that would also be the
best copy to recover the cluster from?

I think the KIP-1347 would not allow to recover from a broker even if a
surviving broker holds a higher log end offset than any of the surviving
controllers. A key reason why we chose to support doing that in our
internal tooling, is that it minimizes the risk that any broker would
experience a leadership epoch decrease. The only way that a broker could
experience a decrease is if we recover from a log that is shorter than what
it holds -- so we don't do that.

If a broker holds the longest log, we "simply" copy its log onto a
controller, and then add the voter set demotion records to the end of that
copy.

Den fre 3 juli 2026 kl 12:38 skrev Anton Agestam <[email protected]>:

> Hi Paolo,
>
> Thanks for this KIP, it solves for a real problem that we at Aiven have
> also found ourselves needing a solution for.
>
> Just to note first that this topic has been discussed before on the mail
> thread, see "KRaft controller disaster recovery", here:
> https://lists.apache.org/thread/84hbwwz46401vf81355v03ypyzkph32f.
>
> On Aiven, we have built custom tooling to handle such disaster recovery
> cases (or rather, best-effort restoring availability, as this is a data
> loss scenario).
>
> The tooling we have built works like this:
>
> - Operator runs a tool to identify the longest log copy available. As both
> brokers and controllers replicate the log, it may be that a broker holds
> the most extensive copy of the log. This works by inspecting the raft log
> on disk on each surviving node in the cluster and identifying the one with
> the highest log end offset.
> - The operator then invokes another tool on the chosen node that manually
> modifies the log on disk to reduce the voter set to only that controller,
> so that it gains quorum on its own.
> - From this we can start all surviving and participants of the cluster up
> again, and our normal automations will scale the voter set up to 3 or 5
> again.
>
> Let me know if it is of interest, and I can do what I can to share more of
> this custom tooling. We could likely open source it in some form.
>
> BR,
> Anton
>
> Den mån 18 maj 2026 kl 15:55 skrev Paolo Patierno <
> [email protected]>:
>
>> Hi all,
>> I would like to start a discussion on KIP-1347
>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1347%3A+Overriding+voter+set+on+storage+formatting
>> >
>> which
>> is about allowing the override of the voter set through the storage
>> formatting tool to recover a disaster scenario where the KRaft quorum
>> can't
>> be formed anymore. This KIP aims to fix KAFKA-20427
>> <https://issues.apache.org/jira/browse/KAFKA-20427>.
>> Any feedback is very welcome.
>>
>> Thanks,
>> Paolo.
>>
>> --
>> Paolo Patierno
>>
>> *Senior Principal Software Engineer @ IBM**CNCF Ambassador*
>>
>> Twitter : @ppatierno <http://twitter.com/ppatierno>
>> Linkedin : paolopatierno <http://it.linkedin.com/in/paolopatierno>
>> GitHub : ppatierno <https://github.com/ppatierno>
>>
>

Reply via email to