Re: [DISCUSS] KIP-858: Handle JBOD broker disk failure in KRaft

Igor Soarez Thu, 18 Aug 2022 08:23:09 -0700

Hi Jason,

Apologies for the delay in this reply.


Thank you for having having a look at this KIP and sharing your suggestions.

> 1. (nit): Instead of "storage id," maybe we could call it "directory id"?
> It seems a little clear since each log dir gets a unique id.

I agree, "directory id" is a more suitable name.
I've updated the KIP accordingly.

> 2. Rather than introducing a new RPC to communicate offline directories,
> would it be reasonable to add it to BrokerHeartbeat? My thinking is that we
> can let broker registration include the complete list of log dirs. The
> heartbeat can then be used to communicate online/offline status of each log
> dir.

Brokers may startup with offline log directories. If brokers only enumerate
log directories in the `BrokerRegistration` request the controller will have
to wait for the first heartbeat to make a decision about which log directories
are usable in each broker.
It would be better to avoid having to model this in-between phase, where
brokers are registered but the status of each log directory is still
undetermined. For this reason, I think `BrokerRegistration` should always
enumerate both online and offline log directories.

As to whether we should use a dedicated RPC to communicate online/offline
status, or extend `BrokerHeartbeat`, I don't think either approach is clearly
better over the other.

I can see good reasons to extend `BrokerHeartbeat`:

- We avoid having an extra RPC with a very narrow and specific functionality.

- There is overlap in the handling of the request with the current handling
  for `BrokerHeartbeat` in that it already a) changes the broker registration,
  and b) generates leader and ISR updates. When a log directory comes offline,
  both a) and b) would also need to happen.

- Continuously signalling the directory status can end up leading to higher
  resiliency in some eventual failures scenarios which could cause the one
  time RPC notification to be missed or incorrectly processed by the
  controller.

But I can also see reasons why we might prefer a separate RPC:

- If the broker includes a list of all online/offline log directories in every
  heartbeat request it creates unnecessary verbosity, it's not a very compact
  form of communication. We can mitigate this by having the broker send only
  changes in log directory status, compared with the broker's view of metadata,
  rather than the full list, but that also means a bit more complexity.

- It ties `broker.heartbeat.interval.ms` with the time to recovery from
  failed log directories. Perhaps this is OK, because it's already tied with
  the recovery time from failed brokers.

- It creates new space for bad requests. A second heartbeat could indicate a
  previously offline log directory as being online, which is currently not
  possible, as log directories can only come back online with a broker restart.
  Perhaps, this might actually be useful in the future, if we ever want to
  supporting bringing log directories back online without a broker restart.

So while I'm not sure it is the necessarily the best approach, yes, I think
it would be reasonable to use `BrokerHeartbeat` instead of a new RPC.

There is currently no "one-off" RPC to change non heartbeat-related aspects of
the broker registration, if there were, I think extending that request would be
best. What do you think about introducing an RPC to modify the broker
registration, rather than a dedicated RPC to signal offline dirs?
Such RPC could become useful later, as/if the broker registration evolves.

> 3. I think we have an opportunity to consolidate normal partition
> reassignment and log dir reassignment here. Since we are modifying the
> `Assignment` to consist of both replica id and storage id, couldn't we make
> a similar change to the `AlterPartitionReassignment` API? Basically that
> would let us treat all reassignments as moving from one log dir to another.
> The case handled by `AlterReplicaLogDir` currently would just become a
> special case where the replica id happens to be the same. This might also
> let us get rid of all the custom logic surrounding JBOD, such as the
> additional fetcher, which has proven difficult to maintain.

I really like this idea.

It's not a well known feature but KIP-113 introduced support for moving
partition replicas across brokers into specific log directories, by combining
requests of `AlterReplicaLogDir` and `AlterPartitionReassignment` and having
the broker remember the target directory as a preferred log dir. [1][2][3]

But the usability isn't great, as it involves multiple requests and expecting
errors as part of a successful operation. It would be really nice to extend
`AlterPartitionReassignment` to support specifying log directories and perhaps
deprecate `AlterReplicaLogDir`. So it would be great to consolidate these
operations.

We'd have to decide whether to break the current behavior - where a partition
replica can be moved to a different log directory while the broker is offline
without issues - or find a way to distinguish when log directory assignment
indicated in the metadata can be corrected (because the partition was moved by
an operator while the broker was offline) vs when it must be followed by the
broker (because a user called `AlterPartitionReassignment`).

Either way, my inclination is that this work is better suited for a follow-up
KIP, as the scope for this one already feels quite large.


Let me know what you think.

Thanks,

--
Igor


[1] 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-113%3A+Support+replicas+movement+between+log+directories
[2] 
https://github.com/apache/kafka/commit/adefc8ea076354e07839f0319fee1fba52343b91#diff-66379edf0ca0d6a674194b6613f1b4cd5cac0c8eee366e67ada68975d43b3387
[3] https://lists.apache.org/thread/h0tb3hgrvwl36xvoj2nzb096of25v6mm

Re: [DISCUSS] KIP-858: Handle JBOD broker disk failure in KRaft

Reply via email to