Jason Gustafson created KAFKA-7601:
--------------------------------------
Summary: Handle message format downgrades during upgrade of
message format version
Key: KAFKA-7601
URL: https://issues.apache.org/jira/browse/KAFKA-7601
Project: Kafka
Issue Type: Bug
Reporter: Jason Gustafson
During an upgrade of the message format, there is a short time during which the
configured message format version is not consistent across all replicas of a
partition. As long as all brokers are on the same version, this typically does
not cause any problems. Followers will take whatever message format is used by
the leader. However, it is possible for leadership to change several times
between brokers which support the new format and those which support the old
format. This can cause the version used in the log to flap between the
different formats until the upgrade is complete.
For example, suppose broker 1 has been updated to use v2 and broker 2 is still
on v1. When broker 1 is the leader, all new messages will be written in the v2
format. When broker 2 is leader, v1 will be used. If there is any instability
in the cluster or if completion of the update is delayed, then the log will be
seen to switch back and forth between v1 and v2. Once the update is completed
and broker 1 begins using v2, then the message format will stabilize and
everything will generally be ok.
Downgrades of the message format are problematic, even if they are just
temporary. There are basically two issues:
1. We use the configured message format version to tell whether down-conversion
is needed. We assume that the this is always the maximum version used in the
log, but that assumption fails in the case of a downgrade. In the worst case,
old clients will see the new format and likely fail.
2. The logic we use for finding the truncation offset during the become
follower transition does not handle flapping between message formats. When the
new format is used by the leader, then the epoch cache will be updated
correctly. When the old format is in use, the epoch cache won't be updated.
This can lead to an incorrect result to OffsetsForLeaderEpoch queries.
For the second point, the specific case we observed is something like this.
Broker 1 is the leader of epoch 0 and writes some messages to the log using the
v2 message format. Broker 2 then becomes the leader for epoch 1 and writes some
messages in the v2 format. On broker 2, the last entry in the epoch cache is
epoch 0. No entry is written for epoch 1 because it uses the old format. When
broker 1 became a follower, it send an OffsetsForLeaderEpoch query to broker 2
for epoch 0. Since epoch 0 was the last entry in the cache, the log end offset
was returned. This resulted in localized log divergence.
There are a few options to fix this problem. From a high level, we can either
be stricter about preventing downgrades of the message format, or we can add
additional logic to make downgrades safe.
(Disallow downgrades): As an example of the first approach, the leader could
always use the maximum of the last version written to the log and the
configured message format version.
(Allow downgrades): If we want to allow downgrades, then it make makes sense to
invalidate and remove all entries in the epoch cache following the message
format downgrade. We would also need a solution for the problem of detecting
when down-conversion is needed for a fetch request. One option I've been
thinking about is enforcing the invariant that each segment uses only one
message format version. Whenever the message format changes, we need to roll a
new segment. Then we can simply remember which format is in use by each segment
to tell whether down-conversion is needed for a given fetch request.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)