Re: [DISCUSS] KIP-630: Kafka Raft Snapshot

Jun Rao Thu, 01 Oct 2020 12:05:00 -0700

Hi, Jose,

Thanks for the reply. A few more comments.


41. That's a good point. With compacted topic, the cleaning won't be done
until the active segment rolls. With snapshots, I guess we don't have this
restriction? So, it would be useful to guard against too frequent
snapshotting. Does the new proposal address this completely? If the
snapshot has only 1 record, and the new record keeps updating the same key,
does that still cause the snapshot to be generated frequently?

42. One of the reasons that we added the per partition limit is to allow
each partition to make relatively even progress during the catchup phase.
This helps kstreams by potentially reducing the gap between stream time
from different partitions. If we can achieve the same thing without the
partition limit, it will be fine too. "3. For the remaining partitions send
at most the average max bytes plus the average remaining bytes." Do we
guarantee that request level max_bytes can be filled up when it can? Could
we document how we distribute the request level max_bytes to partitions in
the KIP? Also, should we change Fetch accordingly?

46. If we don't have IBP, how do we make sure that a broker doesn't
issue FetchSnapshotRequest when the receiving broker hasn't been upgraded
yet?

Thanks,

Jun

On Thu, Oct 1, 2020 at 10:09 AM Jose Garcia Sancio <jsan...@confluent.io>
wrote:

> Thank you for the quick response Jun. Excuse the delayed response but
> I wanted to confirm some things regarding IBP. See comments below.
>
> Here are my changes to the KIP:
>
> https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=158864763&selectedPageVersions=30&selectedPageVersions=28
>
> > 40. LBO: Code wise, logStartOffset is used in so many places like Log,
> > ReplicaManager, LogCleaner, ReplicaFetcher, checkpoint files, etc. I am
> not
> > if it's worth renaming in all those places. If the main concern is to
> > avoid confusion, we can just always spell out logStartOffset.
>
> Done. Keeping it as LogStartOffset is better. I was also concerned
> with external tools that may be generating code from the JSON schema.
>
> > 41. metadata.snapshot.min.cleanable.ratio: Since the snapshot could be
> > empty initially, it's probably better to define the ratio as
> new_data_size
> > / (new_data_size + snapshot_size). This avoids the dividing by zero issue
> > and is also consistent with the cleaner ratio definition for compacted
> > topics.
>
> I am assuming that snapshot_size is the size of the largest snapshot
> on disk, is this correct? If we use this formal then we will generate
> snapshots very quickly if the snapshot on disk is zero or very small.
>
> In general what we care about is if the replicated log has a lot of
> records that delete or update records in the snapshot. I was thinking
> something along the following formula:
>
> (size of delete snapshot records + size of updated records) / (total
> size of snapshot), where total size of snapshot is greater than zero.
> 0, where total size of snapshot is zero
>
> This means that in the extreme case where the replicated log only
> contains "addition" records then we never generate a snapshot. I think
> this is the desired behavior since generating a snapshot will consume
> disk bandwidth without saving disk space. What do you think?
>
> >
> > 42. FetchSnapshotRequest: Since this is designed to fetch more than one
> > partition, it seems it's useful to have a per-partition maxBytes, in
> > addition to the request level maxBytes, just like a Fetch request?
>
> Yeah, we have debated this in another thread from Jason. The argument
> is that MaxBytes at the top level is all that we need if we implement
> the following heuristic:
>
> 1. Compute the average max bytes per partition by dividing the max by
> the number of partitions in the request.
> 2. For all of the partitions with remaining bytes less than this
> average max bytes, then send all of those bytes and sum the remaining
> bytes.
> 3. For the remaining partitions send at most the average max bytes
> plus the average remaining bytes.
>
> Note that this heuristic will only be performed once and not at worst
> N times for N partitions.
>
> What do you think? Besides consistency with Fetch requests, is there
> another reason to have MaxBytes per partition?
>
> > 43. FetchSnapshotResponse:
> > 43.1 I think the confusing part for OFFSET_OUT_OF_RANGE is
> > that FetchSnapshotRequest includes EndOffset. So, OFFSET_OUT_OF_RANGE
> seems
> > to suggest that the provided EndOffset is wrong, which is not the
> intention
> > for the error code.
>
> Yeah. Added a new error called POSITION_OUT_OF_RANGE.
>
> > 43.1 Position field seems to be the same as the one in
> > FetchSnapshotRequest. If we have both, should the requester verify the
> > consistency between two values and what should the requester do if the
> two
> > values don't match?
>
> Yeah the Position in the response will be the same value as the
> Position in the request. I was thinking of only verifying Position
> against the state on the temporary snapshot file on disk. If Position
> is not equal to the size of the file then reject the response and send
> another FetchSnapshot request.
>
> > 44. metric: Would a metric that captures the lag in offset between the
> last
> > snapshot and the logEndOffset be useful?
>
> Yes. How about the difference between the last snapshot offset and the
> high-watermark? Snapshot can only be created up to the high-watermark.
>
> Added this metric. Let me know if you still think we need a metric for
> the difference between the largest snapshot end offset and the
> high-watermark.
>
> > 45. It seems the KIP assumes that every voter (leader and follower) and
> > observer has a local replicated log for __cluster_metadata. It would be
> > useful to make that clear in the overview section.
>
> Updated the overview section. I think that this decision affects the
> section "Changes to Leader Election". That section should not affect
> observers since they don't participate in leader elections. It also
> affects the section "Validation of Snapshot and Log" but it should be
> possible to fix that section if observers don't have the replicated
> log on disk.
>
> > 46. Does this KIP cover upgrading from older versions of Kafka? If so, do
> > we need IBP to guard the usage of modified FetchRequest and the new
> > FetchSnapshotRequest? If not, could we make it clear that upgrading will
> be
> > covered somewhere else?
>
> In short, I don't think we need to increase the IBP. When we implement
> snapshots for other topics like __consumer_offset and __transaction
> then we will have to increase the IBP. We will need another KIP to do
> that. I updated the "Compatibility, Deprecation, and Migration Plan"
> section to include the following information.
>
> This KIP is only implemented for the internal topic
> __cluster_metadata. An increase of the inter broker protocol (IBP) is
> not required, since we are only implementing this for the
> __cluster_metadata topic partition, that partition is a new partition
> and will be handle by the KafkaRaftClient. Internal and external
> clients for all other topics can ignore the SnapshotId as that field
> will not be set for topic partitions that are not __cluster_metadata.
>
>
> Thanks,
> --
> -Jose
>

Re: [DISCUSS] KIP-630: Kafka Raft Snapshot

Reply via email to