Hey Ryanne, Sorry I am late here. Thanks much for all the work! After reading through the latest KIP and all the previous discussion, I have some questions below:
1. Currently if there is topic created with "." in the topic name, would it cause correctness issue for this KIP? For example, will consumer reads from the topic that is not intended, and will API such as RemoteClusterUtils.upstreamClusters(...) return wrong string as the upstream cluster? If there is correctness issue which reduces the usability, we probably should mark it in the Java doc and fix these issues before considering the feature as ready and safe to use. Some future work such as "propose a special separator character" has been suggested in the email thread. Could we document these future work in the KIP if the work is necessary for MM2 to be used reliably? 2. Suppose topic_1 is replicated from cluster A to cluster B. And we have a pair of producer and consumer that produces and consumes from topic_1 in cluster A. Let's say cluster A crashed and we need to migrate this pair of consumer to use cluster B. According to the discussion in the email thread, producer will producer to topic_1 in cluster B whereas Consumer will consume from A.topic_1 in cluster B. It means that the message produced by this producer will not longer be consumed by this consumer. Would this be an issue? 3. Given that the "mechanism to migrate producers or consumers between mirrored clusters" is listed as the motivation of this KIP, do we need to specify in the KIP the migration procedure which we have discussed in depth in the email thread? For example, user needs to additionally call offsetsForTimes(...) after getting the timestamp from the checkpoint for each. 4. A lot of class names (e.g. MirrorCheckpointConnector) is included in the Public Interface section. Not sure if it is useful to specify the class name without specifying the class API and its usage. My understanding is that Public Interface section should include 1) minimum amount of the information that user can read in order to use the feature provided by the KIP; and 2) anything whose change incurs compatibility concern and thus KIP discussion. So I am wondering whether we should include in the Public Interface section 1) the class API for those classes listed in the section (or remove from the KIP if we have not decided the class API); 2) the usage of the new scripts (e.g. connect-mirror-maker.sh) with its arguments (e.g. this <https://cwiki.apache.org/confluence/display/KAFKA/KIP-113%3A+Support+replicas+movement+between+log+directories#KIP-113:Supportreplicasmovementbetweenlogdirectories-Scripts> example); and 3) schema of the checkpoint topic and heartbeat topic which is currently in the Proposed Change section. 5. It is said in the motivation section that the design includes new metrics such as end-to-end replication latency across multiple data centers/clusters. Usually we treat metrics as public interface. Could we also specify these metrics (e.g. definition, type) in Public Interface section similar to this <https://cwiki.apache.org/confluence/display/KAFKA/KIP-237%3A+More+Controller+Health+Metrics#KIP-237:MoreControllerHealthMetrics-PublicInterfaces> example. 6. The KIP says that the RemoteClusterUtils is "subject to change". Are we going to have a future KIP to discuss and finalize these APIs before committing any code that implements the API? If so, it may be cleaner to specify this or even remove this class from the Public Interface section, and specify the future KIP in this KIP. 7. If we are going to determine the API for RemoteClusterUtils in this KIP, then I have one comment regarding the RemoteClusterUtils.translateOffsets(...). If I understand the discussion in the email thread correctly, this API reads the timestamp from the checkpoint topic and return it to the caller. If so, it seems more intuitive to call this e.g. checkpointTimestamp(...). And the doc probably should say "Find the timestamp of the last committed messaged in the source cluster...". Otherwise, could you briefly explain in the KIP how this API is implemented. 8. It is said in the email discussion that targetClusterAlias is needed in order to migrate consumer from the remote topic back to the source topic (i.e. fallback). When a consumer is migrated from A.topic_1 in cluster B to topic_1 in cluster A, how does consumer determine the start offset for topic_1 in cluster A? 9. Does RemoteClusterUtils.upstreamClusters(...) return clusters that is no longer an upstream cluster? If yes, it seems to reduced the usability of the API if user only wants to know the current upstream cluster list. If no, could you explain what is the semantics of the API (e.g. when an upstream cluster is considered outdated) and how is this API implemented to exclude outdated clusters. It may be useful to briefly explain the implementation in the proposed change section for those APIs that are not straightforward. 10. It is mentioned in the Remote Cluster Utils section that "it will be possible to find high water marks that consumers". Not sure how is high watermark used here. Can you explain a bit more, e.g. which API does this sentence refer to? 11. The APIs for the interface ReplicationPolicy is specified without explanation or Java doc. It is hard to determine how they can be used and whether these APIs are necessary. Should we include the Java doc and put this interface in the Public Interface section? 12. The migration section specifies that this API will be carried in three separate Kafka release. Not sure if we need three releases. In the first release, we implement this KIP and deprecate old API. In the subsequent release, we remove the old API. Does this make sense? 13. Since the newly added checkpoint topic is log-compacted, I am wondering whether we need to specify the log compaction key for this topic in the Public Interface section. Thanks, Dong