bump for another potential more discussion
On 2020/08/27 23:31:38, Ning Zhang <ning2008w...@gmail.com> wrote: > Hello Mickael, > > > 1. How does offset translation work with this new sink connector? > > Should we also include a CheckpointSinkConnector? > > CheckpointSourceConnector will be re-used as the same as current. When EOS is > enabled, we will run 3 connectors: > > MirrorSinkConnector (based on SinkConnector) > MirrorCheckpointConnector (based on SourceConnector) > MirrorHeartbeatConnector (based on SourceConnector) > > For the last two connectors (checkpoint, heartbeat), if we do not strictly > require EOS, it is probably OK to use current implementation on > SourceConnector. > > I will update the KIP to clarify this, if it sounds acceptable. > > > 2. Migrating to this new connector could be tricky as effectively the > > Connect runtime needs to point to the other cluster, so its state > > (stored in the __connect topics) is lost. Unfortunately there is no > > easy way today to prime Connect with offsets. Not necessarily a > > blocking issue but this should be described as I think the current > > Migration section looks really optimistic at the moment > > totally agree. I will update the migration part with notes about potential > service interruption, without careful planning. > > > 3. We can probably find a better name than "transaction.producer". > > Maybe we can follow a similar pattern than Streams (which uses > > "processing.guarantee")? > > "processing.guarantee" sounds better > > > 4. Transactional Ids used by the producer are generated based on the > > task assignments. If there's a single task, if it crashes and restarts > > it would still get the same id. Can this be an issue? > > From https://tgrez.github.io/posts/2019-04-13-kafka-transactions.html, the > author suggests to postfix transaction.id with <topic, partition>: > > "To avoid handling an external store we will use a static encoding similarly > as in spring-kafka: > The transactional.id is now the transactionIdPrefix appended with > <group.id>.<topic>.<partition>." > > I think as long as there is no more than one producer use same > "transaction.id" at the same time, it is OK. > > Also from my tests, this "transaction.id" assignment works fine with > failures. To tighten it up, I also tested to use "connector task id" in > "transaction.id". The "connector task id" is typically composed of > connector_name and task_id, which is also unique across all connectors in a > KC cluster. > > > 5. The logic in the KIP creates a new transaction every time put() is > > called. Is there a performance impact? > > It could be a performance hit if the transaction batch is too small under > high ingestion rate. The batch size depends on how many messages that > consumer poll each time. Maybe we could increase "max.poll.records" to have > larger batch size. > > Overall, thanks so much for the valuable feedback. If the responses sounds > good, I will do a cleanup of KIP. > > On 2020/08/27 09:59:57, Mickael Maison <mickael.mai...@gmail.com> wrote: > > Thanks Ning for the KIP. Having stronger guarantees when mirroring > > data would be a nice improvement! > > > > A few comments: > > 1. How does offset translation work with this new sink connector? > > Should we also include a CheckpointSinkConnector? > > > > 2. Migrating to this new connector could be tricky as effectively the > > Connect runtime needs to point to the other cluster, so its state > > (stored in the __connect topics) is lost. Unfortunately there is no > > easy way today to prime Connect with offsets. Not necessarily a > > blocking issue but this should be described as I think the current > > Migration section looks really optimistic at the moment > > > > 3. We can probably find a better name than "transaction.producer". > > Maybe we can follow a similar pattern than Streams (which uses > > "processing.guarantee")? > > > > 4. Transactional Ids used by the producer are generated based on the > > task assignments. If there's a single task, if it crashes and restarts > > it would still get the same id. Can this be an issue? > > > > 5. The logic in the KIP creates a new transaction every time put() is > > called. Is there a performance impact? > > > > On Fri, Aug 21, 2020 at 4:58 PM Ryanne Dolan <ryannedo...@gmail.com> wrote: > > > > > > Awesome, this will be a huge advancement. I also want to point out that > > > this KIP implements MirrorSinkConnector as well, finally, which is a very > > > often requested missing feature in my experience. > > > > > > Ryanne > > > > > > On Fri, Aug 21, 2020, 9:45 AM Ning Zhang <ning2008w...@gmail.com> wrote: > > > > > > > Hello, I wrote a KIP about MirrorMaker2 Exactly-once Semantics (EOS) > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-656%3A+MirrorMaker2+Exactly-once+Semantics > > > > At the high-level, it resembles the idea of how HDFS Sink Connector > > > > achieves EOS across clusters by managing and storing the consumer > > > > offsets > > > > in an external persistent storage, but also leverages the current Kafka > > > > EOS > > > > guarantee within a single cluster. I have done some experiments > > > > especially > > > > for the failure cases and I am very appreciated for comments and > > > > feedback > > > > on this KIP from bigger audience. > > > > > > >