Re: [DISCUSS] KIP-793: Sink Connectors: Support topic-mutating SMTs for async connectors (preCommit users)

Randall Hauch Mon, 03 Oct 2022 12:10:31 -0700

On Mon, Oct 3, 2022 at 11:45 AM Yash Mayya <yash.ma...@gmail.com> wrote:


> Hi Randall,
>
> Thanks for elaborating. I think these are all very good points and I see
> why the overloaded `SinkTask::put` method is a cleaner solution overall.
>
> > public void put(Collection<SinkRecord> records, Map<SinkRecord,
> TopicPartition> updatedTopicPartitions)
>
> I think this should be
>
> `public void put(Collection<SinkRecord> records, Map<SinkRecord,
> TopicPartition> originalTopicPartitions)`
>
> instead because the sink records themselves have the updated topic
> partitions (i.e. after all transformations have been applied) and the KIP
> is proposing a way for the tasks to be able to access the original topic
> partition (i.e. before transformations have been applied).
>

Sounds good.


>
> > Of course, if the developer does not need separate methods, they can
> easily have the older `put` method simply delegate to the newer method.
>
> If the developer does not need separate methods (i.e. they don't need to
> use this new addition), they can simply continue implementing just the
> older `put` method right?
>

Correct. We should update the JavaDoc of both methods to make this clear,
and in general how the two methods should are used and should be
implemented. That can be part of the PR, and the KIP doesn't need this
wording.

>
> > Finally, this gives us a roadmap for *eventually* deprecating the older
> method, once the Connect runtime versions without this change are old
> enough.
>
> I'm not sure we'd ever want to deprecate the older method. Most common sink
> connector implementations do not do their own offset tracking with
> asynchronous processing and will probably never have a need for the
> additional parameter `Map<SinkRecord, TopicPartition>
> originalTopicPartitions` in the proposed new `put` method. These connectors
> can continue implementing only the existing `SinkTask::put` method which
> will be called by the default implementation of the newer overloaded `put`
> method.
>

+1


>
> > the pre-commit methods use the same `Map<TopicPartition,
> OffsetAndMetadata> currentOffsets` data structure I'm suggesting be used.
>
> The data structure you're suggesting be used is a `Map<SinkRecord,
> TopicPartition>` which will map `SinkRecord` objects to the original topic
> partition of the corresponding `ConsumerRecord` right? To clarify, this is
> a new data structure that will need to be managed in the `WorkerSinkTask`.
>

Ah, you're right. Thanks for the correction.

Best regards,
Randall


> Thanks,
> Yash


> On Mon, Oct 3, 2022 at 1:20 AM Randall Hauch <rha...@gmail.com> wrote:
>
> > Hi, Yash.
> >
> > I'm not sure I quite understand why it would be "easier" for connector
> > > developers to account for implementing two different overloaded `put`
> > > methods (assuming that they want to use this new feature) versus using
> a
> > > try-catch block around `SinkRecord` access methods?
> >
> >
> > Using a try-catch to try around an API method that *might* be there is a
> > very unusual thing for most developers. Unfortunately, we've had to
> resort
> > to this atypical approach with Connect in places when there was no good
> > alternative. We seem to relying upon pattern because it's easier for us,
> > not because it offers a better experience for Connector developers. IMO,
> if
> > there's a practical alternative that uses normal development practices
> and
> > techniques, then we should use that alternative. IIUC, there is at least
> > one practical alternative for this KIP that would not require developers
> to
> > use the unusual try-catch to handle the case where methods are not found.
> >
> > I also think having two `put` methods is easier when the Connector has to
> > do different things for different Connect runtimes, too. One of those
> > methods is called by newer Connect runtimes with the new behavior, and
> the
> > other method is called by an older Connect runtime. Of course, if the
> > developer does not need separate methods, they can easily have the older
> > `put` method simply delegate to the newer method.
> >
> > Finally, this gives us a roadmap for *eventually* deprecating the older
> > method, once the Connect runtime versions without this change are old
> > enough.
> >
> > I think the advantage of going with the
> > > proposed approach in the KIP is that it wouldn't require extra
> > book-keeping
> > > (the Map<SinkRecord,
> > > TopicPartition> in `WorkerSinkTask` in your proposed approach)
> > >
> >
> > The connector does have to do some of this bookkeeping in how they track
> > the topic partition offsets used in the `preCommit`, and the pre-commit
> > methods use the same `Map<TopicPartition, OffsetAndMetadata>
> > currentOffsets`
> > data structure I'm suggesting be used.
> >
> > I hope that helps.
> >
> > Best regards,
> >
> > Randall
> >
> > On Mon, Sep 26, 2022 at 9:38 AM Yash Mayya <yash.ma...@gmail.com> wrote:
> >
> > > Hi Randall,
> > >
> > > Thanks for reviewing the KIP!
> > >
> > > > That latter logic can get quite ugly.
> > >
> > > I'm not sure I quite understand why it would be "easier" for connector
> > > developers to account for implementing two different overloaded `put`
> > > methods (assuming that they want to use this new feature) versus using
> a
> > > try-catch block around `SinkRecord` access methods? In both cases, a
> > > connector developer would need to write additional code in order to
> > ensure
> > > that their connector continues working with older Connect runtimes.
> > > Furthermore, we would probably need to carefully document how the
> > > implementation for the older `put` method should look like for
> connectors
> > > that want to use this new feature. I think the advantage of going with
> > the
> > > proposed approach in the KIP is that it wouldn't require extra
> > book-keeping
> > > (the Map<SinkRecord,
> > > TopicPartition> in `WorkerSinkTask` in your proposed approach) and also
> > the
> > > fact that the try-catch based logic is an already established pattern
> > > through
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-610%3A+Error+Reporting+in+Sink+Connectors
> > > and other KIPs which added methods to source/sink connector/task
> > contexts.
> > >
> > > Let me know if you still feel that having a new overloaded put method
> is
> > a
> > > cleaner solution and I'd be happy to reconsider!
> > >
> > > Thanks,
> > > Yash
> > >
> > > On Thu, Sep 22, 2022 at 11:18 PM Randall Hauch <rha...@gmail.com>
> wrote:
> > >
> > > > Hi, Yash. Thanks for picking up this KIP and discussion.
> > > >
> > > > The KIP includes this rejected alternative:
> > > >
> > > > > 4. Update SinkTask.put in any way to pass the new information
> outside
> > > > > SinkRecord (e.g. a Map or a derived class)
> > > > >
> > > > >    -
> > > > >
> > > > >    Much more disruptive change without considerable pros
> > > > >
> > > > >
> > > > One advantage about doing this is that sink connector implementations
> > can
> > > > more easily implement two different "put(...)" methods to handle
> > running
> > > in
> > > > a variety of runtimes, without having to use try-catch logic around
> the
> > > > newer SinkRecord access methods. That latter logic can get quite
> ugly.
> > > >
> > > > For example, the existing `put` method has this signature:
> > > >
> > > > public abstract void put(Collection<SinkRecord> records);
> > > >
> > > > If we added an overloaded method that passed in a map of the old
> > > > topic+partition for each record (and defined the absence of an entry
> as
> > > > having an unchanged topic and partition):
> > > >
> > > > public void put(Collection<SinkRecord> records, Map<SinkRecord,
> > > > TopicPartition> updatedTopicPartitions) {
> > > > put(records);
> > > > }
> > > >
> > > > then a `SinkTask` implementation that wants to use this new feature
> > could
> > > > simply implement both methods:
> > > >
> > > > public void put(Collection<SinkRecord> records) {
> > > > // Running in an older runtime, so no tracking of SMT-modified topic
> > > names
> > > > or partitions
> > > > put(records, Map.of());
> > > > }
> > > >
> > > > public void put(Collection<SinkRecord> records, Map<SinkRecord,
> > > > TopicPartition> updatedTopicPartitions) {
> > > > // real logic here
> > > > }
> > > >
> > > > This seems a lot easier than having to use try-catch logic, yet still
> > > > allows sink connectors to utilize the new functionality and still
> work
> > > with
> > > > older Connect runtimes.
> > > >
> > > > WDYT?
> > > >
> > > > Randall
> > > >
> > > >
> > > > On Thu, Sep 8, 2022 at 7:03 AM Yash Mayya <yash.ma...@gmail.com>
> > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I would like to (re)start a new discussion thread on KIP-793 (Kafka
> > > > > Connect) which proposes some additions to the public SinkRecord
> > > interface
> > > > > in order to support topic mutating SMTs for sink connectors that do
> > > their
> > > > > own offset tracking.
> > > > >
> > > > > Links:
> > > > >
> > > > > KIP:
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=191336830
> > > > >
> > > > > Older discussion thread:
> > > > > https://lists.apache.org/thread/00kcth6057jdcsyzgy1x8nb2s1cymy8h,
> > > > > https://lists.apache.org/thread/rzqkm0q5y5v3vdjhg8wqppxbkw7nyopj
> > > > >
> > > > > Jira: https://issues.apache.org/jira/browse/KAFKA-13431
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Yash
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-793: Sink Connectors: Support topic-mutating SMTs for async connectors (preCommit users)

Reply via email to