Re: [DISCUSS] KIP-66 Kafka Connect Transformers for messages

Gwen Shapira Thu, 15 Dec 2016 14:02:15 -0800

I agree about the ease of use in adding a small-subset of built-in
transformations.


But the same thing is true for connectors - there are maybe 5 super popular
OSS connectors and the rest is a very long tail. We drew the line at not
adding any, because thats the easiest and because we did not want to turn
Kafka into a collection of transformations.

I really don't want to end up with 135 (or even 20) transformations in
Kafka. So either we have a super-clear definition of what belongs and what
doesn't - or we put in one minimal example and the rest goes into the
ecosystem.

We can also start by putting transformations on github and just see if
there is huge demand for them in Apache. It is easier to add stuff to the
project later than to remove functionality.



On Thu, Dec 15, 2016 at 11:59 AM, Shikhar Bhushan <[email protected]>
wrote:

> I have updated KIP-66
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 66%3A+Single+Message+Transforms+for+Kafka+Connect
> with
> the changes I proposed in the design.
>
> Gwen, I think the main downside to not including some transformations with
> Kafka Connect is that it seems less user friendly if folks have to make
> sure to have the right transformation(s) on the classpath as well, besides
> their connector(s). Additionally by going in with a small set included, we
> can encourage a consistent configuration and implementation style and
> provide utilities for e.g. data transformations, which I expect we will
> definitely need (discussed under 'Patterns for data transformations').
>
> It does get hard to draw the line once you go from 'none' to 'some'. To get
> discussion going, if we get agreement on 'none' vs 'some', I added a table
> under 'Bundled transformations' for transformations which I think are worth
> including.
>
> For many of these, I have noticed their absence in the wild as a pain point
> --
> TimestampRouter:
> https://github.com/confluentinc/kafka-connect-elasticsearch/issues/33
> Mask:
> https://groups.google.com/d/msg/confluent-platform/3yHb8_
> mCReQ/sTQc3dNgBwAJ
> Insert:
> http://stackoverflow.com/questions/40664745/elasticsearch-connector-for-
> kafka-connect-offset-and-timestamp
> RegexRouter:
> https://groups.google.com/d/msg/confluent-platform/
> yEBwu1rGcs0/gIAhRp6kBwAJ
> NumericCast:
> https://github.com/confluentinc/kafka-connect-
> jdbc/issues/101#issuecomment-249096119
> TimestampConverter:
> https://groups.google.com/d/msg/confluent-platform/
> gGAOsw3Qeu4/8JCqdDhGBwAJ
> ValueToKey: https://github.com/confluentinc/kafka-connect-jdbc/pull/166
>
> In other cases, their functionality is already being implemented by
> connectors in divergent ways: RegexRouter, Insert, Replace, HoistToStruct,
> ExtractFromStruct
>
> On Wed, Dec 14, 2016 at 6:00 PM Gwen Shapira <[email protected]> wrote:
>
> I'm a bit concerned about adding transformations in Kafka. NiFi has 150
> processors, presumably they are all useful for someone. I don't know if I'd
> want all of that in Apache Kafka. What's the downside of keeping it out? Or
> at least keeping the built-in set super minimal (Flume has like 3 built-in
> interceptors)?
>
> Gwen
>
> On Wed, Dec 14, 2016 at 1:36 PM, Shikhar Bhushan <[email protected]>
> wrote:
>
> > With regard to a), just using `ConnectRecord` with `newRecord` as a new
> > abstract method would be a fine choice. In prototyping, both options end
> up
> > looking pretty similar (in terms of how transformations are implemented
> and
> > the runtime initializes and uses them) and I'm starting to lean towards
> not
> > adding a new interface into the mix.
> >
> > On b) I think we should include a small set of useful transformations
> with
> > Connect, since they can be applicable across different connectors and we
> > should encourage some standardization for common operations. I'll update
> > KIP-66 soon including a spec of transformations that I believe are worth
> > including.
> >
> > On Sat, Dec 10, 2016 at 11:52 PM Ewen Cheslack-Postava <
> [email protected]>
> > wrote:
> >
> > If anyone has time to review here, it'd be great to get feedback. I'd
> > imagine that the proposal itself won't be too controversial -- keeps
> > transformations simple (by only allowing map/filter), doesn't affect the
> > rest of the framework much, and fits in with general config structure
> we've
> > used elsewhere (although ConfigDef could use some updates to make this
> > easier...).
> >
> > I think the main open questions for me are:
> >
> > a) Is TransformableRecord worth it to avoid reimplementing small bits of
> > code (it allows for a single implementation of the interface to trivially
> > apply to both Source and SinkRecords). I think I prefer this, but it does
> > come with some commitment to another interface on top of ConnectRecord.
> We
> > could alternatively modify ConnectRecord which would require fewer
> changes.
> > b) How do folks feel about built-in transformations and the set that are
> > mentioned here? This brings us way back to the discussion of built-in
> > connectors. Transformations, especially when intended to be lightweight
> and
> > touch nothing besides the data already in the record, seem different from
> > connectors -- there might be quite a few, but hopefully limited. Since we
> > (hopefully) already factor out most serialization-specific stuff via
> > Converters, I think we can keep this pretty limited. That said, I have no
> > doubt some folks will (in my opinion) abuse this feature to do data
> > enrichment by querying external systems, so building a bunch of
> > transformations in could potentially open the floodgates, or at least
> make
> > decisions about what is included vs what should be 3rd party muddy.
> >
> > -Ewen
> >
> >
> > On Wed, Dec 7, 2016 at 11:46 AM, Shikhar Bhushan <[email protected]>
> > wrote:
> >
> > > Hi all,
> > >
> > > I have another iteration at a proposal for this feature here:
> > > https://cwiki.apache.org/confluence/display/KAFKA/
> > > Connect+Transforms+-+Proposed+Design
> > >
> > > I'd welcome your feedback and comments.
> > >
> > > Thanks,
> > >
> > > Shikhar
> > >
> > > On Tue, Aug 2, 2016 at 7:21 PM Ewen Cheslack-Postava <
> [email protected]>
> > > wrote:
> > >
> > > On Thu, Jul 28, 2016 at 11:58 PM, Shikhar Bhushan <
> [email protected]>
> > > wrote:
> > >
> > > > >
> > > > >
> > > > > Hmm, operating on ConnectRecords probably doesn't work since you
> need
> > > to
> > > > > emit the right type of record, which might mean instantiating a new
> > > one.
> > > > I
> > > > > think that means we either need 2 methods, one for SourceRecord,
> one
> > > for
> > > > > SinkRecord, or we'd need to limit what parts of the message you can
> > > > modify
> > > > > (e.g. you can change the key/value via something like
> > > > > transformKey(ConnectRecord) and transformValue(ConnectRecord), but
> > > other
> > > > > fields would remain the same and the fmwk would handle allocating
> new
> > > > > Source/SinkRecords if needed)
> > > > >
> > > >
> > > > Good point, perhaps we could add an abstract method on ConnectRecord
> > that
> > > > takes all the shared fields as parameters and the implementations
> > return
> > > a
> > > > copy of the narrower SourceRecord/SinkRecord type as appropriate.
> > > > Transformers would only operate on ConnectRecord rather than caring
> > about
> > > > SourceRecord or SinkRecord (in theory they could instanceof/cast, but
> > the
> > > > API should discourage it)
> > > >
> > > >
> > > > > Is there a use case for hanging on to the original? I can't think
> of
> > a
> > > > > transformation where you'd need to do that (or couldn't just order
> > > things
> > > > > differently so it isn't a problem).
> > > >
> > > >
> > > > Yeah maybe this isn't really necessary. No strong preference here.
> > > >
> > > > That said, I do worry a bit that farming too much stuff out to
> > > transformers
> > > > > can result in "programming via config", i.e. a lot of the
> simplicity
> > > you
> > > > > get from Connect disappears in long config files. Standardization
> > would
> > > > be
> > > > > nice and might just avoid this (and doesn't cost that much
> > implementing
> > > > it
> > > > > in each connector), and I'd personally prefer something a bit less
> > > > flexible
> > > > > but consistent and easy to configure.
> > > >
> > > >
> > > > Not sure what the you're suggesting :-) Standardized config
> properties
> > > for
> > > > a small set of transformations, leaving it upto connectors to
> > integrate?
> > > >
> > >
> > > I just mean that you get to the point where you're practically writing
> a
> > > Kafka Streams application, you're just doing it through either an
> > > incredibly convoluted set of transformers and configs, or a single
> > > transformer with incredibly convoluted set of configs. You basically
> get
> > to
> > > the point where you're config is a mini DSL and you're not really
> saving
> > > that much.
> > >
> > > The real question is how much we want to venture into the "T" part of
> > ETL.
> > > I tend to favor minimizing how much we take on since the rest of
> Connect
> > > isn't designed for it, it's designed around the E & L parts.
> > >
> > > -Ewen
> > >
> > >
> > > > Personally I'm skeptical of that level of flexibility in transformers
> > --
> > > > > its getting awfully complex and certainly takes us pretty far from
> > > > "config
> > > > > only" realtime data integration. It's not clear to me what the use
> > > cases
> > > > > are that aren't covered by a small set of common transformations
> that
> > > can
> > > > > be chained together (e.g. rename/remove fields, mask values, and
> > maybe
> > > a
> > > > > couple more).
> > > > >
> > > >
> > > > I agree that we should have some standard transformations that we
> ship
> > > with
> > > > connect that users would ideally lean towards for routine tasks. The
> > ones
> > > > you mention are some good candidates where I'd imagine can expose
> > simple
> > > > config, e.g.
> > > >    transform.filter.whitelist=x,y,z # filter to a whitelist of
> fields
> > > >    transfom.rename.spec=oldName1=>newName1, oldName2=>newName2
> > > >    topic.rename.replace=-/_
> > > >    topic.rename.prefix=kafka_
> > > > etc..
> > > >
> > > > However the ecosystem will invariably have more complex transformers
> if
> > > we
> > > > make this pluggable. And because ETL is messy, that's probably a good
> > > thing
> > > > if folks are able to do their data munging orthogonally to
> connectors,
> > so
> > > > that connectors can focus on the logic of how data should be copied
> > > from/to
> > > > datastores and Kafka.
> > > >
> > > >
> > > > > In any case, we'd probably also have to change configs of
> connectors
> > if
> > > > we
> > > > > allowed configs like that since presumably transformer configs will
> > > just
> > > > be
> > > > > part of the connector config.
> > > > >
> > > >
> > > > Yeah, haven't thought much about how all the configuration would tie
> > > > together...
> > > >
> > > > I think we'd need the ability to:
> > > > - spec transformer chain (fully-qualified class names? perhaps
> special
> > > > aliases for built-in ones? perhaps third-party fqcns can be assigned
> > > > aliases by users in the chain spec, for easier configuration and to
> > > > uniquely identify a transformation when it occurs more than one time
> in
> > a
> > > > chain?)
> > > > - configure each transformer -- all properties prefixed with that
> > > > transformer's ID (fqcn / alias) get destined to it
> > > >
> > > > Additionally, I think we would probably want to allow for
> > topic-specific
> > > > overrides <https://issues.apache.org/jira/browse/KAFKA-3962> (e.g.
> you
> > > > want
> > > > certain transformations for one topic, but different ones for
> > another...)
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks,
> > > Ewen
> > >
> >
>
>
>
> --
> *Gwen Shapira*
> Product Manager | Confluent
> 650.450.2760 <(650)%20450-2760> | @gwenshap
> Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
> <http://www.confluent.io/blog>
>



-- 
*Gwen Shapira*
Product Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
<http://www.confluent.io/blog>

Re: [DISCUSS] KIP-66 Kafka Connect Transformers for messages

Reply via email to