Re: [DISCUSS] KIP-66 Kafka Connect Transformers for messages

Gwen Shapira Wed, 14 Dec 2016 18:00:52 -0800

I'm a bit concerned about adding transformations in Kafka. NiFi has 150
processors, presumably they are all useful for someone. I don't know if I'd
want all of that in Apache Kafka. What's the downside of keeping it out? Or
at least keeping the built-in set super minimal (Flume has like 3 built-in
interceptors)?


Gwen

On Wed, Dec 14, 2016 at 1:36 PM, Shikhar Bhushan <shik...@confluent.io>
wrote:

> With regard to a), just using `ConnectRecord` with `newRecord` as a new
> abstract method would be a fine choice. In prototyping, both options end up
> looking pretty similar (in terms of how transformations are implemented and
> the runtime initializes and uses them) and I'm starting to lean towards not
> adding a new interface into the mix.
>
> On b) I think we should include a small set of useful transformations with
> Connect, since they can be applicable across different connectors and we
> should encourage some standardization for common operations. I'll update
> KIP-66 soon including a spec of transformations that I believe are worth
> including.
>
> On Sat, Dec 10, 2016 at 11:52 PM Ewen Cheslack-Postava <e...@confluent.io>
> wrote:
>
> If anyone has time to review here, it'd be great to get feedback. I'd
> imagine that the proposal itself won't be too controversial -- keeps
> transformations simple (by only allowing map/filter), doesn't affect the
> rest of the framework much, and fits in with general config structure we've
> used elsewhere (although ConfigDef could use some updates to make this
> easier...).
>
> I think the main open questions for me are:
>
> a) Is TransformableRecord worth it to avoid reimplementing small bits of
> code (it allows for a single implementation of the interface to trivially
> apply to both Source and SinkRecords). I think I prefer this, but it does
> come with some commitment to another interface on top of ConnectRecord. We
> could alternatively modify ConnectRecord which would require fewer changes.
> b) How do folks feel about built-in transformations and the set that are
> mentioned here? This brings us way back to the discussion of built-in
> connectors. Transformations, especially when intended to be lightweight and
> touch nothing besides the data already in the record, seem different from
> connectors -- there might be quite a few, but hopefully limited. Since we
> (hopefully) already factor out most serialization-specific stuff via
> Converters, I think we can keep this pretty limited. That said, I have no
> doubt some folks will (in my opinion) abuse this feature to do data
> enrichment by querying external systems, so building a bunch of
> transformations in could potentially open the floodgates, or at least make
> decisions about what is included vs what should be 3rd party muddy.
>
> -Ewen
>
>
> On Wed, Dec 7, 2016 at 11:46 AM, Shikhar Bhushan <shik...@confluent.io>
> wrote:
>
> > Hi all,
> >
> > I have another iteration at a proposal for this feature here:
> > https://cwiki.apache.org/confluence/display/KAFKA/
> > Connect+Transforms+-+Proposed+Design
> >
> > I'd welcome your feedback and comments.
> >
> > Thanks,
> >
> > Shikhar
> >
> > On Tue, Aug 2, 2016 at 7:21 PM Ewen Cheslack-Postava <e...@confluent.io>
> > wrote:
> >
> > On Thu, Jul 28, 2016 at 11:58 PM, Shikhar Bhushan <shik...@confluent.io>
> > wrote:
> >
> > > >
> > > >
> > > > Hmm, operating on ConnectRecords probably doesn't work since you need
> > to
> > > > emit the right type of record, which might mean instantiating a new
> > one.
> > > I
> > > > think that means we either need 2 methods, one for SourceRecord, one
> > for
> > > > SinkRecord, or we'd need to limit what parts of the message you can
> > > modify
> > > > (e.g. you can change the key/value via something like
> > > > transformKey(ConnectRecord) and transformValue(ConnectRecord), but
> > other
> > > > fields would remain the same and the fmwk would handle allocating new
> > > > Source/SinkRecords if needed)
> > > >
> > >
> > > Good point, perhaps we could add an abstract method on ConnectRecord
> that
> > > takes all the shared fields as parameters and the implementations
> return
> > a
> > > copy of the narrower SourceRecord/SinkRecord type as appropriate.
> > > Transformers would only operate on ConnectRecord rather than caring
> about
> > > SourceRecord or SinkRecord (in theory they could instanceof/cast, but
> the
> > > API should discourage it)
> > >
> > >
> > > > Is there a use case for hanging on to the original? I can't think of
> a
> > > > transformation where you'd need to do that (or couldn't just order
> > things
> > > > differently so it isn't a problem).
> > >
> > >
> > > Yeah maybe this isn't really necessary. No strong preference here.
> > >
> > > That said, I do worry a bit that farming too much stuff out to
> > transformers
> > > > can result in "programming via config", i.e. a lot of the simplicity
> > you
> > > > get from Connect disappears in long config files. Standardization
> would
> > > be
> > > > nice and might just avoid this (and doesn't cost that much
> implementing
> > > it
> > > > in each connector), and I'd personally prefer something a bit less
> > > flexible
> > > > but consistent and easy to configure.
> > >
> > >
> > > Not sure what the you're suggesting :-) Standardized config properties
> > for
> > > a small set of transformations, leaving it upto connectors to
> integrate?
> > >
> >
> > I just mean that you get to the point where you're practically writing a
> > Kafka Streams application, you're just doing it through either an
> > incredibly convoluted set of transformers and configs, or a single
> > transformer with incredibly convoluted set of configs. You basically get
> to
> > the point where you're config is a mini DSL and you're not really saving
> > that much.
> >
> > The real question is how much we want to venture into the "T" part of
> ETL.
> > I tend to favor minimizing how much we take on since the rest of Connect
> > isn't designed for it, it's designed around the E & L parts.
> >
> > -Ewen
> >
> >
> > > Personally I'm skeptical of that level of flexibility in transformers
> --
> > > > its getting awfully complex and certainly takes us pretty far from
> > > "config
> > > > only" realtime data integration. It's not clear to me what the use
> > cases
> > > > are that aren't covered by a small set of common transformations that
> > can
> > > > be chained together (e.g. rename/remove fields, mask values, and
> maybe
> > a
> > > > couple more).
> > > >
> > >
> > > I agree that we should have some standard transformations that we ship
> > with
> > > connect that users would ideally lean towards for routine tasks. The
> ones
> > > you mention are some good candidates where I'd imagine can expose
> simple
> > > config, e.g.
> > >    transform.filter.whitelist=x,y,z # filter to a whitelist of fields
> > >    transfom.rename.spec=oldName1=>newName1, oldName2=>newName2
> > >    topic.rename.replace=-/_
> > >    topic.rename.prefix=kafka_
> > > etc..
> > >
> > > However the ecosystem will invariably have more complex transformers if
> > we
> > > make this pluggable. And because ETL is messy, that's probably a good
> > thing
> > > if folks are able to do their data munging orthogonally to connectors,
> so
> > > that connectors can focus on the logic of how data should be copied
> > from/to
> > > datastores and Kafka.
> > >
> > >
> > > > In any case, we'd probably also have to change configs of connectors
> if
> > > we
> > > > allowed configs like that since presumably transformer configs will
> > just
> > > be
> > > > part of the connector config.
> > > >
> > >
> > > Yeah, haven't thought much about how all the configuration would tie
> > > together...
> > >
> > > I think we'd need the ability to:
> > > - spec transformer chain (fully-qualified class names? perhaps special
> > > aliases for built-in ones? perhaps third-party fqcns can be assigned
> > > aliases by users in the chain spec, for easier configuration and to
> > > uniquely identify a transformation when it occurs more than one time in
> a
> > > chain?)
> > > - configure each transformer -- all properties prefixed with that
> > > transformer's ID (fqcn / alias) get destined to it
> > >
> > > Additionally, I think we would probably want to allow for
> topic-specific
> > > overrides <https://issues.apache.org/jira/browse/KAFKA-3962> (e.g. you
> > > want
> > > certain transformations for one topic, but different ones for
> another...)
> > >
> >
> >
> >
> > --
> > Thanks,
> > Ewen
> >
>



-- 
*Gwen Shapira*
Product Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter <https://twitter.com/ConfluentInc> | blog
<http://www.confluent.io/blog>

Re: [DISCUSS] KIP-66 Kafka Connect Transformers for messages

Reply via email to