Hi all, I have another iteration at a proposal for this feature here: https://cwiki.apache.org/confluence/display/KAFKA/Connect+Transforms+-+Proposed+Design
I'd welcome your feedback and comments. Thanks, Shikhar On Tue, Aug 2, 2016 at 7:21 PM Ewen Cheslack-Postava <e...@confluent.io> wrote: On Thu, Jul 28, 2016 at 11:58 PM, Shikhar Bhushan <shik...@confluent.io> wrote: > > > > > > Hmm, operating on ConnectRecords probably doesn't work since you need to > > emit the right type of record, which might mean instantiating a new one. > I > > think that means we either need 2 methods, one for SourceRecord, one for > > SinkRecord, or we'd need to limit what parts of the message you can > modify > > (e.g. you can change the key/value via something like > > transformKey(ConnectRecord) and transformValue(ConnectRecord), but other > > fields would remain the same and the fmwk would handle allocating new > > Source/SinkRecords if needed) > > > > Good point, perhaps we could add an abstract method on ConnectRecord that > takes all the shared fields as parameters and the implementations return a > copy of the narrower SourceRecord/SinkRecord type as appropriate. > Transformers would only operate on ConnectRecord rather than caring about > SourceRecord or SinkRecord (in theory they could instanceof/cast, but the > API should discourage it) > > > > Is there a use case for hanging on to the original? I can't think of a > > transformation where you'd need to do that (or couldn't just order things > > differently so it isn't a problem). > > > Yeah maybe this isn't really necessary. No strong preference here. > > That said, I do worry a bit that farming too much stuff out to transformers > > can result in "programming via config", i.e. a lot of the simplicity you > > get from Connect disappears in long config files. Standardization would > be > > nice and might just avoid this (and doesn't cost that much implementing > it > > in each connector), and I'd personally prefer something a bit less > flexible > > but consistent and easy to configure. > > > Not sure what the you're suggesting :-) Standardized config properties for > a small set of transformations, leaving it upto connectors to integrate? > I just mean that you get to the point where you're practically writing a Kafka Streams application, you're just doing it through either an incredibly convoluted set of transformers and configs, or a single transformer with incredibly convoluted set of configs. You basically get to the point where you're config is a mini DSL and you're not really saving that much. The real question is how much we want to venture into the "T" part of ETL. I tend to favor minimizing how much we take on since the rest of Connect isn't designed for it, it's designed around the E & L parts. -Ewen > Personally I'm skeptical of that level of flexibility in transformers -- > > its getting awfully complex and certainly takes us pretty far from > "config > > only" realtime data integration. It's not clear to me what the use cases > > are that aren't covered by a small set of common transformations that can > > be chained together (e.g. rename/remove fields, mask values, and maybe a > > couple more). > > > > I agree that we should have some standard transformations that we ship with > connect that users would ideally lean towards for routine tasks. The ones > you mention are some good candidates where I'd imagine can expose simple > config, e.g. > transform.filter.whitelist=x,y,z # filter to a whitelist of fields > transfom.rename.spec=oldName1=>newName1, oldName2=>newName2 > topic.rename.replace=-/_ > topic.rename.prefix=kafka_ > etc.. > > However the ecosystem will invariably have more complex transformers if we > make this pluggable. And because ETL is messy, that's probably a good thing > if folks are able to do their data munging orthogonally to connectors, so > that connectors can focus on the logic of how data should be copied from/to > datastores and Kafka. > > > > In any case, we'd probably also have to change configs of connectors if > we > > allowed configs like that since presumably transformer configs will just > be > > part of the connector config. > > > > Yeah, haven't thought much about how all the configuration would tie > together... > > I think we'd need the ability to: > - spec transformer chain (fully-qualified class names? perhaps special > aliases for built-in ones? perhaps third-party fqcns can be assigned > aliases by users in the chain spec, for easier configuration and to > uniquely identify a transformation when it occurs more than one time in a > chain?) > - configure each transformer -- all properties prefixed with that > transformer's ID (fqcn / alias) get destined to it > > Additionally, I think we would probably want to allow for topic-specific > overrides <https://issues.apache.org/jira/browse/KAFKA-3962> (e.g. you > want > certain transformations for one topic, but different ones for another...) > -- Thanks, Ewen