> > > Hmm, operating on ConnectRecords probably doesn't work since you need to > emit the right type of record, which might mean instantiating a new one. I > think that means we either need 2 methods, one for SourceRecord, one for > SinkRecord, or we'd need to limit what parts of the message you can modify > (e.g. you can change the key/value via something like > transformKey(ConnectRecord) and transformValue(ConnectRecord), but other > fields would remain the same and the fmwk would handle allocating new > Source/SinkRecords if needed) >
Good point, perhaps we could add an abstract method on ConnectRecord that takes all the shared fields as parameters and the implementations return a copy of the narrower SourceRecord/SinkRecord type as appropriate. Transformers would only operate on ConnectRecord rather than caring about SourceRecord or SinkRecord (in theory they could instanceof/cast, but the API should discourage it) > Is there a use case for hanging on to the original? I can't think of a > transformation where you'd need to do that (or couldn't just order things > differently so it isn't a problem). Yeah maybe this isn't really necessary. No strong preference here. That said, I do worry a bit that farming too much stuff out to transformers > can result in "programming via config", i.e. a lot of the simplicity you > get from Connect disappears in long config files. Standardization would be > nice and might just avoid this (and doesn't cost that much implementing it > in each connector), and I'd personally prefer something a bit less flexible > but consistent and easy to configure. Not sure what the you're suggesting :-) Standardized config properties for a small set of transformations, leaving it upto connectors to integrate? Personally I'm skeptical of that level of flexibility in transformers -- > its getting awfully complex and certainly takes us pretty far from "config > only" realtime data integration. It's not clear to me what the use cases > are that aren't covered by a small set of common transformations that can > be chained together (e.g. rename/remove fields, mask values, and maybe a > couple more). > I agree that we should have some standard transformations that we ship with connect that users would ideally lean towards for routine tasks. The ones you mention are some good candidates where I'd imagine can expose simple config, e.g. transform.filter.whitelist=x,y,z # filter to a whitelist of fields transfom.rename.spec=oldName1=>newName1, oldName2=>newName2 topic.rename.replace=-/_ topic.rename.prefix=kafka_ etc.. However the ecosystem will invariably have more complex transformers if we make this pluggable. And because ETL is messy, that's probably a good thing if folks are able to do their data munging orthogonally to connectors, so that connectors can focus on the logic of how data should be copied from/to datastores and Kafka. > In any case, we'd probably also have to change configs of connectors if we > allowed configs like that since presumably transformer configs will just be > part of the connector config. > Yeah, haven't thought much about how all the configuration would tie together... I think we'd need the ability to: - spec transformer chain (fully-qualified class names? perhaps special aliases for built-in ones? perhaps third-party fqcns can be assigned aliases by users in the chain spec, for easier configuration and to uniquely identify a transformation when it occurs more than one time in a chain?) - configure each transformer -- all properties prefixed with that transformer's ID (fqcn / alias) get destined to it Additionally, I think we would probably want to allow for topic-specific overrides <https://issues.apache.org/jira/browse/KAFKA-3962> (e.g. you want certain transformations for one topic, but different ones for another...)