I'm a bit concerned about adding transformations in Kafka. NiFi has 150 processors, presumably they are all useful for someone. I don't know if I'd want all of that in Apache Kafka. What's the downside of keeping it out? Or at least keeping the built-in set super minimal (Flume has like 3 built-in interceptors)?
Gwen On Wed, Dec 14, 2016 at 1:36 PM, Shikhar Bhushan <shik...@confluent.io> wrote: > With regard to a), just using `ConnectRecord` with `newRecord` as a new > abstract method would be a fine choice. In prototyping, both options end up > looking pretty similar (in terms of how transformations are implemented and > the runtime initializes and uses them) and I'm starting to lean towards not > adding a new interface into the mix. > > On b) I think we should include a small set of useful transformations with > Connect, since they can be applicable across different connectors and we > should encourage some standardization for common operations. I'll update > KIP-66 soon including a spec of transformations that I believe are worth > including. > > On Sat, Dec 10, 2016 at 11:52 PM Ewen Cheslack-Postava <e...@confluent.io> > wrote: > > If anyone has time to review here, it'd be great to get feedback. I'd > imagine that the proposal itself won't be too controversial -- keeps > transformations simple (by only allowing map/filter), doesn't affect the > rest of the framework much, and fits in with general config structure we've > used elsewhere (although ConfigDef could use some updates to make this > easier...). > > I think the main open questions for me are: > > a) Is TransformableRecord worth it to avoid reimplementing small bits of > code (it allows for a single implementation of the interface to trivially > apply to both Source and SinkRecords). I think I prefer this, but it does > come with some commitment to another interface on top of ConnectRecord. We > could alternatively modify ConnectRecord which would require fewer changes. > b) How do folks feel about built-in transformations and the set that are > mentioned here? This brings us way back to the discussion of built-in > connectors. Transformations, especially when intended to be lightweight and > touch nothing besides the data already in the record, seem different from > connectors -- there might be quite a few, but hopefully limited. Since we > (hopefully) already factor out most serialization-specific stuff via > Converters, I think we can keep this pretty limited. That said, I have no > doubt some folks will (in my opinion) abuse this feature to do data > enrichment by querying external systems, so building a bunch of > transformations in could potentially open the floodgates, or at least make > decisions about what is included vs what should be 3rd party muddy. > > -Ewen > > > On Wed, Dec 7, 2016 at 11:46 AM, Shikhar Bhushan <shik...@confluent.io> > wrote: > > > Hi all, > > > > I have another iteration at a proposal for this feature here: > > https://cwiki.apache.org/confluence/display/KAFKA/ > > Connect+Transforms+-+Proposed+Design > > > > I'd welcome your feedback and comments. > > > > Thanks, > > > > Shikhar > > > > On Tue, Aug 2, 2016 at 7:21 PM Ewen Cheslack-Postava <e...@confluent.io> > > wrote: > > > > On Thu, Jul 28, 2016 at 11:58 PM, Shikhar Bhushan <shik...@confluent.io> > > wrote: > > > > > > > > > > > > > > Hmm, operating on ConnectRecords probably doesn't work since you need > > to > > > > emit the right type of record, which might mean instantiating a new > > one. > > > I > > > > think that means we either need 2 methods, one for SourceRecord, one > > for > > > > SinkRecord, or we'd need to limit what parts of the message you can > > > modify > > > > (e.g. you can change the key/value via something like > > > > transformKey(ConnectRecord) and transformValue(ConnectRecord), but > > other > > > > fields would remain the same and the fmwk would handle allocating new > > > > Source/SinkRecords if needed) > > > > > > > > > > Good point, perhaps we could add an abstract method on ConnectRecord > that > > > takes all the shared fields as parameters and the implementations > return > > a > > > copy of the narrower SourceRecord/SinkRecord type as appropriate. > > > Transformers would only operate on ConnectRecord rather than caring > about > > > SourceRecord or SinkRecord (in theory they could instanceof/cast, but > the > > > API should discourage it) > > > > > > > > > > Is there a use case for hanging on to the original? I can't think of > a > > > > transformation where you'd need to do that (or couldn't just order > > things > > > > differently so it isn't a problem). > > > > > > > > > Yeah maybe this isn't really necessary. No strong preference here. > > > > > > That said, I do worry a bit that farming too much stuff out to > > transformers > > > > can result in "programming via config", i.e. a lot of the simplicity > > you > > > > get from Connect disappears in long config files. Standardization > would > > > be > > > > nice and might just avoid this (and doesn't cost that much > implementing > > > it > > > > in each connector), and I'd personally prefer something a bit less > > > flexible > > > > but consistent and easy to configure. > > > > > > > > > Not sure what the you're suggesting :-) Standardized config properties > > for > > > a small set of transformations, leaving it upto connectors to > integrate? > > > > > > > I just mean that you get to the point where you're practically writing a > > Kafka Streams application, you're just doing it through either an > > incredibly convoluted set of transformers and configs, or a single > > transformer with incredibly convoluted set of configs. You basically get > to > > the point where you're config is a mini DSL and you're not really saving > > that much. > > > > The real question is how much we want to venture into the "T" part of > ETL. > > I tend to favor minimizing how much we take on since the rest of Connect > > isn't designed for it, it's designed around the E & L parts. > > > > -Ewen > > > > > > > Personally I'm skeptical of that level of flexibility in transformers > -- > > > > its getting awfully complex and certainly takes us pretty far from > > > "config > > > > only" realtime data integration. It's not clear to me what the use > > cases > > > > are that aren't covered by a small set of common transformations that > > can > > > > be chained together (e.g. rename/remove fields, mask values, and > maybe > > a > > > > couple more). > > > > > > > > > > I agree that we should have some standard transformations that we ship > > with > > > connect that users would ideally lean towards for routine tasks. The > ones > > > you mention are some good candidates where I'd imagine can expose > simple > > > config, e.g. > > > transform.filter.whitelist=x,y,z # filter to a whitelist of fields > > > transfom.rename.spec=oldName1=>newName1, oldName2=>newName2 > > > topic.rename.replace=-/_ > > > topic.rename.prefix=kafka_ > > > etc.. > > > > > > However the ecosystem will invariably have more complex transformers if > > we > > > make this pluggable. And because ETL is messy, that's probably a good > > thing > > > if folks are able to do their data munging orthogonally to connectors, > so > > > that connectors can focus on the logic of how data should be copied > > from/to > > > datastores and Kafka. > > > > > > > > > > In any case, we'd probably also have to change configs of connectors > if > > > we > > > > allowed configs like that since presumably transformer configs will > > just > > > be > > > > part of the connector config. > > > > > > > > > > Yeah, haven't thought much about how all the configuration would tie > > > together... > > > > > > I think we'd need the ability to: > > > - spec transformer chain (fully-qualified class names? perhaps special > > > aliases for built-in ones? perhaps third-party fqcns can be assigned > > > aliases by users in the chain spec, for easier configuration and to > > > uniquely identify a transformation when it occurs more than one time in > a > > > chain?) > > > - configure each transformer -- all properties prefixed with that > > > transformer's ID (fqcn / alias) get destined to it > > > > > > Additionally, I think we would probably want to allow for > topic-specific > > > overrides <https://issues.apache.org/jira/browse/KAFKA-3962> (e.g. you > > > want > > > certain transformations for one topic, but different ones for > another...) > > > > > > > > > > > -- > > Thanks, > > Ewen > > > -- *Gwen Shapira* Product Manager | Confluent 650.450.2760 | @gwenshap Follow us: Twitter <https://twitter.com/ConfluentInc> | blog <http://www.confluent.io/blog>