Hi Jorge, Looking good! Got a few more thoughts.
1. Sorry to revisit this, but I think we may want to adopt a slightly different escape syntax style. Backslashes are great, but since they're already used by JSON, using them as an escape sequence in field notation would also lead to some pretty ugly connector configs. Anyone who's had to write regular expressions with backslashes in Java is probably already familiar with this: "this\\\\.is\\\\.not\\\\.very\\\\.readable". What do you think about using the dot character to escape itself? In other words, to access a single field named "this.field", instead of using the syntax "this\.field" (which in JSON would have to be expressed as "this\\.field"), we could use "this..field", and for a single field named "this\field", instead of using the syntax "this\\field" (or, in JSON, "this\\\\field"), we could use "this\field" (or, in JSON, "this\\field"). 2. Could you flesh out the details on the new "field.style" property, including the type, default value, importance, and a preliminary docstring? See https://cwiki.apache.org/confluence/display/KAFKA/KIP-618%3A+Exactly-Once+Support+for+Source+Connectors#KIP618:ExactlyOnceSupportforSourceConnectors-Newproperties for an example. 3. Is the "Compatibility, Deprecation, and Migration Plan" section still accurate after the latest update? Seems like it's still written with the assumption that nested field syntax will be hardcoded or opt-in, which IIUC isn't the case anymore. 4. Nit: The "These SMTs do not require nested structure support" section mentions a "Drop" SMT. I think this may be referring to the Confluent Drop SMT, which isn't a part of Apache Kafka. Should we drop (heh) that SMT from the list? Or perhaps just replace it with "DropHeaders", which is currently missing from the list and shouldn't require any nested-field related updates? Cheers, Chris On Mon, Feb 28, 2022 at 2:12 PM Jorge Esteban Quilcate Otoya < quilcate.jo...@gmail.com> wrote: > Thank you, Chris! and sorry for the delayed response. > > Please, find my comments below: > > On Mon, 14 Feb 2022 at 17:34, Chris Egerton <chr...@confluent.io.invalid> > wrote: > > > Hi Jorge, > > > > Thanks for the KIP! I'd love to see support for nested fields added to > the > > out-of-the-box SMTs provided with Connect. Here are my initial thoughts: > > > > 1. I agree that there's a case to be made for expanding HoistField with a > > new config property for identifying a nested, to-be-hoisted field, but > the > > example in the KIP doesn't really demonstrate why this would be > valuable. I > > think it'd be helpful to expand the example to add other fields in order > to > > show how adding nested field support enables users to hoist a nested > field > > without dropping other fields from the value. Maybe something like this: > > > > source = nested.val > > field = line > > > > value (before): > > { > > "nested": { > > "val": 42, > > "other val": 96 > > } > > } > > > > value (after): > > { > > "nested": { > > "line": { > > "val": 42, > > } > > "other val": 96 > > } > > } > > > > 2. Nit: I think "source" is a little strange for the new HoistField > > property name. Maybe "hoisted" or "hoisted.field" would be more > > descriptive? > > > > > About 1. and 2.: > Agree. The example for this SMT is updated and have added the `hoisted` > configuration. > > > > 3. Is there a reasonable use case for expanding Flatten to be able to > > flatten specific fields? My understanding is that it's mostly useful for > > writing to systems like databases that don't support nested values and > > require everything to be a flat list of key-value pairs. Being able to > > flatten a nested field wouldn't provide any advantage for that use case. > > Are there other cases where it would? > > > > 4. I don't think we should unconditionally change the default delimiter > for > > Flatten. It's a backwards-incompatible, breaking change that could cause > > headaches for users. It might be reasonable to change the default value > > dynamically based on whether the user has specified a value for the > "field" > > property, but considering the motivation for changing the default is that > > it creates conflicts with the to-be-introduced nested field syntax (which > > could arise with downstream SMTs regardless of whether the user has > > explicitly configured Flatten with the "field" property), I don't know > that > > this would be too useful either. I have some thoughts below on how to > > handle possible conflicts between names with dots in their names and > dotted > > syntax for nested field references that should hopefully make either > change > > unnecessary. > > > > > Fair enough. With the support for nested fields in other SMTs, Flatten > could stay as it is. > This removes the need for (4) changing Flatten config as well. > > > > 5. I think it's fine to expand ExtractField to support nested notation, > but > > it might be worth noting in the rejected alternatives section that this > > isn't strictly necessary since you can replace any single invocation of > > that SMT that uses nested field notation with multiple invocations of it > > that use non-nested notation. > > > > Agree. Adding it. > > > > > > 6. Nit: "RegerRouter" should be "RegexRouter" in the list of SMTs that do > > not require nested structure support. > > > > > Ack. Fixing it. > > > > 7. It may be rare for dots in field names to occur in the wild (although > I > > wouldn't be so certain of this), but unless we want to inflict headaches > on > > users of Flatten, I think we're going to have to think about conflicts > > between dotted notation and non-nested fields whose names contain dots. I > > don't think this is actually such a bad thing, though. I agree that > dotted > > notation is intuitive and pretty commonplace (in tools like jq, for > > example), so I'd like it if we could stick to it. What about introducing > > escape syntax, using a backslash? That way, users could disambiguate > > between "this.field" (which would refer to the nested field "field" under > > the top-level "this" field), and "this\.field" (which would refer to the > > field named "this.field"). Like with most languages that use the > backslash > > for escape sequences, it could also be used to escape itself, in the > event > > that a field name contains a backslash. I think this is more intuitive > and > > simpler than, e.g., adding a new config property to toggle the delimiter > to > > be used when parsing nested field references. > > > > I like this approach. Adding to the KIP. > > > > > > 8. I don't think we can unconditionally turn this feature on. The risk of > > breaking existing pipelines (especially ones that involve, for example, a > > combination of the Flatten and Cast SMTs) is pretty high. I think this > > should be an opt-in feature, at least until the next major release. One > way > > we could accomplish this is by introducing a new "field.style" (name > > obviously subject to change) property with values of "plain" (default) > and > > "nested". If set to "plain" then the current non-nested behavior is used, > > and if set to "nested", then the proposed nested behavior is. We can > > consider updating the default value to "nested" in a future major release > > (or even codify that plan in this KIP, if there's enough support for it). > > This would also leave the door open for adding recursive support to SMTs > in > > the future by adding a permitted value of "recursive". > > > > 9. One of the linked tickets in the "Motivation" section, > > https://issues.apache.org/jira/browse/KAFKA-10640, has an open PR and > KIP > > that propose adding recursive support to some SMTs. Have you considered > > this type of functionality for your KIP? Or is your aim to stick solely > to > > nested field support? > > > > I like the `field.style` configuration flag approach. > Thanks for pointing out the `recursive` approach. Will add `nested` at the > moment, let's check the demand for `recursive` to consider it as part of > this or another KIP. > > I have added the following on the KIP: > > ``` > Future KIPs could extend this support for: > > - Recursive notation: name a field and apply it to all fields across the > schema matching that name. > - Access to arrays: Adding `[]` notation to represent access to arrays and > applying SMTs to fields within an array. > ``` > > > > > > Cheers, > > > > Chris > > > > On Tue, Feb 8, 2022 at 1:23 PM Jorge Esteban Quilcate Otoya < > > quilcate.jo...@gmail.com> wrote: > > > > > Hi Dev team, > > > > > > I'd like to start a new discussion thread on Kafka Connect KIP-821: > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-821%3A+Connect+Transforms+support+for+nested+structures > > > > > > This KIP is aimed to include support for nested structures on the > > existing > > > SMTs — where this make sense. > > > > > > Looking forward to your feedback. > > > > > > Regards, > > > Jorge. > > > > > >