Hi Jorge, Thanks for the KIP! I'd love to see support for nested fields added to the out-of-the-box SMTs provided with Connect. Here are my initial thoughts:
1. I agree that there's a case to be made for expanding HoistField with a new config property for identifying a nested, to-be-hoisted field, but the example in the KIP doesn't really demonstrate why this would be valuable. I think it'd be helpful to expand the example to add other fields in order to show how adding nested field support enables users to hoist a nested field without dropping other fields from the value. Maybe something like this: source = nested.val field = line value (before): { "nested": { "val": 42, "other val": 96 } } value (after): { "nested": { "line": { "val": 42, } "other val": 96 } } 2. Nit: I think "source" is a little strange for the new HoistField property name. Maybe "hoisted" or "hoisted.field" would be more descriptive? 3. Is there a reasonable use case for expanding Flatten to be able to flatten specific fields? My understanding is that it's mostly useful for writing to systems like databases that don't support nested values and require everything to be a flat list of key-value pairs. Being able to flatten a nested field wouldn't provide any advantage for that use case. Are there other cases where it would? 4. I don't think we should unconditionally change the default delimiter for Flatten. It's a backwards-incompatible, breaking change that could cause headaches for users. It might be reasonable to change the default value dynamically based on whether the user has specified a value for the "field" property, but considering the motivation for changing the default is that it creates conflicts with the to-be-introduced nested field syntax (which could arise with downstream SMTs regardless of whether the user has explicitly configured Flatten with the "field" property), I don't know that this would be too useful either. I have some thoughts below on how to handle possible conflicts between names with dots in their names and dotted syntax for nested field references that should hopefully make either change unnecessary. 5. I think it's fine to expand ExtractField to support nested notation, but it might be worth noting in the rejected alternatives section that this isn't strictly necessary since you can replace any single invocation of that SMT that uses nested field notation with multiple invocations of it that use non-nested notation. 6. Nit: "RegerRouter" should be "RegexRouter" in the list of SMTs that do not require nested structure support. 7. It may be rare for dots in field names to occur in the wild (although I wouldn't be so certain of this), but unless we want to inflict headaches on users of Flatten, I think we're going to have to think about conflicts between dotted notation and non-nested fields whose names contain dots. I don't think this is actually such a bad thing, though. I agree that dotted notation is intuitive and pretty commonplace (in tools like jq, for example), so I'd like it if we could stick to it. What about introducing escape syntax, using a backslash? That way, users could disambiguate between "this.field" (which would refer to the nested field "field" under the top-level "this" field), and "this\.field" (which would refer to the field named "this.field"). Like with most languages that use the backslash for escape sequences, it could also be used to escape itself, in the event that a field name contains a backslash. I think this is more intuitive and simpler than, e.g., adding a new config property to toggle the delimiter to be used when parsing nested field references. 8. I don't think we can unconditionally turn this feature on. The risk of breaking existing pipelines (especially ones that involve, for example, a combination of the Flatten and Cast SMTs) is pretty high. I think this should be an opt-in feature, at least until the next major release. One way we could accomplish this is by introducing a new "field.style" (name obviously subject to change) property with values of "plain" (default) and "nested". If set to "plain" then the current non-nested behavior is used, and if set to "nested", then the proposed nested behavior is. We can consider updating the default value to "nested" in a future major release (or even codify that plan in this KIP, if there's enough support for it). This would also leave the door open for adding recursive support to SMTs in the future by adding a permitted value of "recursive". 9. One of the linked tickets in the "Motivation" section, https://issues.apache.org/jira/browse/KAFKA-10640, has an open PR and KIP that propose adding recursive support to some SMTs. Have you considered this type of functionality for your KIP? Or is your aim to stick solely to nested field support? Cheers, Chris On Tue, Feb 8, 2022 at 1:23 PM Jorge Esteban Quilcate Otoya < quilcate.jo...@gmail.com> wrote: > Hi Dev team, > > I'd like to start a new discussion thread on Kafka Connect KIP-821: > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-821%3A+Connect+Transforms+support+for+nested+structures > > This KIP is aimed to include support for nested structures on the existing > SMTs — where this make sense. > > Looking forward to your feedback. > > Regards, > Jorge. >