Hi Jorge,

Thanks for the KIP! I'd love to see support for nested fields added to the
out-of-the-box SMTs provided with Connect. Here are my initial thoughts:

1. I agree that there's a case to be made for expanding HoistField with a
new config property for identifying a nested, to-be-hoisted field, but the
example in the KIP doesn't really demonstrate why this would be valuable. I
think it'd be helpful to expand the example to add other fields in order to
show how adding nested field support enables users to hoist a nested field
without dropping other fields from the value. Maybe something like this:

    source = nested.val
    field = line

    value (before):
        {
            "nested": {
                "val": 42,
                "other val": 96
            }
        }

    value (after):
        {
            "nested": {
                "line": {
                    "val": 42,
                }
                "other val": 96
            }
        }

2. Nit: I think "source" is a little strange for the new HoistField
property name. Maybe "hoisted" or "hoisted.field" would be more descriptive?

3. Is there a reasonable use case for expanding Flatten to be able to
flatten specific fields? My understanding is that it's mostly useful for
writing to systems like databases that don't support nested values and
require everything to be a flat list of key-value pairs. Being able to
flatten a nested field wouldn't provide any advantage for that use case.
Are there other cases where it would?

4. I don't think we should unconditionally change the default delimiter for
Flatten. It's a backwards-incompatible, breaking change that could cause
headaches for users. It might be reasonable to change the default value
dynamically based on whether the user has specified a value for the "field"
property, but considering the motivation for changing the default is that
it creates conflicts with the to-be-introduced nested field syntax (which
could arise with downstream SMTs regardless of whether the user has
explicitly configured Flatten with the "field" property), I don't know that
this would be too useful either. I have some thoughts below on how to
handle possible conflicts between names with dots in their names and dotted
syntax for nested field references that should hopefully make either change
unnecessary.

5. I think it's fine to expand ExtractField to support nested notation, but
it might be worth noting in the rejected alternatives section that this
isn't strictly necessary since you can replace any single invocation of
that SMT that uses nested field notation with multiple invocations of it
that use non-nested notation.

6. Nit: "RegerRouter" should be "RegexRouter" in the list of SMTs that do
not require nested structure support.

7. It may be rare for dots in field names to occur in the wild (although I
wouldn't be so certain of this), but unless we want to inflict headaches on
users of Flatten, I think we're going to have to think about conflicts
between dotted notation and non-nested fields whose names contain dots. I
don't think this is actually such a bad thing, though. I agree that dotted
notation is intuitive and pretty commonplace (in tools like jq, for
example), so I'd like it if we could stick to it. What about introducing
escape syntax, using a backslash? That way, users could disambiguate
between "this.field" (which would refer to the nested field "field" under
the top-level "this" field), and "this\.field" (which would refer to the
field named "this.field"). Like with most languages that use the backslash
for escape sequences, it could also be used to escape itself, in the event
that a field name contains a backslash. I think this is more intuitive and
simpler than, e.g., adding a new config property to toggle the delimiter to
be used when parsing nested field references.

8. I don't think we can unconditionally turn this feature on. The risk of
breaking existing pipelines (especially ones that involve, for example, a
combination of the Flatten and Cast SMTs) is pretty high. I think this
should be an opt-in feature, at least until the next major release. One way
we could accomplish this is by introducing a new "field.style" (name
obviously subject to change) property with values of "plain" (default) and
"nested". If set to "plain" then the current non-nested behavior is used,
and if set to "nested", then the proposed nested behavior is. We can
consider updating the default value to "nested" in a future major release
(or even codify that plan in this KIP, if there's enough support for it).
This would also leave the door open for adding recursive support to SMTs in
the future by adding a permitted value of "recursive".

9. One of the linked tickets in the "Motivation" section,
https://issues.apache.org/jira/browse/KAFKA-10640, has an open PR and KIP
that propose adding recursive support to some SMTs. Have you considered
this type of functionality for your KIP? Or is your aim to stick solely to
nested field support?

Cheers,

Chris

On Tue, Feb 8, 2022 at 1:23 PM Jorge Esteban Quilcate Otoya <
quilcate.jo...@gmail.com> wrote:

> Hi Dev team,
>
> I'd like to start a new discussion thread on Kafka Connect KIP-821:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-821%3A+Connect+Transforms+support+for+nested+structures
>
> This KIP is aimed to include support for nested structures on the existing
> SMTs — where this make sense.
>
> Looking forward to your feedback.
>
> Regards,
> Jorge.
>

Reply via email to