Re: [PR] KAFKA-16858: Throw DataException from validateValue on array and map schemas without inner schemas [kafka]

via GitHub Tue, 04 Jun 2024 09:53:39 -0700


gharris1727 commented on PR #16161:
URL: https://github.com/apache/kafka/pull/16161#issuecomment-2147992314

> Can you expound on the motivation for this change? IIRC we began to allow
null element schemas in order to support schema inference for empty collections
parsed by the Values class, which is valid IMO.

Framed in that way, I also think it would make sense to try to infer a
schema for empty lists. But I think that decision is undesirable in the broader
context of the Connect ecosystem.

I tried to determine how a `null` schema is interpreted in the Connect
type-system. When used in a ConnectRecord, it's a top-type, as it allows the
Object value to be any type. The way Values uses it is also as a top-type,
because a heterogeneous array infers a null schema. The current behavior for
validateValue is as a bottom-type, as no value is valid for a null schema
(although this seems to be an accident).

Much of our infrastructure is not prepared for element schemas to be null,
such as the SchemaBuilder#array() and SchemaBuilder#map() methods, which still
have null guards:
https://github.com/apache/kafka/blob/55d38efcc5505a5a1bddb08ba05f4d923f8050f9/connect/api/src/main/java/org/apache/kafka/connect/data/SchemaBuilder.java#L361-L383
The field builder has a null guard:
https://github.com/apache/kafka/blob/55d38efcc5505a5a1bddb08ba05f4d923f8050f9/connect/api/src/main/java/org/apache/kafka/connect/data/SchemaBuilder.java#L326-L327
and obviously the validateValue method was written without null schemas
in-mind.

If we changed all of the above functions to remove the null guards, it would
permit schemaless data within schema'd data. This would allow data sources to
break the single schema vs schemaless check that most SMTs use to determine if
they're operating on schema'd vs schemaless data. Rather than receiving
DataExceptions when assembling the structs in the data sources, it would likely
surface as NPEs in data sinks.

So I don't think we can change validateValue to be top-type without risking
more bad data and NPEs overall. I also don't think it's viable to make
ConnectRecord bottom-type, as schemaless data is intentionally supported. I
think the best path forward is to attempt to align with the SchemaBuilder
semantics before the SchemaBuilder(Type) constructor was made public, and
disallow null element schemas entirely. Then, the assumption that the root
Schema in ConnectRecord should be the only nullable schema will be correct. Of
course, we still have to change the Values class to be more conservative with
the types it outputs, which I opened this ticket for:
https://issues.apache.org/jira/browse/KAFKA-16870 .

One of the good things for us is that the null-schema'd collections from the
Values class and SimpleHeaderConverter don't commonly make it into the key or
value, or as fields in a struct, because AK doesn't have HeaderTo$Key or
HeaderTo$Value. This ticket actually isn't motivated by the Values parsing for
that reason.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] KAFKA-16858: Throw DataException from validateValue on array and map schemas without inner schemas [kafka]

Reply via email to