gharris1727 commented on PR #16161:
URL: https://github.com/apache/kafka/pull/16161#issuecomment-2147992314

   > Can you expound on the motivation for this change? IIRC we began to allow 
null element schemas in order to support schema inference for empty collections 
parsed by the Values class, which is valid IMO.
   
   Framed in that way, I also think it would make sense to try to infer a 
schema for empty lists. But I think that decision is undesirable in the broader 
context of the Connect ecosystem.
   
   I tried to determine how a `null` schema is interpreted in the Connect 
type-system. When used in a ConnectRecord, it's a top-type, as it allows the 
Object value to be any type. The way Values uses it is also as a top-type, 
because a heterogeneous array infers a null schema. The current behavior for 
validateValue is as a bottom-type, as no value is valid for a null schema 
(although this seems to be an accident).
   
   Much of our infrastructure is not prepared for element schemas to be null, 
such as the SchemaBuilder#array() and SchemaBuilder#map() methods, which still 
have null guards: 
https://github.com/apache/kafka/blob/55d38efcc5505a5a1bddb08ba05f4d923f8050f9/connect/api/src/main/java/org/apache/kafka/connect/data/SchemaBuilder.java#L361-L383
 The field builder has a null guard: 
https://github.com/apache/kafka/blob/55d38efcc5505a5a1bddb08ba05f4d923f8050f9/connect/api/src/main/java/org/apache/kafka/connect/data/SchemaBuilder.java#L326-L327
 and obviously the validateValue method was written without null schemas 
in-mind.
   
   If we changed all of the above functions to remove the null guards, it would 
permit schemaless data within schema'd data. This would allow data sources to 
break the single schema vs schemaless check that most SMTs use to determine if 
they're operating on schema'd vs schemaless data. Rather than receiving 
DataExceptions when assembling the structs in the data sources, it would likely 
surface as NPEs in data sinks.
   
   So I don't think we can change validateValue to be top-type without risking 
more bad data and NPEs overall. I also don't think it's viable to make 
ConnectRecord bottom-type, as schemaless data is intentionally supported. I 
think the best path forward is to attempt to align with the SchemaBuilder 
semantics before the SchemaBuilder(Type) constructor was made public, and 
disallow null element schemas entirely. Then, the assumption that the root 
Schema in ConnectRecord should be the only nullable schema will be correct. Of 
course, we still have to change the Values class to be more conservative with 
the types it outputs, which I opened this ticket for: 
https://issues.apache.org/jira/browse/KAFKA-16870 .
   
   One of the good things for us is that the null-schema'd collections from the 
Values class and SimpleHeaderConverter don't commonly make it into the key or 
value, or as fields in a struct, because AK doesn't have HeaderTo$Key or 
HeaderTo$Value. This ticket actually isn't motivated by the Values parsing for 
that reason.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to