Hi Devin, This topic remains of great interest to me. I think there is still a wide schema usability gap between traditional batch data systems (RDBMS for example) and those in the messaging/streaming space.
Pulsar provides no mechanism for real-time schema verification of > message content. Are you specifically referring to validation at the broker entry point and not in the client? The user experience around maintaining types/schemas between apps in > Pulsar is not good What are we comparing this to though? What would the ideal data developer workflow look like? Thanks, Elliot. On Mon, 13 Mar 2023 at 16:58, Devin Bost <devin.b...@gmail.com> wrote: > > Sorry. I do not fully understand here. Is it also related to the "data > > quality" problem > > that we discussed? For the consumer side, we can use the AUTO_CONSUME > schema > > to receive GenericObject (For JSON schema, you can deal with JsonObject > > directly). > > For the producer side, I think yes. We can either send an Object or > bytes[] > > (AUTO_PRODUCE). > > I think there are two problems: > > 1. Pulsar provides no mechanism for real-time schema verification of > message content. There's no way in Pulsar to verify that the type > registered at compile-time matches the content of the message that is sent > at runtime. Companies that want to guarantee schema conformance on a > per-message basis are left to implement such a mechanism on their own. > 2. The user experience around maintaining types/schemas between apps in > Pulsar is not good, but for the purpose of this thread, let's focus on that > first problem above. > > Devin G. Bost > > > On Sun, Nov 20, 2022 at 8:02 PM PengHui Li <peng...@apache.org> wrote: > > > Hi, Devin > > > > Thanks for raising the great discussion. It looks like the salient point > is > > that Pulsar > > doesn't support native JSON schema. Instead, the schema is defined in the > > Avro > > standard but serialized to JSON format.JSON Schema combines aspects of > > type-based and rule-based. As this article[1] said "JSON Schema combines > > aspects of both a grammar-based language and a rule-based one". But the > > Avro > > schema definition only has the aspect of grammar-based. > > > > [1] > > https://yokota.blog/2021/03/29/understanding-json-schema-compatibility/ > > > > > One of the issues with Pulsar's current implementation of schemas for > > JSON > > is the requirement to always have a POCO or some kind of type builder to > > construct the schema. This requirement can be cumbersome for users who > only > > care about a few fields on the object. > > > > Sorry. I do not fully understand here. Is it also related to the "data > > quality" problem > > that we discussed? For the consumer side, we can use the AUTO_CONSUME > > schema > > to receive GenericObject (For JSON schema, you can deal with JsonObject > > directly). > > For the producer side, I think yes. We can either send an Object or > bytes[] > > (AUTO_PRODUCE). > > > > > Plus, the use case is a little different compared to a DLQ or Retry > > topic because we'd like a way to handle content failures separately from > > other kinds of failures. > > > > Yes, I agree. It's not a field of DLQ. > > > > Thanks, > > Penghui > > > > On Thu, Nov 17, 2022 at 7:37 AM Devin Bost <devin.b...@gmail.com> wrote: > > > > > I appreciate all the thoughts and questions so far. > > > > > > One of the issues with Pulsar's current implementation of schemas for > > JSON > > > is the requirement to always have a POCO or some kind of type builder > to > > > construct the schema. This requirement can be cumbersome for users who > > only > > > care about a few fields on the object. > > > Protobuf attempts to simplify the implementation of the mapping (from > > data > > > to class) by having a language-independent mechanism for defining the > > data > > > (so the POCO can be generated in the desired language), but obviously, > > that > > > offers very few benefits for JSON. Additionally, protobuf and Avro > don't > > > provide a way to express constraints on data *values*. Consider an > > example. > > > Let's say a site is sending messages like this: > > > { > > > "user": "bob", > > > "action": "click", > > > "trackingUrn": "urn:siteA:homepage:topNavigation:0.124", > > > "payload" : { > > > . . . > > > [ *highly nested or dynamic data* ] > > > . . . > > > } > > > } > > > > > > Here are some issues we might run into: > > > 1. A consumer wants to take action on messages based on a single field. > > > They only care about if the field exists and has an allowed value. They > > > don't want to spend a week trying to map each of the nested fields > into a > > > POCO and then worry about maintaining the POCO when nested sub-fields > are > > > updated by upstream teams with breaking changes. Consider these use > > cases: > > > - Validate that the "action" value is oneOf: [ "click", > "impression", > > > "hover"]. Route content based on the action unless it's an unexpected > > > value. > > > - Subfields change depending on the trackingUrn values. > > > Consider the following: > > > A) In the validation use case, the app developer shouldn't need to > > deal > > > with any fields other than "action", but they should be able to express > > or > > > verify that "action" is part of a data contract they have agreed to > > consume > > > from. > > > B) Every app like this would need to add its own runtime validation > > > logic, and when many different apps are using their own versions of > > > validation, the implementations are brittle and become hard to > maintain. > > > The solution to the brittleness is to adopt a standard that solves the > > > interoperability problem. > > > C) If subfields are dynamic, well, there's not a good way to express > > > that in Avro. Maybe the developer could use maps, but I think that > > defeats > > > the purpose. > > > 2. We should be able to compose schemas from shared "schema components" > > for > > > improved reusability. (Consider it like object-oriented schema design.) > > > JSON Schema makes this possible (see detailed write-up here > > > < > > > > > > https://json-schema.org/blog/posts/bundling-json-schema-compound-documents > > > >) > > > but Avro does not, so Avro schemas end up with duplication everywhere, > > and > > > this duplication is burdensome for developers to maintain. > Consequently, > > > some developers avoid using schemas entirely, but that has its own > > > consequences. > > > 3. If a message's content is invalid, send the message to an "invalid > > > message topic". Since the concerns above are mostly around data > content > > at > > > runtime, Avro doesn't help us here, but for JSON content, JSON Schema's > > > validation spec > > > < > > > > > > https://json-schema.org/draft/2020-12/json-schema-validation.html#name-overview > > > > > > > could. Plus, the use case is a little different compared to a DLQ or > > Retry > > > topic because we'd like a way to handle content failures separately > from > > > other kinds of failures. > > > > > > (I'm sure I can think of more examples if I give it more thought.) > > > > > > Devin G. Bost > > > > > > > > > On Wed, Nov 16, 2022 at 6:36 AM 丛搏 <congbobo...@gmail.com> wrote: > > > > > > > hi, Devin: > > > > the first Kafka doesn't support schema. `confluent `does. > > > > pulsar schema supports validation and versioning. Are you > encountering > > > > a schema version caused by automatic registration, and the data > source > > > > is not clear? I think you can turn off the producer's automatic > > > > registration schema, and control the schema changes through the > > > > management side. > > > > doc: > > > https://pulsar.apache.org/docs/2.10.x/schema-manage#schema-autoupdate > > > > > > > > Thanks, > > > > bo > > > > > > > > Elliot West <elliot.w...@streamnative.io.invalid> 于2022年11月14日周一 > > > 20:14写道: > > > > > > > > > > While we can get caught up in the specifics of exactly how JSON > > Schema > > > is > > > > > supported in the Kafka ecosystem, it is ultimately possible if > > desired, > > > > and > > > > > is common, even if not part of open-source Apache Kafka. > > > > > > > > > > Devin's assertion is that JSON Schema compliant payload validation > > > > > and schema evolution are not currently supportable in the Pulsar > > > > ecosystem > > > > > and that perhaps they should be. > > > > > > > > > > Elliot. > > > > > > > > > > > > > > > On Fri, 11 Nov 2022 at 14:56, Elliot West < > > elliot.w...@streamnative.io > > > > > > > > > wrote: > > > > > > > > > > > Hey Devin, > > > > > > > > > > > > *"Kafka conforms to the JSON Schema specification"* > > > > > > Only when using Confluent's Schema Registry. > > > > > > > > > > > > *"if a producer makes a change or omission, such as in a value > used > > > for > > > > > > tracking, it might not surface until way down the line"* > > > > > > So let me understand this: Although the producer has a schema, it > > > does > > > > not > > > > > > use it for validation of JSON (as would implicitly occur for > Avro? > > Is > > > > this > > > > > > correct? > > > > > > > > > > > > I agree that robust support for schema, certainly at the edges, > is > > a > > > > > > cornerstone for a data system. I also agree that it would be > better > > > to > > > > > > adopt existing standards rather than implement them in a bespoke > > > > manner. > > > > > > > > > > > > I'd be interested to hear your thoughts on concrete improvements > > that > > > > you > > > > > > believe would be necessary - for example: > > > > > > > > > > > > * Producer validation of JSON occurs using "JSON Schema" > > > > > > * Evolutions of JSON Schema conform to ... > > > > > > * Users can declare topic schema using a JSON Schema document > > > > > > * Users can query topic schema and have a JSON schema document > > > > returned to > > > > > > them > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Elliot. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, 10 Nov 2022 at 16:51, Devin Bost <devin.b...@gmail.com> > > > wrote: > > > > > > > > > > > >> One of the areas where Kafka has an advantage over Pulsar is > > around > > > > data > > > > > >> quality. Kafka conforms to the JSON Schema specification, which > > > > enables > > > > > >> integration with any technology that conforms to the standard, > > such > > > > as for > > > > > >> data validation, discoverability, lineage, versioning, etc. > > > > > >> Pulsar's implementation is non-compliant with the standard, and > > > > producers > > > > > >> and consumers have no built-in way in Pulsar to validate that > > values > > > > in > > > > > >> their messages match expectations. As a consequence, if a > producer > > > > makes a > > > > > >> change or omission, such as in a value used for tracking, it > might > > > not > > > > > >> surface until way down the line, and then it can be very > difficult > > > to > > > > > >> track > > > > > >> down the source of the problem, which kills the agility of teams > > > > > >> responsible for maintaining apps using Pulsar. It's also bad PR > > > > because > > > > > >> then incidents are associated with Pulsar, even though the > > business > > > > might > > > > > >> not understand that the data problem wasn't necessarily caused > by > > > > Pulsar. > > > > > >> > > > > > >> What's the right way for us to address this problem? > > > > > >> > > > > > >> -- > > > > > >> Devin Bost > > > > > >> Sent from mobile > > > > > >> Cell: 801-400-4602 > > > > > >> > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Elliot West > > > > > > > > > > > > Senior Platform Engineer > > > > > > > > > > > > elliot.w...@streamnative.io > > > > > > > > > > > > streamnative.io > > > > > > > > > > > > <https://github.com/streamnative> > > > > > > <https://www.linkedin.com/company/streamnative> > > > > > > <https://twitter.com/streamnativeio> > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Elliot West > > > > > > > > > > Senior Platform Engineer > > > > > > > > > > elliot.w...@streamnative.io > > > > > > > > > > streamnative.io > > > > > > > > > > <https://github.com/streamnative> > > > > > <https://www.linkedin.com/company/streamnative> > > > > > <https://twitter.com/streamnativeio> > > > > > > > > > > -- Elliot West Senior Platform Engineer elliot.w...@streamnative.io streamnative.io <https://github.com/streamnative> <https://www.linkedin.com/company/streamnative> <https://twitter.com/streamnativeio>