Re: Data quality problem

PengHui Li Sun, 20 Nov 2022 18:02:53 -0800

Hi, Devin

Thanks for raising the great discussion. It looks like the salient point is
that Pulsar
doesn't support native JSON schema. Instead, the schema is defined in the
Avro
standard but serialized to JSON format.JSON Schema combines aspects of
type-based and rule-based. As this article[1] said "JSON Schema combines
aspects of both a grammar-based language and a rule-based one". But the
Avro
schema definition only has the aspect of grammar-based.


[1] https://yokota.blog/2021/03/29/understanding-json-schema-compatibility/

> One of the issues with Pulsar's current implementation of schemas for JSON
is the requirement to always have a POCO or some kind of type builder to
construct the schema. This requirement can be cumbersome for users who only
care about a few fields on the object.

Sorry. I do not fully understand here. Is it also related to the "data
quality" problem
that we discussed? For the consumer side, we can use the AUTO_CONSUME schema
to receive GenericObject (For JSON schema, you can deal with JsonObject
directly).
For the producer side, I think yes. We can either send an Object or bytes[]
(AUTO_PRODUCE).

> Plus, the use case is a little different compared to a DLQ or Retry
topic because we'd like a way to handle content failures separately from
other kinds of failures.

Yes, I agree. It's not a field of DLQ.

Thanks,
Penghui

On Thu, Nov 17, 2022 at 7:37 AM Devin Bost <devin.b...@gmail.com> wrote:

> I appreciate all the thoughts and questions so far.
>
> One of the issues with Pulsar's current implementation of schemas for JSON
> is the requirement to always have a POCO or some kind of type builder to
> construct the schema. This requirement can be cumbersome for users who only
> care about a few fields on the object.
> Protobuf attempts to simplify the implementation of the mapping (from data
> to class) by having a language-independent mechanism for defining the data
> (so the POCO can be generated in the desired language), but obviously, that
> offers very few benefits for JSON. Additionally, protobuf and Avro don't
> provide a way to express constraints on data *values*. Consider an example.
> Let's say a site is sending messages like this:
> {
> "user": "bob",
> "action": "click",
> "trackingUrn": "urn:siteA:homepage:topNavigation:0.124",
> "payload" : {
>    . . .
>    [ *highly nested or dynamic data* ]
>    . . .
>   }
> }
>
> Here are some issues we might run into:
> 1. A consumer wants to take action on messages based on a single field.
> They only care about if the field exists and has an allowed value. They
> don't want to spend a week trying to map each of the nested fields into a
> POCO and then worry about maintaining the POCO when nested sub-fields are
> updated by upstream teams with breaking changes. Consider these use cases:
>    - Validate that the "action" value is oneOf: [ "click", "impression",
> "hover"]. Route content based on the action unless it's an unexpected
> value.
>    - Subfields change depending on the trackingUrn values.
> Consider the following:
>    A) In the validation use case, the app developer shouldn't need to deal
> with any fields other than "action", but they should be able to express or
> verify that "action" is part of a data contract they have agreed to consume
> from.
>    B) Every app like this would need to add its own runtime validation
> logic, and when many different apps are using their own versions of
> validation, the implementations are brittle and become hard to maintain.
> The solution to the brittleness is to adopt a standard that solves the
> interoperability problem.
>    C) If subfields are dynamic, well, there's not a good way to express
> that in Avro. Maybe the developer could use maps, but I think that defeats
> the purpose.
> 2. We should be able to compose schemas from shared "schema components" for
> improved reusability. (Consider it like object-oriented schema design.)
> JSON Schema makes this possible (see detailed write-up here
> <
> https://json-schema.org/blog/posts/bundling-json-schema-compound-documents
> >)
> but Avro does not, so Avro schemas end up with duplication everywhere, and
> this duplication is burdensome for developers to maintain. Consequently,
> some developers avoid using schemas entirely, but that has its own
> consequences.
> 3. If a message's content is invalid, send the message to an "invalid
> message topic".  Since the concerns above are mostly around data content at
> runtime, Avro doesn't help us here, but for JSON content, JSON Schema's
> validation spec
> <
> https://json-schema.org/draft/2020-12/json-schema-validation.html#name-overview
> >
> could. Plus, the use case is a little different compared to a DLQ or Retry
> topic because we'd like a way to handle content failures separately from
> other kinds of failures.
>
> (I'm sure I can think of more examples if I give it more thought.)
>
> Devin G. Bost
>
>
> On Wed, Nov 16, 2022 at 6:36 AM 丛搏 <congbobo...@gmail.com> wrote:
>
> > hi, Devin:
> > the first Kafka doesn't support schema. `confluent `does.
> > pulsar schema supports validation and versioning. Are you encountering
> > a schema version caused by automatic registration, and the data source
> > is not clear? I think you can turn off the producer's automatic
> > registration schema, and control the schema changes through the
> > management side.
> > doc:
> https://pulsar.apache.org/docs/2.10.x/schema-manage#schema-autoupdate
> >
> > Thanks,
> > bo
> >
> > Elliot West <elliot.w...@streamnative.io.invalid> 于2022年11月14日周一
> 20:14写道：
> > >
> > > While we can get caught up in the specifics of exactly how JSON Schema
> is
> > > supported in the Kafka ecosystem, it is ultimately possible if desired,
> > and
> > > is common, even if not part of open-source Apache Kafka.
> > >
> > > Devin's assertion is that JSON Schema compliant payload validation
> > > and schema evolution are not currently supportable in the Pulsar
> > ecosystem
> > > and that perhaps they should be.
> > >
> > > Elliot.
> > >
> > >
> > > On Fri, 11 Nov 2022 at 14:56, Elliot West <elliot.w...@streamnative.io
> >
> > > wrote:
> > >
> > > > Hey Devin,
> > > >
> > > > *"Kafka conforms to the JSON Schema specification"*
> > > > Only when using Confluent's Schema Registry.
> > > >
> > > > *"if a producer makes a change or omission, such as in a value used
> for
> > > > tracking, it might not surface until way down the line"*
> > > > So let me understand this: Although the producer has a schema, it
> does
> > not
> > > > use it for validation of JSON (as would implicitly occur for Avro? Is
> > this
> > > > correct?
> > > >
> > > > I agree that robust support for schema, certainly at the edges, is a
> > > > cornerstone for a data system. I also agree that it would be better
> to
> > > > adopt existing standards rather than implement them in a bespoke
> > manner.
> > > >
> > > > I'd be interested to hear your thoughts on concrete improvements that
> > you
> > > > believe would be necessary - for example:
> > > >
> > > > * Producer validation of JSON occurs using "JSON Schema"
> > > > * Evolutions of JSON Schema conform to ...
> > > > * Users can declare topic schema using a JSON Schema document
> > > > * Users can query topic schema and have a JSON schema document
> > returned to
> > > > them
> > > >
> > > > Thanks,
> > > >
> > > > Elliot.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, 10 Nov 2022 at 16:51, Devin Bost <devin.b...@gmail.com>
> wrote:
> > > >
> > > >> One of the areas where Kafka has an advantage over Pulsar is around
> > data
> > > >> quality. Kafka conforms to the JSON Schema specification, which
> > enables
> > > >> integration with any technology that conforms to the standard, such
> > as for
> > > >> data validation, discoverability, lineage, versioning, etc.
> > > >> Pulsar's implementation is non-compliant with the standard, and
> > producers
> > > >> and consumers have no built-in way in Pulsar to validate that values
> > in
> > > >> their messages match expectations. As a consequence, if a producer
> > makes a
> > > >> change or omission, such as in a value used for tracking, it might
> not
> > > >> surface until way down the line, and then it can be very difficult
> to
> > > >> track
> > > >> down the source of the problem, which kills the agility of teams
> > > >> responsible for maintaining apps using Pulsar. It's also bad PR
> > because
> > > >> then incidents are associated with Pulsar, even though the business
> > might
> > > >> not understand that the data problem wasn't necessarily caused by
> > Pulsar.
> > > >>
> > > >> What's the right way for us to address this problem?
> > > >>
> > > >> --
> > > >> Devin Bost
> > > >> Sent from mobile
> > > >> Cell: 801-400-4602
> > > >>
> > > >
> > > >
> > > > --
> > > >
> > > > Elliot West
> > > >
> > > > Senior Platform Engineer
> > > >
> > > > elliot.w...@streamnative.io
> > > >
> > > > streamnative.io
> > > >
> > > > <https://github.com/streamnative>
> > > > <https://www.linkedin.com/company/streamnative>
> > > > <https://twitter.com/streamnativeio>
> > > >
> > >
> > >
> > > --
> > >
> > > Elliot West
> > >
> > > Senior Platform Engineer
> > >
> > > elliot.w...@streamnative.io
> > >
> > > streamnative.io
> > >
> > > <https://github.com/streamnative>
> > > <https://www.linkedin.com/company/streamnative>
> > > <https://twitter.com/streamnativeio>
> >
>

Re: Data quality problem

Reply via email to