Re: Data quality problem

Devin Bost Mon, 13 Mar 2023 09:58:57 -0700

> Sorry. I do not fully understand here. Is it also related to the "data
> quality" problem
> that we discussed? For the consumer side, we can use the AUTO_CONSUME
schema
> to receive GenericObject (For JSON schema, you can deal with JsonObject
> directly).
> For the producer side, I think yes. We can either send an Object or
bytes[]
> (AUTO_PRODUCE).


I think there are two problems:

1. Pulsar provides no mechanism for real-time schema verification of
message content. There's no way in Pulsar to verify that the type
registered at compile-time matches the content of the message that is sent
at runtime. Companies that want to guarantee schema conformance on a
per-message basis are left to implement such a mechanism on their own.
2. The user experience around maintaining types/schemas between apps in
Pulsar is not good, but for the purpose of this thread, let's focus on that
first problem above.

Devin G. Bost


On Sun, Nov 20, 2022 at 8:02 PM PengHui Li <peng...@apache.org> wrote:

> Hi, Devin
>
> Thanks for raising the great discussion. It looks like the salient point is
> that Pulsar
> doesn't support native JSON schema. Instead, the schema is defined in the
> Avro
> standard but serialized to JSON format.JSON Schema combines aspects of
> type-based and rule-based. As this article[1] said "JSON Schema combines
> aspects of both a grammar-based language and a rule-based one". But the
> Avro
> schema definition only has the aspect of grammar-based.
>
> [1]
> https://yokota.blog/2021/03/29/understanding-json-schema-compatibility/
>
> > One of the issues with Pulsar's current implementation of schemas for
> JSON
> is the requirement to always have a POCO or some kind of type builder to
> construct the schema. This requirement can be cumbersome for users who only
> care about a few fields on the object.
>
> Sorry. I do not fully understand here. Is it also related to the "data
> quality" problem
> that we discussed? For the consumer side, we can use the AUTO_CONSUME
> schema
> to receive GenericObject (For JSON schema, you can deal with JsonObject
> directly).
> For the producer side, I think yes. We can either send an Object or bytes[]
> (AUTO_PRODUCE).
>
> > Plus, the use case is a little different compared to a DLQ or Retry
> topic because we'd like a way to handle content failures separately from
> other kinds of failures.
>
> Yes, I agree. It's not a field of DLQ.
>
> Thanks,
> Penghui
>
> On Thu, Nov 17, 2022 at 7:37 AM Devin Bost <devin.b...@gmail.com> wrote:
>
> > I appreciate all the thoughts and questions so far.
> >
> > One of the issues with Pulsar's current implementation of schemas for
> JSON
> > is the requirement to always have a POCO or some kind of type builder to
> > construct the schema. This requirement can be cumbersome for users who
> only
> > care about a few fields on the object.
> > Protobuf attempts to simplify the implementation of the mapping (from
> data
> > to class) by having a language-independent mechanism for defining the
> data
> > (so the POCO can be generated in the desired language), but obviously,
> that
> > offers very few benefits for JSON. Additionally, protobuf and Avro don't
> > provide a way to express constraints on data *values*. Consider an
> example.
> > Let's say a site is sending messages like this:
> > {
> > "user": "bob",
> > "action": "click",
> > "trackingUrn": "urn:siteA:homepage:topNavigation:0.124",
> > "payload" : {
> >    . . .
> >    [ *highly nested or dynamic data* ]
> >    . . .
> >   }
> > }
> >
> > Here are some issues we might run into:
> > 1. A consumer wants to take action on messages based on a single field.
> > They only care about if the field exists and has an allowed value. They
> > don't want to spend a week trying to map each of the nested fields into a
> > POCO and then worry about maintaining the POCO when nested sub-fields are
> > updated by upstream teams with breaking changes. Consider these use
> cases:
> >    - Validate that the "action" value is oneOf: [ "click", "impression",
> > "hover"]. Route content based on the action unless it's an unexpected
> > value.
> >    - Subfields change depending on the trackingUrn values.
> > Consider the following:
> >    A) In the validation use case, the app developer shouldn't need to
> deal
> > with any fields other than "action", but they should be able to express
> or
> > verify that "action" is part of a data contract they have agreed to
> consume
> > from.
> >    B) Every app like this would need to add its own runtime validation
> > logic, and when many different apps are using their own versions of
> > validation, the implementations are brittle and become hard to maintain.
> > The solution to the brittleness is to adopt a standard that solves the
> > interoperability problem.
> >    C) If subfields are dynamic, well, there's not a good way to express
> > that in Avro. Maybe the developer could use maps, but I think that
> defeats
> > the purpose.
> > 2. We should be able to compose schemas from shared "schema components"
> for
> > improved reusability. (Consider it like object-oriented schema design.)
> > JSON Schema makes this possible (see detailed write-up here
> > <
> >
> https://json-schema.org/blog/posts/bundling-json-schema-compound-documents
> > >)
> > but Avro does not, so Avro schemas end up with duplication everywhere,
> and
> > this duplication is burdensome for developers to maintain. Consequently,
> > some developers avoid using schemas entirely, but that has its own
> > consequences.
> > 3. If a message's content is invalid, send the message to an "invalid
> > message topic".  Since the concerns above are mostly around data content
> at
> > runtime, Avro doesn't help us here, but for JSON content, JSON Schema's
> > validation spec
> > <
> >
> https://json-schema.org/draft/2020-12/json-schema-validation.html#name-overview
> > >
> > could. Plus, the use case is a little different compared to a DLQ or
> Retry
> > topic because we'd like a way to handle content failures separately from
> > other kinds of failures.
> >
> > (I'm sure I can think of more examples if I give it more thought.)
> >
> > Devin G. Bost
> >
> >
> > On Wed, Nov 16, 2022 at 6:36 AM 丛搏 <congbobo...@gmail.com> wrote:
> >
> > > hi, Devin:
> > > the first Kafka doesn't support schema. `confluent `does.
> > > pulsar schema supports validation and versioning. Are you encountering
> > > a schema version caused by automatic registration, and the data source
> > > is not clear? I think you can turn off the producer's automatic
> > > registration schema, and control the schema changes through the
> > > management side.
> > > doc:
> > https://pulsar.apache.org/docs/2.10.x/schema-manage#schema-autoupdate
> > >
> > > Thanks,
> > > bo
> > >
> > > Elliot West <elliot.w...@streamnative.io.invalid> 于2022年11月14日周一
> > 20:14写道：
> > > >
> > > > While we can get caught up in the specifics of exactly how JSON
> Schema
> > is
> > > > supported in the Kafka ecosystem, it is ultimately possible if
> desired,
> > > and
> > > > is common, even if not part of open-source Apache Kafka.
> > > >
> > > > Devin's assertion is that JSON Schema compliant payload validation
> > > > and schema evolution are not currently supportable in the Pulsar
> > > ecosystem
> > > > and that perhaps they should be.
> > > >
> > > > Elliot.
> > > >
> > > >
> > > > On Fri, 11 Nov 2022 at 14:56, Elliot West <
> elliot.w...@streamnative.io
> > >
> > > > wrote:
> > > >
> > > > > Hey Devin,
> > > > >
> > > > > *"Kafka conforms to the JSON Schema specification"*
> > > > > Only when using Confluent's Schema Registry.
> > > > >
> > > > > *"if a producer makes a change or omission, such as in a value used
> > for
> > > > > tracking, it might not surface until way down the line"*
> > > > > So let me understand this: Although the producer has a schema, it
> > does
> > > not
> > > > > use it for validation of JSON (as would implicitly occur for Avro?
> Is
> > > this
> > > > > correct?
> > > > >
> > > > > I agree that robust support for schema, certainly at the edges, is
> a
> > > > > cornerstone for a data system. I also agree that it would be better
> > to
> > > > > adopt existing standards rather than implement them in a bespoke
> > > manner.
> > > > >
> > > > > I'd be interested to hear your thoughts on concrete improvements
> that
> > > you
> > > > > believe would be necessary - for example:
> > > > >
> > > > > * Producer validation of JSON occurs using "JSON Schema"
> > > > > * Evolutions of JSON Schema conform to ...
> > > > > * Users can declare topic schema using a JSON Schema document
> > > > > * Users can query topic schema and have a JSON schema document
> > > returned to
> > > > > them
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Elliot.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Thu, 10 Nov 2022 at 16:51, Devin Bost <devin.b...@gmail.com>
> > wrote:
> > > > >
> > > > >> One of the areas where Kafka has an advantage over Pulsar is
> around
> > > data
> > > > >> quality. Kafka conforms to the JSON Schema specification, which
> > > enables
> > > > >> integration with any technology that conforms to the standard,
> such
> > > as for
> > > > >> data validation, discoverability, lineage, versioning, etc.
> > > > >> Pulsar's implementation is non-compliant with the standard, and
> > > producers
> > > > >> and consumers have no built-in way in Pulsar to validate that
> values
> > > in
> > > > >> their messages match expectations. As a consequence, if a producer
> > > makes a
> > > > >> change or omission, such as in a value used for tracking, it might
> > not
> > > > >> surface until way down the line, and then it can be very difficult
> > to
> > > > >> track
> > > > >> down the source of the problem, which kills the agility of teams
> > > > >> responsible for maintaining apps using Pulsar. It's also bad PR
> > > because
> > > > >> then incidents are associated with Pulsar, even though the
> business
> > > might
> > > > >> not understand that the data problem wasn't necessarily caused by
> > > Pulsar.
> > > > >>
> > > > >> What's the right way for us to address this problem?
> > > > >>
> > > > >> --
> > > > >> Devin Bost
> > > > >> Sent from mobile
> > > > >> Cell: 801-400-4602
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Elliot West
> > > > >
> > > > > Senior Platform Engineer
> > > > >
> > > > > elliot.w...@streamnative.io
> > > > >
> > > > > streamnative.io
> > > > >
> > > > > <https://github.com/streamnative>
> > > > > <https://www.linkedin.com/company/streamnative>
> > > > > <https://twitter.com/streamnativeio>
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Elliot West
> > > >
> > > > Senior Platform Engineer
> > > >
> > > > elliot.w...@streamnative.io
> > > >
> > > > streamnative.io
> > > >
> > > > <https://github.com/streamnative>
> > > > <https://www.linkedin.com/company/streamnative>
> > > > <https://twitter.com/streamnativeio>
> > >
> >
>

Re: Data quality problem

Reply via email to