Re: Data quality problem

Elliot West Tue, 14 Mar 2023 02:03:05 -0700

Hi Devin,

This topic remains of great interest to me. I think there is still a wide
schema usability gap between traditional batch data systems (RDBMS for
example) and those in the messaging/streaming space.


Pulsar provides no mechanism for real-time schema verification of
> message content.


Are you specifically referring to validation at the broker entry point and
not in the client?

The user experience around maintaining types/schemas between apps in
> Pulsar is not good


What are we comparing this to though? What would the ideal data developer
workflow look like?

Thanks,

Elliot.


On Mon, 13 Mar 2023 at 16:58, Devin Bost <devin.b...@gmail.com> wrote:

> > Sorry. I do not fully understand here. Is it also related to the "data
> > quality" problem
> > that we discussed? For the consumer side, we can use the AUTO_CONSUME
> schema
> > to receive GenericObject (For JSON schema, you can deal with JsonObject
> > directly).
> > For the producer side, I think yes. We can either send an Object or
> bytes[]
> > (AUTO_PRODUCE).
>
> I think there are two problems:
>
> 1. Pulsar provides no mechanism for real-time schema verification of
> message content. There's no way in Pulsar to verify that the type
> registered at compile-time matches the content of the message that is sent
> at runtime. Companies that want to guarantee schema conformance on a
> per-message basis are left to implement such a mechanism on their own.
> 2. The user experience around maintaining types/schemas between apps in
> Pulsar is not good, but for the purpose of this thread, let's focus on that
> first problem above.
>
> Devin G. Bost
>
>
> On Sun, Nov 20, 2022 at 8:02 PM PengHui Li <peng...@apache.org> wrote:
>
> > Hi, Devin
> >
> > Thanks for raising the great discussion. It looks like the salient point
> is
> > that Pulsar
> > doesn't support native JSON schema. Instead, the schema is defined in the
> > Avro
> > standard but serialized to JSON format.JSON Schema combines aspects of
> > type-based and rule-based. As this article[1] said "JSON Schema combines
> > aspects of both a grammar-based language and a rule-based one". But the
> > Avro
> > schema definition only has the aspect of grammar-based.
> >
> > [1]
> > https://yokota.blog/2021/03/29/understanding-json-schema-compatibility/
> >
> > > One of the issues with Pulsar's current implementation of schemas for
> > JSON
> > is the requirement to always have a POCO or some kind of type builder to
> > construct the schema. This requirement can be cumbersome for users who
> only
> > care about a few fields on the object.
> >
> > Sorry. I do not fully understand here. Is it also related to the "data
> > quality" problem
> > that we discussed? For the consumer side, we can use the AUTO_CONSUME
> > schema
> > to receive GenericObject (For JSON schema, you can deal with JsonObject
> > directly).
> > For the producer side, I think yes. We can either send an Object or
> bytes[]
> > (AUTO_PRODUCE).
> >
> > > Plus, the use case is a little different compared to a DLQ or Retry
> > topic because we'd like a way to handle content failures separately from
> > other kinds of failures.
> >
> > Yes, I agree. It's not a field of DLQ.
> >
> > Thanks,
> > Penghui
> >
> > On Thu, Nov 17, 2022 at 7:37 AM Devin Bost <devin.b...@gmail.com> wrote:
> >
> > > I appreciate all the thoughts and questions so far.
> > >
> > > One of the issues with Pulsar's current implementation of schemas for
> > JSON
> > > is the requirement to always have a POCO or some kind of type builder
> to
> > > construct the schema. This requirement can be cumbersome for users who
> > only
> > > care about a few fields on the object.
> > > Protobuf attempts to simplify the implementation of the mapping (from
> > data
> > > to class) by having a language-independent mechanism for defining the
> > data
> > > (so the POCO can be generated in the desired language), but obviously,
> > that
> > > offers very few benefits for JSON. Additionally, protobuf and Avro
> don't
> > > provide a way to express constraints on data *values*. Consider an
> > example.
> > > Let's say a site is sending messages like this:
> > > {
> > > "user": "bob",
> > > "action": "click",
> > > "trackingUrn": "urn:siteA:homepage:topNavigation:0.124",
> > > "payload" : {
> > >    . . .
> > >    [ *highly nested or dynamic data* ]
> > >    . . .
> > >   }
> > > }
> > >
> > > Here are some issues we might run into:
> > > 1. A consumer wants to take action on messages based on a single field.
> > > They only care about if the field exists and has an allowed value. They
> > > don't want to spend a week trying to map each of the nested fields
> into a
> > > POCO and then worry about maintaining the POCO when nested sub-fields
> are
> > > updated by upstream teams with breaking changes. Consider these use
> > cases:
> > >    - Validate that the "action" value is oneOf: [ "click",
> "impression",
> > > "hover"]. Route content based on the action unless it's an unexpected
> > > value.
> > >    - Subfields change depending on the trackingUrn values.
> > > Consider the following:
> > >    A) In the validation use case, the app developer shouldn't need to
> > deal
> > > with any fields other than "action", but they should be able to express
> > or
> > > verify that "action" is part of a data contract they have agreed to
> > consume
> > > from.
> > >    B) Every app like this would need to add its own runtime validation
> > > logic, and when many different apps are using their own versions of
> > > validation, the implementations are brittle and become hard to
> maintain.
> > > The solution to the brittleness is to adopt a standard that solves the
> > > interoperability problem.
> > >    C) If subfields are dynamic, well, there's not a good way to express
> > > that in Avro. Maybe the developer could use maps, but I think that
> > defeats
> > > the purpose.
> > > 2. We should be able to compose schemas from shared "schema components"
> > for
> > > improved reusability. (Consider it like object-oriented schema design.)
> > > JSON Schema makes this possible (see detailed write-up here
> > > <
> > >
> >
> https://json-schema.org/blog/posts/bundling-json-schema-compound-documents
> > > >)
> > > but Avro does not, so Avro schemas end up with duplication everywhere,
> > and
> > > this duplication is burdensome for developers to maintain.
> Consequently,
> > > some developers avoid using schemas entirely, but that has its own
> > > consequences.
> > > 3. If a message's content is invalid, send the message to an "invalid
> > > message topic".  Since the concerns above are mostly around data
> content
> > at
> > > runtime, Avro doesn't help us here, but for JSON content, JSON Schema's
> > > validation spec
> > > <
> > >
> >
> https://json-schema.org/draft/2020-12/json-schema-validation.html#name-overview
> > > >
> > > could. Plus, the use case is a little different compared to a DLQ or
> > Retry
> > > topic because we'd like a way to handle content failures separately
> from
> > > other kinds of failures.
> > >
> > > (I'm sure I can think of more examples if I give it more thought.)
> > >
> > > Devin G. Bost
> > >
> > >
> > > On Wed, Nov 16, 2022 at 6:36 AM 丛搏 <congbobo...@gmail.com> wrote:
> > >
> > > > hi, Devin:
> > > > the first Kafka doesn't support schema. `confluent `does.
> > > > pulsar schema supports validation and versioning. Are you
> encountering
> > > > a schema version caused by automatic registration, and the data
> source
> > > > is not clear? I think you can turn off the producer's automatic
> > > > registration schema, and control the schema changes through the
> > > > management side.
> > > > doc:
> > > https://pulsar.apache.org/docs/2.10.x/schema-manage#schema-autoupdate
> > > >
> > > > Thanks,
> > > > bo
> > > >
> > > > Elliot West <elliot.w...@streamnative.io.invalid> 于2022年11月14日周一
> > > 20:14写道：
> > > > >
> > > > > While we can get caught up in the specifics of exactly how JSON
> > Schema
> > > is
> > > > > supported in the Kafka ecosystem, it is ultimately possible if
> > desired,
> > > > and
> > > > > is common, even if not part of open-source Apache Kafka.
> > > > >
> > > > > Devin's assertion is that JSON Schema compliant payload validation
> > > > > and schema evolution are not currently supportable in the Pulsar
> > > > ecosystem
> > > > > and that perhaps they should be.
> > > > >
> > > > > Elliot.
> > > > >
> > > > >
> > > > > On Fri, 11 Nov 2022 at 14:56, Elliot West <
> > elliot.w...@streamnative.io
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hey Devin,
> > > > > >
> > > > > > *"Kafka conforms to the JSON Schema specification"*
> > > > > > Only when using Confluent's Schema Registry.
> > > > > >
> > > > > > *"if a producer makes a change or omission, such as in a value
> used
> > > for
> > > > > > tracking, it might not surface until way down the line"*
> > > > > > So let me understand this: Although the producer has a schema, it
> > > does
> > > > not
> > > > > > use it for validation of JSON (as would implicitly occur for
> Avro?
> > Is
> > > > this
> > > > > > correct?
> > > > > >
> > > > > > I agree that robust support for schema, certainly at the edges,
> is
> > a
> > > > > > cornerstone for a data system. I also agree that it would be
> better
> > > to
> > > > > > adopt existing standards rather than implement them in a bespoke
> > > > manner.
> > > > > >
> > > > > > I'd be interested to hear your thoughts on concrete improvements
> > that
> > > > you
> > > > > > believe would be necessary - for example:
> > > > > >
> > > > > > * Producer validation of JSON occurs using "JSON Schema"
> > > > > > * Evolutions of JSON Schema conform to ...
> > > > > > * Users can declare topic schema using a JSON Schema document
> > > > > > * Users can query topic schema and have a JSON schema document
> > > > returned to
> > > > > > them
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Elliot.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, 10 Nov 2022 at 16:51, Devin Bost <devin.b...@gmail.com>
> > > wrote:
> > > > > >
> > > > > >> One of the areas where Kafka has an advantage over Pulsar is
> > around
> > > > data
> > > > > >> quality. Kafka conforms to the JSON Schema specification, which
> > > > enables
> > > > > >> integration with any technology that conforms to the standard,
> > such
> > > > as for
> > > > > >> data validation, discoverability, lineage, versioning, etc.
> > > > > >> Pulsar's implementation is non-compliant with the standard, and
> > > > producers
> > > > > >> and consumers have no built-in way in Pulsar to validate that
> > values
> > > > in
> > > > > >> their messages match expectations. As a consequence, if a
> producer
> > > > makes a
> > > > > >> change or omission, such as in a value used for tracking, it
> might
> > > not
> > > > > >> surface until way down the line, and then it can be very
> difficult
> > > to
> > > > > >> track
> > > > > >> down the source of the problem, which kills the agility of teams
> > > > > >> responsible for maintaining apps using Pulsar. It's also bad PR
> > > > because
> > > > > >> then incidents are associated with Pulsar, even though the
> > business
> > > > might
> > > > > >> not understand that the data problem wasn't necessarily caused
> by
> > > > Pulsar.
> > > > > >>
> > > > > >> What's the right way for us to address this problem?
> > > > > >>
> > > > > >> --
> > > > > >> Devin Bost
> > > > > >> Sent from mobile
> > > > > >> Cell: 801-400-4602
> > > > > >>
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Elliot West
> > > > > >
> > > > > > Senior Platform Engineer
> > > > > >
> > > > > > elliot.w...@streamnative.io
> > > > > >
> > > > > > streamnative.io
> > > > > >
> > > > > > <https://github.com/streamnative>
> > > > > > <https://www.linkedin.com/company/streamnative>
> > > > > > <https://twitter.com/streamnativeio>
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Elliot West
> > > > >
> > > > > Senior Platform Engineer
> > > > >
> > > > > elliot.w...@streamnative.io
> > > > >
> > > > > streamnative.io
> > > > >
> > > > > <https://github.com/streamnative>
> > > > > <https://www.linkedin.com/company/streamnative>
> > > > > <https://twitter.com/streamnativeio>
> > > >
> > >
> >
>


-- 

Elliot West

Senior Platform Engineer

elliot.w...@streamnative.io

streamnative.io

<https://github.com/streamnative>
<https://www.linkedin.com/company/streamnative>
<https://twitter.com/streamnativeio>

Re: Data quality problem

Reply via email to