Re: Data quality problem

Devin Bost Wed, 16 Nov 2022 15:37:13 -0800

I appreciate all the thoughts and questions so far.

One of the issues with Pulsar's current implementation of schemas for JSON
is the requirement to always have a POCO or some kind of type builder to
construct the schema. This requirement can be cumbersome for users who only
care about a few fields on the object.
Protobuf attempts to simplify the implementation of the mapping (from data
to class) by having a language-independent mechanism for defining the data
(so the POCO can be generated in the desired language), but obviously, that
offers very few benefits for JSON. Additionally, protobuf and Avro don't
provide a way to express constraints on data *values*. Consider an example.
Let's say a site is sending messages like this:
{
"user": "bob",
"action": "click",
"trackingUrn": "urn:siteA:homepage:topNavigation:0.124",
"payload" : {
   . . .
   [ *highly nested or dynamic data* ]
   . . .
  }
}


Here are some issues we might run into:
1. A consumer wants to take action on messages based on a single field.
They only care about if the field exists and has an allowed value. They
don't want to spend a week trying to map each of the nested fields into a
POCO and then worry about maintaining the POCO when nested sub-fields are
updated by upstream teams with breaking changes. Consider these use cases:
   - Validate that the "action" value is oneOf: [ "click", "impression",
"hover"]. Route content based on the action unless it's an unexpected
value.
   - Subfields change depending on the trackingUrn values.
Consider the following:
   A) In the validation use case, the app developer shouldn't need to deal
with any fields other than "action", but they should be able to express or
verify that "action" is part of a data contract they have agreed to consume
from.
   B) Every app like this would need to add its own runtime validation
logic, and when many different apps are using their own versions of
validation, the implementations are brittle and become hard to maintain.
The solution to the brittleness is to adopt a standard that solves the
interoperability problem.
   C) If subfields are dynamic, well, there's not a good way to express
that in Avro. Maybe the developer could use maps, but I think that defeats
the purpose.
2. We should be able to compose schemas from shared "schema components" for
improved reusability. (Consider it like object-oriented schema design.)
JSON Schema makes this possible (see detailed write-up here
<https://json-schema.org/blog/posts/bundling-json-schema-compound-documents>)
but Avro does not, so Avro schemas end up with duplication everywhere, and
this duplication is burdensome for developers to maintain. Consequently,
some developers avoid using schemas entirely, but that has its own
consequences.
3. If a message's content is invalid, send the message to an "invalid
message topic".  Since the concerns above are mostly around data content at
runtime, Avro doesn't help us here, but for JSON content, JSON Schema's
validation spec
<https://json-schema.org/draft/2020-12/json-schema-validation.html#name-overview>
could. Plus, the use case is a little different compared to a DLQ or Retry
topic because we'd like a way to handle content failures separately from
other kinds of failures.

(I'm sure I can think of more examples if I give it more thought.)

Devin G. Bost


On Wed, Nov 16, 2022 at 6:36 AM 丛搏 <congbobo...@gmail.com> wrote:

> hi, Devin:
> the first Kafka doesn't support schema. `confluent `does.
> pulsar schema supports validation and versioning. Are you encountering
> a schema version caused by automatic registration, and the data source
> is not clear? I think you can turn off the producer's automatic
> registration schema, and control the schema changes through the
> management side.
> doc: https://pulsar.apache.org/docs/2.10.x/schema-manage#schema-autoupdate
>
> Thanks,
> bo
>
> Elliot West <elliot.w...@streamnative.io.invalid> 于2022年11月14日周一 20:14写道：
> >
> > While we can get caught up in the specifics of exactly how JSON Schema is
> > supported in the Kafka ecosystem, it is ultimately possible if desired,
> and
> > is common, even if not part of open-source Apache Kafka.
> >
> > Devin's assertion is that JSON Schema compliant payload validation
> > and schema evolution are not currently supportable in the Pulsar
> ecosystem
> > and that perhaps they should be.
> >
> > Elliot.
> >
> >
> > On Fri, 11 Nov 2022 at 14:56, Elliot West <elliot.w...@streamnative.io>
> > wrote:
> >
> > > Hey Devin,
> > >
> > > *"Kafka conforms to the JSON Schema specification"*
> > > Only when using Confluent's Schema Registry.
> > >
> > > *"if a producer makes a change or omission, such as in a value used for
> > > tracking, it might not surface until way down the line"*
> > > So let me understand this: Although the producer has a schema, it does
> not
> > > use it for validation of JSON (as would implicitly occur for Avro? Is
> this
> > > correct?
> > >
> > > I agree that robust support for schema, certainly at the edges, is a
> > > cornerstone for a data system. I also agree that it would be better to
> > > adopt existing standards rather than implement them in a bespoke
> manner.
> > >
> > > I'd be interested to hear your thoughts on concrete improvements that
> you
> > > believe would be necessary - for example:
> > >
> > > * Producer validation of JSON occurs using "JSON Schema"
> > > * Evolutions of JSON Schema conform to ...
> > > * Users can declare topic schema using a JSON Schema document
> > > * Users can query topic schema and have a JSON schema document
> returned to
> > > them
> > >
> > > Thanks,
> > >
> > > Elliot.
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Thu, 10 Nov 2022 at 16:51, Devin Bost <devin.b...@gmail.com> wrote:
> > >
> > >> One of the areas where Kafka has an advantage over Pulsar is around
> data
> > >> quality. Kafka conforms to the JSON Schema specification, which
> enables
> > >> integration with any technology that conforms to the standard, such
> as for
> > >> data validation, discoverability, lineage, versioning, etc.
> > >> Pulsar's implementation is non-compliant with the standard, and
> producers
> > >> and consumers have no built-in way in Pulsar to validate that values
> in
> > >> their messages match expectations. As a consequence, if a producer
> makes a
> > >> change or omission, such as in a value used for tracking, it might not
> > >> surface until way down the line, and then it can be very difficult to
> > >> track
> > >> down the source of the problem, which kills the agility of teams
> > >> responsible for maintaining apps using Pulsar. It's also bad PR
> because
> > >> then incidents are associated with Pulsar, even though the business
> might
> > >> not understand that the data problem wasn't necessarily caused by
> Pulsar.
> > >>
> > >> What's the right way for us to address this problem?
> > >>
> > >> --
> > >> Devin Bost
> > >> Sent from mobile
> > >> Cell: 801-400-4602
> > >>
> > >
> > >
> > > --
> > >
> > > Elliot West
> > >
> > > Senior Platform Engineer
> > >
> > > elliot.w...@streamnative.io
> > >
> > > streamnative.io
> > >
> > > <https://github.com/streamnative>
> > > <https://www.linkedin.com/company/streamnative>
> > > <https://twitter.com/streamnativeio>
> > >
> >
> >
> > --
> >
> > Elliot West
> >
> > Senior Platform Engineer
> >
> > elliot.w...@streamnative.io
> >
> > streamnative.io
> >
> > <https://github.com/streamnative>
> > <https://www.linkedin.com/company/streamnative>
> > <https://twitter.com/streamnativeio>
>

Re: Data quality problem

Reply via email to