Re: Data quality problem

Elliot West Mon, 14 Nov 2022 04:14:27 -0800

While we can get caught up in the specifics of exactly how JSON Schema is
supported in the Kafka ecosystem, it is ultimately possible if desired, and
is common, even if not part of open-source Apache Kafka.


Devin's assertion is that JSON Schema compliant payload validation
and schema evolution are not currently supportable in the Pulsar ecosystem
and that perhaps they should be.

Elliot.


On Fri, 11 Nov 2022 at 14:56, Elliot West <elliot.w...@streamnative.io>
wrote:

> Hey Devin,
>
> *"Kafka conforms to the JSON Schema specification"*
> Only when using Confluent's Schema Registry.
>
> *"if a producer makes a change or omission, such as in a value used for
> tracking, it might not surface until way down the line"*
> So let me understand this: Although the producer has a schema, it does not
> use it for validation of JSON (as would implicitly occur for Avro? Is this
> correct?
>
> I agree that robust support for schema, certainly at the edges, is a
> cornerstone for a data system. I also agree that it would be better to
> adopt existing standards rather than implement them in a bespoke manner.
>
> I'd be interested to hear your thoughts on concrete improvements that you
> believe would be necessary - for example:
>
> * Producer validation of JSON occurs using "JSON Schema"
> * Evolutions of JSON Schema conform to ...
> * Users can declare topic schema using a JSON Schema document
> * Users can query topic schema and have a JSON schema document returned to
> them
>
> Thanks,
>
> Elliot.
>
>
>
>
>
>
> On Thu, 10 Nov 2022 at 16:51, Devin Bost <devin.b...@gmail.com> wrote:
>
>> One of the areas where Kafka has an advantage over Pulsar is around data
>> quality. Kafka conforms to the JSON Schema specification, which enables
>> integration with any technology that conforms to the standard, such as for
>> data validation, discoverability, lineage, versioning, etc.
>> Pulsar's implementation is non-compliant with the standard, and producers
>> and consumers have no built-in way in Pulsar to validate that values in
>> their messages match expectations. As a consequence, if a producer makes a
>> change or omission, such as in a value used for tracking, it might not
>> surface until way down the line, and then it can be very difficult to
>> track
>> down the source of the problem, which kills the agility of teams
>> responsible for maintaining apps using Pulsar. It's also bad PR because
>> then incidents are associated with Pulsar, even though the business might
>> not understand that the data problem wasn't necessarily caused by Pulsar.
>>
>> What's the right way for us to address this problem?
>>
>> --
>> Devin Bost
>> Sent from mobile
>> Cell: 801-400-4602
>>
>
>
> --
>
> Elliot West
>
> Senior Platform Engineer
>
> elliot.w...@streamnative.io
>
> streamnative.io
>
> <https://github.com/streamnative>
> <https://www.linkedin.com/company/streamnative>
> <https://twitter.com/streamnativeio>
>


-- 

Elliot West

Senior Platform Engineer

elliot.w...@streamnative.io

streamnative.io

<https://github.com/streamnative>
<https://www.linkedin.com/company/streamnative>
<https://twitter.com/streamnativeio>

Re: Data quality problem

Reply via email to