Re: [DISCUSSIONS] Should we use AUTO_PRODUCE schema?

Xiangying Meng Wed, 14 Dec 2022 00:39:08 -0800

Good viewpoint, It looks like that the AUTO_PRODUCE schema is similar to
the BYTES schema in the semantic.
So can we make the BYTES schema has the features of the AUTO_PRODUCE?
There have some reasons to do this.
Firstly, it does not cause compatibility issues. Now, the topics that have
messages sent by the BYTES schema can only be consumed by the BYTES schema
and AUTO_CONSUME schema, so we make it consumable by other schemas without
affecting user logic already using the schema.
Secondly,  The BYTES schema is easier to understand than the AUTO_PRODUCE
schema.
Finally, now the BYTES schema is the same as other schemas which is an
exclusive schema. But the BYTES schema has some special logic in the pulsar
which is different from other schemas. But if we make the BYTES schema has
the features of AUTO_PORDUCE, then it will be a special schema whose
special logic is reasonable. And we can delete a seemingly unreasonable
existence, AUTO_PRODUCE.


I do not know much well. If there are any questions, feel free to point
them out.

Sincerely,
Xiangying

On Wed, Dec 14, 2022 at 3:12 PM 丛搏 <congbobo...@gmail.com> wrote:

> >
> > > the user only creates one producer to send all Kafka topic data, if
> > using Pulsar schema, the user needs to create all schema producers in
> > a map
> >
> > It doesn't make sense to me. If the source topic has messages of
> > multiple schemas, why did you try to sink them into the same topic
> > with a schema? The key point of AUTO_PRODUCE schema is to download the
> > schema to validate the source messages. But if the schema of the topic
> > evolved, the left messages from the source topic could not be sent to
> > the topic.
> >
> Let me give you an example, AvroSchema will have multi-version,
> the version(0) :
> Student {
> String name;
> }
> the version(1) :
> Student {
> String name;
> int age;
> }
> how do you can create two Student.class in one java process? and use
> the same namespace?
> It's not only the schema type changes it also will have multi-version
> schema.
> In this case, how do you create two producers with version(0) and
> version(1)?
>
> > The most confusing part is that AUTO_PRODUCE schema will perform
> > message format validation before send. It's transparent to users and
> > intuitive. IMO, it's better to call validate explicitly like
> >
> > ```java
> > producer.newMessage().value(bytes).validate().sendAsync();
> > ```
> >
> > There are two benefits:
> > 1. It's clear that the message validation happens before sending.
> > 2. If users don't want to validate before sending, they can choose to
> > send the bytes directly and validate the message during consumption.
> It only uses `schema.validate()` is enough, data validation does not
> belong to the pulsar message, and we can add a usage description in
> the schema doc.
> >
> > The performance problem of the AUTO_PRODUCE schema is that the
> > validation happens twice and it cannot be controlled.
>
> Our data verification is the behavior of the client, not the behavior
> of the broker. Therefore, we cannot effectively verify that bytes are
> generated by a specific schema. I think this is something that users
> should consider rather than something that pulsar should guarantee
> because you can't control the data sent by users that is generated by
> this schema only for client verification. so, we don't need to verify
> twice. Unless we verify in the broker, but this is an overhead, we can
> add config to control, but is it really necessary?
>
> Thanks,
> Bo
>
> Yunze Xu <y...@streamnative.io.invalid> 于2022年12月14日周三 12:40写道：
> >
> > > the user only creates one producer to send all Kafka topic data, if
> > using Pulsar schema, the user needs to create all schema producers in
> > a map
> >
> > It doesn't make sense to me. If the source topic has messages of
> > multiple schemas, why did you try to sink them into the same topic
> > with a schema? The key point of AUTO_PRODUCE schema is to download the
> > schema to validate the source messages. But if the schema of the topic
> > evolved, the left messages from the source topic could not be sent to
> > the topic.
> >
> > The most confusing part is that AUTO_PRODUCE schema will perform
> > message format validation before send. It's transparent to users and
> > intuitive. IMO, it's better to call validate explicitly like
> >
> > ```java
> > producer.newMessage().value(bytes).validate().sendAsync();
> > ```
> >
> > There are two benefits:
> > 1. It's clear that the message validation happens before sending.
> > 2. If users don't want to validate before sending, they can choose to
> > send the bytes directly and validate the message during consumption.
> >
> > The performance problem of the AUTO_PRODUCE schema is that the
> > validation happens twice and it cannot be controlled.
> >
> > Thanks,
> > Yunze
> >
> > On Wed, Dec 14, 2022 at 12:01 PM 丛搏 <bog...@apache.org> wrote:
> > >
> > > Hi, Yunze:
> > >
> > > Yunze Xu <y...@streamnative.io.invalid> 于2022年12月14日周三 02:26写道：
> > >
> > > > First, how do you guarantee the schema can be used to encode the raw
> > > > bytes whose format is unknown?
> > > I think this is what the user needs to ensure that the user knows all
> > > the schema from the Kafka topic and the date(bytes[]) that the user
> > > can send with a pulsar schema
> > > >
> > > > Second, messages that cannot be encoded by the schema can only be
> > > > discarded, i.e. message lost.
> > > If the encoding fails, it proves that the user does not know how to
> > > convert Kafka date's schema to pulsar schema, which is the user's own
> > > problem.
> > > >
> > > > Third, schema in Pulsar is convenient because it can support sending
> > > > any object of type `T` and the Pulsar client is responsible to
> > > > serialize `T` to the bytes. However, when using AUTO_PRODUCE schema,
> > > > the producer still sends raw bytes.
> > > the user only creates one producer to send all Kafka topic data, if
> > > using Pulsar schema, the user needs to create all schema producers in
> > > a map, and get the schema producer to send a message.
> > >
> > >
> > > In my understanding, AUTO_PRODUCE mainly reduces the number of
> > > producers created by the client, which will bring convenience to users
> > > in migrating data. Instead of dealing with unknown schema data. If you
> > > want to use it correctly, you must know the schema of all data, which
> > > can be converted into a pulsar schema. Otherwise, it would be best if
> > > you handled it yourself using the bytes schema.
> > >
> > > Thanks,
> > > Bo
>

Re: [DISCUSSIONS] Should we use AUTO_PRODUCE schema?

Reply via email to