Re: [DISCUSSIONS] Should we use AUTO_PRODUCE schema?

Yunze Xu Tue, 13 Dec 2022 20:40:05 -0800

> the user only creates one producer to send all Kafka topic data, if
using Pulsar schema, the user needs to create all schema producers in
a map

It doesn't make sense to me. If the source topic has messages of
multiple schemas, why did you try to sink them into the same topic
with a schema? The key point of AUTO_PRODUCE schema is to download the
schema to validate the source messages. But if the schema of the topic
evolved, the left messages from the source topic could not be sent to
the topic.

The most confusing part is that AUTO_PRODUCE schema will perform
message format validation before send. It's transparent to users and
intuitive. IMO, it's better to call validate explicitly like

```java
producer.newMessage().value(bytes).validate().sendAsync();
```

There are two benefits:
1. It's clear that the message validation happens before sending.
2. If users don't want to validate before sending, they can choose to
send the bytes directly and validate the message during consumption.

The performance problem of the AUTO_PRODUCE schema is that the
validation happens twice and it cannot be controlled.

Thanks,
Yunze

On Wed, Dec 14, 2022 at 12:01 PM 丛搏 <bog...@apache.org> wrote:
>
> Hi, Yunze:
>
> Yunze Xu <y...@streamnative.io.invalid> 于2022年12月14日周三 02:26写道：
>
> > First, how do you guarantee the schema can be used to encode the raw
> > bytes whose format is unknown?
> I think this is what the user needs to ensure that the user knows all
> the schema from the Kafka topic and the date(bytes[]) that the user
> can send with a pulsar schema
> >
> > Second, messages that cannot be encoded by the schema can only be
> > discarded, i.e. message lost.
> If the encoding fails, it proves that the user does not know how to
> convert Kafka date's schema to pulsar schema, which is the user's own
> problem.
> >
> > Third, schema in Pulsar is convenient because it can support sending
> > any object of type `T` and the Pulsar client is responsible to
> > serialize `T` to the bytes. However, when using AUTO_PRODUCE schema,
> > the producer still sends raw bytes.
> the user only creates one producer to send all Kafka topic data, if
> using Pulsar schema, the user needs to create all schema producers in
> a map, and get the schema producer to send a message.
>
>
> In my understanding, AUTO_PRODUCE mainly reduces the number of
> producers created by the client, which will bring convenience to users
> in migrating data. Instead of dealing with unknown schema data. If you
> want to use it correctly, you must know the schema of all data, which
> can be converted into a pulsar schema. Otherwise, it would be best if
> you handled it yourself using the bytes schema.
>
> Thanks,
> Bo

Re: [DISCUSSIONS] Should we use AUTO_PRODUCE schema?

Reply via email to