[DISCUSSIONS] Should we use AUTO_PRODUCE schema?

Yunze Xu Tue, 13 Dec 2022 10:26:39 -0800

Hi all,

Pulsar supports AUTO_PRODUCE schema, but this feature was introduced
at an early time [1] when there was no PIP. I have read the documents
[2] and found the example scenario.


> Suppose that:
> - You have a producer processing messages from a Kafka topic K.
> - You have a Pulsar topic P, and you do not know its schema type.
> - Your application reads the messages from K and writes the messages to P.

It seems to assume the format of messages from the source topic (`K`)
is **unknown**, but we tried to use a **known schema** from an
existing topic to encode the bytes. This operation is very weird.

First, how do you guarantee the schema can be used to encode the raw
bytes whose format is unknown?

Second, messages that cannot be encoded by the schema can only be
discarded, i.e. message lost.

Third, schema in Pulsar is convenient because it can support sending
any object of type `T` and the Pulsar client is responsible to
serialize `T` to the bytes. However, when using AUTO_PRODUCE schema,
the producer still sends raw bytes.

It looks like the AUTO_PRODUCE schema is used when you assume most of
the source messages can be decoded via a known schema and you can
tolerate discarding other messages.

BTW, the document doesn't describe how to handle the exception. You
need to catch the SchemaSerializationException for `sendAsync`. It
changed the common way of how to use `sendAsync` because the
asynchronous method should not throw any exception in regular cases.
And the exception message might look like

> java.lang.ArrayIndexOutOfBoundsException: Index -39 out of bounds for length 2

It's not helpful to know why a specific message cannot be encoded by
the existing schema and hard to detect the problem.

I cannot think of a scenario where the `AUTO_PRODUCE` schema is
useful. It just forces the producers to validate messages, rather than
consumers. With AUTO_PRODUCE schema, the exception is thrown from
`Producer#sendAsync`, while without it the exception will be thrown
from `Message#getValue`.

When we want to use schema, the producer side should know the format
of messages to send. Schema should be used when you know the format of
messages to send while the topic doesn't accept this format [3].

In conclusion, I think it's a very bad feature and we should not
encourage users to use this feature. i.e. mark it as deprecated and
remove it from the documents. Feel free to comment your thoughts!

[1] https://github.com/apache/pulsar/pull/2685
[2] https://pulsar.apache.org/docs/2.10.x/schema-understand/#auto_produce
[3] https://pulsar.apache.org/docs/2.10.x/schema-get-started/#why-use-schema

Thanks,
Yunze

[DISCUSSIONS] Should we use AUTO_PRODUCE schema?

Reply via email to