BewareMyPower commented on code in PR #24328:
URL: https://github.com/apache/pulsar/pull/24328#discussion_r2191960133
##########
pip/pip-420.md:
##########
@@ -0,0 +1,277 @@
+# PIP-420: Provides an ability for Pulsar clients to integrate with
third-party schema registry service
+
+# Motivation
+
+Apache Pulsar currently provides a built-in schema management system tightly
coupled with the broker.
+Pulsar clients interact with this system implicitly when creating producers
and consumers.
+
+However, many organizations already have independent schema registry services
(such as Confluent Schema Registry)
+and wish to reuse their existing schema governance processes across multiple
messaging systems, including Pulsar.
+
+By enabling Pulsar clients to integrate with third-party schema registry
services:
+- Users can unify schema management across different platforms.
+- Pulsar brokers can be decoupled from schema storage and validation
responsibilities.
+- Pulsar users can integrate with ecosystems that rely on external schema
registries easier.
+
+This flexibility is particularly valuable for enterprises with strict schema
validation, versioning,
+and governance workflows already centralized in external registries.
+
+# Goals
+
+## In Scope
+
+- Provide the ability for Pulsar clients to leverage third-party schema
registry services for schema operations.
+
+## Out Scope
+
+- Providing built-in implementations for third-party schemas.
+- Support `AutoProduceBytesSchema` and `AutoConsumeSchema`.
+- Migrating existing Pulsar-managed schemas to external schema registries.
+
+# High Level Design
+
+- Provide a mechanism to configure the Pulsar client to use either:
+ - The existing Pulsar schema registry (default)
+ - Third-party schema registry implementations
+
+# Detailed Design
+
+## Design & Implementation Details
+
+This PIP aims to enable the Pulsar client to directly integrate with external
schema registry services for schema management.
+In this model, the external schema registry is fully responsible for schema
storage, retrieval, and validation.
+The Pulsar broker will no longer manage schema data for topics using external
schemas.
+
+### SchemaType: EXTERNAL
+
+Pulsar will introduce a new schema type: **SchemaType.EXTERNAL**.
+
+- All schemas that integrate with external schema registries must declare
`SchemaType.EXTERNAL`.
+- When using `EXTERNAL` schema type, the Pulsar client will provide empty
schema data to the broker.
+- The broker will only record the schema type for topics.
+- Compatibility restrictions:
+ - Introduce a new compatibility check in broker side.
+ - The schema type `SchemaType.EXTERNAL` can't be compatible with other
Pulsar schemas
+ - This prevents accidental data corruption or schema conflicts between
internal and external schema management systems.
+- Pulsar Geo replicator needs to transfer the schema type
`SchemaType.EXTERNAL` to the remote cluster.
+
+This design isolates external schema management and protects existing topics
using Pulsar’s native schema system.
+
+### Extensibility via Client Interfaces
+
+To integrate with external schema registries, users can:
+- Implement the `Schema` interface to define custom schema encoding and
decoding logic.
+
+#### Key `Schema` Interface Methods:
+- byte[] encode(T message)
+ - Serializes the message using the external schema.
+ - Implementations should throw `SchemaSerializationException` if the
serialization fails.
+
+- T decode(byte[] bytes)
+ - Deserialize the message using the external schema.
+ - Users should handle exceptions when get value by themselves.
+
+- void setSchemaInfoProvider(SchemaInfoProvider schemaInfoProvider)
+ - Call this method when creating schema
+ - External schema can be initialized when calling this method
+
+- close() **(New addition)**
+ - Called when the producer or consumer is closed.
+ - Allows external schema implementations to release resources, such as
schema registry connections or caches.
+
+#### Example Workflow:
+
+- During producer or consumer initialization:
+ The external schema info will be registered to Pulsar schema storage.
+
+- During message send or receive:
+ The `encode` and `decode` methods handle the schema-aware serialization and
deserialization using the external schema registry.
+
+#### Schema ID & Schema Version
+
+Unlike Pulsar, which uses **schema version** to identify schemas, many
external schema registry systems use **schema ID** as the primary schema
identifier.
+
+When integrating with external schema registries:
+- The `schemaVersion` filed in Pulsar message metadata is used in some places,
**set to `-1` to flag the message is using external schema systems**.
Review Comment:
```protobuf
optional bytes schema_version = 16;
```
The type of schema version is `bytes`. It's better to clearly define the
specific bytes that represent external schema registry.
In addition, could you explain more about when is the schema version used? I
took a quick look and found the following API should be affected:
```java
public interface LookupService extends AutoCloseable {
CompletableFuture<Optional<SchemaInfo>> getSchema(TopicName topicName,
byte[] version);
```
It's used in `MultiVersionSchemaInfoProvider#loadSchema`.
It's better to add a **background** section to explain all related details,
otherwise setting -1 as the schema version might be confusing. For example, why
not use an empty byte array as the schema version?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]