Hi,

Broken schema ledgers and topic unavailability are the issues we have been
trying to resolve for multiple years now and We have been trying to add lot
of patches to fix but we still find different usecases which can make topic
unavailable when schema ledger is not recoverable (ledger is deleted /
bookies are not available to read ledger).
In order to address non-recoverable managed-ledgers, we had introduced a
mechanism to derive non-recoverable ledgers and allow broker to skip those
ledgers and in some cases allow skipping corrupt ledgers forcefully to
avoid topic unavailability, and since then it has been working fine with
managed ledgers.
We have followed the same approach for schema ledger to address ongoing
corrupt schema ledger issues fundamentally with PR#9212 to introduce
non-recoverable exceptions and handle it. but we are keep adding
enhancements and miss out the handling of non-recoverable error and eat
non-recoverable error at the Topic/ServerCnx level which causes topic
unavailability and #22469 had tried to address the same issue where
consumer was ignoring non-recoverable error/deleted-ledger and failing to
create consumer and creating topic unavailability for the consumer
application.
We can't afford such topic unavailability in production clusters and impact
business critical usecases. Therefore, while adding new enhancement, we
should always make sure to handle non-recoverable errors, always make the
topic available in case of non-recoverable errors and allow administrators
to control forceful recovery of the topics in case the cluster is facing
corruption for schema topics.

Thanks,
Rajan


On Tue, Apr 16, 2024 at 6:52 AM SiNan Liu <liusinan1...@gmail.com> wrote:

> #17221 <https://github.com/apache/pulsar/issues/17221> describes an
> environment when multiple bookie copies are corrupted, or a Ledger has been
> deleted. The loss of schema ledger results in new producers and consumers
> not even being created and working properly.
>
> At present, if the integrity of the schema is damaged, it cannot be
> repaired because this function does not exist at present.
> But the current behavior is that even if the scheme is lost, the connected
> producers and consumers can work normally.
>
> *So we need to discuss solutions for the schema that has been lost:*
>
>
> *1. The first is to skip the non-recoverable ledger error.*
> - Description in https://github.com/apache/pulsar/pull/18010: If enabled
> autoSkipNonRecoverableData, when the schema ledger is lost, the consumer
> and producer can add new schemas without compatibility check(because the
> original schema definition cannot be found).
>
> - Description in https://github.com/apache/pulsar/pull/22469: Schema
> should
> be recovered if schema ledger is failing to open due to non-recoverable
> ledger error.
>
> The second PR has been Merged, which causes producers and consumers who are
> already connected may not work properly.
> https://github.com/apache/pulsar/pull/22469#issuecomment-2057198666
> Compared with #18010 <https://github.com/apache/pulsar/pull/18010>, there
> is no configuration to control this behavior. The default behavior is to
> automatically skip when the integrity of the schema is destroyed.
>
> *2. If we don't just skip error, we can fix the schema in some way to
> maintain the integrity of the schema version. Even if this requires the
> user to manually handle the missing schema, and this topic cannot be used
> during this period. This is also better than just skipping the error.
> Skipping errors will bring more problems.*
>
> - https://github.com/apache/pulsar/pull/20415 (
> https://github.com/apache/pulsar/issues/20414)
> Currently this PR tries to fix the missing schema.
>
>
>
> *I hope you can discuss these two schemes and what to do with the #22469
> <https://github.com/apache/pulsar/pull/22469> that has been Merged.*
>
> *If for the second solution +1, we can talk about
> https://github.com/apache/pulsar/issues/20414
> <https://github.com/apache/pulsar/issues/20414>. The way to manually fix
> the missing schema is described in the `Alternatives`.I think we can add
> this functions to the `upload schema` admin api
> (https://pulsar.apache.org/docs/3.2.x/admin-api-schemas/#upload-a-schema
> <https://pulsar.apache.org/docs/3.2.x/admin-api-schemas/#upload-a-schema
> >)*
>
>
> Thanks,
> sinan
>

Reply via email to