Re: [DISCUSS] Solution after missing schema

Rajan Dhabalia Tue, 16 Apr 2024 19:06:59 -0700

>>  If the schema is lost and the automatic skip is selected (and there is
no configuration to control whether this behavior is turned on, it is on by
default)
Skipping for any failure must be controlled by a flag to support fixing
issues with bulk topic and that can be used by on-demand by the admin of
the system. that should be handled by PIP-327. However, handling
non-recoverable can be addressed by either skipping schema or by handling
at schema-registry-implementation to create a newer one and start enforcing
the newer schema.


>> If there are three versions of schema: v1, v2, v3. After losing the
schema, the schema of v3 used when using this topic for the first time.
What else is wrong with using v2? I don't think these situations are tested
in https://github.com/apache/pulsar/pull/22469
Umm.. I am not sure it's a good idea to fallback to a deprecated schema
version which again challenges the integrity of the schema and can impact
existing clients because it's the same as reading dirty entries from the
database which is not useful. So, losing schema ledger means data loss and
it should be handled without impacting topic availability and existing
clients. Right now, broken ledger breaks consumer applications and PR#22469
<https://github.com/apache/pulsar/pull/22469> simply adds safe handling at
top level for non-recoverable error handling to avoid unavailability which
is fundamentally correct and it should stay there. However, we should
additionally enhance schema-registry to handle it smartly by creating fresh
schema if it doesn't exist because the top level module can't decide that.

Thanks,
Rajan

On Tue, Apr 16, 2024 at 6:47 PM SiNan Liu <liusinan1...@gmail.com> wrote:

> The two solutions are conflicting.
>
> If the schema is lost and the automatic skip is selected (and there is no
> configuration to control whether this behavior is turned on, it is on by
> default), the next time you use this topic, the schema must be the lost
> schema.
>
>
> If there are three versions of schema: v1, v2, v3.
>
> At this time, the v2 schema is lost. What are the consequences of skipping
> error? After losing the schema, the schema of v3 used when using this topic
> for the first time. What else is wrong with using v2? I don't think these
> situations are tested in https://github.com/apache/pulsar/pull/22469.
>
>
> If solution 2 is added later. We can repair it manually. We only need to
> upload the lost schema once, and then use this topic normally. But skipping
> error is the default behavior at this time, so there is no way to fix it
> manually. The integrity of the schema has been destroyed.
>
>
> There is also content in PIP-327 that decides to skip processing when the
> schema is lost.
>
>
> Thanks,
> sinan
>
>
> Rajan Dhabalia <rdhaba...@apache.org>于2024年4月17日 周三02:34写道：
>
> > Hi,
> >
> > Broken schema ledgers and topic unavailability are the issues we have
> been
> > trying to resolve for multiple years now and We have been trying to add
> lot
> > of patches to fix but we still find different usecases which can make
> topic
> > unavailable when schema ledger is not recoverable (ledger is deleted /
> > bookies are not available to read ledger).
> > In order to address non-recoverable managed-ledgers, we had introduced a
> > mechanism to derive non-recoverable ledgers and allow broker to skip
> those
> > ledgers and in some cases allow skipping corrupt ledgers forcefully to
> > avoid topic unavailability, and since then it has been working fine with
> > managed ledgers.
> > We have followed the same approach for schema ledger to address ongoing
> > corrupt schema ledger issues fundamentally with PR#9212 to introduce
> > non-recoverable exceptions and handle it. but we are keep adding
> > enhancements and miss out the handling of non-recoverable error and eat
> > non-recoverable error at the Topic/ServerCnx level which causes topic
> > unavailability and #22469 had tried to address the same issue where
> > consumer was ignoring non-recoverable error/deleted-ledger and failing to
> > create consumer and creating topic unavailability for the consumer
> > application.
> > We can't afford such topic unavailability in production clusters and
> impact
> > business critical usecases. Therefore, while adding new enhancement, we
> > should always make sure to handle non-recoverable errors, always make the
> > topic available in case of non-recoverable errors and allow
> administrators
> > to control forceful recovery of the topics in case the cluster is facing
> > corruption for schema topics.
> >
> > Thanks,
> > Rajan
> >
> >
> > On Tue, Apr 16, 2024 at 6:52 AM SiNan Liu <liusinan1...@gmail.com>
> wrote:
> >
> > > #17221 <https://github.com/apache/pulsar/issues/17221> describes an
> > > environment when multiple bookie copies are corrupted, or a Ledger has
> > been
> > > deleted. The loss of schema ledger results in new producers and
> consumers
> > > not even being created and working properly.
> > >
> > > At present, if the integrity of the schema is damaged, it cannot be
> > > repaired because this function does not exist at present.
> > > But the current behavior is that even if the scheme is lost, the
> > connected
> > > producers and consumers can work normally.
> > >
> > > *So we need to discuss solutions for the schema that has been lost:*
> > >
> > >
> > > *1. The first is to skip the non-recoverable ledger error.*
> > > - Description in https://github.com/apache/pulsar/pull/18010: If
> enabled
> > > autoSkipNonRecoverableData, when the schema ledger is lost, the
> consumer
> > > and producer can add new schemas without compatibility check(because
> the
> > > original schema definition cannot be found).
> > >
> > > - Description in https://github.com/apache/pulsar/pull/22469: Schema
> > > should
> > > be recovered if schema ledger is failing to open due to non-recoverable
> > > ledger error.
> > >
> > > The second PR has been Merged, which causes producers and consumers who
> > are
> > > already connected may not work properly.
> > > https://github.com/apache/pulsar/pull/22469#issuecomment-2057198666
> > > Compared with #18010 <https://github.com/apache/pulsar/pull/18010>,
> > there
> > > is no configuration to control this behavior. The default behavior is
> to
> > > automatically skip when the integrity of the schema is destroyed.
> > >
> > > *2. If we don't just skip error, we can fix the schema in some way to
> > > maintain the integrity of the schema version. Even if this requires the
> > > user to manually handle the missing schema, and this topic cannot be
> used
> > > during this period. This is also better than just skipping the error.
> > > Skipping errors will bring more problems.*
> > >
> > > - https://github.com/apache/pulsar/pull/20415 (
> > > https://github.com/apache/pulsar/issues/20414)
> > > Currently this PR tries to fix the missing schema.
> > >
> > >
> > >
> > > *I hope you can discuss these two schemes and what to do with the
> #22469
> > > <https://github.com/apache/pulsar/pull/22469> that has been Merged.*
> > >
> > > *If for the second solution +1, we can talk about
> > > https://github.com/apache/pulsar/issues/20414
> > > <https://github.com/apache/pulsar/issues/20414>. The way to manually
> fix
> > > the missing schema is described in the `Alternatives`.I think we can
> add
> > > this functions to the `upload schema` admin api
> > > (
> https://pulsar.apache.org/docs/3.2.x/admin-api-schemas/#upload-a-schema
> > > <
> https://pulsar.apache.org/docs/3.2.x/admin-api-schemas/#upload-a-schema
> > > >)*
> > >
> > >
> > > Thanks,
> > > sinan
> > >
> >
>

Re: [DISCUSS] Solution after missing schema

Reply via email to