bump, need more people to discuss.

Thanks,
sinan


Rajan Dhabalia <rdhaba...@apache.org>于2024年4月17日 周三10:06写道:

> >>  If the schema is lost and the automatic skip is selected (and there is
> no configuration to control whether this behavior is turned on, it is on by
> default)
> Skipping for any failure must be controlled by a flag to support fixing
> issues with bulk topic and that can be used by on-demand by the admin of
> the system. that should be handled by PIP-327. However, handling
> non-recoverable can be addressed by either skipping schema or by handling
> at schema-registry-implementation to create a newer one and start enforcing
> the newer schema.
>
> >> If there are three versions of schema: v1, v2, v3. After losing the
> schema, the schema of v3 used when using this topic for the first time.
> What else is wrong with using v2? I don't think these situations are tested
> in https://github.com/apache/pulsar/pull/22469
> Umm.. I am not sure it's a good idea to fallback to a deprecated schema
> version which again challenges the integrity of the schema and can impact
> existing clients because it's the same as reading dirty entries from the
> database which is not useful. So, losing schema ledger means data loss and
> it should be handled without impacting topic availability and existing
> clients. Right now, broken ledger breaks consumer applications and PR#22469
> <https://github.com/apache/pulsar/pull/22469> simply adds safe handling at
> top level for non-recoverable error handling to avoid unavailability which
> is fundamentally correct and it should stay there. However, we should
> additionally enhance schema-registry to handle it smartly by creating fresh
> schema if it doesn't exist because the top level module can't decide that.
>
> Thanks,
> Rajan
>
> On Tue, Apr 16, 2024 at 6:47 PM SiNan Liu <liusinan1...@gmail.com> wrote:
>
> > The two solutions are conflicting.
> >
> > If the schema is lost and the automatic skip is selected (and there is no
> > configuration to control whether this behavior is turned on, it is on by
> > default), the next time you use this topic, the schema must be the lost
> > schema.
> >
> >
> > If there are three versions of schema: v1, v2, v3.
> >
> > At this time, the v2 schema is lost. What are the consequences of
> skipping
> > error? After losing the schema, the schema of v3 used when using this
> topic
> > for the first time. What else is wrong with using v2? I don't think these
> > situations are tested in https://github.com/apache/pulsar/pull/22469.
> >
> >
> > If solution 2 is added later. We can repair it manually. We only need to
> > upload the lost schema once, and then use this topic normally. But
> skipping
> > error is the default behavior at this time, so there is no way to fix it
> > manually. The integrity of the schema has been destroyed.
> >
> >
> > There is also content in PIP-327 that decides to skip processing when the
> > schema is lost.
> >
> >
> > Thanks,
> > sinan
> >
> >
> > Rajan Dhabalia <rdhaba...@apache.org>于2024年4月17日 周三02:34写道:
> >
> > > Hi,
> > >
> > > Broken schema ledgers and topic unavailability are the issues we have
> > been
> > > trying to resolve for multiple years now and We have been trying to add
> > lot
> > > of patches to fix but we still find different usecases which can make
> > topic
> > > unavailable when schema ledger is not recoverable (ledger is deleted /
> > > bookies are not available to read ledger).
> > > In order to address non-recoverable managed-ledgers, we had introduced
> a
> > > mechanism to derive non-recoverable ledgers and allow broker to skip
> > those
> > > ledgers and in some cases allow skipping corrupt ledgers forcefully to
> > > avoid topic unavailability, and since then it has been working fine
> with
> > > managed ledgers.
> > > We have followed the same approach for schema ledger to address ongoing
> > > corrupt schema ledger issues fundamentally with PR#9212 to introduce
> > > non-recoverable exceptions and handle it. but we are keep adding
> > > enhancements and miss out the handling of non-recoverable error and eat
> > > non-recoverable error at the Topic/ServerCnx level which causes topic
> > > unavailability and #22469 had tried to address the same issue where
> > > consumer was ignoring non-recoverable error/deleted-ledger and failing
> to
> > > create consumer and creating topic unavailability for the consumer
> > > application.
> > > We can't afford such topic unavailability in production clusters and
> > impact
> > > business critical usecases. Therefore, while adding new enhancement, we
> > > should always make sure to handle non-recoverable errors, always make
> the
> > > topic available in case of non-recoverable errors and allow
> > administrators
> > > to control forceful recovery of the topics in case the cluster is
> facing
> > > corruption for schema topics.
> > >
> > > Thanks,
> > > Rajan
> > >
> > >
> > > On Tue, Apr 16, 2024 at 6:52 AM SiNan Liu <liusinan1...@gmail.com>
> > wrote:
> > >
> > > > #17221 <https://github.com/apache/pulsar/issues/17221> describes an
> > > > environment when multiple bookie copies are corrupted, or a Ledger
> has
> > > been
> > > > deleted. The loss of schema ledger results in new producers and
> > consumers
> > > > not even being created and working properly.
> > > >
> > > > At present, if the integrity of the schema is damaged, it cannot be
> > > > repaired because this function does not exist at present.
> > > > But the current behavior is that even if the scheme is lost, the
> > > connected
> > > > producers and consumers can work normally.
> > > >
> > > > *So we need to discuss solutions for the schema that has been lost:*
> > > >
> > > >
> > > > *1. The first is to skip the non-recoverable ledger error.*
> > > > - Description in https://github.com/apache/pulsar/pull/18010: If
> > enabled
> > > > autoSkipNonRecoverableData, when the schema ledger is lost, the
> > consumer
> > > > and producer can add new schemas without compatibility check(because
> > the
> > > > original schema definition cannot be found).
> > > >
> > > > - Description in https://github.com/apache/pulsar/pull/22469: Schema
> > > > should
> > > > be recovered if schema ledger is failing to open due to
> non-recoverable
> > > > ledger error.
> > > >
> > > > The second PR has been Merged, which causes producers and consumers
> who
> > > are
> > > > already connected may not work properly.
> > > > https://github.com/apache/pulsar/pull/22469#issuecomment-2057198666
> > > > Compared with #18010 <https://github.com/apache/pulsar/pull/18010>,
> > > there
> > > > is no configuration to control this behavior. The default behavior is
> > to
> > > > automatically skip when the integrity of the schema is destroyed.
> > > >
> > > > *2. If we don't just skip error, we can fix the schema in some way to
> > > > maintain the integrity of the schema version. Even if this requires
> the
> > > > user to manually handle the missing schema, and this topic cannot be
> > used
> > > > during this period. This is also better than just skipping the error.
> > > > Skipping errors will bring more problems.*
> > > >
> > > > - https://github.com/apache/pulsar/pull/20415 (
> > > > https://github.com/apache/pulsar/issues/20414)
> > > > Currently this PR tries to fix the missing schema.
> > > >
> > > >
> > > >
> > > > *I hope you can discuss these two schemes and what to do with the
> > #22469
> > > > <https://github.com/apache/pulsar/pull/22469> that has been Merged.*
> > > >
> > > > *If for the second solution +1, we can talk about
> > > > https://github.com/apache/pulsar/issues/20414
> > > > <https://github.com/apache/pulsar/issues/20414>. The way to manually
> > fix
> > > > the missing schema is described in the `Alternatives`.I think we can
> > add
> > > > this functions to the `upload schema` admin api
> > > > (
> > https://pulsar.apache.org/docs/3.2.x/admin-api-schemas/#upload-a-schema
> > > > <
> > https://pulsar.apache.org/docs/3.2.x/admin-api-schemas/#upload-a-schema
> > > > >)*
> > > >
> > > >
> > > > Thanks,
> > > > sinan
> > > >
> > >
> >
>

Reply via email to