The two solutions are conflicting. If the schema is lost and the automatic skip is selected (and there is no configuration to control whether this behavior is turned on, it is on by default), the next time you use this topic, the schema must be the lost schema.
If there are three versions of schema: v1, v2, v3. At this time, the v2 schema is lost. What are the consequences of skipping error? After losing the schema, the schema of v3 used when using this topic for the first time. What else is wrong with using v2? I don't think these situations are tested in https://github.com/apache/pulsar/pull/22469. If solution 2 is added later. We can repair it manually. We only need to upload the lost schema once, and then use this topic normally. But skipping error is the default behavior at this time, so there is no way to fix it manually. The integrity of the schema has been destroyed. There is also content in PIP-327 that decides to skip processing when the schema is lost. Thanks, sinan Rajan Dhabalia <rdhaba...@apache.org>于2024年4月17日 周三02:34写道: > Hi, > > Broken schema ledgers and topic unavailability are the issues we have been > trying to resolve for multiple years now and We have been trying to add lot > of patches to fix but we still find different usecases which can make topic > unavailable when schema ledger is not recoverable (ledger is deleted / > bookies are not available to read ledger). > In order to address non-recoverable managed-ledgers, we had introduced a > mechanism to derive non-recoverable ledgers and allow broker to skip those > ledgers and in some cases allow skipping corrupt ledgers forcefully to > avoid topic unavailability, and since then it has been working fine with > managed ledgers. > We have followed the same approach for schema ledger to address ongoing > corrupt schema ledger issues fundamentally with PR#9212 to introduce > non-recoverable exceptions and handle it. but we are keep adding > enhancements and miss out the handling of non-recoverable error and eat > non-recoverable error at the Topic/ServerCnx level which causes topic > unavailability and #22469 had tried to address the same issue where > consumer was ignoring non-recoverable error/deleted-ledger and failing to > create consumer and creating topic unavailability for the consumer > application. > We can't afford such topic unavailability in production clusters and impact > business critical usecases. Therefore, while adding new enhancement, we > should always make sure to handle non-recoverable errors, always make the > topic available in case of non-recoverable errors and allow administrators > to control forceful recovery of the topics in case the cluster is facing > corruption for schema topics. > > Thanks, > Rajan > > > On Tue, Apr 16, 2024 at 6:52 AM SiNan Liu <liusinan1...@gmail.com> wrote: > > > #17221 <https://github.com/apache/pulsar/issues/17221> describes an > > environment when multiple bookie copies are corrupted, or a Ledger has > been > > deleted. The loss of schema ledger results in new producers and consumers > > not even being created and working properly. > > > > At present, if the integrity of the schema is damaged, it cannot be > > repaired because this function does not exist at present. > > But the current behavior is that even if the scheme is lost, the > connected > > producers and consumers can work normally. > > > > *So we need to discuss solutions for the schema that has been lost:* > > > > > > *1. The first is to skip the non-recoverable ledger error.* > > - Description in https://github.com/apache/pulsar/pull/18010: If enabled > > autoSkipNonRecoverableData, when the schema ledger is lost, the consumer > > and producer can add new schemas without compatibility check(because the > > original schema definition cannot be found). > > > > - Description in https://github.com/apache/pulsar/pull/22469: Schema > > should > > be recovered if schema ledger is failing to open due to non-recoverable > > ledger error. > > > > The second PR has been Merged, which causes producers and consumers who > are > > already connected may not work properly. > > https://github.com/apache/pulsar/pull/22469#issuecomment-2057198666 > > Compared with #18010 <https://github.com/apache/pulsar/pull/18010>, > there > > is no configuration to control this behavior. The default behavior is to > > automatically skip when the integrity of the schema is destroyed. > > > > *2. If we don't just skip error, we can fix the schema in some way to > > maintain the integrity of the schema version. Even if this requires the > > user to manually handle the missing schema, and this topic cannot be used > > during this period. This is also better than just skipping the error. > > Skipping errors will bring more problems.* > > > > - https://github.com/apache/pulsar/pull/20415 ( > > https://github.com/apache/pulsar/issues/20414) > > Currently this PR tries to fix the missing schema. > > > > > > > > *I hope you can discuss these two schemes and what to do with the #22469 > > <https://github.com/apache/pulsar/pull/22469> that has been Merged.* > > > > *If for the second solution +1, we can talk about > > https://github.com/apache/pulsar/issues/20414 > > <https://github.com/apache/pulsar/issues/20414>. The way to manually fix > > the missing schema is described in the `Alternatives`.I think we can add > > this functions to the `upload schema` admin api > > (https://pulsar.apache.org/docs/3.2.x/admin-api-schemas/#upload-a-schema > > <https://pulsar.apache.org/docs/3.2.x/admin-api-schemas/#upload-a-schema > > >)* > > > > > > Thanks, > > sinan > > >