bump, need more people to discuss. Thanks, sinan
Rajan Dhabalia <rdhaba...@apache.org>于2024年4月17日 周三10:06写道: > >> If the schema is lost and the automatic skip is selected (and there is > no configuration to control whether this behavior is turned on, it is on by > default) > Skipping for any failure must be controlled by a flag to support fixing > issues with bulk topic and that can be used by on-demand by the admin of > the system. that should be handled by PIP-327. However, handling > non-recoverable can be addressed by either skipping schema or by handling > at schema-registry-implementation to create a newer one and start enforcing > the newer schema. > > >> If there are three versions of schema: v1, v2, v3. After losing the > schema, the schema of v3 used when using this topic for the first time. > What else is wrong with using v2? I don't think these situations are tested > in https://github.com/apache/pulsar/pull/22469 > Umm.. I am not sure it's a good idea to fallback to a deprecated schema > version which again challenges the integrity of the schema and can impact > existing clients because it's the same as reading dirty entries from the > database which is not useful. So, losing schema ledger means data loss and > it should be handled without impacting topic availability and existing > clients. Right now, broken ledger breaks consumer applications and PR#22469 > <https://github.com/apache/pulsar/pull/22469> simply adds safe handling at > top level for non-recoverable error handling to avoid unavailability which > is fundamentally correct and it should stay there. However, we should > additionally enhance schema-registry to handle it smartly by creating fresh > schema if it doesn't exist because the top level module can't decide that. > > Thanks, > Rajan > > On Tue, Apr 16, 2024 at 6:47 PM SiNan Liu <liusinan1...@gmail.com> wrote: > > > The two solutions are conflicting. > > > > If the schema is lost and the automatic skip is selected (and there is no > > configuration to control whether this behavior is turned on, it is on by > > default), the next time you use this topic, the schema must be the lost > > schema. > > > > > > If there are three versions of schema: v1, v2, v3. > > > > At this time, the v2 schema is lost. What are the consequences of > skipping > > error? After losing the schema, the schema of v3 used when using this > topic > > for the first time. What else is wrong with using v2? I don't think these > > situations are tested in https://github.com/apache/pulsar/pull/22469. > > > > > > If solution 2 is added later. We can repair it manually. We only need to > > upload the lost schema once, and then use this topic normally. But > skipping > > error is the default behavior at this time, so there is no way to fix it > > manually. The integrity of the schema has been destroyed. > > > > > > There is also content in PIP-327 that decides to skip processing when the > > schema is lost. > > > > > > Thanks, > > sinan > > > > > > Rajan Dhabalia <rdhaba...@apache.org>于2024年4月17日 周三02:34写道: > > > > > Hi, > > > > > > Broken schema ledgers and topic unavailability are the issues we have > > been > > > trying to resolve for multiple years now and We have been trying to add > > lot > > > of patches to fix but we still find different usecases which can make > > topic > > > unavailable when schema ledger is not recoverable (ledger is deleted / > > > bookies are not available to read ledger). > > > In order to address non-recoverable managed-ledgers, we had introduced > a > > > mechanism to derive non-recoverable ledgers and allow broker to skip > > those > > > ledgers and in some cases allow skipping corrupt ledgers forcefully to > > > avoid topic unavailability, and since then it has been working fine > with > > > managed ledgers. > > > We have followed the same approach for schema ledger to address ongoing > > > corrupt schema ledger issues fundamentally with PR#9212 to introduce > > > non-recoverable exceptions and handle it. but we are keep adding > > > enhancements and miss out the handling of non-recoverable error and eat > > > non-recoverable error at the Topic/ServerCnx level which causes topic > > > unavailability and #22469 had tried to address the same issue where > > > consumer was ignoring non-recoverable error/deleted-ledger and failing > to > > > create consumer and creating topic unavailability for the consumer > > > application. > > > We can't afford such topic unavailability in production clusters and > > impact > > > business critical usecases. Therefore, while adding new enhancement, we > > > should always make sure to handle non-recoverable errors, always make > the > > > topic available in case of non-recoverable errors and allow > > administrators > > > to control forceful recovery of the topics in case the cluster is > facing > > > corruption for schema topics. > > > > > > Thanks, > > > Rajan > > > > > > > > > On Tue, Apr 16, 2024 at 6:52 AM SiNan Liu <liusinan1...@gmail.com> > > wrote: > > > > > > > #17221 <https://github.com/apache/pulsar/issues/17221> describes an > > > > environment when multiple bookie copies are corrupted, or a Ledger > has > > > been > > > > deleted. The loss of schema ledger results in new producers and > > consumers > > > > not even being created and working properly. > > > > > > > > At present, if the integrity of the schema is damaged, it cannot be > > > > repaired because this function does not exist at present. > > > > But the current behavior is that even if the scheme is lost, the > > > connected > > > > producers and consumers can work normally. > > > > > > > > *So we need to discuss solutions for the schema that has been lost:* > > > > > > > > > > > > *1. The first is to skip the non-recoverable ledger error.* > > > > - Description in https://github.com/apache/pulsar/pull/18010: If > > enabled > > > > autoSkipNonRecoverableData, when the schema ledger is lost, the > > consumer > > > > and producer can add new schemas without compatibility check(because > > the > > > > original schema definition cannot be found). > > > > > > > > - Description in https://github.com/apache/pulsar/pull/22469: Schema > > > > should > > > > be recovered if schema ledger is failing to open due to > non-recoverable > > > > ledger error. > > > > > > > > The second PR has been Merged, which causes producers and consumers > who > > > are > > > > already connected may not work properly. > > > > https://github.com/apache/pulsar/pull/22469#issuecomment-2057198666 > > > > Compared with #18010 <https://github.com/apache/pulsar/pull/18010>, > > > there > > > > is no configuration to control this behavior. The default behavior is > > to > > > > automatically skip when the integrity of the schema is destroyed. > > > > > > > > *2. If we don't just skip error, we can fix the schema in some way to > > > > maintain the integrity of the schema version. Even if this requires > the > > > > user to manually handle the missing schema, and this topic cannot be > > used > > > > during this period. This is also better than just skipping the error. > > > > Skipping errors will bring more problems.* > > > > > > > > - https://github.com/apache/pulsar/pull/20415 ( > > > > https://github.com/apache/pulsar/issues/20414) > > > > Currently this PR tries to fix the missing schema. > > > > > > > > > > > > > > > > *I hope you can discuss these two schemes and what to do with the > > #22469 > > > > <https://github.com/apache/pulsar/pull/22469> that has been Merged.* > > > > > > > > *If for the second solution +1, we can talk about > > > > https://github.com/apache/pulsar/issues/20414 > > > > <https://github.com/apache/pulsar/issues/20414>. The way to manually > > fix > > > > the missing schema is described in the `Alternatives`.I think we can > > add > > > > this functions to the `upload schema` admin api > > > > ( > > https://pulsar.apache.org/docs/3.2.x/admin-api-schemas/#upload-a-schema > > > > < > > https://pulsar.apache.org/docs/3.2.x/admin-api-schemas/#upload-a-schema > > > > >)* > > > > > > > > > > > > Thanks, > > > > sinan > > > > > > > > > >