Ivan, just to make sure ... The discussed case will fully solve the issue [1] in case we'll also add some strategy to reject partitions with missed updates (updateCnt==Ok, Hash!=Ok). For example, we may use the Quorum strategy, when the majority wins. Sounds correct?
[1] https://issues.apache.org/jira/browse/IGNITE-10078 On Tue, Apr 30, 2019 at 3:14 PM Anton Vinogradov <a...@apache.org> wrote: > Ivan, > > Thanks for the detailed explanation. > I'll try to implement the PoC to check the idea. > > On Mon, Apr 29, 2019 at 8:22 PM Ivan Rakov <ivan.glu...@gmail.com> wrote: > >> > But how to keep this hash? >> I think, we can just adopt way of storing partition update counters. >> Update counters are: >> 1) Kept and updated in heap, see >> IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#pCntr (accessed during >> regular cache operations, no page replacement latency issues) >> 2) Synchronized with page memory (and with disk) on every checkpoint, >> see GridCacheOffheapManager#saveStoreMetadata >> 3) Stored in partition meta page, see PagePartitionMetaIO#setUpdateCounter >> 4) On node restart, we init onheap counter with value from disk (for the >> moment of last checkpoint) and update it to latest value during WAL >> logical records replay >> >> > 2) PME is a rare operation on production cluster, but, seems, we have >> > to check consistency in a regular way. >> > Since we have to finish all operations before the check, should we >> > have fake PME for maintenance check in this case? >> From my experience, PME happens on prod clusters from time to time >> (several times per week), which can be enough. In case it's needed to >> check consistency more often than regular PMEs occur, we can implement >> command that will trigger fake PME for consistency checking. >> >> Best Regards, >> Ivan Rakov >> >> On 29.04.2019 18:53, Anton Vinogradov wrote: >> > Ivan, thanks for the analysis! >> > >> > >> With having pre-calculated partition hash value, we can >> > automatically detect inconsistent partitions on every PME. >> > Great idea, seems this covers all broken synс cases. >> > >> > It will check alive nodes in case the primary failed immediately >> > and will check rejoining node once it finished a rebalance (PME on >> > becoming an owner). >> > Recovered cluster will be checked on activation PME (or even before >> > that?). >> > Also, warmed cluster will be still warmed after check. >> > >> > Have I missed some cases leads to broken sync except bugs? >> > >> > 1) But how to keep this hash? >> > - It should be automatically persisted on each checkpoint (it should >> > not require recalculation on restore, snapshots should be covered too) >> > (and covered by WAL?). >> > - It should be always available at RAM for every partition (even for >> > cold partitions never updated/readed on this node) to be immediately >> > used once all operations done on PME. >> > >> > Can we have special pages to keep such hashes and never allow their >> > eviction? >> > >> > 2) PME is a rare operation on production cluster, but, seems, we have >> > to check consistency in a regular way. >> > Since we have to finish all operations before the check, should we >> > have fake PME for maintenance check in this case? >> > >> > On Mon, Apr 29, 2019 at 4:59 PM Ivan Rakov <ivan.glu...@gmail.com >> > <mailto:ivan.glu...@gmail.com>> wrote: >> > >> > Hi Anton, >> > >> > Thanks for sharing your ideas. >> > I think your approach should work in general. I'll just share my >> > concerns about possible issues that may come up. >> > >> > 1) Equality of update counters doesn't imply equality of >> > partitions content under load. >> > For every update, primary node generates update counter and then >> > update is delivered to backup node and gets applied with the >> > corresponding update counter. For example, there are two >> > transactions (A and B) that update partition X by the following >> > scenario: >> > - A updates key1 in partition X on primary node and increments >> > counter to 10 >> > - B updates key2 in partition X on primary node and increments >> > counter to 11 >> > - While A is still updating another keys, B is finally committed >> > - Update of key2 arrives to backup node and sets update counter to >> 11 >> > Observer will see equal update counters (11), but update of key 1 >> > is still missing in the backup partition. >> > This is a fundamental problem which is being solved here: >> > https://issues.apache.org/jira/browse/IGNITE-10078 >> > "Online verify" should operate with new complex update counters >> > which take such "update holes" into account. Otherwise, online >> > verify may provide false-positive inconsistency reports. >> > >> > 2) Acquisition and comparison of update counters is fast, but >> > partition hash calculation is long. We should check that update >> > counter remains unchanged after every K keys handled. >> > >> > 3) >> > >> >> Another hope is that we'll be able to pause/continue scan, for >> >> example, we'll check 1/3 partitions today, 1/3 tomorrow, and in >> >> three days we'll check the whole cluster. >> > Totally makes sense. >> > We may find ourselves into a situation where some "hot" partitions >> > are still unprocessed, and every next attempt to calculate >> > partition hash fails due to another concurrent update. We should >> > be able to track progress of validation (% of calculation time >> > wasted due to concurrent operations may be a good metric, 100% is >> > the worst case) and provide option to stop/pause activity. >> > I think, pause should return an "intermediate results report" with >> > information about which partitions have been successfully checked. >> > With such report, we can resume activity later: partitions from >> > report will be just skipped. >> > >> > 4) >> > >> >> Since "Idle verify" uses regular pagmem, I assume it replaces hot >> >> data with persisted. >> >> So, we have to warm up the cluster after each check. >> >> Are there any chances to check without cooling the cluster? >> > I don't see an easy way to achieve it with our page memory >> > architecture. We definitely can't just read pages from disk >> > directly: we need to synchronize page access with concurrent >> > update operations and checkpoints. >> > From my point of view, the correct way to solve this issue is >> > improving our page replacement [1] mechanics by making it truly >> > scan-resistant. >> > >> > P. S. There's another possible way of achieving online verify: >> > instead of on-demand hash calculation, we can always keep >> > up-to-date hash value for every partition. We'll need to update >> > hash on every insert/update/remove operation, but there will be no >> > reordering issues as per function that we use for aggregating hash >> > results (+) is commutative. With having pre-calculated partition >> > hash value, we can automatically detect inconsistent partitions on >> > every PME. What do you think? >> > >> > [1] - >> > >> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood#IgniteDurableMemory-underthehood-Pagereplacement(rotationwithdisk) >> > >> > Best Regards, >> > Ivan Rakov >> > >> > On 29.04.2019 12:20, Anton Vinogradov wrote: >> >> Igniters and especially Ivan Rakov, >> >> >> >> "Idle verify" [1] is a really cool tool, to make sure that >> >> cluster is consistent. >> >> >> >> 1) But it required to have operations paused during cluster check. >> >> At some clusters, this check requires hours (3-4 hours at cases I >> >> saw). >> >> I've checked the code of "idle verify" and it seems it possible >> >> to make it "online" with some assumptions. >> >> >> >> Idea: >> >> Currently "Idle verify" checks that partitions hashes, generated >> >> this way >> >> while (it.hasNextX()) { >> >> CacheDataRow row = it.nextX(); >> >> partHash += row.key().hashCode(); >> >> partHash += >> >> >> Arrays.hashCode(row.value().valueBytes(grpCtx.cacheObjectContext())); >> >> } >> >> , are the same. >> >> >> >> What if we'll generate same pairs updateCounter-partitionHash but >> >> will compare hashes only in case counters are the same? >> >> So, for example, will ask cluster to generate pairs for 64 >> >> partitions, then will find that 55 have the same counters (was >> >> not updated during check) and check them. >> >> The rest (64-55 = 9) partitions will be re-requested and >> >> rechecked with an additional 55. >> >> This way we'll be able to check cluster is consistent even in >> >> сase operations are in progress (just retrying modified). >> >> >> >> Risks and assumptions: >> >> Using this strategy we'll check the cluster's consistency ... >> >> eventually, and the check will take more time even on an idle >> >> cluster. >> >> In case operationsPerTimeToGeneratePartitionHashes > >> >> partitionsCount we'll definitely gain no progress. >> >> But, in case of the load is not high, we'll be able to check all >> >> cluster. >> >> >> >> Another hope is that we'll be able to pause/continue scan, for >> >> example, we'll check 1/3 partitions today, 1/3 tomorrow, and in >> >> three days we'll check the whole cluster. >> >> >> >> Have I missed something? >> >> >> >> 2) Since "Idle verify" uses regular pagmem, I assume it replaces >> >> hot data with persisted. >> >> So, we have to warm up the cluster after each check. >> >> Are there any chances to check without cooling the cluster? >> >> >> >> [1] >> >> >> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums >> > >> >