Igniters and especially Ivan Rakov,

"Idle verify" [1] is a really cool tool, to make sure that cluster is
consistent.

1) But it required to have operations paused during cluster check.
At some clusters, this check requires hours (3-4 hours at cases I saw).
I've checked the code of "idle verify" and it seems it possible to make it
"online" with some assumptions.

Idea:
Currently "Idle verify" checks that partitions hashes, generated this way
while (it.hasNextX()) {
CacheDataRow row = it.nextX();
partHash += row.key().hashCode();
partHash +=
Arrays.hashCode(row.value().valueBytes(grpCtx.cacheObjectContext()));
}
, are the same.

What if we'll generate same pairs updateCounter-partitionHash but will
compare hashes only in case counters are the same?
So, for example, will ask cluster to generate pairs for 64 partitions, then
will find that 55 have the same counters (was not updated during check) and
check them.
The rest (64-55 = 9) partitions will be re-requested and rechecked with an
additional 55.
This way we'll be able to check cluster is consistent even in сase
operations are in progress (just retrying modified).

Risks and assumptions:
Using this strategy we'll check the cluster's consistency ... eventually,
and the check will take more time even on an idle cluster.
In case operationsPerTimeToGeneratePartitionHashes > partitionsCount we'll
definitely gain no progress.
But, in case of the load is not high, we'll be able to check all cluster.

Another hope is that we'll be able to pause/continue scan, for example,
we'll check 1/3 partitions today, 1/3 tomorrow, and in three days we'll
check the whole cluster.

Have I missed something?

2) Since "Idle verify" uses regular pagmem, I assume it replaces hot data
with persisted.
So, we have to warm up the cluster after each check.
Are there any chances to check without cooling the cluster?

[1]
https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums

Reply via email to