Hi all,
I have a 6 nodes production cluster with 1.5 TiB load (RF=3) and a
single-node DC dedicated as a "remote disaster recovery copy" 2.7 TiB.
Doing repairs only on the production cluster takes a semi-decent time
(24h for the biggest keyspace, which takes 90% of the space), but by
doing repair across the two DCs takes forever, and segments often fail
even if I increased Reaper segment time limit to 2h.
In trying to debug the issue, I noticed that "compactionstats -H" on the
DR node shows huge (and very very slow) validations:
compaction completed total unit progress
Validation 2.78 GiB 8.11 GiB bytes 34.33%
Validation 0 bytes 2.67 TiB bytes 0.00%
Validation 1.7 TiB 2.43 TiB bytes 69.75%
Validation 124.26 GiB 2.67 TiB bytes 4.55%
Validation 536.67 GiB 2.67 TiB bytes 19.63%
Such validations take a few hours to complete, and as far as I
understood segment repair always fails on the first try do to those, and
only has success after a few tries when the original validation executed
in the first try has ended.
My question is this: is it normal to have to validate all of the
keyspace content on each segment's validation?
Is the DB in a "strange" state?
Would it be useful to issue a "rebuild" on that node, in order to send
all missing data anyways, and this skipping the lenghty validations?
thanks!
--
Lapo Luchini
l...@lapo.it