Hi all,
I have a 6 nodes production cluster with 1.5 TiB load (RF=3) and a single-node DC dedicated as a "remote disaster recovery copy" 2.7 TiB.

Doing repairs only on the production cluster takes a semi-decent time (24h for the biggest keyspace, which takes 90% of the space), but by doing repair across the two DCs takes forever, and segments often fail even if I increased Reaper segment time limit to 2h.

In trying to debug the issue, I noticed that "compactionstats -H" on the DR node shows huge (and very very slow) validations:

compaction completed  total      unit  progress
Validation 2.78 GiB   8.11 GiB   bytes 34.33%
Validation 0 bytes    2.67 TiB   bytes 0.00%
Validation 1.7 TiB    2.43 TiB   bytes 69.75%
Validation 124.26 GiB 2.67 TiB   bytes 4.55%
Validation 536.67 GiB 2.67 TiB   bytes 19.63%

Such validations take a few hours to complete, and as far as I understood segment repair always fails on the first try do to those, and only has success after a few tries when the original validation executed in the first try has ended.

My question is this: is it normal to have to validate all of the keyspace content on each segment's validation?
Is the DB in a "strange" state?
Would it be useful to issue a "rebuild" on that node, in order to send all missing data anyways, and this skipping the lenghty validations?

thanks!

--
Lapo Luchini
l...@lapo.it

Reply via email to