Repair on a slow node (or is it?)

Lapo Luchini Mon, 29 Mar 2021 02:47:17 -0700

Hi all,

I have a 6 nodes production cluster with 1.5 TiB load (RF=3) and asingle-node DC dedicated as a "remote disaster recovery copy" 2.7 TiB.

Doing repairs only on the production cluster takes a semi-decent time(24h for the biggest keyspace, which takes 90% of the space), but bydoing repair across the two DCs takes forever, and segments often faileven if I increased Reaper segment time limit to 2h.

In trying to debug the issue, I noticed that "compactionstats -H" on theDR node shows huge (and very very slow) validations:


compaction completed  total      unit  progress
Validation 2.78 GiB   8.11 GiB   bytes 34.33%
Validation 0 bytes    2.67 TiB   bytes 0.00%
Validation 1.7 TiB    2.43 TiB   bytes 69.75%
Validation 124.26 GiB 2.67 TiB   bytes 4.55%
Validation 536.67 GiB 2.67 TiB   bytes 19.63%

Such validations take a few hours to complete, and as far as Iunderstood segment repair always fails on the first try do to those, andonly has success after a few tries when the original validation executedin the first try has ended.

My question is this: is it normal to have to validate all of thekeyspace content on each segment's validation?

Is the DB in a "strange" state?

Would it be useful to issue a "rebuild" on that node, in order to sendall missing data anyways, and this skipping the lenghty validations?


thanks!

--
Lapo Luchini
l...@lapo.it

Repair on a slow node (or is it?)

Reply via email to