Thanks for all your suggestions!
I'm looking into it and so far it seems to be mainly a problem of disk
I/O, as the host is running on spindle disks and being a DR of an entire
cluster gives it many changes to follow.
First (easy) try will be to add an SSD as ZFS cache (ZIL + L2ARC).
Should make a huge difference alrady.
I will then later on study Medusa/tablesnap too, thanks.
On 2021-03-29 12:32, Kane Wilson wrote:
Check what your compactionthroughput is set to, as it will impact the
validation compactions. also what kind of disks does the DR node have?
The validation compaction sizes are likely fine, I'm not sure of the
exact details but it's normal to expect very large validations.
Rebuilding would not be an ideal mechanism for repairing, and would
likely be slower and chew up a lot of disk space. It's also not
guaranteed to give you data that will be consistent with the other DC,
as replicas will only be streamed from one node.
I think you're better off looking at setting up regular backups and if
you really need it commitlog backups. The storage would be cheaper and
more reliable, plus less impactful on your production DC. Restoring will
also be a lot easier and faster as well, as restoring from a single node
DC will be network bottlenecked. There are various tools around that do
this for you such as medusa or tablesnap.