> And the more spread out the out of sync is, the worse it will be. Though in
> general we can expect that to not be too spread out. For the same reason than
> why caches work.

(Speaking generally now, not addressing the OP's issue)

I'm not sure I buy that. Unless your data is such that hotness is
expected to be related somehow to the order of the data, even with a
workload quite susceptible to caching you very quickly spread requests
out lots of merkle ring segments.

For example, suppose you have a CF with lots of smallish rows - say
250 byte per row on average. Given a 50 GB load and 2^15 for the range
(let's assume 0.8 improvements), each chunk/merkle segment will be 1.6
mb in size. You don't need to hit a lot of unique keys in order to
make the next repair transfer most content. 2^15 is only 32k; if a
node is e.g. overloaded or some other event causes more than just
random sporadic message dropping for any non-trivial period, it is not
difficult at all to hit >> 32k unique keys being written to for pretty
plausible workloads.

It's somewhat of a concern particularly because of the original reason
for dropped keys is overload, and you have lots of data and are
bottlenecking on I/O, the "over-repair" will then cause further data
bloat (even if temporarily), causing more dropped messages, causing
the next repair to do the same, etc.
\
> So maybe it is worth improving the granularity of the tree. It is true that
> having a fixed granularity as we do is probably a bit naive. In any case, it
> will still be a trade-off between the precision of the tree and it's size
> (which impact memory used and we have to transfer it (though the transfer part
> could be done by tree level -- right now the whole tree is always transfered
> which is not so optimal)).

Longer term I'm thinking something more incremental and less "bulky"
is desired, and such a solution might have the side-effect of avoiding
the granularity problem.

For example, consider a repair process where the originating node
iterates over its own range in a compaction-like fashion, calculating
small merkle trees as required. The node could ask neighbors to
deliver relevant data in a fashion more similar to hinted hand-off or
read-repair. You'd now have an incremental process that behaves more
like normal writes in terms of the impact on data size, compaction etc
instead of a sudden spike in sstable sizes and large long-running
compaction/streaming jobs.

Further the process should be fairly insensitive to issues like total
data size, the average amount of rows out of synch, etc. It could grab
one chunk at a time, with a "chunk" being some fixed reasonable size
like say 1 GB spread out over appropriate sstables (i..e, take
whatever segment of the ring corresponds to 1 GB). There would be no
need to keep long-term refs to obsolete sstables during this process
since the iteration would be entirely in token order so both the
receiver and the sender in the synchronization process should see a
minimal impact in terms of sstable retention, backed up compactions,
etc. By doing chunks however, you avoid the cost of traversing the
data in a seek bound fashion, since each chunk of reasonable size
would have compaction-like performance characteristics.

Hmmm, I'm starting to like this idea more and more the more I think of it ;)

-- 
/ Peter Schuller

Reply via email to