> And the more spread out the out of sync is, the worse it will be. Though in > general we can expect that to not be too spread out. For the same reason than > why caches work.
(Speaking generally now, not addressing the OP's issue) I'm not sure I buy that. Unless your data is such that hotness is expected to be related somehow to the order of the data, even with a workload quite susceptible to caching you very quickly spread requests out lots of merkle ring segments. For example, suppose you have a CF with lots of smallish rows - say 250 byte per row on average. Given a 50 GB load and 2^15 for the range (let's assume 0.8 improvements), each chunk/merkle segment will be 1.6 mb in size. You don't need to hit a lot of unique keys in order to make the next repair transfer most content. 2^15 is only 32k; if a node is e.g. overloaded or some other event causes more than just random sporadic message dropping for any non-trivial period, it is not difficult at all to hit >> 32k unique keys being written to for pretty plausible workloads. It's somewhat of a concern particularly because of the original reason for dropped keys is overload, and you have lots of data and are bottlenecking on I/O, the "over-repair" will then cause further data bloat (even if temporarily), causing more dropped messages, causing the next repair to do the same, etc. \ > So maybe it is worth improving the granularity of the tree. It is true that > having a fixed granularity as we do is probably a bit naive. In any case, it > will still be a trade-off between the precision of the tree and it's size > (which impact memory used and we have to transfer it (though the transfer part > could be done by tree level -- right now the whole tree is always transfered > which is not so optimal)). Longer term I'm thinking something more incremental and less "bulky" is desired, and such a solution might have the side-effect of avoiding the granularity problem. For example, consider a repair process where the originating node iterates over its own range in a compaction-like fashion, calculating small merkle trees as required. The node could ask neighbors to deliver relevant data in a fashion more similar to hinted hand-off or read-repair. You'd now have an incremental process that behaves more like normal writes in terms of the impact on data size, compaction etc instead of a sudden spike in sstable sizes and large long-running compaction/streaming jobs. Further the process should be fairly insensitive to issues like total data size, the average amount of rows out of synch, etc. It could grab one chunk at a time, with a "chunk" being some fixed reasonable size like say 1 GB spread out over appropriate sstables (i..e, take whatever segment of the ring corresponds to 1 GB). There would be no need to keep long-term refs to obsolete sstables during this process since the iteration would be entirely in token order so both the receiver and the sender in the synchronization process should see a minimal impact in terms of sstable retention, backed up compactions, etc. By doing chunks however, you avoid the cost of traversing the data in a seek bound fashion, since each chunk of reasonable size would have compaction-like performance characteristics. Hmmm, I'm starting to like this idea more and more the more I think of it ;) -- / Peter Schuller