On Thu, Sep 27, 2012 at 9:52 AM, Sylvain Lebresne <sylv...@datastax.com> wrote: >> I don't understand why it copied data twice. In worst case scenario it >> should copy everything (~90G) > > Sadly no, repair is currently peer-to-peer based (there is a ticket to > fix it: https://issues.apache.org/jira/browse/CASSANDRA-3200, but > that's not trivial). This mean that you can end up with RF times the > data after a repair. Obviously that should be a worst case scenario as > it implies everything is repaired, but at least the triplicate part is > a problem, but a know and not so easy to fix one.
I see. It explains why I get 85G + 85G instead of 90G. But after next repair I have six extra files 75G each, how is it possible? It looks like repair is done per sstable, not CF. Is it possible? > > Is it possible that each time you've ran repair, one of the node in > the cluster was very out of sync with the other nodes. Maybe a node > that has crashed for a long time? > No, nodes go down time to time (OOM), but I restart them automatically. But my specific is - I have order preserved partitioner and update intensively every 5th or 10th row. As far as I understand, because of that when Merklee tree is calculated, in every range I have several "hot" rows. These rows are good candidates to be inconsistant. There is one thing I don't understand. Does Merklee tree calculation algorithm use sstables flushed on hard drive or it uses mem tables also? Let's say I have "hot" row which sits in memory in one node but flushed out in another. Is the any difference in Merklee trees? Thank you, Andrey