We enabled the major repair on every node every 7 days. I think you mean 2 cases of "failed" write. One is the replication failure of a writer. Duplication generated from this kind of "failed" should be very small in my case, because I only parse the data from 12 nodes, which should NOT contain any replication nodes. If one node persistent a write, plus a "hint" of failed replication write, this write will still store as one write in its SSTable files, right? Why need to store 2 copies as duplication in SSTable files? Another case is what you describe as client retries writing when time-out exception happens. This can explain the duplication reasonable. Here is the duplication count happened in our SSTable files. You can see a lot of data duplicate 2 times, but also some with even higher number. But max duplication count is 27, can one client retry 27 times? duplication_count duplication_occurrence
2 123615348 3 6446783 4 21102 5 1054 6 2496 7 47 8 726 9 52 10 12 11 3 12 7 13 9 14 7 15 3 16 2 17 2 18 1 19 5 20 5 22 1 23 3 25 2 27 99 Another question is do you have any guess what could cause case 2 happen in my original email? Thanks Date: Tue, 22 Oct 2013 17:52:24 -0700 Subject: Re: Questions related to the data in SSTable files From: rc...@eventbrite.com To: user@cassandra.apache.org On Tue, Oct 22, 2013 at 5:17 PM, java8964 java8964 <java8...@hotmail.com> wrote: Any way I can verify how often the system being "repaired"? I can ask another group who maintain the Cassandra cluster. But do you mean that even the failed writes will be stored in the SSTable files? "repair" sessions are logged in system.log, and the "best practice" is to run a repair once every gc_grace_seconds, which defaults to 10 days. A "failed" write means only that it "failed" to meet its ConsistencyLevel in the request_timeout. It does not mean that it failed to write everywhere it tried to write. There is no rollback, so in practice with RF>1 it is likely that a "failed" write succeeded at least somewhere. But if any failure is noted, Cassandra will generate a hint for hinted handoff and attempt to redeliver the "failed" write. Also, many/most client applications will respond to a timedoutexception by attempting to re-write the "failed" write, using the same client timestamp. Repair has a fixed granularity, so the larger the size of your dataset the more "over-repair" any given "repair" will cause. Duplicates occur as a natural consequences of this, if you have 1 row which differs in the merkle tree chunk and the merkle tree chunk is, for example, 1000 rows.. you will "repair" one row and "duplicate" the other 999. =Rob