Any way I can verify how often the system being "repaired"? I can ask another group who maintain the Cassandra cluster. But do you mean that even the failed writes will be stored in the SSTable files? I thought the Cassandra will use different storage to store that kind of data, as the regular good data in memtable, then in the SSTable files. Yong
Date: Tue, 22 Oct 2013 14:50:07 -0700 Subject: Re: Questions related to the data in SSTable files From: rc...@eventbrite.com To: user@cassandra.apache.org On Tue, Oct 22, 2013 at 2:29 PM, java8964 java8964 <java8...@hotmail.com> wrote: 1) In the data of full snapshot, I see more than 10% of duplication data. What I mean duplication is that there are event_activities with the same (entity_1_id, entity_2_id, entity_3_id, entity_4_id, created_on_timestamp, column_timestamp). I am surprised to see the high level duplication data, especially even adding with the column_timestamp. As my understanding, the column_timestamp is provided from the client when Cassandra store the column in the row key data. So if there are some small amount of duplication, I can explain as application bug, or duplication comes from the replication. But more than 10% is too much to explain this way. Have you run "repair"? Do you regularly have hinted handoff kicking in due to down nodes or dropped messages, such that failed writes are re-delivered as hints? =Rob