Questions related to the data in SSTable files

java8964 java8964 Tue, 22 Oct 2013 14:31:29 -0700
Hi, I have some questions related the data in the SSTable files.
Our production environment has 36 boxes, so in theory 12 of them will make one 
group of data without replication.
Right now, I got all the SSTable files from 12 nodes of the cluster (Based on 
my understanding, these 12 nodes are one replication group, and they are NOT 
randomly picked up by our Cassandra admin) from one full snapshot and one 
incremental backup after the snapshot for one column family.
This column family stores the time serials data only, so there is no 
Update/Delete action in Cassandra, only insert. But when I use sstable2json to 
parse all the data out for both snapshot and incremental backup, I got the 
following cases which I cannot explain. In this column family, we have 
following schema structure:
key is the composite key as (entity_1_id, entity_2_id)column is the composite 
column with name as (entity_3_id, entity_4_id, reverse(date as 
create_on_timestamp)), and json data as the value.
I use the sstable2json to parse all the data out, and also parse the column 
timestamp in the output, just want to understand the data better. I also 
explode the data, which means if one row having 10 columns, I flatten them into 
10 rows, so I can check the duplication. But when I check the output from all 
12 nodes, I have the following cases, which I don't know why they happened in 
the SSTable files data:
1) In the data of full snapshot, I see more than 10% of duplication data. What 
I mean duplication is that there are event_activities with the same 
(entity_1_id, entity_2_id, entity_3_id, entity_4_id, created_on_timestamp, 
column_timestamp). I am surprised to see the high level duplication data, 
especially even adding with the column_timestamp. As my understanding, the 
column_timestamp is provided from the client when Cassandra store the column in 
the row key data. So if there are some small amount of duplication, I can 
explain as application bug, or duplication comes from the replication. But more 
than 10% is too much to explain this way.
2) More puzzle output is when I parse the incremental backup data. In the 
output, I found out a lot of data in the following format:
 (entity_1_id, entity_2_id, entity_3_id, entity_4_id, created_on_timestamp as 
(Dec-22-2012) , column_time_stamp as (Oct 14-2013)).
The snapshot was taken on Oct 12th, 2013, and incremental backup was taken on 
Oct 15th, 2013. So the above records shown in the incremental backup makes 
sense based on the column_timestamp, as it is between these 2 dates. But the 
event_activity date is too old. This means the event happened on Dec 2012, 
which is almost more than 10 months ago. First, I search the output of snapshot 
for above record, I cannot find this event activity based on the UUIDs given, 
but I cannot image an event happened 10 months ago flushed to SSTable files 
now. This kind of records is not in small amount, but quite a lot. The event 
activity created_on dates veried from Dec 2012 to Oct 11th 2013. Why is that? I 
know from the business point, there is NO update for any existing records in 
Cassandra. I also check from the output of Json, there is NO delete type 
record, which confirms my understanding that there is no delete action in 
Cassandra system. But no update is just based on ourunderstanding of the 
business point.
I cannot explain why above 2 cases happen in the data parsed out from snapshot 
and backups. One possible reason is the wrong nodes are given to me, so 
replication make the duplication count is so high. Even so, there is still no 
reason to explain why the case 2 shown up, with so many occurrences? Does 
anyone have any hint what could cause case 2?
Thanks
Yong
Questions related to the data in SSTable files

Reply via email to