Re: data clean up problem

cem Tue, 28 May 2013 12:19:05 -0700

Thanks for the answer but it is already set to 0 since I don't do any
delete.


Cem


On Tue, May 28, 2013 at 9:03 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote:

> You need to change the gc_grace time of the column family. It defaults to
> 10 days. By default the tombstones will not go away for 10 days.
>
>
> On Tue, May 28, 2013 at 2:46 PM, cem <cayiro...@gmail.com> wrote:
>
>> Hi Experts,
>>
>> We have general problem about cleaning up data from the disk. I need to
>> free the disk space after retention period and the customer wants to
>> dimension the disk space base on that.
>>
>> After running multiple performance tests with TTL of 1 day we saw that
>> the compaction couldn't keep up with the request rate. Disks were getting
>> full after 3 days. There were also a lot of sstables that are older than 1
>> day after 3 days.
>>
>> Things that we tried:
>>
>> -Change the compaction strategy to leveled. (helped a bit but not much)
>>
>> -Use big sstable size (10G) with leveled compaction to have more
>> aggressive compaction.(helped a bit but not much)
>>
>> -Upgrade Cassandra from 1.0 to 1.2 to use TTL histograms (didn't help at
>> all since it has key overlapping estimation algorithm that generates %100
>> match. Although we don't have...)
>>
>> Our column family structure is like this:
>>
>> Event_data_cf: (we store event data. Event_id  is randomly generated and
>> each event has attributes like location=london)
>>
>> row                  data
>>
>> event id          data blob
>>
>> timeseries_cf: (key is the attribute that we want to index. It can be
>> location=london, we didnt use secondary indexes because the indexes are
>> dynamic.)
>>
>> row                  data
>>
>> index key       time series of event id (event1_id, event2_id….)
>>
>> timeseries_inv_cf: (this is used for removing event by event row key. )
>>
>> row                  data
>>
>> event id          set of index keys
>>
>> Candidate Solution: Implementing time range partitions.
>>
>> Each partition will have column family set and will be managed by client.
>>
>> Suppose that you want to have 7 days retention period. Then you can
>> configure the partition size as 1 day and have 7 active partitions at any
>> time. Then you can drop inactive partitions (older that 7 days). Dropping
>> will immediate remove the data from the disk. (With proper Cassandra.yaml
>> configuration)
>>
>> Storing an event:
>>
>> Find the current partition p1
>>
>> store to event_data to Event_data_cf_p1
>>
>> store to indexes to timeseries_cff_p1
>>
>> store to inverted indexes to timeseries_inv_cf_p1
>>
>>
>> A time range query with an index:
>>
>> Find the all partitions belongs to that time range
>>
>> Do read starting from the first partition until you reach to limit
>>
>> .....
>>
>> Could you please provide your comments and concerns ?
>>
>> Is there any other option that we can try?
>>
>> What do you think about the candidate solution?
>>
>> Does anyone have the same issue? How would you solve it in another way?
>>
>>
>> Thanks in advance!
>>
>> Cem
>>
>
>

Re: data clean up problem

Reply via email to