Our approach for this scenario is to run a hadoop job that periodically
cleans old entries, but I admit it's far from ideal. Would be nice to have
a more native way to perform these kinds of tasks.

There's a legend about a compaction strategy that keeps only the N first
entries of a partition key, but I don't think it was implemented yet, but
if I remember correctly there's a JIRA ticket about it.


On Tue, Jun 10, 2014 at 3:39 PM, Redmumba <redmu...@gmail.com> wrote:

> Honestly, this has been by far my single biggest obstacle with Cassandra
> for time-based data--cleaning up the old data when the deletion criteria
> (i.e., date) isn't the primary key.  I've asked about a few different
> approaches, but I haven't really seen any feasible options that can be
> implemented easily.  I've seen the following:
>
>    1. Use date-based tables, then drop old tables, ala
>    "audit_table_20140610", "audit_table_20140609", etc..
>    But then I run into the issue of having to query every table--I would
>    have to execute queries against every day to get the data, and then merge
>    the data myself.  Unless, there's something in the binary driver I'm
>    missing, it doesn't sound like this would be practical.
>    2. Use a TTL
>    But then I have to basically decide on a value that works for
>    everything and, if it ever turns out I overestimated, I'm basically SOL,
>    because my cluster will be out of space.
>    3. Maintain a separate index of days to keys, and use this index as
>    the reference for which keys to delete.
>    But then this requires maintaining another index and a relatively
>    manual delete.
>
> I can't help but feel that I am just way over-engineering this, or that
> I'm missing something basic in my data model.  Except for the last
> approach, I can't help but feel that I'm overlooking something obvious.
>
> Andrew
>
>
> Of course, Jonathan, I'll do my best!
>
> It's an auditing table that, right now, uses a primary key consisting of a
> combination of a combined partition id of the region and the object id, the
> date, and the process ID.  Each event in our system will create anywhere
> from 1-20 rows, for example, and multiple parts of the system might be
> working on the same "object ID".  So the CF is constantly being appended
> to, but reads are rare.
>
> CREATE TABLE audit (
>>     id bigint,
>>     region ascii,
>>     date timestamp,
>>     pid int,
>>     PRIMARY KEY ((id, region), date, pid)
>> );
>
>
> Data is queried on a specific object ID and region.  Optionally, users can
> restrict their query to a specific date range, which the above data model
> provides.
>
> However, we generate quite a bit of data, and we want a convenient way to
> get rid of the oldest data.  Since our system scales with the time of year,
> we might get 50GB a day during peak, and 5GB of data off peak.  We could
> pick the safest number--let's say, 30 days--and set the TTL using that.
> The problem there is that, most of the year, we'll be using a very small
> percentage of our available space 90% of the year.
>
> What I'd like to be able to do is drop old tables as needed--i.e., let's
> say when we hit 80% load across the cluster (or some such metric that takes
> the cluster-wide load into account), I want to drop the oldest day's
> records until we're under 80%.  That way, we're always using the maximum
> amount of space we can, without having to worry about getting to the point
> where we run out of space cluster-wide.
>
> My thoughts are--we could always make the date part of the primary key,
> but then we'd either a) have to query the entire range of dates, or b) we'd
> have to force a small date range when querying.  What are the penalties?
> Do you have any other suggestions?
>
>
> On Mon, Jun 9, 2014 at 5:15 PM, Jonathan Lacefield <
> jlacefi...@datastax.com> wrote:
>
>> Hello,
>>
>>   Will you please describe the use case and what you are trying to model.
>>  What are some questions/queries that you would like to serve via
>> Cassandra.  This will help the community help you a little better.
>>
>> Jonathan Lacefield
>> Solutions Architect, DataStax
>> (404) 822 3487
>>  <http://www.linkedin.com/in/jlacefield>
>>
>> <http://www.datastax.com/cassandrasummit14>
>>
>>
>>
>> On Mon, Jun 9, 2014 at 7:51 PM, Redmumba <redmu...@gmail.com> wrote:
>>
>>> I've been trying to work around using "date-based tables" because I'd
>>> like to avoid the overhead.  It seems, however, that this is just not going
>>> to work.
>>>
>>> So here's a question--for these date-based tables (i.e., a table per
>>> day/week/month/whatever), how are they queried?  If I keep 60 days worth of
>>> auditing data, for example, I'd need to query all 60 tables--can I do that
>>> smoothly?  Or do I have to have 60 different select statements?  Is there a
>>> way for me to run the same query against all the tables?
>>>
>>>
>>> On Mon, Jun 9, 2014 at 3:42 PM, Redmumba <redmu...@gmail.com> wrote:
>>>
>>>> Ah, so the secondary indices are really secondary against the primary
>>>> key.  That makes sense.
>>>>
>>>> I'm beginning to see why the whole "date-based table" approach is the
>>>> only one I've been able to find... thanks for the quick responses, guys!
>>>>
>>>>
>>>> On Mon, Jun 9, 2014 at 2:45 PM, Michal Michalski <
>>>> michal.michal...@boxever.com> wrote:
>>>>
>>>>> Secondary indexes internally are just CFs that map the indexed value
>>>>> to a row key which that value belongs to, so you can only query these
>>>>> indexes using "=", not ">", ">=" etc.
>>>>>
>>>>> However, your query does not require index *IF* you provide a row key
>>>>> - you can use "<" or ">" like you did for the date column, as long as you
>>>>> refer to a single row. However, if you don't provide it, it's not going to
>>>>> work.
>>>>>
>>>>> M.
>>>>>
>>>>> Kind regards,
>>>>> MichaƂ Michalski,
>>>>> michal.michal...@boxever.com
>>>>>
>>>>>
>>>>> On 9 June 2014 21:18, Redmumba <redmu...@gmail.com> wrote:
>>>>>
>>>>>> I have a table with a timestamp column on it; however, when I try to
>>>>>> query based on it, it fails saying that I must use ALLOW FILTERING--which
>>>>>> to me, means its not using the secondary index.  Table definition is
>>>>>> (snipping out irrelevant parts)...
>>>>>>
>>>>>> CREATE TABLE audit (
>>>>>>>     id bigint,
>>>>>>>     date timestamp,
>>>>>>> ...
>>>>>>>     PRIMARY KEY (id, date)
>>>>>>> );
>>>>>>> CREATE INDEX date_idx ON audit (date);
>>>>>>>
>>>>>>
>>>>>> There are other fields, but they are not relevant to this example.
>>>>>> The date is part of the primary key, and I have a secondary index on it.
>>>>>> When I run a SELECT against it, I get an error:
>>>>>>
>>>>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01';
>>>>>>> Bad Request: Cannot execute this query as it might involve data
>>>>>>> filtering and thus may have unpredictable performance. If you want to
>>>>>>> execute this query despite the performance unpredictability, use ALLOW
>>>>>>> FILTERING
>>>>>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date <
>>>>>>> '2014-05-01' ALLOW FILTERING;
>>>>>>> Request did not complete within rpc_timeout.
>>>>>>>
>>>>>>
>>>>>> How can I force it to use the index?  I've seen rebuild_index tasks
>>>>>> running, but can I verify the "health" of the index?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
*Paulo Motta*

Chaordic | *Platform*
*www.chaordic.com.br <http://www.chaordic.com.br/>*
+55 48 3232.3200

Reply via email to