Our approach for this scenario is to run a hadoop job that periodically cleans old entries, but I admit it's far from ideal. Would be nice to have a more native way to perform these kinds of tasks.
There's a legend about a compaction strategy that keeps only the N first entries of a partition key, but I don't think it was implemented yet, but if I remember correctly there's a JIRA ticket about it. On Tue, Jun 10, 2014 at 3:39 PM, Redmumba <redmu...@gmail.com> wrote: > Honestly, this has been by far my single biggest obstacle with Cassandra > for time-based data--cleaning up the old data when the deletion criteria > (i.e., date) isn't the primary key. I've asked about a few different > approaches, but I haven't really seen any feasible options that can be > implemented easily. I've seen the following: > > 1. Use date-based tables, then drop old tables, ala > "audit_table_20140610", "audit_table_20140609", etc.. > But then I run into the issue of having to query every table--I would > have to execute queries against every day to get the data, and then merge > the data myself. Unless, there's something in the binary driver I'm > missing, it doesn't sound like this would be practical. > 2. Use a TTL > But then I have to basically decide on a value that works for > everything and, if it ever turns out I overestimated, I'm basically SOL, > because my cluster will be out of space. > 3. Maintain a separate index of days to keys, and use this index as > the reference for which keys to delete. > But then this requires maintaining another index and a relatively > manual delete. > > I can't help but feel that I am just way over-engineering this, or that > I'm missing something basic in my data model. Except for the last > approach, I can't help but feel that I'm overlooking something obvious. > > Andrew > > > Of course, Jonathan, I'll do my best! > > It's an auditing table that, right now, uses a primary key consisting of a > combination of a combined partition id of the region and the object id, the > date, and the process ID. Each event in our system will create anywhere > from 1-20 rows, for example, and multiple parts of the system might be > working on the same "object ID". So the CF is constantly being appended > to, but reads are rare. > > CREATE TABLE audit ( >> id bigint, >> region ascii, >> date timestamp, >> pid int, >> PRIMARY KEY ((id, region), date, pid) >> ); > > > Data is queried on a specific object ID and region. Optionally, users can > restrict their query to a specific date range, which the above data model > provides. > > However, we generate quite a bit of data, and we want a convenient way to > get rid of the oldest data. Since our system scales with the time of year, > we might get 50GB a day during peak, and 5GB of data off peak. We could > pick the safest number--let's say, 30 days--and set the TTL using that. > The problem there is that, most of the year, we'll be using a very small > percentage of our available space 90% of the year. > > What I'd like to be able to do is drop old tables as needed--i.e., let's > say when we hit 80% load across the cluster (or some such metric that takes > the cluster-wide load into account), I want to drop the oldest day's > records until we're under 80%. That way, we're always using the maximum > amount of space we can, without having to worry about getting to the point > where we run out of space cluster-wide. > > My thoughts are--we could always make the date part of the primary key, > but then we'd either a) have to query the entire range of dates, or b) we'd > have to force a small date range when querying. What are the penalties? > Do you have any other suggestions? > > > On Mon, Jun 9, 2014 at 5:15 PM, Jonathan Lacefield < > jlacefi...@datastax.com> wrote: > >> Hello, >> >> Will you please describe the use case and what you are trying to model. >> What are some questions/queries that you would like to serve via >> Cassandra. This will help the community help you a little better. >> >> Jonathan Lacefield >> Solutions Architect, DataStax >> (404) 822 3487 >> <http://www.linkedin.com/in/jlacefield> >> >> <http://www.datastax.com/cassandrasummit14> >> >> >> >> On Mon, Jun 9, 2014 at 7:51 PM, Redmumba <redmu...@gmail.com> wrote: >> >>> I've been trying to work around using "date-based tables" because I'd >>> like to avoid the overhead. It seems, however, that this is just not going >>> to work. >>> >>> So here's a question--for these date-based tables (i.e., a table per >>> day/week/month/whatever), how are they queried? If I keep 60 days worth of >>> auditing data, for example, I'd need to query all 60 tables--can I do that >>> smoothly? Or do I have to have 60 different select statements? Is there a >>> way for me to run the same query against all the tables? >>> >>> >>> On Mon, Jun 9, 2014 at 3:42 PM, Redmumba <redmu...@gmail.com> wrote: >>> >>>> Ah, so the secondary indices are really secondary against the primary >>>> key. That makes sense. >>>> >>>> I'm beginning to see why the whole "date-based table" approach is the >>>> only one I've been able to find... thanks for the quick responses, guys! >>>> >>>> >>>> On Mon, Jun 9, 2014 at 2:45 PM, Michal Michalski < >>>> michal.michal...@boxever.com> wrote: >>>> >>>>> Secondary indexes internally are just CFs that map the indexed value >>>>> to a row key which that value belongs to, so you can only query these >>>>> indexes using "=", not ">", ">=" etc. >>>>> >>>>> However, your query does not require index *IF* you provide a row key >>>>> - you can use "<" or ">" like you did for the date column, as long as you >>>>> refer to a single row. However, if you don't provide it, it's not going to >>>>> work. >>>>> >>>>> M. >>>>> >>>>> Kind regards, >>>>> MichaĆ Michalski, >>>>> michal.michal...@boxever.com >>>>> >>>>> >>>>> On 9 June 2014 21:18, Redmumba <redmu...@gmail.com> wrote: >>>>> >>>>>> I have a table with a timestamp column on it; however, when I try to >>>>>> query based on it, it fails saying that I must use ALLOW FILTERING--which >>>>>> to me, means its not using the secondary index. Table definition is >>>>>> (snipping out irrelevant parts)... >>>>>> >>>>>> CREATE TABLE audit ( >>>>>>> id bigint, >>>>>>> date timestamp, >>>>>>> ... >>>>>>> PRIMARY KEY (id, date) >>>>>>> ); >>>>>>> CREATE INDEX date_idx ON audit (date); >>>>>>> >>>>>> >>>>>> There are other fields, but they are not relevant to this example. >>>>>> The date is part of the primary key, and I have a secondary index on it. >>>>>> When I run a SELECT against it, I get an error: >>>>>> >>>>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01'; >>>>>>> Bad Request: Cannot execute this query as it might involve data >>>>>>> filtering and thus may have unpredictable performance. If you want to >>>>>>> execute this query despite the performance unpredictability, use ALLOW >>>>>>> FILTERING >>>>>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < >>>>>>> '2014-05-01' ALLOW FILTERING; >>>>>>> Request did not complete within rpc_timeout. >>>>>>> >>>>>> >>>>>> How can I force it to use the index? I've seen rebuild_index tasks >>>>>> running, but can I verify the "health" of the index? >>>>>> >>>>> >>>>> >>>> >>> >> > -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br <http://www.chaordic.com.br/>* +55 48 3232.3200