You could just separate the history data from the current data. Then when the user's result is updated, just write into two tables.
CREATE TABLE all_answers ( user_id uuid, created timeuuid, result text, question_id varint, PRIMARY KEY (user_id, created) ) CREATE TABLE current_answers ( user_id uuid, question_id varint, created timeuuid, result text, PRIMARY KEY (user_id, question_id) ) > select * FROM current_answers ; user_id | question_id | result | created --------------------------------------+-------------+--------+-------------------------------------- 11b1e59c-ddfa-11e2-a28f-0800200c9a66 | 1 | no | f9893ee0-ddfa-11e2-b74c-35d7be46b354 11b1e59c-ddfa-11e2-a28f-0800200c9a66 | 2 | blah | f7af75d0-ddfa-11e2-b74c-35d7be46b354 > select * FROM all_answers ; user_id | created | question_id | result --------------------------------------+--------------------------------------+-------------+-------- 11b1e59c-ddfa-11e2-a28f-0800200c9a66 | f0141234-ddfa-11e2-b74c-35d7be46b354 | 1 | yes 11b1e59c-ddfa-11e2-a28f-0800200c9a66 | f7af75d0-ddfa-11e2-b74c-35d7be46b354 | 2 | blah 11b1e59c-ddfa-11e2-a28f-0800200c9a66 | f9893ee0-ddfa-11e2-b74c-35d7be46b354 | 1 | no This way you can get the history of answers if you want and there is a simple way to get the most current answers. Just a thought. -Colin B. On 06/24/2013 03:28 PM, Christopher J. Bottaro wrote: > Yes, that makes sense and that article helped a lot, but I still have > a few questions... > > The created_at in our answers table is basically used as a version id. > When a user updates his answer, we don't overwrite the old answer, > but rather insert a new answer with a more recent timestamp (the version). > > answers > ------- > user_id | created_at | question_id | result > ------------------------------------------- > 1 | 2013-01-01 | 1 | yes > 1 | 2013-01-01 | 2 | blah > 1 | 2013-01-02 | 1 | no > > So the queries we really want to run are "find me all the answers for > a given user at a given time." So given the date of 2013-01-02 and > user_id 1, we would want rows 2 and 3 returned (since rows 3 obsoletes > row 1). Is it possible to do this with CQL given the current schema? > > As an aside, we can do this in Postgresql using window functions, not > standard SQL, but pretty neat. > > We can alter our schema like so... > > answers > ------- > user_id | start_at | end_at | question_id | result > > Where the start_at and end_at denote when an answer is active. So the > example above would become: > > answers > ------- > user_id | start_at | end_at | question_id | result > -------------------------------------------------------- > 1 | 2013-01-01 | 2013-01-02 | 1 | yes > 1 | 2013-01-01 | null | 2 | blah > 1 | 2013-01-02 | null | 1 | no > > Now we can query "SELECT * FROM answers WHERE user_id = 1 AND start_at > >= '2013-01-02' AND (end_at < '2013-01-02' OR end_at IS NULL)". > > How would one define the partitioning key and cluster columns in CQL > to accomplish this? Is it as simple as PRIMARY KEY (user_id, > start_at, end_at, question_id) (remembering that we sometimes want to > limit by question_id)? > > Also, we are a bit worried about race conditions. Consider two > separate processes updating an answer for a given user_id / > question_id. There will be a race condition between the two to update > the correct row's end_at field. Does that make sense? I can draw it > out with ASCII tables, but I feel like this email is already too > long... :P > > Thanks for the help. > > > > On Wed, Jun 19, 2013 at 2:28 PM, David McNelis <dmcne...@gmail.com > <mailto:dmcne...@gmail.com>> wrote: > > So, if you want to grab by the created_at and occasionally limit > by question id, that is why you'd use created_at. > > The way the primary keys work is the first part of the primary key > is the Partioner key, that field is what essentially is the single > cassandra row. The second key is the order preserving key, so you > can sort by that key. If you have a third piece, then that is the > secondary order preserving key. > > The reason you'd want to do (user_id, created_at, question_id) is > because when you do a query on the keys, if you MUST use the > preceding pieces of the primary key. So in your case, you could > not do a query with just user_id and question_id with the > user-created-question key. Alternatively if you went with > (user_id, question_id, created_at), you would not be able to > include a range of created_at unless you were also filtering on > the question_id. > > Does that make sense? > > As for the large rows, 10k is unlikely to cause you too many > issues (unless the answer is potentially a big blob of text). > Newer versions of cassandra deal with a lot of things in far, > far, superior ways to < 1.0. > > For a really good primary on keys in cql and how to potentially > avoid hot rows, a really good article to read is this > one: http://thelastpickle.com/2013/01/11/primary-keys-in-cql/ > Aaron did a great job of laying out the subtleties of primary > keys in CQL. > > > On Wed, Jun 19, 2013 at 2:21 PM, Christopher J. Bottaro > <cjbott...@academicworks.com <mailto:cjbott...@academicworks.com>> > wrote: > > Interesting, thank you for the reply. > > Two questions though... > > Why should created_at come before question_id in the primary > key? In other words, why (user_id, created_at, question_id) > instead of (user_id, question_id, created_at)? > > Given this setup, all a user's answers (all 10k) will be > stored in a single C* (internal, not cql) row? I thought > having "fat" or "big" rows was bad. I worked with Cassandra > 0.6 at my previous job and given the nature of our work, we > would sometimes generate these "fat" rows... at which point > Cassandra would basically shit the bed. > > Thanks for the help. > > > On Wed, Jun 19, 2013 at 12:26 PM, David McNelis > <dmcne...@gmail.com <mailto:dmcne...@gmail.com>> wrote: > > I think you'd just be better served with just a little > different primary key. > > If your primary key was (user_id, created_at) or > (user_id, created_at, question_id), then you'd be able to > run the above query without a problem. > > This will mean that the entire pantheon of a specific > user_id will be stored as a 'row' (in the old style C* > vernacular), and then the information would be ordered by > the 2nd piece of the primary key (or 2nd, then 3rd if you > included question_id). > > You would certainly want to include any field that makes a > record unique in the primary key. Another thing to note > is that if a field is part of the primary key you can not > create a secondary index on that field. You can work > around that by storing the field twice, but you might want > to rethink your structure if you find yourself doing that > often. > > > On Wed, Jun 19, 2013 at 12:05 PM, Christopher J. Bottaro > <cjbott...@academicworks.com > <mailto:cjbott...@academicworks.com>> wrote: > > Hello, > > We are considering using Cassandra and I want to make > sure our use case fits Cassandra's strengths. We have > the table like: > > answers > ------- > user_id | question_id | result | created_at > > Where our most common query will be something like: > > SELECT * FROM answers WHERE user_id = 123 AND > created_at > '01/01/2012' AND created_at < '01/01/2013' > > Sometimes we will also limit by a question_id or a > list of question_ids. > > Secondary indexes will be created on user_id and > question_id. We expect the upper bound of number of > answers for a given user to be around 10,000. > > Now my understanding of how Cassandra will run the > aforementioned query is that it will load all the > answers for a given user into memory using the > secondary index, then scan over that set filtering > based on the dates. > > Considering that that will be our most used query and > it will happen very often, is this a bad use case for > Cassandra? > > Thanks for the help. > > > > >