Re: Date range queries

Colin Blower Tue, 25 Jun 2013 18:07:21 -0700

You could just separate the history data from the current data. Then
when the user's result is updated, just write into two tables.


CREATE TABLE all_answers (
  user_id uuid,
  created timeuuid,
  result text,
  question_id varint,
  PRIMARY KEY (user_id, created)
)

CREATE TABLE current_answers (
  user_id uuid,
  question_id varint,
  created timeuuid,
  result text,
  PRIMARY KEY (user_id, question_id)
)


> select * FROM current_answers ;
 user_id                              | question_id | result | created
--------------------------------------+-------------+--------+--------------------------------------
 11b1e59c-ddfa-11e2-a28f-0800200c9a66 |           1 |     no |
f9893ee0-ddfa-11e2-b74c-35d7be46b354
 11b1e59c-ddfa-11e2-a28f-0800200c9a66 |           2 |   blah |
f7af75d0-ddfa-11e2-b74c-35d7be46b354

> select * FROM all_answers ;
 user_id                              |
created                              | question_id | result
--------------------------------------+--------------------------------------+-------------+--------
 11b1e59c-ddfa-11e2-a28f-0800200c9a66 |
f0141234-ddfa-11e2-b74c-35d7be46b354 |           1 |    yes
 11b1e59c-ddfa-11e2-a28f-0800200c9a66 |
f7af75d0-ddfa-11e2-b74c-35d7be46b354 |           2 |   blah
 11b1e59c-ddfa-11e2-a28f-0800200c9a66 |
f9893ee0-ddfa-11e2-b74c-35d7be46b354 |           1 |     no

This way you can get the history of answers if you want and there is a
simple way to get the most current answers.

Just a thought.
-Colin B.


On 06/24/2013 03:28 PM, Christopher J. Bottaro wrote:
> Yes, that makes sense and that article helped a lot, but I still have
> a few questions...
>
> The created_at in our answers table is basically used as a version id.
>  When a user updates his answer, we don't overwrite the old answer,
> but rather insert a new answer with a more recent timestamp (the version).
>
> answers
> -------
> user_id | created_at | question_id | result
> -------------------------------------------
>       1 | 2013-01-01 | 1           | yes
>       1 | 2013-01-01 | 2           | blah
>       1 | 2013-01-02 | 1           | no
>
> So the queries we really want to run are "find me all the answers for
> a given user at a given time."  So given the date of 2013-01-02 and
> user_id 1, we would want rows 2 and 3 returned (since rows 3 obsoletes
> row 1).  Is it possible to do this with CQL given the current schema?
>
> As an aside, we can do this in Postgresql using window functions, not
> standard SQL, but pretty neat.
>
> We can alter our schema like so...
>
> answers
> -------
> user_id | start_at | end_at | question_id | result
>
> Where the start_at and end_at denote when an answer is active.  So the
> example above would become:
>
> answers
> -------
> user_id | start_at   | end_at     | question_id | result
> --------------------------------------------------------
>       1 | 2013-01-01 | 2013-01-02 | 1           | yes
>       1 | 2013-01-01 | null       | 2           | blah
>       1 | 2013-01-02 | null       | 1           | no
>
> Now we can query "SELECT * FROM answers WHERE user_id = 1 AND start_at
> >= '2013-01-02' AND (end_at < '2013-01-02' OR end_at IS NULL)".
>
> How would one define the partitioning key and cluster columns in CQL
> to accomplish this?  Is it as simple as PRIMARY KEY (user_id,
> start_at, end_at, question_id) (remembering that we sometimes want to
> limit by question_id)?
>
> Also, we are a bit worried about race conditions.  Consider two
> separate processes updating an answer for a given user_id /
> question_id.  There will be a race condition between the two to update
> the correct row's end_at field.  Does that make sense?  I can draw it
> out with ASCII tables, but I feel like this email is already too
> long... :P
>
> Thanks for the help.
>
>
>
> On Wed, Jun 19, 2013 at 2:28 PM, David McNelis <dmcne...@gmail.com
> <mailto:dmcne...@gmail.com>> wrote:
>
>     So, if you want to grab by the created_at and occasionally limit
>     by question id, that is why you'd use created_at.
>
>     The way the primary keys work is the first part of the primary key
>     is the Partioner key, that field is what essentially is the single
>     cassandra row.  The second key is the order preserving key, so you
>     can sort by that key.  If you have a third piece, then that is the
>     secondary order preserving key.
>
>     The reason you'd want to do (user_id, created_at, question_id) is
>     because when you do a query on the keys, if you MUST use the
>     preceding pieces of the primary key.  So in your case, you could
>     not do a query with just user_id and question_id with the
>     user-created-question key.  Alternatively if you went with
>     (user_id, question_id, created_at), you would not be able to
>     include a range of created_at unless you were also filtering on
>     the question_id.
>
>     Does that make sense?
>
>     As for the large rows, 10k is unlikely to cause you too many
>     issues (unless the answer is potentially a big blob of text).
>      Newer versions of cassandra deal with a lot of things in far,
>     far, superior ways to < 1.0.
>
>     For a really good primary on keys in cql and how to potentially
>     avoid hot rows, a really good article to read is this
>     one: http://thelastpickle.com/2013/01/11/primary-keys-in-cql/
>      Aaron did a great job of laying out the subtleties of primary
>     keys in CQL.
>
>
>     On Wed, Jun 19, 2013 at 2:21 PM, Christopher J. Bottaro
>     <cjbott...@academicworks.com <mailto:cjbott...@academicworks.com>>
>     wrote:
>
>         Interesting, thank you for the reply.
>
>         Two questions though...
>
>         Why should created_at come before question_id in the primary
>         key?  In other words, why (user_id, created_at, question_id)
>         instead of (user_id, question_id, created_at)?
>
>         Given this setup, all a user's answers (all 10k) will be
>         stored in a single C* (internal, not cql) row?  I thought
>         having "fat" or "big" rows was bad.  I worked with Cassandra
>         0.6 at my previous job and given the nature of our work, we
>         would sometimes generate these "fat" rows... at which point
>         Cassandra would basically shit the bed.
>
>         Thanks for the help.
>
>
>         On Wed, Jun 19, 2013 at 12:26 PM, David McNelis
>         <dmcne...@gmail.com <mailto:dmcne...@gmail.com>> wrote:
>
>             I think you'd just be better served with just a little
>             different primary key.
>
>             If your primary key was (user_id, created_at)  or
>             (user_id, created_at, question_id), then you'd be able to
>             run the above query without a problem.
>
>             This will mean that the entire pantheon of a specific
>             user_id will be stored as a 'row' (in the old style C*
>             vernacular), and then the information would be ordered by
>             the 2nd piece of the primary key (or 2nd, then 3rd if you
>             included question_id). 
>
>             You would certainly want to include any field that makes a
>             record unique in the primary key.  Another thing to note
>             is that if a field is part of the primary key you can not
>             create a secondary index on that field.  You can work
>             around that by storing the field twice, but you might want
>             to rethink your structure if you find yourself doing that
>             often.
>
>
>             On Wed, Jun 19, 2013 at 12:05 PM, Christopher J. Bottaro
>             <cjbott...@academicworks.com
>             <mailto:cjbott...@academicworks.com>> wrote:
>
>                 Hello,
>
>                 We are considering using Cassandra and I want to make
>                 sure our use case fits Cassandra's strengths.  We have
>                 the table like:
>
>                 answers
>                 -------
>                 user_id | question_id | result | created_at
>
>                 Where our most common query will be something like:
>
>                 SELECT * FROM answers WHERE user_id = 123 AND
>                 created_at > '01/01/2012' AND created_at < '01/01/2013'
>
>                 Sometimes we will also limit by a question_id or a
>                 list of question_ids.
>
>                 Secondary indexes will be created on user_id and
>                 question_id.  We expect the upper bound of number of
>                 answers for a given user to be around 10,000.
>
>                 Now my understanding of how Cassandra will run the
>                 aforementioned query is that it will load all the
>                 answers for a given user into memory using the
>                 secondary index, then scan over that set filtering
>                 based on the dates.
>
>                 Considering that that will be our most used query and
>                 it will happen very often, is this a bad use case for
>                 Cassandra?
>
>                 Thanks for the help.
>
>
>
>
>

Re: Date range queries

Reply via email to