Re: Data Modeling Conundrum

William Ashley Sat, 08 May 2010 07:45:55 -0700

Daniel,
Partitioning applies to row keys, not column sorting. You could take both of my 
Cassandra solutions and refactor them to use row keys containing userId:time or 
userId:guid, but you ultimately wind up with the same compromises on update or 
retrieval efficiency, plus then you have to use an ordered preserving 
partitioner and therefore have to take additional steps to ensure even 
distribution of keys across the cluster.


To answer your second question, one user can log in from his home machine, 
phone and work desktop. Therefore there would be three tracking guids 
associated with his account. Remember, guids always identify a (computer, 
browser) independent of who may or may not be logged in.

I hope that clears things up for you.
- William
 

On May 7, 2010, at 11:39 PM, vineet daniel wrote:

> Query : Why are you sorting AFAIK cassandra sorts the keys by itself if you 
> are using ordered partitioning. And how do you store data pertaining to 
> single user but having several GUID's to attach with.
> 
> 
> _______________________________________
> Vineet Daniel
> _______________________________________
> 
> Let your email find you....
> 
> 
> On Sat, May 8, 2010 at 9:01 AM, William Ashley <wash...@gmail.com> wrote:
> List,
> I have a case where visitors to a site are tracked via a persistent cookie 
> containing a guid. This cookie is created and set when missing. Some of these 
> visitors are logged in, meaning a userId may also be available. What I’m 
> looking to do is have a way to associate each userId with all of the guids 
> that it has been seen with. Conceptually, this would identify the unique 
> (device, browser) pairs for each userId. The catch is that I want to be able 
> to retrieve the most-recently-seen N guids for a userId.
> 
> 
> One possible solution to this problem in SQL looks like this (made up on the 
> fly):
> # Table schema
> CREATE TABLE UserGuid ( userId INT, guid VARCHAR, when TIMESTAMP, PRIMARY 
> KEY( userId, guid ), INDEX( userId, when ) );
> 
> # For each request with guid G and userId U at time T
> INSERT INTO UserGuid ( userId, guid, when ) VALUES ( U, G, T ) ON DUPLICATE 
> KEY UPDATE SET when = T;
> 
> # To get most recent N guids for userId U
> SELECT guid FROM UserGuid WHERE userId = U ORDER BY when DESC LIMIT N;
> 
> 
> Hopefully I’ve sufficiently explained what I’m trying to do. Now on to 
> solving this problem in Cassandra. I’ve been trying to find a way that allows 
> both of the above operations to be performed efficiently. Updates are a 
> breeze with a structure like this:
> 
> // Row key is userId
> 12345 : {
>  // Column name is guid
>  ‘256fb890-5a4b-11df-a08a-0800200c9a66’ : {
>    // Column timestamp is last time guid was seen
>    timestamp : 387587235233
>  }
> }
> 
> but getting the last N recently seen guids requires pulling all columns and 
> sorting by timestamp. Retrievals can be done efficiently with a structure 
> taking advantage of column sorting:
> 
> // Row key is userId
> 12345 : {
>  // Column name is last time guid was seen
>  387587235233 : {
>    // Column value is guid
>    value: ‘256fb890-5a4b-11df-a08a-0800200c9a66’
>  }
> }
> 
> where we use a slice get on the row with limit N (and reverse order). 
> However, updates involve pulling all columns to de-duplicate guid values. 
> Neither solution is ideal, and so I present this to you fine gentlemen who 
> have more experience modeling data in Cassandra than I.
> 
> I would much prefer to avoid any solutions that require pulling an 
> indeterminate amount of data for either operation. For the time being I am 
> using the first method and only pulling the first M columns, sorting, and 
> taking the top N (M >= N).
> 
> One thing I was thinking would be nice (if possible), is to have a column 
> family where columns are either sorted by their timestamp, or by the time the 
> column was created/updated (which may be equivalent to not sorting at all, 
> but I have not looked at the implementation).
> 
> I appreciate any feedback or suggestions you might have.
> - William
> 
>

Re: Data Modeling Conundrum

Reply via email to