Data Modeling Conundrum

William Ashley Fri, 07 May 2010 20:32:24 -0700

List,
I have a case where visitors to a site are tracked via a persistent cookie 
containing a guid. This cookie is created and set when missing. Some of these 
visitors are logged in, meaning a userId may also be available. What I’m 
looking to do is have a way to associate each userId with all of the guids that 
it has been seen with. Conceptually, this would identify the unique (device, 
browser) pairs for each userId. The catch is that I want to be able to retrieve 
the most-recently-seen N guids for a userId.



One possible solution to this problem in SQL looks like this (made up on the 
fly):
# Table schema
CREATE TABLE UserGuid ( userId INT, guid VARCHAR, when TIMESTAMP, PRIMARY KEY( 
userId, guid ), INDEX( userId, when ) ); 

# For each request with guid G and userId U at time T
INSERT INTO UserGuid ( userId, guid, when ) VALUES ( U, G, T ) ON DUPLICATE KEY 
UPDATE SET when = T;

# To get most recent N guids for userId U
SELECT guid FROM UserGuid WHERE userId = U ORDER BY when DESC LIMIT N;


Hopefully I’ve sufficiently explained what I’m trying to do. Now on to solving 
this problem in Cassandra. I’ve been trying to find a way that allows both of 
the above operations to be performed efficiently. Updates are a breeze with a 
structure like this:

// Row key is userId
12345 : {
  // Column name is guid
  ‘256fb890-5a4b-11df-a08a-0800200c9a66’ : {
    // Column timestamp is last time guid was seen
    timestamp : 387587235233
  }
}

but getting the last N recently seen guids requires pulling all columns and 
sorting by timestamp. Retrievals can be done efficiently with a structure 
taking advantage of column sorting:

// Row key is userId
12345 : {
  // Column name is last time guid was seen
  387587235233 : {
    // Column value is guid
    value: ‘256fb890-5a4b-11df-a08a-0800200c9a66’
  }
}

where we use a slice get on the row with limit N (and reverse order). However, 
updates involve pulling all columns to de-duplicate guid values. Neither 
solution is ideal, and so I present this to you fine gentlemen who have more 
experience modeling data in Cassandra than I.

I would much prefer to avoid any solutions that require pulling an 
indeterminate amount of data for either operation. For the time being I am 
using the first method and only pulling the first M columns, sorting, and 
taking the top N (M >= N).

One thing I was thinking would be nice (if possible), is to have a column 
family where columns are either sorted by their timestamp, or by the time the 
column was created/updated (which may be equivalent to not sorting at all, but 
I have not looked at the implementation).

I appreciate any feedback or suggestions you might have.
- William

Data Modeling Conundrum

Reply via email to