>
> I'm wondering if anyone has used cassandra as a datastore for a
> user-profile service.  I'm thinking of applications like behavioral
> targeting, where there are lots & lots of users (10s to 100s of millions),
> and lots & lots of data about them intermixed in, say, weblogs (probably TBs
> worth).  The idea would be to use Cassandra as a datastore for distributed
> parallel processing of the TBs of files (say on hadoop).  Then the resulting
> user-profiles would be query-able quickly.
>

Just to be clear, you're primarily interested in storing the processed data
(which you give examples of below) in Cassandra?


> Anyone know of that sort of application of Cassandra?  I'm trying to puzzle
> out just what the column family might look like.  Seems like a mix of
> time-oriented information (user x visits site y at time z), location
> information (user x appeared from ip x.y.z.a which is geo-location 31.20309,
> 120.10923), and derived information (because user x visited site y 15 times
> within a 10 day window, user x must be interested in buying a car).
>

For the time-oriented data, you generally want to dedicate one row  as a
timeline per user, using timestamps as column names.  I wouldn't expect any
of these to create extremely large rows, but if that's a possibility, you
should consider splitting the timelines into one row per year (or a smaller
time period) if needed.  If you have any need for an aggregate timeline with
a higher volume of data, different strategies apply.

How you store the location data depends on what you want to do with it.  If
you're only interested in going from user -> locations, not from location ->
users, then a couple of possibilities come to mind.  You might want a
timeline of locations that a user has appeared from, or you might want a
counter for each location a user has appeared from.  What would you like to
do with these?

As for the derived information, I think you would need to decide a little
more concretely exactly what data you'll have and and what you want to be
able to do with it.


> I don't have specifics as yet... just some general thoughts.
>

Let me know what specifics you can come up with and I'll try to give you
some more specific answers.  The devil is in the details when it comes to
data modeling in Cassandra!

-- 
Tyler Hobbs
Software Engineer, DataStax <http://datastax.com/>
Maintainer of the pycassa <http://github.com/pycassa/pycassa> Cassandra
Python client library

Reply via email to