> > I'm wondering if anyone has used cassandra as a datastore for a > user-profile service. I'm thinking of applications like behavioral > targeting, where there are lots & lots of users (10s to 100s of millions), > and lots & lots of data about them intermixed in, say, weblogs (probably TBs > worth). The idea would be to use Cassandra as a datastore for distributed > parallel processing of the TBs of files (say on hadoop). Then the resulting > user-profiles would be query-able quickly. >
Just to be clear, you're primarily interested in storing the processed data (which you give examples of below) in Cassandra? > Anyone know of that sort of application of Cassandra? I'm trying to puzzle > out just what the column family might look like. Seems like a mix of > time-oriented information (user x visits site y at time z), location > information (user x appeared from ip x.y.z.a which is geo-location 31.20309, > 120.10923), and derived information (because user x visited site y 15 times > within a 10 day window, user x must be interested in buying a car). > For the time-oriented data, you generally want to dedicate one row as a timeline per user, using timestamps as column names. I wouldn't expect any of these to create extremely large rows, but if that's a possibility, you should consider splitting the timelines into one row per year (or a smaller time period) if needed. If you have any need for an aggregate timeline with a higher volume of data, different strategies apply. How you store the location data depends on what you want to do with it. If you're only interested in going from user -> locations, not from location -> users, then a couple of possibilities come to mind. You might want a timeline of locations that a user has appeared from, or you might want a counter for each location a user has appeared from. What would you like to do with these? As for the derived information, I think you would need to decide a little more concretely exactly what data you'll have and and what you want to be able to do with it. > I don't have specifics as yet... just some general thoughts. > Let me know what specifics you can come up with and I'll try to give you some more specific answers. The devil is in the details when it comes to data modeling in Cassandra! -- Tyler Hobbs Software Engineer, DataStax <http://datastax.com/> Maintainer of the pycassa <http://github.com/pycassa/pycassa> Cassandra Python client library