Your requirement is still not quite clear - are you counting users or reports, or reports of a type for each user, or...?
You can have a separate table, with the partition key being the user type, and using the user id as a clustering column - provided that the number of users is only thousands or no more than low millions. Then write a row whenever a report is generated for a given type and user ID. Do you need to count multiple instances of the same report for a given user? If so, you can use a time stamp as an additional clustering column. -- Jack Krupansky On Wed, Jun 3, 2015 at 10:48 AM, Matthew Johnson <matt.john...@algomi.com> wrote: > Hi all, > > > > I am trying to store some data (user actions in our application) for > future analysis (probably using Spark). I understand best practice is to > store it in denormalized form, and this will definitely make some of our > future queries much easier. But I have a problem with denormalizing the > data. > > > > For example, let’s say one of my queries is “the number of reports > generated by user type”. In the part of the application that the user > connects to to generate reports, we only have access to the user id. In a > traditional RDBMS, this is fine, because at query time you join the user id > onto the users table and get all the user data associated with that user. > But how do I populate extra fields like user type on the fly? > > > > My ideas so far: > > 1. I try and maintain an in-memory cache of data such as “user”, > and do a lookup to this cache for every user action and store the user data > with it. #PROS: fast #CONS: not scalable, will run out of memory if data > sets grow > > 2. For each user action, I do a call to RDBMS and look up the data > for the user in question, then store the user action plus the user data as > a single row. #PROS easy to scale #CONS slow > > 3. I write only the user id and the action straight away, and have > a separate batch process that periodically goes through my table looking > for rows without user data, and looks up the user data from RDBMS and > populates it > > > > > > None of these solutions seem ideal to me. Does Cassandra have something > like ‘triggers’, where I can set up a table to automatically populate some > rows based on a lookup from another table? Or perhaps Spark or some other > library has built-in functionality that solves exactly this problem? > > > > Any suggestions much appreciated. > > > > Thanks, > > Matthew > > >