Re: How to store denormalized data

Jack Krupansky Wed, 03 Jun 2015 08:10:31 -0700

Your requirement is still not quite clear - are you counting users or
reports, or reports of a type for each user, or...?


You can have a separate table, with the partition key being the user type,
and using the user id as a clustering column - provided that the number of
users is only thousands or no more than low millions. Then write a row
whenever a report is generated for a given type and user ID. Do you need to
count multiple instances of the same report for a given user? If so, you
can use a time stamp as an additional clustering column.


-- Jack Krupansky

On Wed, Jun 3, 2015 at 10:48 AM, Matthew Johnson <matt.john...@algomi.com>
wrote:

> Hi all,
>
>
>
> I am trying to store some data (user actions in our application) for
> future analysis (probably using Spark). I understand best practice is to
> store it in denormalized form, and this will definitely make some of our
> future queries much easier. But I have a problem with denormalizing the
> data.
>
>
>
> For example, let’s say one of my queries is “the number of reports
> generated by user type”. In the part of the application that the user
> connects to to generate reports, we only have access to the user id. In a
> traditional RDBMS, this is fine, because at query time you join the user id
> onto the users table and get all the user data associated with that user.
> But how do I populate extra fields like user type on the fly?
>
>
>
> My ideas so far:
>
> 1.       I try and maintain an in-memory cache of data such as “user”,
> and do a lookup to this cache for every user action and store the user data
> with it. #PROS: fast #CONS: not scalable, will run out of memory if data
> sets grow
>
> 2.       For each user action, I do a call to RDBMS and look up the data
> for the user in question, then store the user action plus the user data as
> a single row. #PROS easy to scale #CONS slow
>
> 3.       I write only the user id and the action straight away, and have
> a separate batch process that periodically goes through my table looking
> for rows without user data, and looks up the user data from RDBMS and
> populates it
>
>
>
>
>
> None of these solutions seem ideal to me. Does Cassandra have something
> like ‘triggers’, where I can set up a table to automatically populate some
> rows based on a lookup from another table? Or perhaps Spark or some other
> library has built-in functionality that solves exactly this problem?
>
>
>
> Any suggestions much appreciated.
>
>
>
> Thanks,
>
> Matthew
>
>
>

Re: How to store denormalized data

Reply via email to