Re: Data Model Question

aaron morton Sun, 22 Jan 2012 11:53:44 -0800

In general if you are collecting data over time you should consider 
partitioning the row's to avoid creating very large rows. Also if you have a 
common request you want to support consider modeling it directly rather than 
using secondary indexes.


Assuming my understanding of the problem is in some one way correct I would 
consider this for the Action Groups…

Pick a time partition, this is the minimum time resolution you are interested 
in (your T Minutes). 

CF: DimensionUpdates
Stores which dimensions (tags, categories) were updated in the time partition. 

key: <time_partition>  is the start of the partition e.g. 2011-01-23T08:30  
col_names: <dimension_name:dimension_value> where <dimension_name> is "tag" or 
"category" and <dimension_value> is a value from that domain. e.g. <tag:foo>
col_value: empty


CF: DimensionFacts
Stores the facts that included the dimension in a time partition. 

key: <time_partition:dimension_name:dimension_value> definitions as above.
col_names: ActionGroupID. 
col_values: empty

So to…

> Find all the recent ActionGroups (those who were updated with actions 
> performed during the last T minutes), who has at list one of the new action’s 
> categories AND at list one of the new action’s tags. 

1) Query the DimensionUpdates CF with the  current time partition as the key, 
and the tags and columns the action group has. 
2) For each column returned from (1) query the rows in DimensionFacts to get 
the ActionGroups.
3) Filter the unique set of ActionGroups client side. 


Some notes:
1) Row size in all cases are bound to the time partition size. This will make 
your life easier when it comes to repair and compaction. By default rows sizes 
of 64MB will take a slower 2 pass approach that will cost you IO. 

2) All queries are bound. Query 1 will only want request 1 to 35 columns from a 
row that contains 0 to 35 columns. Query 2 can be done as either a multi get 
(select with lots of KEY clauses) or a series of multi gets, and can be further 
bound by limiting the number of columns in each request. Making queries that 
take for a lot of rows at once can harm overall query throughput.   

3) Overwrites (writing the to the same row) are bound by the time partition. 
Depending on load this *may* mean that rows are only physically written to one 
SSTable. 

4) You will also want to partition the list of actions in an actiongroup.

Hope that helps. 


-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 20/01/2012, at 10:11 PM, Tamar Fraenkel wrote:

> Hi!
> I am a newbie to Cassandra and seeking some advice regarding the data model I 
> should use to best address my needs.
> 
> For simplicity, what I want to accomplish is:
> 
> I have a system that has users (potentially ~10,000 per day) and they perform 
> actions in the system (total of ~50,000 a day).
> 
> Each User’s action is taking place in a certain point in time, and is also 
> classified into categories (1 to 5) and tagged by 1-30 tags. Each action’s 
> Categories and Tags has a score associated with it, the score is between 0 to 
> 1 (let’s assume precision of 0.0001).
> 
> I want to be able to identify similar actions in the system (performed 
> usually by more than one user). Similarity of actions is calculated based on 
> their common Categories and Tags taking scores into account.
> 
> I need the system to store:
> 
> The list of my users with attributes like name, age etc
> For each action – the categories and tags associated with it and their score, 
> the time of the action, and the user who performed it.
> Groups of similar actions (ActionGroups) – the id’s of actions in the group, 
> the categories and tags describing the group, with their scores. Those are 
> calculated using an algorithm that takes into account the categories and tags 
> of the actions in the group.
> When a user performs a new action in the system, I want to add it to a 
> fitting ActionGroups (with similar categories and tags).
> 
> For this I need to be able to perform the following:
> 
> Find all the recent ActionGroups (those who were updated with actions 
> performed during the last T minutes), who has at list one of the new action’s 
> categories AND at list one of the new action’s tags.
> 
>  
> I thought of two ways to address the issue and I would appreciate your 
> insights.
> 
>  
> First one using secondary indexes
> 
> Column Family: Users
> 
> Key: userId
> 
> Compare with Bytes Type
> 
> Columns: name: <>, age: <> etc…
> 
>  
> Column Family: Actions
> 
> Key: actionId
> 
> Compare with Bytes Type
> 
> Columns:  Category1 : <Score> ….
> 
>           CategoriN: <Score>,
> 
>           Tag1 : <Score>, ….
> 
>           TagK:<Score>
> 
>           Time: timestamp
> 
>           user: userId
> 
>  
> Column Family: ActionGroups
> 
> Key: actionGroupId
> 
> Compare with Bytes Type
> 
> Columns: Category1 : <Score> ….
> 
>          CategoriN: <Score>,
> 
>          Tag1 : <Score> ….
> 
>          TagK:<Score>
> 
>          lastUpdateTime: timestamp
> 
>          actionId1: null, … ,
> 
>          actionIdM: null
> 
>  
> I will then define secondary index on each tag columns, category columns, and 
> the update time column.
> 
> Let’s assume the new action I want to add to ActionGroup has 
> NewActionCategory1 - NewActionCategoryK, and has NewActionTag1 – 
> NewActionTagN. I will perform the following query:
> 
> Select  * From ActionGroups where
> 
>    (NewActionCategory1 > 0  … or NewActionCategoryK > 0) and
> 
>    (NewActionTag1 > 0  … or NewActionTagN > 0) and
> 
>    lastUpdateTime > T;
> 
>  
> Second solution
> 
> Have the same CF as in the first solution without the secondary index , and 
> have two additional CF-ies:
> 
> Column Family: CategoriesToActionGroupId
> 
> Key: categoryId
> 
> Compare with ByteType
> 
> Columns: {Timestamp, ActionGroupsId1 } : null
> 
>          {Timestamp, ActionGroupsId2} : null
> 
>          ...
> 
> *timestamp is the update time for the ActionGroup
> 
>  
> A similar CF will be defined for tags.
> 
>  
> I will then be able to run several queries on CategoriesToActionGroupId (one 
> for each of the new story Categories), with column slice for the right update 
> time of the ActionGroup.
> 
> I will do the same for the TagsToActionGroupId.
> 
> I will then use my client code to remove duplicates (ActionGroups who are 
> associated with more than one Tag or Category).
> 
>  
> My questions are:
> 
> Are the two solutions viable? If yes, which is better
> Is there any better way of doing this?
> Can I use jdbc and CQL with both method, or do I have to use Hector (I am 
> using Java).
> Thanks
> 
> Tamar
> 
>  
>

Re: Data Model Question

Reply via email to