Re: Data Model Question

Jean-Nicolas Boulay Desjardins Sat, 21 Jan 2012 15:37:23 -0800

But What about: Rainbird?


On Sat, Jan 21, 2012 at 10:52 AM, R. Verlangen <ro...@us2.nl> wrote:
>
> A couple of days ago I came across Countandra ( http://countandra.org/ ). It 
> seems that it might be a solution for you.
>
> Gr. Robin
>
>
> 2012/1/20 Tamar Fraenkel <ta...@tok-media.com>
>>
>> Hi!
>>
>> I am a newbie to Cassandra and seeking some advice regarding the data model 
>> I should use to best address my needs.
>>
>> For simplicity, what I want to accomplish is:
>>
>> I have a system that has users (potentially ~10,000 per day) and they 
>> perform actions in the system (total of ~50,000 a day).
>>
>> Each User’s action is taking place in a certain point in time, and is also 
>> classified into categories (1 to 5) and tagged by 1-30 tags. Each action’s 
>> Categories and Tags has a score associated with it, the score is between 0 
>> to 1 (let’s assume precision of 0.0001).
>>
>> I want to be able to identify similar actions in the system (performed 
>> usually by more than one user). Similarity of actions is calculated based on 
>> their common Categories and Tags taking scores into account.
>>
>> I need the system to store:
>>
>> The list of my users with attributes like name, age etc
>> For each action – the categories and tags associated with it and their 
>> score, the time of the action, and the user who performed it.
>> Groups of similar actions (ActionGroups) – the id’s of actions in the group, 
>> the categories and tags describing the group, with their scores. Those are 
>> calculated using an algorithm that takes into account the categories and 
>> tags of the actions in the group.
>>
>> When a user performs a new action in the system, I want to add it to a 
>> fitting ActionGroups (with similar categories and tags).
>>
>> For this I need to be able to perform the following:
>>
>> Find all the recent ActionGroups (those who were updated with actions 
>> performed during the last T minutes), who has at list one of the new 
>> action’s categories AND at list one of the new action’s tags.
>>
>>
>>
>> I thought of two ways to address the issue and I would appreciate your 
>> insights.
>>
>>
>>
>> First one using secondary indexes
>>
>> Column Family: Users
>>
>> Key: userId
>>
>> Compare with Bytes Type
>>
>> Columns: name: <>, age: <> etc…
>>
>>
>>
>> Column Family: Actions
>>
>> Key: actionId
>>
>> Compare with Bytes Type
>>
>> Columns:  Category1 : <Score> ….
>>
>>           CategoriN: <Score>,
>>
>>           Tag1 : <Score>, ….
>>
>>           TagK:<Score>
>>
>>           Time: timestamp
>>
>>           user: userId
>>
>>
>>
>> Column Family: ActionGroups
>>
>> Key: actionGroupId
>>
>> Compare with Bytes Type
>>
>> Columns: Category1 : <Score> ….
>>
>>          CategoriN: <Score>,
>>
>>          Tag1 : <Score> ….
>>
>>          TagK:<Score>
>>
>>          lastUpdateTime: timestamp
>>
>>          actionId1: null, … ,
>>
>>          actionIdM: null
>>
>>
>>
>> I will then define secondary index on each tag columns, category columns, 
>> and the update time column.
>>
>> Let’s assume the new action I want to add to ActionGroup has 
>> NewActionCategory1 - NewActionCategoryK, and has NewActionTag1 – 
>> NewActionTagN. I will perform the following query:
>>
>> Select  * From ActionGroups where
>>
>>    (NewActionCategory1 > 0  … or NewActionCategoryK > 0) and
>>
>>    (NewActionTag1 > 0  … or NewActionTagN > 0) and
>>
>>    lastUpdateTime > T;
>>
>>
>>
>> Second solution
>>
>> Have the same CF as in the first solution without the secondary index , and 
>> have two additional CF-ies:
>>
>> Column Family: CategoriesToActionGroupId
>>
>> Key: categoryId
>>
>> Compare with ByteType
>>
>> Columns: {Timestamp, ActionGroupsId1 } : null
>>
>>          {Timestamp, ActionGroupsId2} : null
>>
>>          ...
>>
>> *timestamp is the update time for the ActionGroup
>>
>>
>>
>> A similar CF will be defined for tags.
>>
>>
>>
>> I will then be able to run several queries on CategoriesToActionGroupId (one 
>> for each of the new story Categories), with column slice for the right 
>> update time of the ActionGroup.
>>
>> I will do the same for the TagsToActionGroupId.
>>
>> I will then use my client code to remove duplicates (ActionGroups who are 
>> associated with more than one Tag or Category).
>>
>>
>>
>> My questions are:
>>
>> Are the two solutions viable? If yes, which is better
>> Is there any better way of doing this?
>> Can I use jdbc and CQL with both method, or do I have to use Hector (I am 
>> using Java).
>>
>> Thanks
>>
>> Tamar
>>
>>
>>
>>
>
>

Re: Data Model Question

Reply via email to