Hi! Thank you very much for your response! I have couple of questions regarding it, some are just to make sure I understood you: 1. regarding time slicing, ifat any point of time I am interested in what happened in the last T minutes, then I will need to query more than one row of the DimentionUpdates, right?
2. What did you mean by "You will also want to partition the list of actions in an actiongroup" 3. For insert of a new ActionGroup, I will need to a) add (in some cases overwrite) columns for all the new ActionGroup catetories and tags in the time_partition row of DimensionUpdates. b) add to all the rows in DimentionFacts that are relevant (time, categories and tags) a column for the ActionGroupId. c) add all the information to the ActionGroup CF and Action CF, including the tags\categories as column names and their scores as values. Is there any problem of synchronization there? 4. regarding the row size, wouldn't that be solved by giving TTL to the columns 5. After reading a bit I also thought that built in secondary indices won't work. There is also another complication, that having tagA with CategoryA is different than having tagA with CategoryB. But I still thought that something like CF: ActionGroupByCategoryTag Stores which action groups are tagged by each tag. key: <category_name:tag_name> col_names: <action_group_creation_time> col_value: <action_group_id> And then I will get categories*tags rows from this index table, and get sliced columns by the time range i am interested in. I can define some TTL on the columns to get only ActionGroups created in the last T minutes. Does that makes sense? Do you think your solution is still better? Tamar On January 22, 2012 at 9:53 PM aaron morton <aa...@thelastpickle.com> wrote: > In general if you are collecting data over time you should consider > partitioning the row's to avoid creating very large rows. Also if you have a > common request you want to support consider modeling it directly rather than > using secondary indexes. > > Assuming my understanding of the problem is in some one way correct I would > consider this for the Action Groups… > > Pick a time partition, this is the minimum time resolution you are interested > in (your T Minutes). > > CF: DimensionUpdates > Stores which dimensions (tags, categories) were updated in the time > partition. > > key: <time_partition> is the start of the partition e.g. 2011-01-23T08:30 > col_names: <dimension_name:dimension_value> where <dimension_name> is "tag" or > "category" and <dimension_value> is a value from that domain. e.g. <tag:foo> > col_value: empty > > > CF: DimensionFacts > Stores the facts that included the dimension in a time partition. > > key: <time_partition:dimension_name:dimension_value> definitions as above. > col_names: ActionGroupID. > col_values: empty > > So to… > > > > > > Find all the recent ActionGroups (those who were updated with actions > > performed during the last T minutes), who has at list one of the new > > action’s categories AND at list one of the new action’s tags. > > 1) Query the DimensionUpdates CF with the current time partition as the key, > and the tags and columns the action group has. > 2) For each column returned from (1) query the rows in DimensionFacts to get > the ActionGroups. > 3) Filter the unique set of ActionGroups client side. > > > Some notes: > 1) Row size in all cases are bound to the time partition size. This will make > your life easier when it comes to repair and compaction. By default rows sizes > of 64MB will take a slower 2 pass approach that will cost you IO. > > 2) All queries are bound. Query 1 will only want request 1 to 35 columns from > a row that contains 0 to 35 columns. Query 2 can be done as either a multi get > (select with lots of KEY clauses) or a series of multi gets, and can be > further bound by limiting the number of columns in each request. Making > queries that take for a lot of rows at once can harm overall query throughput. > > > > 3) Overwrites (writing the to the same row) are bound by the time partition. > Depending on load this *may* mean that rows are only physically written to one > SSTable. > > 4) You will also want to partition the list of actions in an actiongroup. > > Hope that helps. > > > > > > > > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > > On 20/01/2012, at 10:11 PM, Tamar Fraenkel wrote: > > > > > > > Hi!I am a newbie to Cassandra and seeking some advice regarding the data > > model I should use to best address my needs. > > For simplicity, what I want to accomplish is: > > I have a system that has users (potentially ~10,000 per day) and they > > perform actions in the system (total of ~50,000 a day). > > Each User’s action is taking place in a certain point in time, and is also > > classified into categories (1 to 5) and tagged by 1-30 tags. Each action’s > > Categories and Tags has a score associated with it, the score is between 0 > > to 1 (let’s assume precision of 0.0001). > > I want to be able to identify similar actions in the system (performed > > usually by more than one user). Similarity of actions is calculated based on > > their common Categories and Tags taking scores into account. > > I need the system to store: > > > > The list of my users with attributes like name, age etc > > For each action – the categories and tags associated with it and their > > score, the time of the action, and the user who performed it. > > Groups of similar actions (ActionGroups) – the id’s of actions in the group, > > the categories and tags describing the group, with their scores. Those are > > calculated using an algorithm that takes into account the categories and > > tags of the actions in the group. > > > > When a user performs a new action in the system, I want to add it to a > > fitting ActionGroups (with similar categories and tags). > > For this I need to be able to perform the following: > > Find all the recent ActionGroups (those who were updated with actions > > performed during the last T minutes), who has at list one of the new > > action’s categories AND at list one of the new action’s tags. > > > > I thought of two ways to address the issue and I would appreciate your > > insights. > > > > First one using secondary indexes > > Column Family:Users > > Key: userId > > Compare with Bytes Type > > Columns: name: <>, age: <> etc… > > > > Column Family:Actions > > Key: actionId > > Compare with Bytes Type > > Columns: Category1 : <Score> …. > > CategoriN: <Score>, > > Tag1 : <Score>, …. > > TagK:<Score> > > Time: timestamp > > user: userId > > > > Column Family:ActionGroups > > Key: actionGroupId > > Compare with Bytes Type > > Columns: Category1 : <Score> …. > > CategoriN: <Score>, > > Tag1 : <Score> …. > > TagK:<Score> > > lastUpdateTime: timestamp > > actionId1: null, … , > > actionIdM: null > > > > I will then define secondary index on each tag columns, category columns, > > and the update time column. > > Let’s assume the new action I want to add to ActionGroup has > > NewActionCategory1 - NewActionCategoryK, and has NewActionTag1 – > > NewActionTagN. I will perform the following query: > > Select * From ActionGroups where > > (NewActionCategory1 > 0 … or NewActionCategoryK > 0) and > > (NewActionTag1 > 0 … or NewActionTagN > 0) and > > lastUpdateTime > T; > > > > Second solution > > Have the same CF as in the first solutionwithout the secondaryindex, and > > have two additional CF-ies: > > Column Family:CategoriesToActionGroupId > > Key: categoryId > > Compare with ByteType > > Columns: {Timestamp, ActionGroupsId1 } : null > > {Timestamp, ActionGroupsId2} : null > > ... > > *timestamp is the update time for the ActionGroup > > > > A similar CF will be defined for tags. > > > > I will then be able to run several queries on CategoriesToActionGroupId > > (one for each of the new story Categories), with column slice for the right > > update time of the ActionGroup. > > I will do the same for the TagsToActionGroupId. > > I will then use my client code to remove duplicates (ActionGroups who are > > associated with more than one Tag or Category). > > > > My questions are: > > > > Are the two solutions viable? If yes, which is better > > Is there any better way of doing this? > > Can I use jdbc and CQL with both method, or do I have to use Hector (I am > > using Java). > > > > Thanks > > Tamar > > > > > >