On Wed, Oct 26, 2011 at 7:35 PM, Ben Gambley <ben.gamb...@intoscience.com>wrote:
> Our requirement is to store per user, many unique results (which is > basically an attempt at some questions ..) so I had thought of having the > userid as the row key and the result id as columns. > > The keys for the result ids are maintained in a separate location so are > known without having to perform any additional lookups. > > My concern is that over time reading a single result would incur the > overhead of reading the entire row from disk so gradually slow things down. > > So I was considering if changing the row key to *userid + result id* would > be a better solution ? > This is a clustering choice. Assuming your dataset is too big to fit in system memory, some considerations which should drive your decision are locality of access, cache efficiency, and worst-case performance. 1) If you access many results for a user-id around the same time, then putting them close together will get you better lookup throughput, as once you cause a disk seek to access a result, other results will be available in memory (until it falls out of cache). This is generally going to be somewhat true whether you use a key (userid/resultid) or a key (userid) and column (resultid). 2) If some (or one) result-id is accessed much more frequently than others for all users, then you might wish for most of that result-id to get good cache hitrate, while other result-ids that are less frequently accessed do not need to be cached as much. To get this behavior, you'd want the hottest-most-frequently-accessed result-id to be packed together, and thus you'd want to either use a key such as "result-id/user-id", or use a separate column family for "hot" result-ids. 3) how many results can a user have? If every user will have an unbounded number of results (say > 40k), but you generally only need one result at a time, then you probably want to use a compound key (userid/resultid) rather than a key(userid)+column(resultid), because you don't want to have to deal with large data just to get a small piece. That said, it seems that cassandra's handling of wide-rows is improving, so perhaps in the future this will not be as large an issue.