On Wed, Oct 26, 2011 at 7:35 PM, Ben Gambley <ben.gamb...@intoscience.com>wrote:

> Our requirement is to store per user, many unique results (which is
> basically an attempt at some questions ..) so I had thought of having the
> userid as the row key and the result id as columns.
>
> The keys  for the result ids are maintained in a separate location so are
> known without having to perform any additional lookups.
>
> My concern is that over time reading a single result would incur the
> overhead of reading the entire row from disk so gradually slow things down.
>
> So I was considering if changing the row key to *userid + result id* would
> be a better solution ?
>

This is a clustering choice. Assuming your dataset is too big to fit in
system memory, some considerations which should drive your decision are
locality of access, cache efficiency, and worst-case performance.

1) If you access many results for a user-id around the same time, then
putting them close together will get you better lookup throughput, as once
you cause a disk seek to access a result, other results will be available in
memory (until it falls out of cache). This is generally going to be somewhat
true whether you use a key (userid/resultid) or a key (userid) and column
(resultid).

2) If some (or one) result-id is accessed much more frequently than others
for all users, then you might wish for most of that result-id to get good
cache hitrate, while other result-ids that are less frequently accessed do
not need to be cached as much. To get this behavior, you'd want the
hottest-most-frequently-accessed result-id to be packed together, and thus
you'd want to either use a key such as "result-id/user-id", or use a
separate column family for "hot" result-ids.

3) how many results can a user have? If every user will have an unbounded
number of results (say > 40k), but you generally only need one result at a
time, then you probably want to use a compound key (userid/resultid) rather
than a key(userid)+column(resultid), because you don't want to have to deal
with large data just to get a small piece. That said, it seems that
cassandra's handling of wide-rows is improving, so perhaps in the future
this will not be as large an issue.

Reply via email to