Ed, I could be completely wrong about this working--I haven't specifically looked at how the counts are executed, but I think this makes sense.
You could potentially shard across several rows, based on a hash of the username combined with the time period as the row key. Run a count across each row and then add them up. If your cluster is large enough this could spread the computation enough to make each query for the count a bit faster. Depending on how often this query would be hit, I would still recommend caching, but you could calculate reality a little more often. Zach On Mon, Oct 31, 2011 at 12:22 PM, Ed Anuff <e...@anuff.com> wrote: > I'm looking at the scenario of how to keep track of the number of > unique visitors within a given time period. Inserting user ids into a > wide row would allow me to have a list of every user within the time > period that the row represented. My experience in the past was that > using get_count on a row to get the column count got slow pretty quick > but that might still be the easiest way to get the count of unique > users with some sort of caching of the count so that it's not > expensive subsequently. Using Hadoop is overkill for this scenario. > Any other approaches? > > Ed >