Thanks, good point, splitting wide rows via sharding is a good optimization for the get_count approach.
On Mon, Oct 31, 2011 at 10:58 AM, Zach Richardson <j.zach.richard...@gmail.com> wrote: > Ed, > > I could be completely wrong about this working--I haven't specifically > looked at how the counts are executed, but I think this makes sense. > > You could potentially shard across several rows, based on a hash of > the username combined with the time period as the row key. Run a > count across each row and then add them up. If your cluster is large > enough this could spread the computation enough to make each query for > the count a bit faster. > > Depending on how often this query would be hit, I would still > recommend caching, but you could calculate reality a little more > often. > > Zach > > > On Mon, Oct 31, 2011 at 12:22 PM, Ed Anuff <e...@anuff.com> wrote: >> I'm looking at the scenario of how to keep track of the number of >> unique visitors within a given time period. Inserting user ids into a >> wide row would allow me to have a list of every user within the time >> period that the row represented. My experience in the past was that >> using get_count on a row to get the column count got slow pretty quick >> but that might still be the easiest way to get the count of unique >> users with some sort of caching of the count so that it's not >> expensive subsequently. Using Hadoop is overkill for this scenario. >> Any other approaches? >> >> Ed >> >