Thanks, good point, splitting wide rows via sharding is a good
optimization for the get_count approach.

On Mon, Oct 31, 2011 at 10:58 AM, Zach Richardson
<j.zach.richard...@gmail.com> wrote:
> Ed,
>
> I could be completely wrong about this working--I haven't specifically
> looked at how the counts are executed, but I think this makes sense.
>
> You could potentially shard across several rows, based on a hash of
> the username combined with the time period as the row key.  Run a
> count across each row and then add them up.  If your cluster is large
> enough this could spread the computation enough to make each query for
> the count a bit faster.
>
> Depending on how often this query would be hit, I would still
> recommend caching, but you could calculate reality a little more
> often.
>
> Zach
>
>
> On Mon, Oct 31, 2011 at 12:22 PM, Ed Anuff <e...@anuff.com> wrote:
>> I'm looking at the scenario of how to keep track of the number of
>> unique visitors within a given time period.  Inserting user ids into a
>> wide row would allow me to have a list of every user within the time
>> period that the row represented.  My experience in the past was that
>> using get_count on a row to get the column count got slow pretty quick
>> but that might still be the easiest way to get the count of unique
>> users with some sort of caching of the count so that it's not
>> expensive subsequently.  Using Hadoop is overkill for this scenario.
>> Any other approaches?
>>
>> Ed
>>
>

Reply via email to