Re: Distinct Counter Proposal for Cassandra

Yuki Morishita Wed, 13 Jun 2012 10:01:11 -0700

You can open JIRA ticket at https://issues.apache.org/jira/browse/CASSANDRA 
with your proposal.


Just for the input:

I had once implemented HyperLogLog counter to use internally in Cassandra, but 
it turned out I didn't need it so I just put it to gist. You can find it here: 
https://gist.github.com/2597943

The above implementation and most of the other ones (including stream-lib) 
implement the optimized version of the algorithm which counts up to 10^9, so 
may need some work.

Other alternative is self-learning bitmap 
(http://ect.bell-labs.com/who/aychen/sbitmap4p.pdf) which, in my understanding, 
is more memory efficient when counting small values.

Yuki


On Wednesday, June 13, 2012 at 11:28 AM, Utku Can Topçu wrote:

> Hi All,
>  
> Let's assume we have a use case where we need to count the number of columns 
> for a given key. Let's say the key is the URL and the column-name is the IP 
> address or any cardinality identifier.
>  
> The straight forward implementation seems to be simple, just inserting the IP 
> Adresses as columns under the key defined by the URL and using get_count to 
> count them back. However the problem here is in case of large rows (where too 
> many IP addresses are in); the get_count method has to de-serialize the whole 
> row and calculate the count. As also defined in the user guides, it's not an 
> O(1) operation and it's quite costly.
>  
> However, this problem seems to have better solutions if you don't have a 
> strict requirement for the count to be exact. There are streaming algorithms 
> that will provide good cardinality estimations within a predefined failure 
> rate, I think the most popular one seems to be the (Hyper)LogLog algorithm, 
> also there's an optimal one developed recently, please check 
> http://dl.acm.org/citation.cfm?doid=1807085.1807094
>  
> If you want to take a look at the Java implementation for LogLog, Clearspring 
> has both LogLog and space optimized HyperLogLog available at 
> https://github.com/clearspring/stream-lib
>  
> I don't see a reason why this can't be implemented in Cassandra. The 
> distributed nature of all these algorithms can easily be adapted to 
> Cassandra's model. I think most of us would love to see come cardinality 
> estimating columns in Cassandra.
>  
> Regards,
> Utku

Re: Distinct Counter Proposal for Cassandra

Reply via email to