You can open JIRA ticket at https://issues.apache.org/jira/browse/CASSANDRA with your proposal.
Just for the input: I had once implemented HyperLogLog counter to use internally in Cassandra, but it turned out I didn't need it so I just put it to gist. You can find it here: https://gist.github.com/2597943 The above implementation and most of the other ones (including stream-lib) implement the optimized version of the algorithm which counts up to 10^9, so may need some work. Other alternative is self-learning bitmap (http://ect.bell-labs.com/who/aychen/sbitmap4p.pdf) which, in my understanding, is more memory efficient when counting small values. Yuki On Wednesday, June 13, 2012 at 11:28 AM, Utku Can Topçu wrote: > Hi All, > > Let's assume we have a use case where we need to count the number of columns > for a given key. Let's say the key is the URL and the column-name is the IP > address or any cardinality identifier. > > The straight forward implementation seems to be simple, just inserting the IP > Adresses as columns under the key defined by the URL and using get_count to > count them back. However the problem here is in case of large rows (where too > many IP addresses are in); the get_count method has to de-serialize the whole > row and calculate the count. As also defined in the user guides, it's not an > O(1) operation and it's quite costly. > > However, this problem seems to have better solutions if you don't have a > strict requirement for the count to be exact. There are streaming algorithms > that will provide good cardinality estimations within a predefined failure > rate, I think the most popular one seems to be the (Hyper)LogLog algorithm, > also there's an optimal one developed recently, please check > http://dl.acm.org/citation.cfm?doid=1807085.1807094 > > If you want to take a look at the Java implementation for LogLog, Clearspring > has both LogLog and space optimized HyperLogLog available at > https://github.com/clearspring/stream-lib > > I don't see a reason why this can't be implemented in Cassandra. The > distributed nature of all these algorithms can easily be adapted to > Cassandra's model. I think most of us would love to see come cardinality > estimating columns in Cassandra. > > Regards, > Utku