Would it be possible to support this in a more general case by providing a distributed |= operator over arbitrary byte strings (like the + operator on counter columns), which would allow distributed bloom filters as well?
Tim Wintle On Fri, Jun 29, 2012 at 6:31 AM, Chris Burroughs <chris.burrou...@gmail.com>wrote: > Well I obviously think it would be handy. If this get's proposed and > end's up using stream-lib don't be shy about asking for help. > > On a more general note, it would be great to see the special case > Counter code become more general atomic operation code. > > On 06/13/2012 01:15 PM, Utku Can Topçu wrote: > > Hi Yuki, > > > > I think I should have used the word discussion instead of proposal for > the > > mailing subject. I have quite some of a design in my mind but I think > it's > > not yet ripe enough to formalize. I'll try to simplify it and open a Jira > > ticket. > > But first I'm wondering if there would be any excitement in the community > > for such a feature. > > > > Regards, > > Utku > > > > On Wed, Jun 13, 2012 at 7:00 PM, Yuki Morishita <mor.y...@gmail.com> > wrote: > > > >> You can open JIRA ticket at > >> https://issues.apache.org/jira/browse/CASSANDRA with your proposal. > >> > >> Just for the input: > >> > >> I had once implemented HyperLogLog counter to use internally in > Cassandra, > >> but it turned out I didn't need it so I just put it to gist. You can > find > >> it here: https://gist.github.com/2597943 > >> > >> The above implementation and most of the other ones (including > stream-lib) > >> implement the optimized version of the algorithm which counts up to > 10^9, > >> so may need some work. > >> > >> Other alternative is self-learning bitmap ( > >> http://ect.bell-labs.com/who/aychen/sbitmap4p.pdf) which, in my > >> understanding, is more memory efficient when counting small values. > >> > >> Yuki > >> > >> On Wednesday, June 13, 2012 at 11:28 AM, Utku Can Topçu wrote: > >> > >> Hi All, > >> > >> Let's assume we have a use case where we need to count the number of > >> columns for a given key. Let's say the key is the URL and the > column-name > >> is the IP address or any cardinality identifier. > >> > >> The straight forward implementation seems to be simple, just inserting > the > >> IP Adresses as columns under the key defined by the URL and using > get_count > >> to count them back. However the problem here is in case of large rows > >> (where too many IP addresses are in); the get_count method has to > >> de-serialize the whole row and calculate the count. As also defined in > the > >> user guides, it's not an O(1) operation and it's quite costly. > >> > >> However, this problem seems to have better solutions if you don't have a > >> strict requirement for the count to be exact. There are streaming > >> algorithms that will provide good cardinality estimations within a > >> predefined failure rate, I think the most popular one seems to be the > >> (Hyper)LogLog algorithm, also there's an optimal one developed recently, > >> please check http://dl.acm.org/citation.cfm?doid=1807085.1807094 > >> > >> If you want to take a look at the Java implementation for LogLog, > >> Clearspring has both LogLog and space optimized HyperLogLog available at > >> https://github.com/clearspring/stream-lib > >> > >> I don't see a reason why this can't be implemented in Cassandra. The > >> distributed nature of all these algorithms can easily be adapted to > >> Cassandra's model. I think most of us would love to see come cardinality > >> estimating columns in Cassandra. > >> > >> Regards, > >> Utku > >> > >> > >> > > > >