Would it be possible to support this in a more general case by providing a
distributed |= operator over arbitrary byte strings (like the + operator on
counter columns), which would allow distributed bloom filters as well?

Tim Wintle

On Fri, Jun 29, 2012 at 6:31 AM, Chris Burroughs
<chris.burrou...@gmail.com>wrote:

> Well I obviously think it would be handy.  If this get's proposed and
> end's up using stream-lib don't be shy about asking for help.
>
> On a more general note, it would be great to see the special case
> Counter code become more general atomic operation code.
>
> On 06/13/2012 01:15 PM, Utku Can Topçu wrote:
> > Hi Yuki,
> >
> > I think I should have used the word discussion instead of proposal for
> the
> > mailing subject. I have quite some of a design in my mind but I think
> it's
> > not yet ripe enough to formalize. I'll try to simplify it and open a Jira
> > ticket.
> > But first I'm wondering if there would be any excitement in the community
> > for such a feature.
> >
> > Regards,
> > Utku
> >
> > On Wed, Jun 13, 2012 at 7:00 PM, Yuki Morishita <mor.y...@gmail.com>
> wrote:
> >
> >> You can open JIRA ticket at
> >> https://issues.apache.org/jira/browse/CASSANDRA with your proposal.
> >>
> >> Just for the input:
> >>
> >> I had once implemented HyperLogLog counter to use internally in
> Cassandra,
> >> but it turned out I didn't need it so I just put it to gist. You can
> find
> >> it here: https://gist.github.com/2597943
> >>
> >> The above implementation and most of the other ones (including
> stream-lib)
> >> implement the optimized version of the algorithm which counts up to
> 10^9,
> >> so may need some work.
> >>
> >> Other alternative is self-learning bitmap (
> >> http://ect.bell-labs.com/who/aychen/sbitmap4p.pdf) which, in my
> >> understanding, is more memory efficient when counting small values.
> >>
> >> Yuki
> >>
> >> On Wednesday, June 13, 2012 at 11:28 AM, Utku Can Topçu wrote:
> >>
> >> Hi All,
> >>
> >> Let's assume we have a use case where we need to count the number of
> >> columns for a given key. Let's say the key is the URL and the
> column-name
> >> is the IP address or any cardinality identifier.
> >>
> >> The straight forward implementation seems to be simple, just inserting
> the
> >> IP Adresses as columns under the key defined by the URL and using
> get_count
> >> to count them back. However the problem here is in case of large rows
> >> (where too many IP addresses are in); the get_count method has to
> >> de-serialize the whole row and calculate the count. As also defined in
> the
> >> user guides, it's not an O(1) operation and it's quite costly.
> >>
> >> However, this problem seems to have better solutions if you don't have a
> >> strict requirement for the count to be exact. There are streaming
> >> algorithms that will provide good cardinality estimations within a
> >> predefined failure rate, I think the most popular one seems to be the
> >> (Hyper)LogLog algorithm, also there's an optimal one developed recently,
> >> please check http://dl.acm.org/citation.cfm?doid=1807085.1807094
> >>
> >> If you want to take a look at the Java implementation for LogLog,
> >> Clearspring has both LogLog and space optimized HyperLogLog available at
> >> https://github.com/clearspring/stream-lib
> >>
> >> I don't see a reason why this can't be implemented in Cassandra. The
> >> distributed nature of all these algorithms can easily be adapted to
> >> Cassandra's model. I think most of us would love to see come cardinality
> >> estimating columns in Cassandra.
> >>
> >> Regards,
> >> Utku
> >>
> >>
> >>
> >
>
>

Reply via email to