Re: Bitmap indexes - reviving CASSANDRA-1472

Jason Rutherglen Fri, 12 Apr 2013 07:27:28 -0700

Brian,

The Solr StatsComponent performs aggregations.


http://wiki.apache.org/solr/StatsComponent

I recommend using Datastax DSE Search...


On Fri, Apr 12, 2013 at 10:09 AM, Brian O'Neill <b...@alumni.brown.edu>wrote:

> @Jason,
>
> I have a lot of experience with SOLR + ES, but mainly for search.  (i.e.
> Finding the most relevant records given a query)
> That's been working well, but now we have requirements to support
> dashboards.  Those dashboards have aggregations in them (sum, average,
> count(s), etc).  I have limited experience using filter functions and
> facets to achieve similar things w/ Lucene, but they never seemed to
> perform well when the sets were large.
>
> If Lucene/SOLR/ES can support this kind of functionality, we'd gladly use
> it instead. (Let me know!)
>
> When we looked around, Druid seemed to fit the bill exactly: (and it was
> open source)
> http://metamarkets.com/2011/druid-part-i-real-time-analytics-at-a-billion-r
> ows-per-second/
>
> BTW, here is more information on the compression that Druid uses:
> http://metamarkets.com/2012/druid-bitmap-compression/
>
>
> To echo Matt's sentiment, we'd love to leverage a C* native capability for
> this.
> (Acunu provides most of the capability, but it isn't open source)
>
> I think once we have the "conditional write" semantics that are coming, we
> could layer this on top of C*. (extending the secondary indexes
> functionality)
>
> -brian
>
>
>
> ---
> Brian O'Neill
> Lead Architect, Software Development
> Health Market Science
> The Science of Better Results
> 2700 Horizon Drive € King of Prussia, PA € 19406
> M: 215.588.6024 € @boneill42 <http://www.twitter.com/boneill42>  €
> healthmarketscience.com
>
> This information transmitted in this email message is for the intended
> recipient only and may contain confidential and/or privileged material. If
> you received this email in error and are not the intended recipient, or
> the person responsible to deliver it to the intended recipient, please
> contact the sender at the email above and delete this email and any
> attachments and destroy any copies thereof. Any review, retransmission,
> dissemination, copying or other use of, or taking any action in reliance
> upon, this information by persons or entities other than the intended
> recipient is strictly prohibited.
>
>
>
>
>
>
>
> On 4/12/13 12:46 AM, "Matt Stump" <mrevilgn...@gmail.com> wrote:
>
> >You could embed Lucene, but then you pretty much have DSE search, and
> >there
> >are people on this list in a better position than I to describe
> >the difficulty in making that scale. By rolling your own you get
> >simplicity
> >and control. If you use a uniform index size you can just assign chunks of
> >it to the cassandra ring making it easy to distribute queries. I think
> >that
> >using Lucene in this way would cause most of the benefit of the library to
> >be lost, and add unnecessary complexity. If Lucene were easy, then I think
> >given the team's experience with both Lucene and C* it would have been
> >done
> >already.
> >
> >Sorry if it's a fuzzy answer, but I haven't run down every technical angle
> >on the integration with C* yet. The idea was still very much in the
> >wouldn't it be very cool if this thing lived in Cassandra. It would be the
> >nail in the coffin for impala, redshift, et al.
> >
> >
> >On Thu, Apr 11, 2013 at 3:15 PM, Jason Rutherglen <
> >jason.rutherg...@gmail.com> wrote:
> >
> >> What's the advantage over Lucene?
> >>
> >>
> >> On Wed, Apr 10, 2013 at 10:43 PM, Matt Stump <mrevilgn...@gmail.com>
> >> wrote:
> >>
> >> > Druid was our inspiration to layer bitmap indexes on top of Cassandra.
> >> > Druid doesn't work for us because or data set is too large. We would
> >>need
> >> > many hundreds of nodes just for the pre-processed data. What I
> >>envisioned
> >> > was the ability to perform druid style queries (no aggregation)
> >>without
> >> the
> >> > limitations imposed by having the entire dataset in memory. I
> >>primarily
> >> > need to query whether a user performed some event, but I also intend
> >>to
> >> add
> >> > trigram indexes for LIKE, ILIKE or possibly regex style matching.
> >> >
> >> > I wasn't aware of CONCISE, thanks for the pointer. We are currently
> >> > evaluating fastbit, which is a very similar project:
> >> > https://sdm.lbl.gov/fastbit/
> >> >
> >> >
> >> > On Wed, Apr 10, 2013 at 5:49 PM, Brian O'Neill <b...@alumni.brown.edu
> >> > >wrote:
> >> >
> >> > >
> >> > > How does this compare with Druid?
> >> > > https://github.com/metamx/druid
> >> > >
> >> > > We're currently evaluating Acunu, Vertica and Druid...
> >> > >
> >> > >
> >> >
> >>
> >>http://brianoneill.blogspot.com/2013/04/bianalytics-on-big-datacassandra
> .
> >>html
> >> > >
> >> > > With its bitmapped indexes, Druid appears to have the most
> >>potential.
> >> > > They boast some pretty impressive stats, especially WRT handling
> >> > > "real-time" updates and adding new dimensions.
> >> > >
> >> > > They also use a compression algorithm, CONCISE, to cut down on the
> >> space
> >> > > requirements.
> >> > > http://ricerca.mat.uniroma3.it/users/colanton/concise.html
> >> > >
> >> > > I haven't looked too deep into the Druid code, but I've been
> >>meaning to
> >> > > see if it could be backed by C*.
> >> > >
> >> > > We'd be game to join the hunt if you pursue such a beast. (with your
> >> > code,
> >> > > or with portions of Druid)
> >> > >
> >> > > -brian
> >> > >
> >> > >
> >> > > On Apr 10, 2013, at 5:40 PM, mrevilgnome wrote:
> >> > >
> >> > > > What do you think about set manipulation via indexes in Cassandra?
> >> I'm
> >> > > > interested in answering queries such as give me all users that
> >> > performed
> >> > > > event 1, 2, and 3, but not 4. If the answer is yes than I can
> >>make a
> >> > case
> >> > > > for spending my time on C*. The only downside for us would be our
> >> > current
> >> > > > prototype is in C++ so we would loose some performance and the
> >> ability
> >> > to
> >> > > > dedicate an entire machine to caching/performing queries.
> >> > > >
> >> > > >
> >> > > > On Wed, Apr 10, 2013 at 11:57 AM, Jonathan Ellis
> >><jbel...@gmail.com>
> >> > > wrote:
> >> > > >
> >> > > >> If you mean, "Can someone help me figure out how to get started
> >> > updating
> >> > > >> these old patches to trunk and cleaning out the Avro?" then yes,
> >> I've
> >> > > been
> >> > > >> knee-deep in indexing code recently.
> >> > > >>
> >> > > >>
> >> > > >> On Wed, Apr 10, 2013 at 11:34 AM, mrevilgnome <
> >> mrevilgn...@gmail.com>
> >> > > >> wrote:
> >> > > >>
> >> > > >>> I'm currently building a distributed cluster on top of
> >>cassandra to
> >> > > >> perform
> >> > > >>> fast set manipulation via bitmap indexes. This gives me the
> >>ability
> >> > to
> >> > > >>> perform unions, intersections, and set subtraction across
> >> > sub-queries.
> >> > > >>> Currently I'm storing index information for thousands of
> >>dimensions
> >> > as
> >> > > >>> cassandra rows, and my cluster keeps this information cached,
> >> > > distributed
> >> > > >>> and replicated in order to answer queries.
> >> > > >>>
> >> > > >>> Every couple of days I think to myself this should really exist
> >>in
> >> > C*.
> >> > > >>> Given all the benifits would there be any interest in
> >> > > >>> reviving CASSANDRA-1472?
> >> > > >>>
> >> > > >>> Some downsides are that this is very memory intensive, even for
> >> > sparse
> >> > > >>> bitmaps.
> >> > > >>>
> >> > > >>
> >> > > >>
> >> > > >>
> >> > > >> --
> >> > > >> Jonathan Ellis
> >> > > >> Project Chair, Apache Cassandra
> >> > > >> co-founder, http://www.datastax.com
> >> > > >> @spyced
> >> > > >>
> >> > >
> >> > > --
> >> > > Brian ONeill
> >> > > Lead Architect, Health Market Science
> >>(http://healthmarketscience.com)
> >> > > mobile:215.588.6024
> >> > > blog: http://weblogs.java.net/blog/boneill42/
> >> > > blog: http://brianoneill.blogspot.com/
> >> > >
> >> > >
> >> >
> >>
>
>
>

Re: Bitmap indexes - reviving CASSANDRA-1472

Reply via email to