Brian, The Solr StatsComponent performs aggregations.
http://wiki.apache.org/solr/StatsComponent I recommend using Datastax DSE Search... On Fri, Apr 12, 2013 at 10:09 AM, Brian O'Neill <b...@alumni.brown.edu>wrote: > @Jason, > > I have a lot of experience with SOLR + ES, but mainly for search. (i.e. > Finding the most relevant records given a query) > That's been working well, but now we have requirements to support > dashboards. Those dashboards have aggregations in them (sum, average, > count(s), etc). I have limited experience using filter functions and > facets to achieve similar things w/ Lucene, but they never seemed to > perform well when the sets were large. > > If Lucene/SOLR/ES can support this kind of functionality, we'd gladly use > it instead. (Let me know!) > > When we looked around, Druid seemed to fit the bill exactly: (and it was > open source) > http://metamarkets.com/2011/druid-part-i-real-time-analytics-at-a-billion-r > ows-per-second/ > > BTW, here is more information on the compression that Druid uses: > http://metamarkets.com/2012/druid-bitmap-compression/ > > > To echo Matt's sentiment, we'd love to leverage a C* native capability for > this. > (Acunu provides most of the capability, but it isn't open source) > > I think once we have the "conditional write" semantics that are coming, we > could layer this on top of C*. (extending the secondary indexes > functionality) > > -brian > > > > --- > Brian O'Neill > Lead Architect, Software Development > Health Market Science > The Science of Better Results > 2700 Horizon Drive € King of Prussia, PA € 19406 > M: 215.588.6024 € @boneill42 <http://www.twitter.com/boneill42> € > healthmarketscience.com > > This information transmitted in this email message is for the intended > recipient only and may contain confidential and/or privileged material. If > you received this email in error and are not the intended recipient, or > the person responsible to deliver it to the intended recipient, please > contact the sender at the email above and delete this email and any > attachments and destroy any copies thereof. Any review, retransmission, > dissemination, copying or other use of, or taking any action in reliance > upon, this information by persons or entities other than the intended > recipient is strictly prohibited. > > > > > > > > On 4/12/13 12:46 AM, "Matt Stump" <mrevilgn...@gmail.com> wrote: > > >You could embed Lucene, but then you pretty much have DSE search, and > >there > >are people on this list in a better position than I to describe > >the difficulty in making that scale. By rolling your own you get > >simplicity > >and control. If you use a uniform index size you can just assign chunks of > >it to the cassandra ring making it easy to distribute queries. I think > >that > >using Lucene in this way would cause most of the benefit of the library to > >be lost, and add unnecessary complexity. If Lucene were easy, then I think > >given the team's experience with both Lucene and C* it would have been > >done > >already. > > > >Sorry if it's a fuzzy answer, but I haven't run down every technical angle > >on the integration with C* yet. The idea was still very much in the > >wouldn't it be very cool if this thing lived in Cassandra. It would be the > >nail in the coffin for impala, redshift, et al. > > > > > >On Thu, Apr 11, 2013 at 3:15 PM, Jason Rutherglen < > >jason.rutherg...@gmail.com> wrote: > > > >> What's the advantage over Lucene? > >> > >> > >> On Wed, Apr 10, 2013 at 10:43 PM, Matt Stump <mrevilgn...@gmail.com> > >> wrote: > >> > >> > Druid was our inspiration to layer bitmap indexes on top of Cassandra. > >> > Druid doesn't work for us because or data set is too large. We would > >>need > >> > many hundreds of nodes just for the pre-processed data. What I > >>envisioned > >> > was the ability to perform druid style queries (no aggregation) > >>without > >> the > >> > limitations imposed by having the entire dataset in memory. I > >>primarily > >> > need to query whether a user performed some event, but I also intend > >>to > >> add > >> > trigram indexes for LIKE, ILIKE or possibly regex style matching. > >> > > >> > I wasn't aware of CONCISE, thanks for the pointer. We are currently > >> > evaluating fastbit, which is a very similar project: > >> > https://sdm.lbl.gov/fastbit/ > >> > > >> > > >> > On Wed, Apr 10, 2013 at 5:49 PM, Brian O'Neill <b...@alumni.brown.edu > >> > >wrote: > >> > > >> > > > >> > > How does this compare with Druid? > >> > > https://github.com/metamx/druid > >> > > > >> > > We're currently evaluating Acunu, Vertica and Druid... > >> > > > >> > > > >> > > >> > >>http://brianoneill.blogspot.com/2013/04/bianalytics-on-big-datacassandra > . > >>html > >> > > > >> > > With its bitmapped indexes, Druid appears to have the most > >>potential. > >> > > They boast some pretty impressive stats, especially WRT handling > >> > > "real-time" updates and adding new dimensions. > >> > > > >> > > They also use a compression algorithm, CONCISE, to cut down on the > >> space > >> > > requirements. > >> > > http://ricerca.mat.uniroma3.it/users/colanton/concise.html > >> > > > >> > > I haven't looked too deep into the Druid code, but I've been > >>meaning to > >> > > see if it could be backed by C*. > >> > > > >> > > We'd be game to join the hunt if you pursue such a beast. (with your > >> > code, > >> > > or with portions of Druid) > >> > > > >> > > -brian > >> > > > >> > > > >> > > On Apr 10, 2013, at 5:40 PM, mrevilgnome wrote: > >> > > > >> > > > What do you think about set manipulation via indexes in Cassandra? > >> I'm > >> > > > interested in answering queries such as give me all users that > >> > performed > >> > > > event 1, 2, and 3, but not 4. If the answer is yes than I can > >>make a > >> > case > >> > > > for spending my time on C*. The only downside for us would be our > >> > current > >> > > > prototype is in C++ so we would loose some performance and the > >> ability > >> > to > >> > > > dedicate an entire machine to caching/performing queries. > >> > > > > >> > > > > >> > > > On Wed, Apr 10, 2013 at 11:57 AM, Jonathan Ellis > >><jbel...@gmail.com> > >> > > wrote: > >> > > > > >> > > >> If you mean, "Can someone help me figure out how to get started > >> > updating > >> > > >> these old patches to trunk and cleaning out the Avro?" then yes, > >> I've > >> > > been > >> > > >> knee-deep in indexing code recently. > >> > > >> > >> > > >> > >> > > >> On Wed, Apr 10, 2013 at 11:34 AM, mrevilgnome < > >> mrevilgn...@gmail.com> > >> > > >> wrote: > >> > > >> > >> > > >>> I'm currently building a distributed cluster on top of > >>cassandra to > >> > > >> perform > >> > > >>> fast set manipulation via bitmap indexes. This gives me the > >>ability > >> > to > >> > > >>> perform unions, intersections, and set subtraction across > >> > sub-queries. > >> > > >>> Currently I'm storing index information for thousands of > >>dimensions > >> > as > >> > > >>> cassandra rows, and my cluster keeps this information cached, > >> > > distributed > >> > > >>> and replicated in order to answer queries. > >> > > >>> > >> > > >>> Every couple of days I think to myself this should really exist > >>in > >> > C*. > >> > > >>> Given all the benifits would there be any interest in > >> > > >>> reviving CASSANDRA-1472? > >> > > >>> > >> > > >>> Some downsides are that this is very memory intensive, even for > >> > sparse > >> > > >>> bitmaps. > >> > > >>> > >> > > >> > >> > > >> > >> > > >> > >> > > >> -- > >> > > >> Jonathan Ellis > >> > > >> Project Chair, Apache Cassandra > >> > > >> co-founder, http://www.datastax.com > >> > > >> @spyced > >> > > >> > >> > > > >> > > -- > >> > > Brian ONeill > >> > > Lead Architect, Health Market Science > >>(http://healthmarketscience.com) > >> > > mobile:215.588.6024 > >> > > blog: http://weblogs.java.net/blog/boneill42/ > >> > > blog: http://brianoneill.blogspot.com/ > >> > > > >> > > > >> > > >> > > >