Not terribly large.... ~50 million rows, each row has ~100-300 columns. But big enough that a map/reduce job takes longer than users would like.
Actually maybe that is another question... Does anyone have any benchmarks running map/reduce against Cassandra? (even a simple count / or copy CF benchmark would be helpful) -brian On Fri, Jan 20, 2012 at 12:41 PM, Zach Richardson < j.zach.richard...@gmail.com> wrote: > How much data do you think you will need ad hoc query ability for? > > > On Fri, Jan 20, 2012 at 11:28 AM, Brian O'Neill <b...@alumni.brown.edu>wrote: > >> >> I can't remember if I asked this question before, but.... >> >> We're using Cassandra as our transactional system, and building up quite >> a library of map/reduce jobs that perform data quality analysis, >> statistics, etc. >> (> 100 jobs now) >> >> But... we are still struggling to provide an "ad-hoc" query mechanism for >> our users. >> >> To fill that gap, I believe we still need to materialize our data in an >> RDBMS. >> >> Anyone have any ideas? Better ways to support ad-hoc queries? >> >> Effectively, our users want to be able to select count(distinct Y) from X >> group by Z. >> Where Y and Z are arbitrary columns of rows in X. >> >> We believe we can create column families with different key structures >> (using Y an Z as row keys), but some column names we don't know / can't >> predict ahead of time. >> >> Are people doing bulk exports? >> Anyone trying to keep an RDBMS in synch in real-time? >> >> -brian >> >> -- >> Brian ONeill >> Lead Architect, Health Market Science (http://healthmarketscience.com) >> mobile:215.588.6024 >> blog: http://weblogs.java.net/blog/boneill42/ >> blog: http://brianoneill.blogspot.com/ >> >> > -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/