Hi Brian,

We're trying to do the exact same thing and I find myself asking very
similar questions.

Our solution though has been to find what kind of queries we need to
satisfy on a preemptive basis and leverage cassandra's built-in indexing
features to build those result sets beforehand.  The whole point here then
is that our gain in cost efficiency comes from the fact that disk space is
really cheap and serving up result sets from disk is fast provided that
those result sets are pre-calculated and reasonable in size (even if we
don't know all the values upfront).  For example, when you're writing to
your CF "X", you could also make writes to column family "A" like this:

- write A[Z][Y] = 1
where A = CF, Z = key, Y = column

Answering the question "select count(distinct Y) from X group by Z" then is
as simple as getting a list of rows for CF A and counting the distinct
values of Y and grouping them by Z on the client side.

Alternatively, there are much better ways to do this with composite
keys/columns and distributed counters but it's hard for me to tell what
makes the most sense without knowing more about your data / product
requirements.

Either way, I feel your pain in getting things like this to work with
Cassandra when the domain of values for a particular key or column is
unknown and secondary indexing doesn't apply, but I'm positive there's a
much cheaper way to make it work than paying for Oracle if you have at
least a decent idea about what kinds of queries you need to satisfy (which
it sounds like you do).  To Maxim's "death by index" point, you could
certainly go overboard with this concept and cross a pricing threshold with
some other database technology, but I can't imagine you're even close to
being in that boat given how concise your query needs seem to be.

If you're interested, I'd be happy to share how we do these things to save
lots of money over commercial databases and try to relate that to your use
case, but if not, then I hope at least some of that this useful for you.

Good luck either way!

On Fri, Jan 20, 2012 at 9:27 PM, Maxim Potekhin <potek...@bnl.gov> wrote:

> I certainly agree with "difficult to predict". There is a Danish
> proverb, which goes "it's difficult to make predictions, especially
> about the future".
>
> My point was that it's equally difficult with noSQL and RDBMS.
> The latter requires indexing to operate well, and that's a potential
> performance problem.
>
>
> On 1/20/2012 7:55 PM, Mohit Anchlia wrote:
>
>> I think the problem stems when you have data in a column that you need
>> to run adhoc query on which is not denormalized. In most cases it's
>> difficult to predict the type of query that would be required.
>>
>> Another way of solving this could be to index the fields in search engine.
>>
>> On Fri, Jan 20, 2012 at 7:37 PM, Maxim Potekhin<potek...@bnl.gov>  wrote:
>>
>>> What makes you think that RDBMS will give you acceptable performance?
>>>
>>> I guess you will try to index it to death (because otherwise the "ad hoc"
>>> queries won't work well if at all), and at this point you may be hit
>>> with a
>>> performance penalty.
>>>
>>> It may be a good idea to interview users and build denormalized views in
>>> Cassandra, maybe on a separate "look-up" cluster. A few percent of users
>>> will be unhappy, but you'll find it hard to do better. I'm talking from
>>> my
>>> experience with an industrial strength RDBMS which doesn't scale very
>>> well
>>> for what you call "ad-hoc" queries.
>>>
>>> Regards,
>>> Maxim
>>>
>>>
>>>
>>>
>>>
>>> On 1/20/2012 9:28 AM, Brian O'Neill wrote:
>>>
>>>>
>>>> I can't remember if I asked this question before, but....
>>>>
>>>> We're using Cassandra as our transactional system, and building up
>>>> quite a
>>>> library of map/reduce jobs that perform data quality analysis,
>>>> statistics,
>>>> etc.
>>>> (>  100 jobs now)
>>>>
>>>> But... we are still struggling to provide an "ad-hoc" query mechanism
>>>> for
>>>> our users.
>>>>
>>>> To fill that gap, I believe we still need to materialize our data in an
>>>> RDBMS.
>>>>
>>>> Anyone have any ideas?  Better ways to support ad-hoc queries?
>>>>
>>>> Effectively, our users want to be able to select count(distinct Y) from
>>>> X
>>>> group by Z.
>>>> Where Y and Z are arbitrary columns of rows in X.
>>>>
>>>> We believe we can create column families with different key structures
>>>> (using Y an Z as row keys), but some column names we don't know / can't
>>>> predict ahead of time.
>>>>
>>>> Are people doing bulk exports?
>>>> Anyone trying to keep an RDBMS in synch in real-time?
>>>>
>>>> -brian
>>>>
>>>> --
>>>> Brian ONeill
>>>> Lead Architect, Health Market Science 
>>>> (http://healthmarketscience.**com<http://healthmarketscience.com>
>>>> )
>>>> mobile:215.588.6024
>>>> blog: 
>>>> http://weblogs.java.net/blog/**boneill42/<http://weblogs.java.net/blog/boneill42/>
>>>> blog: http://brianoneill.blogspot.**com/<http://brianoneill.blogspot.com/>
>>>>
>>>>
>

Reply via email to