Re: Best approach in Cassandra (+ Spark?) for Continuous Queries?

Colin Sat, 03 Jan 2015 15:12:43 -0800

Use a message bus with a transactional get, get the message, send to cassandra, 
upon write success, submit to esp, commit get on bus.  Messaging systems like 
rabbitmq support this semantic.


Using cassandra as a queuing mechanism is an anti-pattern.

--
Colin Clark 
+1-320-221-9531
 

> On Jan 3, 2015, at 6:07 PM, Hugo José Pinto <hugo.pi...@inovaworks.com> wrote:
> 
> Thank you all for your answers.
> 
> It seems I'll have to go with some event-driven processing before/during the 
> Cassandra write path. 
> 
> My concern would be that I'd love to first guarantee the disk write of the 
> Cassandra persistence and then do the event processing (which is mostly CRUD 
> intercepts at this point), even if slightly delayed, and doing so via 
> triggers would probably bog down the whole processing pipeline. 
> 
> What I'd probably do is to write, in trigger, a separate key table with all 
> the CRUDed elements and to have the ESP process that table.
> 
> Thank you for your contribution. Should anyone else have any experiende 
> experience in these scenarios I'm obviously all ears as well. 
> 
> Best,
> 
> Hugo 
> 
> Enviado do meu iPhone
> 
> No dia 03/01/2015, às 11:09, DuyHai Doan <doanduy...@gmail.com> escreveu:
> 
>> Hello Hugo
>> 
>>  I was facing the same kind of requirement from some users. Long story 
>> short, below are the possible strategies with advantages and draw-backs of 
>> each
>> 
>> 1) Put Spark in front of the back-end, every incoming 
>> modification/update/insert goes into Spark first, then Spark will forward it 
>> to Cassandra for persistence. With Spark, you can perform pre or 
>> post-processing and notify external clients of mutation.
>> 
>>  The draw back of this solution is that all the incoming mutations must go 
>> through Spark. You may set up a Kafka queue as temporary storage to 
>> distribute the load and consume mutations with Spark but it add ups to the 
>> architecture complexity with additional components & technologies
>> 
>> 2) For high availability and resilience, you probably want to have all 
>> mutations saved first into Cassandra then process notifications with Spark. 
>> In this case the only way to have notifications from Cassandra, as of 
>> version 2.1, is to rely on manually coded triggers (which is still 
>> experimental feature).
>> 
>> With the triggers you can notify whatever clients you want, not only Spark.
>> 
>> The big draw back of this solution is that playing with triggers is 
>> dangerous if you are not familiar with Cassandra internals. Indeed the 
>> trigger is on the write path and may hurt performance if you are doing 
>> complex and blocking tasks.
>> 
>> That's the 2 solutions I can see, maybe the ML members will propose other 
>> innovative choices
>> 
>>  Regards
>> 
>>> On Sat, Jan 3, 2015 at 11:46 AM, Hugo José Pinto 
>>> <hugo.pi...@inovaworks.com> wrote:
>>> Hello.
>>> 
>>> We're currently using Hazelcast (http://hazelcast.org/) as a distributed 
>>> in-memory data grid. That's been working sort-of-well for us, but going 
>>> solely in-memory has exhausted its path in our use case, and we're 
>>> considering porting our application to a NoSQL persistent store. After the 
>>> usual comparisons and evaluations, we're borderline close to picking 
>>> Cassandra, plus eventually Spark for analytics.
>>> 
>>> Nonetheless, there is a gap in our architectural needs that we're still not 
>>> grasping how to solve in Cassandra (with or without Spark): Hazelcast 
>>> allows us to create a Continuous Query in that, whenever a row is 
>>> added/removed/modified from the clause's resultset, Hazelcast calls up back 
>>> with the corresponding notification. We use this to continuously update the 
>>> clients via AJAX streaming with the new/changed rows.
>>> 
>>> This is probably a conceptual mismatch we're making, so - how to best 
>>> address this use case in Cassandra (with or without Spark's help)? Is there 
>>> something in the API that allows for Continuous Queries on key/clause 
>>> changes (haven't found it)? Is there some other way to get a stream of 
>>> key/clause updates? Events of some sort?
>>> 
>>> I'm aware that we could, eventually, periodically poll Cassandra, but in 
>>> our use case, the client is potentially interested in a large number of 
>>> table clause notifications (think "all changes to Ship positions on 
>>> California's coastline"), and iterating out of the store would kill the 
>>> streamer's scalability.
>>> 
>>> Hence, the magic question: what are we missing? Is Cassandra the wrong tool 
>>> for the job? Are we not aware of a particular part of the API or external 
>>> library in/outside the apache realm that would allow for this?
>>> 
>>> Many thanks for any assistance!
>>> 
>>> Hugo
>>> 
>>

Re: Best approach in Cassandra (+ Spark?) for Continuous Queries?

Reply via email to