Re: Best approach in Cassandra (+ Spark?) for Continuous Queries?

Hugo José Pinto Sun, 04 Jan 2015 05:44:22 -0800

Many thanks once again.

I rethought the target data structure, and things started coming together to 
allow for really elegant, compact ESP preprocessing and storage.


Best.

Enviado do meu iPhone

No dia 03/01/2015, às 23:53, Peter Lin <wool...@gmail.com> escreveu:

> 
> if you like SQL dialect, try out products that use streamSQL to do continuous 
> queries. Espers comes to mind. Google to see what other products support 
> streamSQL
> 
>> On Sat, Jan 3, 2015 at 6:48 PM, Hugo José Pinto <hugo.pi...@inovaworks.com> 
>> wrote:
>> Thanks :)
>> 
>> Duly noted - this is all uncharted territory for us, hence the value of 
>> seasoned advice.
>> 
>> 
>> Best
>> 
>> --
>> Hugo José Pinto
>> 
>> No dia 03/01/2015, às 23:43, Peter Lin <wool...@gmail.com> escreveu:
>> 
>>> 
>>> listen to colin's advice, avoid the temptation of anti-patterns.
>>> 
>>>> On Sat, Jan 3, 2015 at 6:10 PM, Colin <colpcl...@gmail.com> wrote:
>>>> Use a message bus with a transactional get, get the message, send to 
>>>> cassandra, upon write success, submit to esp, commit get on bus.  
>>>> Messaging systems like rabbitmq support this semantic.
>>>> 
>>>> Using cassandra as a queuing mechanism is an anti-pattern.
>>>> 
>>>> --
>>>> Colin Clark 
>>>> +1-320-221-9531
>>>>  
>>>> 
>>>>> On Jan 3, 2015, at 6:07 PM, Hugo José Pinto <hugo.pi...@inovaworks.com> 
>>>>> wrote:
>>>>> 
>>>>> Thank you all for your answers.
>>>>> 
>>>>> It seems I'll have to go with some event-driven processing before/during 
>>>>> the Cassandra write path. 
>>>>> 
>>>>> My concern would be that I'd love to first guarantee the disk write of 
>>>>> the Cassandra persistence and then do the event processing (which is 
>>>>> mostly CRUD intercepts at this point), even if slightly delayed, and 
>>>>> doing so via triggers would probably bog down the whole processing 
>>>>> pipeline. 
>>>>> 
>>>>> What I'd probably do is to write, in trigger, a separate key table with 
>>>>> all the CRUDed elements and to have the ESP process that table.
>>>>> 
>>>>> Thank you for your contribution. Should anyone else have any experiende 
>>>>> experience in these scenarios I'm obviously all ears as well. 
>>>>> 
>>>>> Best,
>>>>> 
>>>>> Hugo 
>>>>> 
>>>>> Enviado do meu iPhone
>>>>> 
>>>>> No dia 03/01/2015, às 11:09, DuyHai Doan <doanduy...@gmail.com> escreveu:
>>>>> 
>>>>>> Hello Hugo
>>>>>> 
>>>>>>  I was facing the same kind of requirement from some users. Long story 
>>>>>> short, below are the possible strategies with advantages and draw-backs 
>>>>>> of each
>>>>>> 
>>>>>> 1) Put Spark in front of the back-end, every incoming 
>>>>>> modification/update/insert goes into Spark first, then Spark will 
>>>>>> forward it to Cassandra for persistence. With Spark, you can perform pre 
>>>>>> or post-processing and notify external clients of mutation.
>>>>>> 
>>>>>>  The draw back of this solution is that all the incoming mutations must 
>>>>>> go through Spark. You may set up a Kafka queue as temporary storage to 
>>>>>> distribute the load and consume mutations with Spark but it add ups to 
>>>>>> the architecture complexity with additional components & technologies
>>>>>> 
>>>>>> 2) For high availability and resilience, you probably want to have all 
>>>>>> mutations saved first into Cassandra then process notifications with 
>>>>>> Spark. In this case the only way to have notifications from Cassandra, 
>>>>>> as of version 2.1, is to rely on manually coded triggers (which is still 
>>>>>> experimental feature).
>>>>>> 
>>>>>> With the triggers you can notify whatever clients you want, not only 
>>>>>> Spark.
>>>>>> 
>>>>>> The big draw back of this solution is that playing with triggers is 
>>>>>> dangerous if you are not familiar with Cassandra internals. Indeed the 
>>>>>> trigger is on the write path and may hurt performance if you are doing 
>>>>>> complex and blocking tasks.
>>>>>> 
>>>>>> That's the 2 solutions I can see, maybe the ML members will propose 
>>>>>> other innovative choices
>>>>>> 
>>>>>>  Regards
>>>>>> 
>>>>>>> On Sat, Jan 3, 2015 at 11:46 AM, Hugo José Pinto 
>>>>>>> <hugo.pi...@inovaworks.com> wrote:
>>>>>>> Hello.
>>>>>>> 
>>>>>>> We're currently using Hazelcast (http://hazelcast.org/) as a 
>>>>>>> distributed in-memory data grid. That's been working sort-of-well for 
>>>>>>> us, but going solely in-memory has exhausted its path in our use case, 
>>>>>>> and we're considering porting our application to a NoSQL persistent 
>>>>>>> store. After the usual comparisons and evaluations, we're borderline 
>>>>>>> close to picking Cassandra, plus eventually Spark for analytics.
>>>>>>> 
>>>>>>> Nonetheless, there is a gap in our architectural needs that we're still 
>>>>>>> not grasping how to solve in Cassandra (with or without Spark): 
>>>>>>> Hazelcast allows us to create a Continuous Query in that, whenever a 
>>>>>>> row is added/removed/modified from the clause's resultset, Hazelcast 
>>>>>>> calls up back with the corresponding notification. We use this to 
>>>>>>> continuously update the clients via AJAX streaming with the new/changed 
>>>>>>> rows.
>>>>>>> 
>>>>>>> This is probably a conceptual mismatch we're making, so - how to best 
>>>>>>> address this use case in Cassandra (with or without Spark's help)? Is 
>>>>>>> there something in the API that allows for Continuous Queries on 
>>>>>>> key/clause changes (haven't found it)? Is there some other way to get a 
>>>>>>> stream of key/clause updates? Events of some sort?
>>>>>>> 
>>>>>>> I'm aware that we could, eventually, periodically poll Cassandra, but 
>>>>>>> in our use case, the client is potentially interested in a large number 
>>>>>>> of table clause notifications (think "all changes to Ship positions on 
>>>>>>> California's coastline"), and iterating out of the store would kill the 
>>>>>>> streamer's scalability.
>>>>>>> 
>>>>>>> Hence, the magic question: what are we missing? Is Cassandra the wrong 
>>>>>>> tool for the job? Are we not aware of a particular part of the API or 
>>>>>>> external library in/outside the apache realm that would allow for this?
>>>>>>> 
>>>>>>> Many thanks for any assistance!
>>>>>>> 
>>>>>>> Hugo
>>>>>>> 
>

Re: Best approach in Cassandra (+ Spark?) for Continuous Queries?

Reply via email to