If you absolutely have to use Cassandra as the source of your data, I agree
with Dor.

That being said, if you're going to be doing a lot of analytics, I
recommend using something other than Cassandra with Spark.  The performance
isn't particularly wonderful and you'll likely get anywhere from 10-50x
improvement from putting the data in an analytics friendly format (parquet)
and on a block / blob store (DFS or S3) instead.

On Fri, Jan 4, 2019 at 1:43 PM Goutham reddy <goutham.chiru...@gmail.com>
wrote:

> Thank you very much Dor for the detailed information, yes that should be
> the primary reason why we have to isolate from Cassandra.
>
> Thanks and Regards,
> Goutham Reddy
>
>
> On Fri, Jan 4, 2019 at 1:29 PM Dor Laor <d...@scylladb.com> wrote:
>
>> I strongly recommend option B, separate clusters. Reasons:
>>  - Networking of node-node is negligible compared to networking within
>> the node
>>  - Different scaling considerations
>>    Your workload may require 10 Spark nodes and 20 database nodes, so why
>> bundle them?
>>    This ratio may also change over time as your application evolves and
>> amount of data changes.
>>  - Isolation - If Spark has a spike in cpu/IO utilization, you wouldn't
>> want it to affect Cassandra and the opposite.
>>    If you isolate it with cgroups, you may have too much idle time when
>> the above doesn't happen.
>>
>>
>> On Fri, Jan 4, 2019 at 12:47 PM Goutham reddy <goutham.chiru...@gmail.com>
>> wrote:
>>
>>> Hi,
>>> We have requirement of heavy data lifting and analytics requirement and
>>> decided to go with Apache Spark. In the process we have come up with two
>>> patterns
>>> a. Apache Spark and Apache Cassandra co-located and shared on same nodes.
>>> b. Apache Spark on one independent cluster and Apache Cassandra as one
>>> independent cluster.
>>>
>>> Need good pattern how to use the analytic engine for Cassandra. Thanks
>>> in advance.
>>>
>>> Regards
>>> Goutham.
>>>
>>

-- 
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade

Reply via email to