If you absolutely have to use Cassandra as the source of your data, I agree with Dor.
That being said, if you're going to be doing a lot of analytics, I recommend using something other than Cassandra with Spark. The performance isn't particularly wonderful and you'll likely get anywhere from 10-50x improvement from putting the data in an analytics friendly format (parquet) and on a block / blob store (DFS or S3) instead. On Fri, Jan 4, 2019 at 1:43 PM Goutham reddy <goutham.chiru...@gmail.com> wrote: > Thank you very much Dor for the detailed information, yes that should be > the primary reason why we have to isolate from Cassandra. > > Thanks and Regards, > Goutham Reddy > > > On Fri, Jan 4, 2019 at 1:29 PM Dor Laor <d...@scylladb.com> wrote: > >> I strongly recommend option B, separate clusters. Reasons: >> - Networking of node-node is negligible compared to networking within >> the node >> - Different scaling considerations >> Your workload may require 10 Spark nodes and 20 database nodes, so why >> bundle them? >> This ratio may also change over time as your application evolves and >> amount of data changes. >> - Isolation - If Spark has a spike in cpu/IO utilization, you wouldn't >> want it to affect Cassandra and the opposite. >> If you isolate it with cgroups, you may have too much idle time when >> the above doesn't happen. >> >> >> On Fri, Jan 4, 2019 at 12:47 PM Goutham reddy <goutham.chiru...@gmail.com> >> wrote: >> >>> Hi, >>> We have requirement of heavy data lifting and analytics requirement and >>> decided to go with Apache Spark. In the process we have come up with two >>> patterns >>> a. Apache Spark and Apache Cassandra co-located and shared on same nodes. >>> b. Apache Spark on one independent cluster and Apache Cassandra as one >>> independent cluster. >>> >>> Need good pattern how to use the analytic engine for Cassandra. Thanks >>> in advance. >>> >>> Regards >>> Goutham. >>> >> -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade