Hi all,

I read the DSE 4.6 documentation and I'm still not 100% sure what a mixed
workload Cassandra + Spark installation would look like, especially on
AWS.  What I gather is that you use OpsCenter to set up the following:


   - One "virtual data center" for real-time processing (e.g., ingestion of
   time-series data, replying to requests for an interactive application)
   - Another "virtual data center" for batch analytics (Spark, possibly for
   machine learning)


If I understand this correctly, if I estimate that I need a five-node
cluster to handle all of my data, under the system described above, I would
have five nodes serving real-time traffic and all of the data replicated in
another five nodes that I use for batch processing.  Cassandra would take
care of keeping the data synced between the two sets of five nodes.  Is
that correct?

I assume the motivation for such a dual-virtual-data-center architecture is
to prevent the Spark jobs (which are going to do lots of scans from
Cassandra, and maybe run computation on the same machines hosting
Cassandra) from disrupting the real-time performance.  But doing so means
that we need 2x as many nodes as we need for the real-time cluster alone.

*Could someone confirm that my interpretation above of what I read about in
the DSE documentation is correct?*

If my application needs to run analytics on Spark only a few hours a day,
might we be better off spending our money to get a bigger Cassandra cluster
and then just spin up Spark jobs on EMR for a few hours at night?  (I know
this is a hard question to answer, since it all depends on the
application---just curious if anyone else here has had to make similar
tradeoffs.)  e.g., maybe instead of having a five-node real-time cluster,
we would have an eight-node real-time cluster, and use our remaining budget
on EMR jobs.

I am curious if anyone has any thoughts / experience about this.

Best regards,
Clint

Reply via email to