Re: Running Cassandra + Spark on AWS - architecture questions

Eric Stevens Sun, 22 Feb 2015 08:03:43 -0800

I'm not sure if this is a good use case for you, but you might also
consider setting up several keyspaces, one for any data you want available
for analytics (such as business object tables), and one for data you don't
want to do analytics on (such as custom secondary indices).  Maybe a third
one for data which should only exist in the analytics space, such as for
temporary rollup data.


This can reduce the amount of data you replicate into your analytics space,
and allow you to run a smaller analytics cluster than your production
cluster.

On Fri, Feb 20, 2015 at 2:43 PM, DuyHai Doan <doanduy...@gmail.com> wrote:

> "Cassandra would take care of keeping the data synced between the two
> sets of five nodes.  Is that correct?"
>
> Correct
>
> "But doing so means that we need 2x as many nodes as we need for the
> real-time cluster alone"
>
> Not necessarily. With multi DC you can configure the replication factor
> value per DC, meaning that you can have RF = 3 for the real time DC and
> RF=1 or RF=2 for the analytics DC. Thus the number of nodes can be
> different for each DC
>
> In addition, you can also tune the hardware. If the realtime DC is mostly
> write only for incoming data and read-only from aggregated table, it is
> less IO intensive than the analytics DC with lot of read & write to compute
> aggregations.
>
>
>
> On Fri, Feb 20, 2015 at 10:17 PM, Clint Kelly <clint.ke...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I read the DSE 4.6 documentation and I'm still not 100% sure what a mixed
>> workload Cassandra + Spark installation would look like, especially on
>> AWS.  What I gather is that you use OpsCenter to set up the following:
>>
>>
>>    - One "virtual data center" for real-time processing (e.g., ingestion
>>    of time-series data, replying to requests for an interactive application)
>>    - Another "virtual data center" for batch analytics (Spark, possibly
>>    for machine learning)
>>
>>
>> If I understand this correctly, if I estimate that I need a five-node
>> cluster to handle all of my data, under the system described above, I would
>> have five nodes serving real-time traffic and all of the data replicated in
>> another five nodes that I use for batch processing.  Cassandra would take
>> care of keeping the data synced between the two sets of five nodes.  Is
>> that correct?
>>
>> I assume the motivation for such a dual-virtual-data-center architecture
>> is to prevent the Spark jobs (which are going to do lots of scans from
>> Cassandra, and maybe run computation on the same machines hosting
>> Cassandra) from disrupting the real-time performance.  But doing so means
>> that we need 2x as many nodes as we need for the real-time cluster alone.
>>
>> *Could someone confirm that my interpretation above of what I read about
>> in the DSE documentation is correct?*
>>
>> If my application needs to run analytics on Spark only a few hours a day,
>> might we be better off spending our money to get a bigger Cassandra cluster
>> and then just spin up Spark jobs on EMR for a few hours at night?  (I know
>> this is a hard question to answer, since it all depends on the
>> application---just curious if anyone else here has had to make similar
>> tradeoffs.)  e.g., maybe instead of having a five-node real-time cluster,
>> we would have an eight-node real-time cluster, and use our remaining budget
>> on EMR jobs.
>>
>> I am curious if anyone has any thoughts / experience about this.
>>
>> Best regards,
>> Clint
>>
>
>

Re: Running Cassandra + Spark on AWS - architecture questions

Reply via email to