These are both good suggestions, thanks!

I thought I had remembered reading that different virtual datacenters
should always have the same number of nodes.  I think I was mistaken about
that.  In the past we had major issues running huge analytics jobs on data
stored in HBase (it would bring down our real-time performance), so this
capability of Cassandra is great!

Best regards,
Clint


On Sun, Feb 22, 2015 at 8:02 AM, Eric Stevens <migh...@gmail.com> wrote:

> I'm not sure if this is a good use case for you, but you might also
> consider setting up several keyspaces, one for any data you want available
> for analytics (such as business object tables), and one for data you don't
> want to do analytics on (such as custom secondary indices).  Maybe a third
> one for data which should only exist in the analytics space, such as for
> temporary rollup data.
>
> This can reduce the amount of data you replicate into your analytics
> space, and allow you to run a smaller analytics cluster than your
> production cluster.
>
> On Fri, Feb 20, 2015 at 2:43 PM, DuyHai Doan <doanduy...@gmail.com> wrote:
>
>> "Cassandra would take care of keeping the data synced between the two
>> sets of five nodes.  Is that correct?"
>>
>> Correct
>>
>> "But doing so means that we need 2x as many nodes as we need for the
>> real-time cluster alone"
>>
>> Not necessarily. With multi DC you can configure the replication factor
>> value per DC, meaning that you can have RF = 3 for the real time DC and
>> RF=1 or RF=2 for the analytics DC. Thus the number of nodes can be
>> different for each DC
>>
>> In addition, you can also tune the hardware. If the realtime DC is mostly
>> write only for incoming data and read-only from aggregated table, it is
>> less IO intensive than the analytics DC with lot of read & write to compute
>> aggregations.
>>
>>
>>
>> On Fri, Feb 20, 2015 at 10:17 PM, Clint Kelly <clint.ke...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I read the DSE 4.6 documentation and I'm still not 100% sure what a
>>> mixed workload Cassandra + Spark installation would look like, especially
>>> on AWS.  What I gather is that you use OpsCenter to set up the following:
>>>
>>>
>>>    - One "virtual data center" for real-time processing (e.g.,
>>>    ingestion of time-series data, replying to requests for an interactive
>>>    application)
>>>    - Another "virtual data center" for batch analytics (Spark, possibly
>>>    for machine learning)
>>>
>>>
>>> If I understand this correctly, if I estimate that I need a five-node
>>> cluster to handle all of my data, under the system described above, I would
>>> have five nodes serving real-time traffic and all of the data replicated in
>>> another five nodes that I use for batch processing.  Cassandra would take
>>> care of keeping the data synced between the two sets of five nodes.  Is
>>> that correct?
>>>
>>> I assume the motivation for such a dual-virtual-data-center architecture
>>> is to prevent the Spark jobs (which are going to do lots of scans from
>>> Cassandra, and maybe run computation on the same machines hosting
>>> Cassandra) from disrupting the real-time performance.  But doing so means
>>> that we need 2x as many nodes as we need for the real-time cluster alone.
>>>
>>> *Could someone confirm that my interpretation above of what I read about
>>> in the DSE documentation is correct?*
>>>
>>> If my application needs to run analytics on Spark only a few hours a
>>> day, might we be better off spending our money to get a bigger Cassandra
>>> cluster and then just spin up Spark jobs on EMR for a few hours at night?
>>>  (I know this is a hard question to answer, since it all depends on the
>>> application---just curious if anyone else here has had to make similar
>>> tradeoffs.)  e.g., maybe instead of having a five-node real-time cluster,
>>> we would have an eight-node real-time cluster, and use our remaining budget
>>> on EMR jobs.
>>>
>>> I am curious if anyone has any thoughts / experience about this.
>>>
>>> Best regards,
>>> Clint
>>>
>>
>>
>

Reply via email to