These are both good suggestions, thanks! I thought I had remembered reading that different virtual datacenters should always have the same number of nodes. I think I was mistaken about that. In the past we had major issues running huge analytics jobs on data stored in HBase (it would bring down our real-time performance), so this capability of Cassandra is great!
Best regards, Clint On Sun, Feb 22, 2015 at 8:02 AM, Eric Stevens <migh...@gmail.com> wrote: > I'm not sure if this is a good use case for you, but you might also > consider setting up several keyspaces, one for any data you want available > for analytics (such as business object tables), and one for data you don't > want to do analytics on (such as custom secondary indices). Maybe a third > one for data which should only exist in the analytics space, such as for > temporary rollup data. > > This can reduce the amount of data you replicate into your analytics > space, and allow you to run a smaller analytics cluster than your > production cluster. > > On Fri, Feb 20, 2015 at 2:43 PM, DuyHai Doan <doanduy...@gmail.com> wrote: > >> "Cassandra would take care of keeping the data synced between the two >> sets of five nodes. Is that correct?" >> >> Correct >> >> "But doing so means that we need 2x as many nodes as we need for the >> real-time cluster alone" >> >> Not necessarily. With multi DC you can configure the replication factor >> value per DC, meaning that you can have RF = 3 for the real time DC and >> RF=1 or RF=2 for the analytics DC. Thus the number of nodes can be >> different for each DC >> >> In addition, you can also tune the hardware. If the realtime DC is mostly >> write only for incoming data and read-only from aggregated table, it is >> less IO intensive than the analytics DC with lot of read & write to compute >> aggregations. >> >> >> >> On Fri, Feb 20, 2015 at 10:17 PM, Clint Kelly <clint.ke...@gmail.com> >> wrote: >> >>> Hi all, >>> >>> I read the DSE 4.6 documentation and I'm still not 100% sure what a >>> mixed workload Cassandra + Spark installation would look like, especially >>> on AWS. What I gather is that you use OpsCenter to set up the following: >>> >>> >>> - One "virtual data center" for real-time processing (e.g., >>> ingestion of time-series data, replying to requests for an interactive >>> application) >>> - Another "virtual data center" for batch analytics (Spark, possibly >>> for machine learning) >>> >>> >>> If I understand this correctly, if I estimate that I need a five-node >>> cluster to handle all of my data, under the system described above, I would >>> have five nodes serving real-time traffic and all of the data replicated in >>> another five nodes that I use for batch processing. Cassandra would take >>> care of keeping the data synced between the two sets of five nodes. Is >>> that correct? >>> >>> I assume the motivation for such a dual-virtual-data-center architecture >>> is to prevent the Spark jobs (which are going to do lots of scans from >>> Cassandra, and maybe run computation on the same machines hosting >>> Cassandra) from disrupting the real-time performance. But doing so means >>> that we need 2x as many nodes as we need for the real-time cluster alone. >>> >>> *Could someone confirm that my interpretation above of what I read about >>> in the DSE documentation is correct?* >>> >>> If my application needs to run analytics on Spark only a few hours a >>> day, might we be better off spending our money to get a bigger Cassandra >>> cluster and then just spin up Spark jobs on EMR for a few hours at night? >>> (I know this is a hard question to answer, since it all depends on the >>> application---just curious if anyone else here has had to make similar >>> tradeoffs.) e.g., maybe instead of having a five-node real-time cluster, >>> we would have an eight-node real-time cluster, and use our remaining budget >>> on EMR jobs. >>> >>> I am curious if anyone has any thoughts / experience about this. >>> >>> Best regards, >>> Clint >>> >> >> >