If you use Containers like Docker Plan A can work provided you do the resource and capacity planning. I tend to think that Plan B is more Standard and easier Although you can wait to hear from others for a second opinion.
Caution: Data Locality will make sense if the Disk throughput is significantly higher than Network Throughput (Not all, have the same scenario) On Thu, Jun 8, 2017 at 1:25 AM, 한 승호 <shha...@outlook.com> wrote: > Hello, > > > > I am Seung-ho and I work as a Data Engineer in Korea. I need some advice. > > > > My company recently consider replacing RDMBS-based system with Cassandra > and Hadoop. > > The purpose of this system is to analyze Cadssandra and HDFS data with > Spark. > > > > It seems many user cases put emphasis on data locality, for instance, both > Cassandra and Spark executor should be on the same node. > > > > The thing is, my company's data analyst team wants to analyze > heterogeneous data source, Cassandra and HDFS, using Spark. > > So, I wonder what would be the best practices of using Cassandra and > Hadoop in such case. > > > > Plan A: Both HDFS and Cassandra with NodeManager(Spark Executor) on the > same node > > > > Plan B: Cassandra + Node Manager / HDFS + NodeManager in each node > separately but the same cluster > > > > > > Which would be better or correct, or would be a better way? > > > > I appreciate your advice in advance :) > > > > Best Regards, > > Seung-Ho Han > > > > > > Windows 10용 메일 <https://go.microsoft.com/fwlink/?LinkId=550986>에서 보냄 > > >