I wonder what benefits do I really I get If I colocate my spark worker process and Cassandra server process on each node?
I understand the concept of moving compute towards the data instead of moving data towards computation but It sounds more like one is trying to optimize for network latency. Majority of my nodes (m4.xlarge) have 1Gbps = 125MB/s (Megabytes per second) Network throughput. and the DISK throughput for m4.xlarge is 93.75 MB/s (link below) http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html so In this case I don't see how colocation can help even if there is one to one mapping from spark worker node to a colocated Cassandra node where say we are doing a table scan of billion rows ? Thanks!