I wonder what benefits do I really I get If I colocate my spark worker
process and Cassandra server process on each node?

I understand the concept of moving compute towards the data instead of
moving data towards computation but It sounds more like one is trying to
optimize for network latency.

Majority of my nodes (m4.xlarge)  have 1Gbps = 125MB/s (Megabytes per
second) Network throughput.

and the DISK throughput for m4.xlarge is 93.75 MB/s (link below)

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html

so In this case I don't see how colocation can help even if there is one to
one mapping from spark worker node to a colocated Cassandra node where say
we are doing a table scan of billion rows ?

Thanks!

Reply via email to