I'm sure he meant that this is downside to not colocating.
You are asking the right question. While networking is traditionally much
slower than disk, that changes a bit in the cloud, where attached storage
is remote too.
The disk throughput here is mostly achievable in normal workloads. However
I think you'll find it's going to be much harder to get 1Gbps out of
network transfers. That's just the speed of the local interface, and of
course the transfer speed depends on hops across the network beyond that.
Network latency is going to be higher than disk too, though that's not as
much an issue in this context.

On Sat, Dec 3, 2016 at 8:42 AM kant kodali <kanth...@gmail.com> wrote:

> wait, how is that a benefit? isn't that a bad thing if you are saying
> colocating leads to more latency  and overall execution time is longer?
>
> On Sat, Dec 3, 2016 at 12:34 AM, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
> You get more latency on reads so overall execution time is longer
>
> Le 3 déc. 2016 7:39 AM, "kant kodali" <kanth...@gmail.com> a écrit :
>
>
> I wonder what benefits do I really I get If I colocate my spark worker
> process and Cassandra server process on each node?
>
> I understand the concept of moving compute towards the data instead of
> moving data towards computation but It sounds more like one is trying to
> optimize for network latency.
>
> Majority of my nodes (m4.xlarge)  have 1Gbps = 125MB/s (Megabytes per
> second) Network throughput.
>
> and the DISK throughput for m4.xlarge is 93.75 MB/s (link below)
>
> http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html
>
> so In this case I don't see how colocation can help even if there is one
> to one mapping from spark worker node to a colocated Cassandra node where
> say we are doing a table scan of billion rows ?
>
> Thanks!
>
>
>

Reply via email to