On 3 Dec 2016, at 09:16, Manish Malhotra 
<manish.malhotra.w...@gmail.com<mailto:manish.malhotra.w...@gmail.com>> wrote:

thanks for sharing number as well !

Now a days even network can be with very high throughput, and might out perform 
the disk, but as Sean mentioned data on network will have other dependencies 
like network hops, like if its across rack, which can have switch in between.

But yes people are discussing and talking about Mesos + high performance 
network and not worried about the colocation for various use cases.

AWS emphmerial is not good for reliable storage file system, EBS is the 
expensive alternative :)


If you working with HDFS, then on linux HDFS can bypass the entire network 
stack: after opening a block for an authenticated user, HDFS passes the open 
file handle back to the caller for them to talk direct to the filesystem. You 
can't get any faster than that.

On AWS, well, your life is complex as networking is now something you get to 
pay for in your choice of VM and storage options; it is going to generally 
offer lower performance than a physical cluster.

Me? I'd recommend using HDFS for transient storage and then s3 for persistent 
storage of the final data


On Sat, Dec 3, 2016 at 1:12 AM, kant kodali 
<kanth...@gmail.com<mailto:kanth...@gmail.com>> wrote:
Thanks Sean! Just for the record I am currently seeing 95 MB/s RX (Receive 
throughput ) on my spark worker machine when I do `sudo iftop -B`

The problem with instance store on AWS is that they all are ephemeral so 
placing Cassandra on top doesn't make a lot of sense. so In short, AWS doesn't 
seem to be the right place for colocating in theory. I would still give you the 
benefit of doubt and colocate :) but just the numbers are not reflecting 
significant margins in terms of performance gains for AWS


On Sat, Dec 3, 2016 at 12:56 AM, Sean Owen 
<so...@cloudera.com<mailto:so...@cloudera.com>> wrote:
I'm sure he meant that this is downside to not colocating.
You are asking the right question. While networking is traditionally much 
slower than disk, that changes a bit in the cloud, where attached storage is 
remote too.
The disk throughput here is mostly achievable in normal workloads. However I 
think you'll find it's going to be much harder to get 1Gbps out of network 
transfers. That's just the speed of the local interface, and of course the 
transfer speed depends on hops across the network beyond that. Network latency 
is going to be higher than disk too, though that's not as much an issue in this 
context.

On Sat, Dec 3, 2016 at 8:42 AM kant kodali 
<kanth...@gmail.com<mailto:kanth...@gmail.com>> wrote:
wait, how is that a benefit? isn't that a bad thing if you are saying 
colocating leads to more latency  and overall execution time is longer?

On Sat, Dec 3, 2016 at 12:34 AM, vincent gromakowski 
<vincent.gromakow...@gmail.com<mailto:vincent.gromakow...@gmail.com>> wrote:

You get more latency on reads so overall execution time is longer

Le 3 déc. 2016 7:39 AM, "kant kodali" 
<kanth...@gmail.com<mailto:kanth...@gmail.com>> a écrit :

I wonder what benefits do I really I get If I colocate my spark worker process 
and Cassandra server process on each node?

I understand the concept of moving compute towards the data instead of moving 
data towards computation but It sounds more like one is trying to optimize for 
network latency.

Majority of my nodes (m4.xlarge)  have 1Gbps = 125MB/s (Megabytes per second) 
Network throughput.

and the DISK throughput for m4.xlarge is 93.75 MB/s (link below)

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html

so In this case I don't see how colocation can help even if there is one to one 
mapping from spark worker node to a colocated Cassandra node where say we are 
doing a table scan of billion rows ?

Thanks!





Reply via email to