On 3 Dec 2016, at 09:16, Manish Malhotra <manish.malhotra.w...@gmail.com<mailto:manish.malhotra.w...@gmail.com>> wrote:
thanks for sharing number as well ! Now a days even network can be with very high throughput, and might out perform the disk, but as Sean mentioned data on network will have other dependencies like network hops, like if its across rack, which can have switch in between. But yes people are discussing and talking about Mesos + high performance network and not worried about the colocation for various use cases. AWS emphmerial is not good for reliable storage file system, EBS is the expensive alternative :) If you working with HDFS, then on linux HDFS can bypass the entire network stack: after opening a block for an authenticated user, HDFS passes the open file handle back to the caller for them to talk direct to the filesystem. You can't get any faster than that. On AWS, well, your life is complex as networking is now something you get to pay for in your choice of VM and storage options; it is going to generally offer lower performance than a physical cluster. Me? I'd recommend using HDFS for transient storage and then s3 for persistent storage of the final data On Sat, Dec 3, 2016 at 1:12 AM, kant kodali <kanth...@gmail.com<mailto:kanth...@gmail.com>> wrote: Thanks Sean! Just for the record I am currently seeing 95 MB/s RX (Receive throughput ) on my spark worker machine when I do `sudo iftop -B` The problem with instance store on AWS is that they all are ephemeral so placing Cassandra on top doesn't make a lot of sense. so In short, AWS doesn't seem to be the right place for colocating in theory. I would still give you the benefit of doubt and colocate :) but just the numbers are not reflecting significant margins in terms of performance gains for AWS On Sat, Dec 3, 2016 at 12:56 AM, Sean Owen <so...@cloudera.com<mailto:so...@cloudera.com>> wrote: I'm sure he meant that this is downside to not colocating. You are asking the right question. While networking is traditionally much slower than disk, that changes a bit in the cloud, where attached storage is remote too. The disk throughput here is mostly achievable in normal workloads. However I think you'll find it's going to be much harder to get 1Gbps out of network transfers. That's just the speed of the local interface, and of course the transfer speed depends on hops across the network beyond that. Network latency is going to be higher than disk too, though that's not as much an issue in this context. On Sat, Dec 3, 2016 at 8:42 AM kant kodali <kanth...@gmail.com<mailto:kanth...@gmail.com>> wrote: wait, how is that a benefit? isn't that a bad thing if you are saying colocating leads to more latency and overall execution time is longer? On Sat, Dec 3, 2016 at 12:34 AM, vincent gromakowski <vincent.gromakow...@gmail.com<mailto:vincent.gromakow...@gmail.com>> wrote: You get more latency on reads so overall execution time is longer Le 3 déc. 2016 7:39 AM, "kant kodali" <kanth...@gmail.com<mailto:kanth...@gmail.com>> a écrit : I wonder what benefits do I really I get If I colocate my spark worker process and Cassandra server process on each node? I understand the concept of moving compute towards the data instead of moving data towards computation but It sounds more like one is trying to optimize for network latency. Majority of my nodes (m4.xlarge) have 1Gbps = 125MB/s (Megabytes per second) Network throughput. and the DISK throughput for m4.xlarge is 93.75 MB/s (link below) http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html so In this case I don't see how colocation can help even if there is one to one mapping from spark worker node to a colocated Cassandra node where say we are doing a table scan of billion rows ? Thanks!