Forgot to mention my entire cluster is on one DC. so if it is across multiple DC's then colocating does makes sense in theory as well.
On Sat, Dec 3, 2016 at 1:12 AM, kant kodali <kanth...@gmail.com> wrote: > Thanks Sean! Just for the record I am currently seeing 95 MB/s RX (Receive > throughput ) on my spark worker machine when I do `sudo iftop -B` > > The problem with instance store on AWS is that they all are ephemeral so > placing Cassandra on top doesn't make a lot of sense. so In short, AWS > doesn't seem to be the right place for colocating in theory. I would still > give you the benefit of doubt and colocate :) but just the numbers are not > reflecting significant margins in terms of performance gains for AWS > > > On Sat, Dec 3, 2016 at 12:56 AM, Sean Owen <so...@cloudera.com> wrote: > >> I'm sure he meant that this is downside to not colocating. >> You are asking the right question. While networking is traditionally much >> slower than disk, that changes a bit in the cloud, where attached storage >> is remote too. >> The disk throughput here is mostly achievable in normal workloads. >> However I think you'll find it's going to be much harder to get 1Gbps out >> of network transfers. That's just the speed of the local interface, and of >> course the transfer speed depends on hops across the network beyond that. >> Network latency is going to be higher than disk too, though that's not as >> much an issue in this context. >> >> On Sat, Dec 3, 2016 at 8:42 AM kant kodali <kanth...@gmail.com> wrote: >> >>> wait, how is that a benefit? isn't that a bad thing if you are saying >>> colocating leads to more latency and overall execution time is longer? >>> >>> On Sat, Dec 3, 2016 at 12:34 AM, vincent gromakowski < >>> vincent.gromakow...@gmail.com> wrote: >>> >>> You get more latency on reads so overall execution time is longer >>> >>> Le 3 déc. 2016 7:39 AM, "kant kodali" <kanth...@gmail.com> a écrit : >>> >>> >>> I wonder what benefits do I really I get If I colocate my spark worker >>> process and Cassandra server process on each node? >>> >>> I understand the concept of moving compute towards the data instead of >>> moving data towards computation but It sounds more like one is trying to >>> optimize for network latency. >>> >>> Majority of my nodes (m4.xlarge) have 1Gbps = 125MB/s (Megabytes per >>> second) Network throughput. >>> >>> and the DISK throughput for m4.xlarge is 93.75 MB/s (link below) >>> >>> http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html >>> >>> so In this case I don't see how colocation can help even if there is one >>> to one mapping from spark worker node to a colocated Cassandra node where >>> say we are doing a table scan of billion rows ? >>> >>> Thanks! >>> >>> >>> >