Bill, I haven't worked with Yarn, but I would try adding a repartition() call after you receive your data from Kafka. I would be surprised if that didn't help.
On Thu, Jul 10, 2014 at 6:23 AM, Bill Jay <bill.jaypeter...@gmail.com> wrote: > Hi Tobias, > > I was using Spark 0.9 before and the master I used was yarn-standalone. In > Spark 1.0, the master will be either yarn-cluster or yarn-client. I am not > sure whether it is the reason why more machines do not provide better > scalability. What is the difference between these two modes in terms of > efficiency? Thanks! > > > On Tue, Jul 8, 2014 at 5:26 PM, Tobias Pfeiffer <t...@preferred.jp> wrote: > >> Bill, >> >> do the additional 100 nodes receive any tasks at all? (I don't know which >> cluster you use, but with Mesos you could check client logs in the web >> interface.) You might want to try something like repartition(N) or >> repartition(N*2) (with N the number of your nodes) after you receive your >> data. >> >> Tobias >> >> >> On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay <bill.jaypeter...@gmail.com> >> wrote: >> >>> Hi Tobias, >>> >>> Thanks for the suggestion. I have tried to add more nodes from 300 to >>> 400. It seems the running time did not get improved. >>> >>> >>> On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer <t...@preferred.jp> >>> wrote: >>> >>>> Bill, >>>> >>>> can't you just add more nodes in order to speed up the processing? >>>> >>>> Tobias >>>> >>>> >>>> On Thu, Jul 3, 2014 at 7:09 AM, Bill Jay <bill.jaypeter...@gmail.com> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I have a problem of using Spark Streaming to accept input data and >>>>> update a result. >>>>> >>>>> The input of the data is from Kafka and the output is to report a map >>>>> which is updated by historical data in every minute. My current method is >>>>> to set batch size as 1 minute and use foreachRDD to update this map and >>>>> output the map at the end of the foreachRDD function. However, the current >>>>> issue is the processing cannot be finished within one minute. >>>>> >>>>> I am thinking of updating the map whenever the new data come instead >>>>> of doing the update when the whoe RDD comes. Is there any idea on how to >>>>> achieve this in a better running time? Thanks! >>>>> >>>>> Bill >>>>> >>>> >>>> >>> >> >