Bill,

I haven't worked with Yarn, but I would try adding a repartition() call
after you receive your data from Kafka. I would be surprised if that didn't
help.


On Thu, Jul 10, 2014 at 6:23 AM, Bill Jay <bill.jaypeter...@gmail.com>
wrote:

> Hi Tobias,
>
> I was using Spark 0.9 before and the master I used was yarn-standalone. In
> Spark 1.0, the master will be either yarn-cluster or yarn-client. I am not
> sure whether it is the reason why more machines do not provide better
> scalability. What is the difference between these two modes in terms of
> efficiency? Thanks!
>
>
> On Tue, Jul 8, 2014 at 5:26 PM, Tobias Pfeiffer <t...@preferred.jp> wrote:
>
>> Bill,
>>
>> do the additional 100 nodes receive any tasks at all? (I don't know which
>> cluster you use, but with Mesos you could check client logs in the web
>> interface.) You might want to try something like repartition(N) or
>> repartition(N*2) (with N the number of your nodes) after you receive your
>> data.
>>
>> Tobias
>>
>>
>> On Wed, Jul 9, 2014 at 3:09 AM, Bill Jay <bill.jaypeter...@gmail.com>
>> wrote:
>>
>>> Hi Tobias,
>>>
>>> Thanks for the suggestion. I have tried to add more nodes from 300 to
>>> 400. It seems the running time did not get improved.
>>>
>>>
>>> On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer <t...@preferred.jp>
>>> wrote:
>>>
>>>> Bill,
>>>>
>>>> can't you just add more nodes in order to speed up the processing?
>>>>
>>>> Tobias
>>>>
>>>>
>>>> On Thu, Jul 3, 2014 at 7:09 AM, Bill Jay <bill.jaypeter...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I have a problem of using Spark Streaming to accept input data and
>>>>> update a result.
>>>>>
>>>>> The input of the data is from Kafka and the output is to report a map
>>>>> which is updated by historical data in every minute. My current method is
>>>>> to set batch size as 1 minute and use foreachRDD to update this map and
>>>>> output the map at the end of the foreachRDD function. However, the current
>>>>> issue is the processing cannot be finished within one minute.
>>>>>
>>>>> I am thinking of updating the map whenever the new data come instead
>>>>> of doing the update when the whoe RDD comes. Is there any idea on how to
>>>>> achieve this in a better running time? Thanks!
>>>>>
>>>>> Bill
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to