Hi Praveen,

I did not change the number of total executors. I specified 300 as the
number of executors when I submitted the jobs. However, for some stages,
the number of executors is very small, leading to long calculation time
even for small data set. That means not all executors were used for some
stages.

If I went to the detail of the running time of different executors, I found
some of them had very low running time while very few had very long running
time, leading to long overall running time. Another point I noticed is that
the number of completed tasks are usually larger than the number of total
tasks. That means sometimes the job is still running in some stages
although all the tasks have been finished. These are the too behavior I
observed that may related to the wrong running time.

Bill


On Thu, Jul 10, 2014 at 11:26 PM, Praveen Seluka <psel...@qubole.com> wrote:

> If I understand correctly, you could not change the number of executors at
> runtime right(correct me if am wrong) - its defined when we start the
> application and fixed. Do you mean number of tasks?
>
>
> On Fri, Jul 11, 2014 at 6:29 AM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> Can you try setting the number-of-partitions in all the shuffle-based
>> DStream operations, explicitly. It may be the case that the default
>> parallelism (that is, spark.default.parallelism) is probably not being
>> respected.
>>
>> Regarding the unusual delay, I would look at the task details of that
>> stage in the Spark web ui. It will show break of time for each task,
>> including GC times, etc. That might give some indication.
>>
>> TD
>>
>>
>> On Thu, Jul 10, 2014 at 5:13 PM, Bill Jay <bill.jaypeter...@gmail.com>
>> wrote:
>>
>>> Hi Tathagata,
>>>
>>> I set default parallelism as 300 in my configuration file. Sometimes
>>> there are more executors in a job. However, it is still slow. And I further
>>> observed that most executors take less than 20 seconds but two of them take
>>> much longer such as 2 minutes. The data size is very small (less than 480k
>>> lines with only 4 fields). I am not sure why the group by operation takes
>>> more then 3 minutes.  Thanks!
>>>
>>> Bill
>>>
>>>
>>> On Thu, Jul 10, 2014 at 4:28 PM, Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
>>>> Are you specifying the number of reducers in all the DStream.****ByKey
>>>> operations? If the reduce by key is not set, then the number of reducers
>>>> used in the stages can keep changing across batches.
>>>>
>>>> TD
>>>>
>>>>
>>>> On Wed, Jul 9, 2014 at 4:05 PM, Bill Jay <bill.jaypeter...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I have a Spark streaming job running on yarn. It consume data from
>>>>> Kafka and group the data by a certain field. The data size is 480k lines
>>>>> per minute where the batch size is 1 minute.
>>>>>
>>>>> For some batches, the program sometimes take more than 3 minute to
>>>>> finish the groupBy operation, which seems slow to me. I allocated 300
>>>>> workers and specify 300 as the partition number for groupby. When I 
>>>>> checked
>>>>> the slow stage *"combineByKey at ShuffledDStream.scala:42",* there
>>>>> are sometimes 2 executors allocated for this stage. However, during other
>>>>> batches, the executors can be several hundred for the same stage, which
>>>>> means the number of executors for the same operations change.
>>>>>
>>>>> Does anyone know how Spark allocate the number of executors for
>>>>> different stages and how to increase the efficiency for task? Thanks!
>>>>>
>>>>> Bill
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to