It really depends on the code. I would say that the easiest way is to
restart the problematic action, find the straggler task and analyze whats
happening with it with jstack / make a heap dump and analyze locally. For
example, there might be the case that your tasks are connecting to some
external resource and this resource is timing out under the pressure. Also
call toDebugString on the problematic RDD before calling an action that
triggers the calculations, this would give you an understanding what your
execution tasks are really doing

On Fri, Aug 28, 2015 at 7:47 PM, Muler <mulugeta.abe...@gmail.com> wrote:

> I have a 7 node cluster running in standalone mode (1 executor per node,
> 100g/executor, 18 cores/executor)
>
> Attached is the Task status for two of my nodes. I'm not clear why some of
> my tasks are taking too long:
>
>    1. [node sk5, green] task 197 took 35 mins while task 218 took less
>    than 2 mins. But if you look into the size of output size/records they have
>    almost same size. Even more strange, the size of shuffle spill for memory
>    and disk is 0 for task 197 and yet it is taking a long time
>
> Same issue for my other node (sk3, red)
>
> Can you please explain what is going on?
>
> Thanks,
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>



-- 
Alexey Grishchenko, http://0x0fff.com

Reply via email to