that is, rdd.coalesce(200, forceShuffle) . Does
> anyone have ideas on how to distribute your data evenly and co-locate
> partitions of interest?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Any-ideas-why-a-few-tasks-would
This did not work for me. that is, rdd.coalesce(200, forceShuffle) . Does
anyone have ideas on how to distribute your data evenly and co-locate
partitions of interest?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Any-ideas-why-a-few-tasks-would-stall
Thanks - I found the same thing -
calling
boolean forceShuffle = true;
myRDD = myRDD.coalesce(120,forceShuffle );
worked - there were 120 partitions but forcing a shuffle distributes the
work
I believe there is a bug in my code causing memory to accumulate as
partitions grow in si
Good point, Ankit.
Steve - You can click on the link for '27' in the first column to get a
break down of how much data is in each of those 116 cached partitions. But
really, you want to also understand how much data is in the 4 non-cached
partitions, as they may be huge. One thing you can try doin
I ran into something similar before. 19/20 partitions would complete very
quickly, and 1 would take the bulk of time and shuffle reads & writes. This was
because the majority of partitions were empty, and 1 had all the data. Perhaps
something similar is going on here - I would suggest taking a l
1) I can go there but none of the links are clickable
2) when I see something like 116/120 partitions succeeded in the stages ui
in the storage ui I see
NOTE RDD 27 has 116 partitions cached - 4 not and those are exactly the
number of machines which will not complete
Also RDD 27 does not show up i
Have you tried taking thread dumps via the UI? There is a link to do so on
the Executors' page (typically under http://driver IP:4040/exectuors.
By visualizing the thread call stack of the executors with slow running
tasks, you can see exactly what code is executing at an instant in time. If
you s
I am working on a problem which will eventually involve many millions of
function calls. A have a small sample with several thousand calls working
but when I try to scale up the amount of data things stall. I use 120
partitions and 116 finish in very little time. The remaining 4 seem to do
all the