After slimming down the job quite a bit, it looks like a call to coalesce()
on a larger RDD can cause these Python worker spikes (additional details in
Jira:
https://issues.apache.org/jira/browse/SPARK-5395?focusedCommentId=14294570&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14294570
).

Any ideas to what coalesce() is doing that triggers the creation of
additional workers?

On Sat, Jan 24, 2015 at 12:27 AM, Sven Krasser <kras...@gmail.com> wrote:

> Hey Davies,
>
> Sure thing, it's filed here now:
> https://issues.apache.org/jira/browse/SPARK-5395
>
> As far as a repro goes, what is a normal number of workers I should
> expect? Even shortly after kicking the job off, I see workers in the
> double-digits per container. Here's an example using pstree on a worker
> node (the worker node runs 16 containers, i.e. the java--bash branches
> below). The initial Python process per container is forked between 7 and 33
> times.
>
>      ├─java─┬─bash───java─┬─python───22*[python]
>      │      │             └─89*[{java}]
>      │      ├─bash───java─┬─python───13*[python]
>      │      │             └─80*[{java}]
>      │      ├─bash───java─┬─python───9*[python]
>      │      │             └─78*[{java}]
>      │      ├─bash───java─┬─python───24*[python]
>      │      │             └─91*[{java}]
>      │      ├─3*[bash───java─┬─python───19*[python]]
>      │      │                └─86*[{java}]]
>      │      ├─bash───java─┬─python───25*[python]
>      │      │             └─90*[{java}]
>      │      ├─bash───java─┬─python───33*[python]
>      │      │             └─98*[{java}]
>      │      ├─bash───java─┬─python───7*[python]
>      │      │             └─75*[{java}]
>      │      ├─bash───java─┬─python───15*[python]
>      │      │             └─80*[{java}]
>      │      ├─bash───java─┬─python───18*[python]
>      │      │             └─84*[{java}]
>      │      ├─bash───java─┬─python───10*[python]
>      │      │             └─85*[{java}]
>      │      ├─bash───java─┬─python───26*[python]
>      │      │             └─90*[{java}]
>      │      ├─bash───java─┬─python───17*[python]
>      │      │             └─85*[{java}]
>      │      ├─bash───java─┬─python───24*[python]
>      │      │             └─92*[{java}]
>      │      └─265*[{java}]
>
> Each container has 2 cores in case that makes a difference.
>
> Thank you!
> -Sven
>
> On Fri, Jan 23, 2015 at 11:52 PM, Davies Liu <dav...@databricks.com>
> wrote:
>
>> It should be a bug, the Python worker did not exit normally, could you
>> file a JIRA for this?
>>
>> Also, could you show how to reproduce this behavior?
>>
>> On Fri, Jan 23, 2015 at 11:45 PM, Sven Krasser <kras...@gmail.com> wrote:
>> > Hey Adam,
>> >
>> > I'm not sure I understand just yet what you have in mind. My takeaway
>> from
>> > the logs is that the container actually was above its allotment of about
>> > 14G. Since 6G of that are for overhead, I assumed there to be plenty of
>> > space for Python workers, but there seem to be more of those than I'd
>> > expect.
>> >
>> > Does anyone know if that is actually the intended behavior, i.e. in this
>> > case over 90 Python processes on a 2 core executor?
>> >
>> > Best,
>> > -Sven
>> >
>> >
>> > On Fri, Jan 23, 2015 at 10:04 PM, Adam Diaz <adam.h.d...@gmail.com>
>> wrote:
>> >>
>> >> Yarn only has the ability to kill not checkpoint or sig suspend.  If
>> you
>> >> use too much memory it will simply kill tasks based upon the yarn
>> config.
>> >> https://issues.apache.org/jira/browse/YARN-2172
>> >>
>> >>
>> >> On Friday, January 23, 2015, Sandy Ryza <sandy.r...@cloudera.com>
>> wrote:
>> >>>
>> >>> Hi Sven,
>> >>>
>> >>> What version of Spark are you running?  Recent versions have a change
>> >>> that allows PySpark to share a pool of processes instead of starting
>> a new
>> >>> one for each task.
>> >>>
>> >>> -Sandy
>> >>>
>> >>> On Fri, Jan 23, 2015 at 9:36 AM, Sven Krasser <kras...@gmail.com>
>> wrote:
>> >>>>
>> >>>> Hey all,
>> >>>>
>> >>>> I am running into a problem where YARN kills containers for being
>> over
>> >>>> their memory allocation (which is about 8G for executors plus 6G for
>> >>>> overhead), and I noticed that in those containers there are tons of
>> >>>> pyspark.daemon processes hogging memory. Here's a snippet from a
>> container
>> >>>> with 97 pyspark.daemon processes. The total sum of RSS usage across
>> all of
>> >>>> these is 1,764,956 pages (i.e. 6.7GB on the system).
>> >>>>
>> >>>> Any ideas what's happening here and how I can get the number of
>> >>>> pyspark.daemon processes back to a more reasonable count?
>> >>>>
>> >>>> 2015-01-23 15:36:53,654 INFO  [Reporter] yarn.YarnAllocationHandler
>> >>>> (Logging.scala:logInfo(59)) - Container marked as failed:
>> >>>> container_1421692415636_0052_01_000030. Exit status: 143.
>> Diagnostics:
>> >>>> Container
>> [pid=35211,containerID=container_1421692415636_0052_01_000030] is
>> >>>> running beyond physical memory limits. Current usage: 14.9 GB of
>> 14.5 GB
>> >>>> physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing
>> >>>> container.
>> >>>> Dump of the process-tree for container_1421692415636_0052_01_000030 :
>> >>>>    |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
>> >>>> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES)
>> FULL_CMD_LINE
>> >>>>    |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m
>> >>>> pyspark.daemon
>> >>>>    |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m
>> >>>> pyspark.daemon
>> >>>>    |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python
>> -m
>> >>>> pyspark.daemon
>> >>>>
>> >>>>    [...]
>> >>>>
>> >>>>
>> >>>> Full output here:
>> https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c
>> >>>>
>> >>>> Thank you!
>> >>>> -Sven
>> >>>>
>> >>>> --
>> >>>> krasser
>> >>>
>> >>>
>> >
>> >
>> >
>> > --
>> > http://sites.google.com/site/krasser/?utm_source=sig
>>
>
>
>
> --
> http://sites.google.com/site/krasser/?utm_source=sig
>



-- 
http://sites.google.com/site/krasser/?utm_source=sig

Reply via email to