After slimming down the job quite a bit, it looks like a call to coalesce() on a larger RDD can cause these Python worker spikes (additional details in Jira: https://issues.apache.org/jira/browse/SPARK-5395?focusedCommentId=14294570&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14294570 ).
Any ideas to what coalesce() is doing that triggers the creation of additional workers? On Sat, Jan 24, 2015 at 12:27 AM, Sven Krasser <kras...@gmail.com> wrote: > Hey Davies, > > Sure thing, it's filed here now: > https://issues.apache.org/jira/browse/SPARK-5395 > > As far as a repro goes, what is a normal number of workers I should > expect? Even shortly after kicking the job off, I see workers in the > double-digits per container. Here's an example using pstree on a worker > node (the worker node runs 16 containers, i.e. the java--bash branches > below). The initial Python process per container is forked between 7 and 33 > times. > > ├─java─┬─bash───java─┬─python───22*[python] > │ │ └─89*[{java}] > │ ├─bash───java─┬─python───13*[python] > │ │ └─80*[{java}] > │ ├─bash───java─┬─python───9*[python] > │ │ └─78*[{java}] > │ ├─bash───java─┬─python───24*[python] > │ │ └─91*[{java}] > │ ├─3*[bash───java─┬─python───19*[python]] > │ │ └─86*[{java}]] > │ ├─bash───java─┬─python───25*[python] > │ │ └─90*[{java}] > │ ├─bash───java─┬─python───33*[python] > │ │ └─98*[{java}] > │ ├─bash───java─┬─python───7*[python] > │ │ └─75*[{java}] > │ ├─bash───java─┬─python───15*[python] > │ │ └─80*[{java}] > │ ├─bash───java─┬─python───18*[python] > │ │ └─84*[{java}] > │ ├─bash───java─┬─python───10*[python] > │ │ └─85*[{java}] > │ ├─bash───java─┬─python───26*[python] > │ │ └─90*[{java}] > │ ├─bash───java─┬─python───17*[python] > │ │ └─85*[{java}] > │ ├─bash───java─┬─python───24*[python] > │ │ └─92*[{java}] > │ └─265*[{java}] > > Each container has 2 cores in case that makes a difference. > > Thank you! > -Sven > > On Fri, Jan 23, 2015 at 11:52 PM, Davies Liu <dav...@databricks.com> > wrote: > >> It should be a bug, the Python worker did not exit normally, could you >> file a JIRA for this? >> >> Also, could you show how to reproduce this behavior? >> >> On Fri, Jan 23, 2015 at 11:45 PM, Sven Krasser <kras...@gmail.com> wrote: >> > Hey Adam, >> > >> > I'm not sure I understand just yet what you have in mind. My takeaway >> from >> > the logs is that the container actually was above its allotment of about >> > 14G. Since 6G of that are for overhead, I assumed there to be plenty of >> > space for Python workers, but there seem to be more of those than I'd >> > expect. >> > >> > Does anyone know if that is actually the intended behavior, i.e. in this >> > case over 90 Python processes on a 2 core executor? >> > >> > Best, >> > -Sven >> > >> > >> > On Fri, Jan 23, 2015 at 10:04 PM, Adam Diaz <adam.h.d...@gmail.com> >> wrote: >> >> >> >> Yarn only has the ability to kill not checkpoint or sig suspend. If >> you >> >> use too much memory it will simply kill tasks based upon the yarn >> config. >> >> https://issues.apache.org/jira/browse/YARN-2172 >> >> >> >> >> >> On Friday, January 23, 2015, Sandy Ryza <sandy.r...@cloudera.com> >> wrote: >> >>> >> >>> Hi Sven, >> >>> >> >>> What version of Spark are you running? Recent versions have a change >> >>> that allows PySpark to share a pool of processes instead of starting >> a new >> >>> one for each task. >> >>> >> >>> -Sandy >> >>> >> >>> On Fri, Jan 23, 2015 at 9:36 AM, Sven Krasser <kras...@gmail.com> >> wrote: >> >>>> >> >>>> Hey all, >> >>>> >> >>>> I am running into a problem where YARN kills containers for being >> over >> >>>> their memory allocation (which is about 8G for executors plus 6G for >> >>>> overhead), and I noticed that in those containers there are tons of >> >>>> pyspark.daemon processes hogging memory. Here's a snippet from a >> container >> >>>> with 97 pyspark.daemon processes. The total sum of RSS usage across >> all of >> >>>> these is 1,764,956 pages (i.e. 6.7GB on the system). >> >>>> >> >>>> Any ideas what's happening here and how I can get the number of >> >>>> pyspark.daemon processes back to a more reasonable count? >> >>>> >> >>>> 2015-01-23 15:36:53,654 INFO [Reporter] yarn.YarnAllocationHandler >> >>>> (Logging.scala:logInfo(59)) - Container marked as failed: >> >>>> container_1421692415636_0052_01_000030. Exit status: 143. >> Diagnostics: >> >>>> Container >> [pid=35211,containerID=container_1421692415636_0052_01_000030] is >> >>>> running beyond physical memory limits. Current usage: 14.9 GB of >> 14.5 GB >> >>>> physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing >> >>>> container. >> >>>> Dump of the process-tree for container_1421692415636_0052_01_000030 : >> >>>> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) >> >>>> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) >> FULL_CMD_LINE >> >>>> |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m >> >>>> pyspark.daemon >> >>>> |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m >> >>>> pyspark.daemon >> >>>> |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python >> -m >> >>>> pyspark.daemon >> >>>> >> >>>> [...] >> >>>> >> >>>> >> >>>> Full output here: >> https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c >> >>>> >> >>>> Thank you! >> >>>> -Sven >> >>>> >> >>>> -- >> >>>> krasser >> >>> >> >>> >> > >> > >> > >> > -- >> > http://sites.google.com/site/krasser/?utm_source=sig >> > > > > -- > http://sites.google.com/site/krasser/?utm_source=sig > -- http://sites.google.com/site/krasser/?utm_source=sig