RE: Spark 1.5.2 memory error

Mohammed Guller Wed, 03 Feb 2016 13:35:09 -0800

Nirav,
Sorry to hear about your experience with Spark; however, sucks is a very strong 
word. Many organizations are processing a lot more than 150GB of data  with 
Spark.


Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Nirav Patel [mailto:[email protected]]
Sent: Wednesday, February 3, 2016 11:31 AM
To: Stefan Panayotov
Cc: Jim Green; Ted Yu; Jakob Odersky; [email protected]
Subject: Re: Spark 1.5.2 memory error

Hi Stefan,

Welcome to the OOM - heap space club. I have been struggling with similar 
errors (OOM and yarn executor being killed) and failing job or sending it in 
retry loops. I bet the same job will run perfectly fine with less resource on 
Hadoop MapReduce program. I have tested it for my program and it does work.

Bottomline from my experience. Spark sucks with memory management when job is 
processing large (not huge) amount of data. It's failing for me with 16gb 
executors, 10 executors, 6 threads each. And data its processing is only 150GB! 
It's 1 billion rows for me. Same job works perfectly fine with 1 million rows.

Hope that saves you some trouble.

Nirav



On Wed, Feb 3, 2016 at 11:00 AM, Stefan Panayotov 
<[email protected]<mailto:[email protected]>> wrote:
I drastically increased the memory:

spark.executor.memory = 50g
spark.driver.memory = 8g
spark.driver.maxResultSize = 8g
spark.yarn.executor.memoryOverhead = 768

I still see executors killed, but this time the memory does not seem to be the 
issue.
The error on the Jupyter notebook is:



Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.collectAndServe.

: org.apache.spark.SparkException: Job aborted due to stage failure: Exception 
while getting task result: java.io.IOException: Failed to connect to 
/10.0.0.9:48755<http://10.0.0.9:48755>

From nodemanagers log corresponding to worker 10.0.0.9<http://10.0.0.9>:

2016-02-03 17:31:44,917 INFO  yarn.YarnShuffleService 
(YarnShuffleService.java:initializeApplication(129)) - Initializing application 
application_1454509557526_0014

2016-02-03 17:31:44,918 INFO  container.ContainerImpl 
(ContainerImpl.java:handle(1131)) - Container 
container_1454509557526_0014_01_000093 transitioned from LOCALIZING to LOCALIZED

2016-02-03 17:31:44,947 INFO  container.ContainerImpl 
(ContainerImpl.java:handle(1131)) - Container 
container_1454509557526_0014_01_000093 transitioned from LOCALIZED to RUNNING

2016-02-03 17:31:44,951 INFO  nodemanager.DefaultContainerExecutor 
(DefaultContainerExecutor.java:buildCommandExecutor(267)) - launchContainer: 
[bash, 
/mnt/resource/hadoop/yarn/local/usercache/root/appcache/application_1454509557526_0014/container_1454509557526_0014_01_000093/default_container_executor.sh]

2016-02-03 17:31:45,686 INFO  monitor.ContainersMonitorImpl 
(ContainersMonitorImpl.java:run(371)) - Starting resource-monitoring for 
container_1454509557526_0014_01_000093

2016-02-03 17:31:45,686 INFO  monitor.ContainersMonitorImpl 
(ContainersMonitorImpl.java:run(385)) - Stopping resource-monitoring for 
container_1454509557526_0014_01_000011



Then I can see the memory usage increasing from 230.6 MB to 12.6 GB, which is 
far below 50g, and the suddenly getting killed!?!



2016-02-03 17:33:17,350 INFO  monitor.ContainersMonitorImpl 
(ContainersMonitorImpl.java:run(458)) - Memory usage of ProcessTree 30962 for 
container-id container_1454509557526_0014_01_000093: 12.6 GB of 51 GB physical 
memory used; 52.8 GB of 107.1 GB virtual memory used

2016-02-03 17:33:17,613 INFO  container.ContainerImpl 
(ContainerImpl.java:handle(1131)) - Container 
container_1454509557526_0014_01_000093 transitioned from RUNNING to KILLING

2016-02-03 17:33:17,613 INFO  launcher.ContainerLaunch 
(ContainerLaunch.java:cleanupContainer(370)) - Cleaning up container 
container_1454509557526_0014_01_000093

2016-02-03 17:33:17,629 WARN  nodemanager.DefaultContainerExecutor 
(DefaultContainerExecutor.java:launchContainer(223)) - Exit code from container 
container_1454509557526_0014_01_000093 is : 143

2016-02-03 17:33:17,667 INFO  container.ContainerImpl 
(ContainerImpl.java:handle(1131)) - Container 
container_1454509557526_0014_01_000093 transitioned from KILLING to 
CONTAINER_CLEANEDUP_AFTER_KILL

2016-02-03 17:33:17,669 INFO  nodemanager.NMAuditLogger 
(NMAuditLogger.java:logSuccess(89)) - USER=root       OPERATION=Container 
Finished - Killed    TARGET=ContainerImpl RESULT=SUCCESS       
APPID=application_1454509557526_0014     
CONTAINERID=container_1454509557526_0014_01_000093

2016-02-03 17:33:17,670 INFO  container.ContainerImpl 
(ContainerImpl.java:handle(1131)) - Container 
container_1454509557526_0014_01_000093 transitioned from 
CONTAINER_CLEANEDUP_AFTER_KILL to DONE

2016-02-03 17:33:17,670 INFO  application.ApplicationImpl 
(ApplicationImpl.java:transition(347)) - Removing 
container_1454509557526_0014_01_000093 from application 
application_1454509557526_0014

2016-02-03 17:33:17,671 INFO  logaggregation.AppLogAggregatorImpl 
(AppLogAggregatorImpl.java:startContainerLogAggregation(546)) - Considering 
container container_1454509557526_0014_01_000093 for log-aggregation

2016-02-03 17:33:17,671 INFO  containermanager.AuxServices 
(AuxServices.java:handle(196)) - Got event CONTAINER_STOP for appId 
application_1454509557526_0014

2016-02-03 17:33:17,671 INFO  yarn.YarnShuffleService 
(YarnShuffleService.java:stopContainer(161)) - Stopping container 
container_1454509557526_0014_01_000093

2016-02-03 17:33:20,351 INFO  monitor.ContainersMonitorImpl 
(ContainersMonitorImpl.java:run(385)) - Stopping resource-monitoring for 
container_1454509557526_0014_01_000093

2016-02-03 17:33:20,383 INFO  monitor.ContainersMonitorImpl 
(ContainersMonitorImpl.java:run(458)) - Memory usage of ProcessTree 28727 for 
container-id container_1454509557526_0012_01_000001: 319.8 MB of 1.5 GB 
physical memory used; 1.7 GB of 3.1 GB virtual memory used
2016-02-03 17:33:22,627 INFO  nodemanager.NodeStatusUpdaterImpl 
(NodeStatusUpdaterImpl.java:removeOrTrackCompletedContainersFromContext(529)) - 
Removed completed containers from NM context: 
[container_1454509557526_0014_01_000093]

I'll appreciate any suggestions.

Thanks,
Stefan Panayotov, PhD
Home: 610-355-0919<tel:610-355-0919>
Cell: 610-517-5586<tel:610-517-5586>
email: [email protected]<mailto:[email protected]>
[email protected]<mailto:[email protected]>
[email protected]<mailto:[email protected]>


________________________________
Date: Tue, 2 Feb 2016 15:40:10 -0800
Subject: Re: Spark 1.5.2 memory error
From: [email protected]<mailto:[email protected]>
To: [email protected]<mailto:[email protected]>
CC: [email protected]<mailto:[email protected]>; 
[email protected]<mailto:[email protected]>; 
[email protected]<mailto:[email protected]>

Look at part#3 in below blog:
http://www.openkb.info/2015/06/resource-allocation-configurations-for.html

You may want to increase the executor memory, not just the 
spark.yarn.executor.memoryOverhead.

On Tue, Feb 2, 2016 at 2:14 PM, Stefan Panayotov 
<[email protected]<mailto:[email protected]>> wrote:

For the memoryOvethead I have the default of 10% of 16g, and Spark version is 
1.5.2.



Stefan Panayotov, PhD
Sent from Outlook Mail for Windows 10 phone



From: Ted Yu<mailto:[email protected]>
Sent: Tuesday, February 2, 2016 4:52 PM
To: Jakob Odersky<mailto:[email protected]>
Cc: Stefan Panayotov<mailto:[email protected]>; 
[email protected]<mailto:[email protected]>
Subject: Re: Spark 1.5.2 memory error



What value do you use for spark.yarn.executor.memoryOverhead ?



Please see https://spark.apache.org/docs/latest/running-on-yarn.html for 
description of the parameter.



Which Spark release are you using ?



Cheers



On Tue, Feb 2, 2016 at 1:38 PM, Jakob Odersky 
<[email protected]<mailto:[email protected]>> wrote:

Can you share some code that produces the error? It is probably not
due to spark but rather the way data is handled in the user code.
Does your code call any reduceByKey actions? These are often a source
for OOM errors.

On Tue, Feb 2, 2016 at 1:22 PM, Stefan Panayotov 
<[email protected]<mailto:[email protected]>> wrote:
> Hi Guys,
>
> I need help with Spark memory errors when executing ML pipelines.
> The error that I see is:
>
>
> 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 32.0 in
> stage 32.0 (TID 3298)
>
>
> 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 12.0 in
> stage 32.0 (TID 3278)
>
>
> 16/02/02 20:34:39 INFO MemoryStore: ensureFreeSpace(2004728720) called with
> curMem=296303415, maxMem=8890959790
>
>
> 16/02/02 20:34:39 INFO MemoryStore: Block taskresult_3298 stored as bytes in
> memory (estimated size 1911.9 MB, free 6.1 GB)
>
>
> 16/02/02 20:34:39 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15:
> SIGTERM
>
>
> 16/02/02 20:34:39 ERROR Executor: Exception in task 12.0 in stage 32.0 (TID
> 3278)
>
>
> java.lang.OutOfMemoryError: Java heap space
>
>
>        at java.util.Arrays.copyOf(Arrays.java:2271)
>
>
>        at
> java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
>
>
>        at
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86)
>
>
>        at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)
>
>
>        at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>
>        at java.lang.Thread.run(Thread.java:745)
>
>
> 16/02/02 20:34:39 INFO DiskBlockManager: Shutdown hook called
>
>
> 16/02/02 20:34:39 INFO Executor: Finished task 32.0 in stage 32.0 (TID
> 3298). 2004728720 bytes result sent via BlockManager)
>
>
> 16/02/02 20:34:39 ERROR SparkUncaughtExceptionHandler: Uncaught exception in
> thread Thread[Executor task launch worker-8,5,main]
>
>
> java.lang.OutOfMemoryError: Java heap space
>
>
>        at java.util.Arrays.copyOf(Arrays.java:2271)
>
>
>        at
> java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
>
>
>        at
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86)
>
>
>        at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)
>
>
>        at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>
>        at java.lang.Thread.run(Thread.java:745)
>
>
> 16/02/02 20:34:39 INFO ShutdownHookManager: Shutdown hook called
>
>
> 16/02/02 20:34:39 INFO MetricsSystemImpl: Stopping azure-file-system metrics
> system...
>
>
> 16/02/02 20:34:39 INFO MetricsSinkAdapter: azurefs2 thread interrupted.
>
>
> 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics system
> stopped.
>
>
> 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics system
> shutdown complete.
>
>
>
>
>
> And …..
>
>
>
>
>
> 16/02/02 20:09:03 INFO impl.ContainerManagementProtocolProxy: Opening proxy
> : 10.0.0.5:30050<http://10.0.0.5:30050>
>
>
> 16/02/02 20:33:51 INFO yarn.YarnAllocator: Completed container
> container_1454421662639_0011_01_000005 (state: COMPLETE, exit status: -104)
>
>
> 16/02/02 20:33:51 WARN yarn.YarnAllocator: Container killed by YARN for
> exceeding memory limits. 16.8 GB of 16.5 GB physical memory used. Consider
> boosting spark.yarn.executor.memoryOverhead.
>
>
> 16/02/02 20:33:56 INFO yarn.YarnAllocator: Will request 1 executor
> containers, each with 2 cores and 16768 MB memory including 384 MB overhead
>
>
> 16/02/02 20:33:56 INFO yarn.YarnAllocator: Container request (host: Any,
> capability: <memory:16768, vCores:2>)
>
>
> 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching container
> container_1454421662639_0011_01_000037 for on host 10.0.0.8
>
>
> 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching ExecutorRunnable.
> driverUrl:
> akka.tcp://[email protected]:47446/user/CoarseGrainedScheduler<http://10.0.0.15:47446/user/CoarseGrainedScheduler>,
> executorHostname: 10.0.0.8
>
>
> 16/02/02 20:33:57 INFO yarn.YarnAllocator: Received 1 containers from YARN,
> launching executors on 1 of them.
>
>
> I'll really appreciate any help here.
>
> Thank you,
>
> Stefan Panayotov, PhD
> Home: 610-355-0919
> Cell: 610-517-5586
> email: [email protected]<mailto:[email protected]>
> [email protected]<mailto:[email protected]>
> [email protected]<mailto:[email protected]>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: 
[email protected]<mailto:[email protected]>
For additional commands, e-mail: 
[email protected]<mailto:[email protected]>







--
Thanks,
www.openkb.info<http://www.openkb.info>
(Open KnowledgeBase for Hadoop/Database/OS/Network/Tool)




[What's New with Xactly]<http://www.xactlycorp.com/email-click/>

[https://www.xactlycorp.com/wp-content/uploads/2015/07/nyse_xtly_alt_24.png]<https://www.nyse.com/quote/XNYS:XTLY>
  [LinkedIn] <https://www.linkedin.com/company/xactly-corporation>   [Twitter] 
<https://twitter.com/Xactly>   [Facebook] <https://www.facebook.com/XactlyCorp> 
  [YouTube] <http://www.youtube.com/xactlycorporation>

RE: Spark 1.5.2 memory error

Reply via email to