Re: Spark Job aborted due to stage failure - error=13 Permission denied

Amos B. Elberg Thu, 17 Dec 2015 01:07:06 -0800

Zeppelin doesn’t set the memory.  Anyhow, you have my two cents, good luck.

From: Hoc Phan <quang...@yahoo.com>
Reply: Hoc Phan <quang...@yahoo.com>
Date: December 17, 2015 at 4:03:39 AM
To: Amos B. Elberg <amos.elb...@me.com>, users@zeppelin.incubator.apache.org 
<users@zeppelin.incubator.apache.org>
Subject:  Re: Spark Job aborted due to stage failure - error=13 Permission 
denied  

Hi

I just followed a very simple exercise below

myfile = 
sc.textFile("hdfs://cdhe1master.fbdl.local:8020/user/zeppelin/testcdh.log")
lineLengths = myfile.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b)

The file is tiny like few hundreds KB (not even 1 MB).

This is out-of-the-box Cloudera 5.5 and Spark 1.5 on YARN. I am running on 
machines with 100GB RAM. So there must be something wrong either in how 
Zeppelin sets memory. Because like I said in previous thread, I do pyspark CLI 
and run the exact lines just fine (in the same machine and same user).

On Thursday, December 17, 2015 12:55 AM, Amos B. Elberg <amos.elb...@me.com> 
wrote:

If its failing at the reduce step and you’re getting an OOM error, then its 
probably not the *executor* memory that’s the issue.  There’s also master 
memory and backend memory.

When you do a reduce(), Spark wants to dump the data into the Spark “backend” 
which is the JVM process that initiated the spark job.  That’s probably sitting 
on the same machine as your zeppelin server.  Anyway, if its a lot of data, 
your algorithm can run fine on the executors but OOM on the reduce when all the 
data hits the backend. 

What you want to do is look at your logs closely and try to figure out if the 
stack traces you’re seeing are actually coming from executory processes or the 
backend or master. 

You can increase the memory for the backend by configuring spark.driver.memory. 

Other alternatives are to change your algorithm, such as with reduceByKey(), so 
the reduce step happens in chunks on the executors rather than on the backend.  

But, just a warning — whenever I’ve had issues of OOM’ing the backend, trying 
to fix it by adjusting memory settings always turned out to be a rabbit-hole.  
So, you could also interpret the error as a yellow-flag that you should 
re-engineer your algorithm.  That’s what I do when I see the error now. 

From: Hoc Phan <quang...@yahoo.com>
Reply: users@zeppelin.incubator.apache.org 
<users@zeppelin.incubator.apache.org>, Hoc Phan <quang...@yahoo.com>
Date: December 17, 2015 at 3:33:15 AM
To: users@zeppelin.incubator.apache.org <users@zeppelin.incubator.apache.org>, 
Hoc Phan <quang...@yahoo.com>
Subject:  Re: Spark Job aborted due to stage failure - error=13 Permission 
denied

Hi

Any help on this? I am stuck for a week.

I have tried to follow this thread
http://mail-archives.apache.org/mod_mbox/incubator-zeppelin-users/201506.mbox/%3CCABU7W=ZwKPyuPYzTQncg9wCSAs-v=c1c+welsvzx4qj7eg-...@mail.gmail.com%3E

I set these without luck:

> export PYTHONPATH="${SPARK_HOME}/python:${SPARK_HOME
> }/python/lib/py4j-0.8.2.1-src.zip"
>
> export SPARK_YARN_USER_ENV="PYTHONPATH=${PYTHONPATH}"
>
It only failed at .reduce() step. I think it is trying to do some IO at 
/usr/bin in one of the Cloudera worker nodes. Why is that?

But in the log, I also saw:

java.lang.OutOfMemoryError: GC overhead limit exceeded

So I am quite confused. I set executor memory to 5g and that didn't help.

On Tuesday, December 15, 2015 10:34 AM, Hoc Phan <quang...@yahoo.com> wrote:

Hi

/usr/bin is where pyspark and spark-shell is located. But all have executable 
permission.
The trouble I don't get is when I ssh into the machine and login as "zeppelin" 
user, I was able to go through the same script in pyspark. So my question is 
what Zeppelin is trying to access? Using what user? What is a way to trace and 
troubleshoot?

On Tuesday, December 15, 2015 9:46 AM, Felix Cheung <felixcheun...@hotmail.com> 
wrote:

It looks like it doesn't have permission to launch something

Caused by: java.io.IOException: Cannot run program "/usr/bin": error=13, 
Permission denied

Perhaps the file path is incorrect? It looks to point to /usr/bin which is 
likely a directory.

On Mon, Dec 14, 2015 at 12:25 PM -0800, "Hoc Phan" <quang...@yahoo.com> wrote:

Hi

When I installed Zeppelin, I created a zeppelin user with below permission:
uid=500(zeppelin) gid=490(hdfs) groups=490(hdfs),492(hadoop),501(supergroup)

I ran this via pyspark just fine under this zeppelin user 

myfile = 
sc.textFile("hdfs://cdhe1master.fbdl.local:8020/user/zeppelin/testcdh.log")
lineLengths = myfile.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b)

print totalLength

However, when I run the same thing using Zeppelin, I got this error below. Any 
idea?

Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 
14, cdhe1worker0.fbdl.local): java.io.IOException: Cannot run program 
"/usr/bin": error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1047)
at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:160)
at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:135)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:186)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028)
... 13 more

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1294)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1282)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1281)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1281)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1507)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1469)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.io.IOException: Cannot run program "/usr/bin": error=13, 
Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1047)
at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:160)
at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:135)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:186)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028)
... 13 more

(<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError(u'An error occurred while 
calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.\n', JavaObject 
id=o87), <traceback object at 0x1d37b00>)

Re: Spark Job aborted due to stage failure - error=13 Permission denied

Reply via email to