Re: Spark Job aborted due to stage failure - error=13 Permission denied

Hoc Phan Thu, 17 Dec 2015 01:04:13 -0800

Hi
I just followed a very simple exercise below
myfile = 
sc.textFile("hdfs://cdhe1master.fbdl.local:8020/user/zeppelin/testcdh.log")lineLengths
 = myfile.map(lambda s: len(s))totalLength = lineLengths.reduce(lambda a, b: a 
+ b)
The file is tiny like few hundreds KB (not even 1 MB).
This is out-of-the-box Cloudera 5.5 and Spark 1.5 on YARN. I am running on 
machines with 100GB RAM. So there must be something wrong either in how 
Zeppelin sets memory. Because like I said in previous thread, I do pyspark CLI 
and run the exact lines just fine (in the same machine and same user).

    On Thursday, December 17, 2015 12:55 AM, Amos B. Elberg 
<amos.elb...@me.com> wrote:

 #yiv4996738167 body{font-family:Helvetica, Arial;font-size:13px;}If its 
failing at the reduce step and you’re getting an OOM error, then its probably 
not the *executor* memory that’s the issue.  There’s also master memory and 
backend memory.
When you do a reduce(), Spark wants to dump the data into the Spark “backend” 
which is the JVM process that initiated the spark job.  That’s probably sitting 
on the same machine as your zeppelin server.  Anyway, if its a lot of data, 
your algorithm can run fine on the executors but OOM on the reduce when all the 
data hits the backend. 
What you want to do is look at your logs closely and try to figure out if the 
stack traces you’re seeing are actually coming from executory processes or the 
backend or master. 
You can increase the memory for the backend by configuring spark.driver.memory. 

Other alternatives are to change your algorithm, such as with reduceByKey(), so 
the reduce step happens in chunks on the executors rather than on the backend.  
But, just a warning — whenever I’ve had issues of OOM’ing the backend, trying 
to fix it by adjusting memory settings always turned out to be a rabbit-hole.  
So, you could also interpret the error as a yellow-flag that you should 
re-engineer your algorithm.  That’s what I do when I see the error now.  

From: Hoc Phan <quang...@yahoo.com>
Reply: users@zeppelin.incubator.apache.org 
<users@zeppelin.incubator.apache.org>, Hoc Phan <quang...@yahoo.com>
Date: December 17, 2015 at 3:33:15 AM
To: users@zeppelin.incubator.apache.org <users@zeppelin.incubator.apache.org>, 
Hoc Phan <quang...@yahoo.com>
Subject:  Re: Spark Job aborted due to stage failure - error=13 Permission 
denied 

Hi
Any help on this? I am stuckfor a week.
I have tried to follow 
thisthreadhttp://mail-archives.apache.org/mod_mbox/incubator-zeppelin-users/201506.mbox/%3CCABU7W=ZwKPyuPYzTQncg9wCSAs-v=c1c+welsvzx4qj7eg-...@mail.gmail.com%3E

I set thesewithout luck:
> export PYTHONPATH="${SPARK_HOME}/python:${SPARK_HOME
> }/python/lib/py4j-0.8.2.1-src.zip"
>
> export SPARK_YARN_USER_ENV="PYTHONPATH=${PYTHONPATH}"
>
It onlyfailed at .reduce() step. I think it is trying to do some IO at/usr/bin 
in one of the Cloudera worker nodes. Why is that?
But in thelog, I also saw:
java.lang.OutOfMemoryError: GC overhead limit exceeded

So I am quiteconfused. I set executor memory to 5g and that didn't help.

On Tuesday, December 15,2015 10:34 AM, Hoc Phan <quang...@yahoo.com>wrote:

Hi
/usr/bin iswhere pyspark and spark-shell is located. But all have 
executablepermission.The trouble I don't get is when I ssh into the machine 
andlogin as "zeppelin" user, I was able to go through the same scriptin 
pyspark. So my question is what Zeppelin is trying to access?Using what user? 
What is a way to trace andtroubleshoot?

On Tuesday, December 15,2015 9:46 AM, Felix Cheung 
<felixcheun...@hotmail.com>wrote:

It looks like it doesn't have permission to launchsomething

Caused by: java.io.IOException: Cannot run program "/usr/bin":error=13, 
Permission denied 
Perhaps the file path is incorrect? It looks to point to/usr/bin which is 
likely a directory.

On Mon, Dec 14, 2015 at 12:25PM -0800, "Hoc Phan" <quang...@yahoo.com>wrote:

Hi
When Iinstalled Zeppelin, I created a zeppelin user with 
belowpermission:uid=500(zeppelin)gid=490(hdfs)groups=490(hdfs),492(hadoop),501(supergroup)

I ranthis via pyspark just fine under this zeppelin user 
myfile 
=sc.textFile("hdfs://cdhe1master.fbdl.local:8020/user/zeppelin/testcdh.log")lineLengths
 =myfile.map(lambda s: len(s))totalLength =lineLengths.reduce(lambda a, b: a + 
b)
printtotalLength
However, when Irun the same thing using Zeppelin, I got this error below. 
Anyidea?
Py4JJavaError: Anerror occurred while 
callingz:org.apache.spark.api.python.PythonRDD.collectAndServe.:org.apache.spark.SparkException:
 Job aborted due to stage failure:Task 1 in stage 1.0 failed 4 times, most 
recent failure: Lost task1.3 in stage 1.0 (TID 14, 
cdhe1worker0.fbdl.local):java.io.IOException: Cannot run program "/usr/bin": 
error=13,Permission 
deniedatjava.lang.ProcessBuilder.start(ProcessBuilder.java:1047)atorg.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:160)atorg.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)atorg.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)atorg.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:135)atorg.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73)atorg.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)atorg.apache.spark.rdd.RDD.iterator(RDD.scala:264)atorg.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)atorg.apache.spark.scheduler.Task.run(Task.scala:88)atorg.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)atjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)atjava.lang.Thread.run(Thread.java:745)Caused
 by:java.io.IOException: error=13, Permission 
deniedatjava.lang.UNIXProcess.forkAndExec(Native 
Method)atjava.lang.UNIXProcess.<init>(UNIXProcess.java:186)atjava.lang.ProcessImpl.start(ProcessImpl.java:130)atjava.lang.ProcessBuilder.start(ProcessBuilder.java:1028)...
 13 more
Driverstacktrace:atorg.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1294)atorg.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1282)atorg.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1281)atscala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)atscala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)atorg.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1281)atorg.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)atorg.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)atscala.Option.foreach(Option.scala:236)atorg.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)atorg.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1507)atorg.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1469)atorg.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)atorg.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)atorg.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)atorg.apache.spark.SparkContext.runJob(SparkContext.scala:1824)atorg.apache.spark.SparkContext.runJob(SparkContext.scala:1837)atorg.apache.spark.SparkContext.runJob(SparkContext.scala:1850)atorg.apache.spark.SparkContext.runJob(SparkContext.scala:1921)atorg.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)atorg.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)atorg.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)atorg.apache.spark.rdd.RDD.withScope(RDD.scala:306)atorg.apache.spark.rdd.RDD.collect(RDD.scala:904)atorg.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373)atorg.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)atsun.reflect.NativeMethodAccessorImpl.invoke0(Native

Method)atsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)atsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)atjava.lang.reflect.Method.invoke(Method.java:606)atpy4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)atpy4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)atpy4j.Gateway.invoke(Gateway.java:259)atpy4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)atpy4j.commands.CallCommand.execute(CallCommand.java:79)atpy4j.GatewayConnection.run(GatewayConnection.java:207)atjava.lang.Thread.run(Thread.java:744)Caused
 by:java.io.IOException: Cannot run program "/usr/bin": error=13,Permission 
deniedatjava.lang.ProcessBuilder.start(ProcessBuilder.java:1047)atorg.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:160)atorg.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)atorg.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)atorg.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:135)atorg.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73)atorg.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)atorg.apache.spark.rdd.RDD.iterator(RDD.scala:264)atorg.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)atorg.apache.spark.scheduler.Task.run(Task.scala:88)atorg.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)atjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)atjava.lang.Thread.run(Thread.java:745)Caused
 by:java.io.IOException: error=13, Permission 
deniedatjava.lang.UNIXProcess.forkAndExec(Native 
Method)atjava.lang.UNIXProcess.<init>(UNIXProcess.java:186)atjava.lang.ProcessImpl.start(ProcessImpl.java:130)atjava.lang.ProcessBuilder.start(ProcessBuilder.java:1028)...
 13 more
(<class'py4j.protocol.Py4JJavaError'>, Py4JJavaError(u'An erroroccurred while 
callingz:org.apache.spark.api.python.PythonRDD.collectAndServe.\n',JavaObject 
id=o87), <traceback object at 0x1d37b00>)

Re: Spark Job aborted due to stage failure - error=13 Permission denied

Reply via email to