Zeppelin doesn’t set the memory. Anyhow, you have my two cents, good luck.
From: Hoc Phan <quang...@yahoo.com> Reply: Hoc Phan <quang...@yahoo.com> Date: December 17, 2015 at 4:03:39 AM To: Amos B. Elberg <amos.elb...@me.com>, users@zeppelin.incubator.apache.org <users@zeppelin.incubator.apache.org> Subject: Re: Spark Job aborted due to stage failure - error=13 Permission denied Hi I just followed a very simple exercise below myfile = sc.textFile("hdfs://cdhe1master.fbdl.local:8020/user/zeppelin/testcdh.log") lineLengths = myfile.map(lambda s: len(s)) totalLength = lineLengths.reduce(lambda a, b: a + b) The file is tiny like few hundreds KB (not even 1 MB). This is out-of-the-box Cloudera 5.5 and Spark 1.5 on YARN. I am running on machines with 100GB RAM. So there must be something wrong either in how Zeppelin sets memory. Because like I said in previous thread, I do pyspark CLI and run the exact lines just fine (in the same machine and same user). On Thursday, December 17, 2015 12:55 AM, Amos B. Elberg <amos.elb...@me.com> wrote: If its failing at the reduce step and you’re getting an OOM error, then its probably not the *executor* memory that’s the issue. There’s also master memory and backend memory. When you do a reduce(), Spark wants to dump the data into the Spark “backend” which is the JVM process that initiated the spark job. That’s probably sitting on the same machine as your zeppelin server. Anyway, if its a lot of data, your algorithm can run fine on the executors but OOM on the reduce when all the data hits the backend. What you want to do is look at your logs closely and try to figure out if the stack traces you’re seeing are actually coming from executory processes or the backend or master. You can increase the memory for the backend by configuring spark.driver.memory. Other alternatives are to change your algorithm, such as with reduceByKey(), so the reduce step happens in chunks on the executors rather than on the backend. But, just a warning — whenever I’ve had issues of OOM’ing the backend, trying to fix it by adjusting memory settings always turned out to be a rabbit-hole. So, you could also interpret the error as a yellow-flag that you should re-engineer your algorithm. That’s what I do when I see the error now. From: Hoc Phan <quang...@yahoo.com> Reply: users@zeppelin.incubator.apache.org <users@zeppelin.incubator.apache.org>, Hoc Phan <quang...@yahoo.com> Date: December 17, 2015 at 3:33:15 AM To: users@zeppelin.incubator.apache.org <users@zeppelin.incubator.apache.org>, Hoc Phan <quang...@yahoo.com> Subject: Re: Spark Job aborted due to stage failure - error=13 Permission denied Hi Any help on this? I am stuck for a week. I have tried to follow this thread http://mail-archives.apache.org/mod_mbox/incubator-zeppelin-users/201506.mbox/%3CCABU7W=ZwKPyuPYzTQncg9wCSAs-v=c1c+welsvzx4qj7eg-...@mail.gmail.com%3E I set these without luck: > export PYTHONPATH="${SPARK_HOME}/python:${SPARK_HOME > }/python/lib/py4j-0.8.2.1-src.zip" > > export SPARK_YARN_USER_ENV="PYTHONPATH=${PYTHONPATH}" > It only failed at .reduce() step. I think it is trying to do some IO at /usr/bin in one of the Cloudera worker nodes. Why is that? But in the log, I also saw: java.lang.OutOfMemoryError: GC overhead limit exceeded So I am quite confused. I set executor memory to 5g and that didn't help. On Tuesday, December 15, 2015 10:34 AM, Hoc Phan <quang...@yahoo.com> wrote: Hi /usr/bin is where pyspark and spark-shell is located. But all have executable permission. The trouble I don't get is when I ssh into the machine and login as "zeppelin" user, I was able to go through the same script in pyspark. So my question is what Zeppelin is trying to access? Using what user? What is a way to trace and troubleshoot? On Tuesday, December 15, 2015 9:46 AM, Felix Cheung <felixcheun...@hotmail.com> wrote: It looks like it doesn't have permission to launch something Caused by: java.io.IOException: Cannot run program "/usr/bin": error=13, Permission denied Perhaps the file path is incorrect? It looks to point to /usr/bin which is likely a directory. On Mon, Dec 14, 2015 at 12:25 PM -0800, "Hoc Phan" <quang...@yahoo.com> wrote: Hi When I installed Zeppelin, I created a zeppelin user with below permission: uid=500(zeppelin) gid=490(hdfs) groups=490(hdfs),492(hadoop),501(supergroup) I ran this via pyspark just fine under this zeppelin user myfile = sc.textFile("hdfs://cdhe1master.fbdl.local:8020/user/zeppelin/testcdh.log") lineLengths = myfile.map(lambda s: len(s)) totalLength = lineLengths.reduce(lambda a, b: a + b) print totalLength However, when I run the same thing using Zeppelin, I got this error below. Any idea? Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 14, cdhe1worker0.fbdl.local): java.io.IOException: Cannot run program "/usr/bin": error=13, Permission denied at java.lang.ProcessBuilder.start(ProcessBuilder.java:1047) at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:160) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:135) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: error=13, Permission denied at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.<init>(UNIXProcess.java:186) at java.lang.ProcessImpl.start(ProcessImpl.java:130) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028) ... 13 more Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1294) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1282) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1281) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1281) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1507) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1469) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) at org.apache.spark.rdd.RDD.collect(RDD.scala:904) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:744) Caused by: java.io.IOException: Cannot run program "/usr/bin": error=13, Permission denied at java.lang.ProcessBuilder.start(ProcessBuilder.java:1047) at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:160) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:135) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: error=13, Permission denied at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.<init>(UNIXProcess.java:186) at java.lang.ProcessImpl.start(ProcessImpl.java:130) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028) ... 13 more (<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError(u'An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.\n', JavaObject id=o87), <traceback object at 0x1d37b00>)