Hi I just followed a very simple exercise below myfile = sc.textFile("hdfs://cdhe1master.fbdl.local:8020/user/zeppelin/testcdh.log")lineLengths = myfile.map(lambda s: len(s))totalLength = lineLengths.reduce(lambda a, b: a + b) The file is tiny like few hundreds KB (not even 1 MB). This is out-of-the-box Cloudera 5.5 and Spark 1.5 on YARN. I am running on machines with 100GB RAM. So there must be something wrong either in how Zeppelin sets memory. Because like I said in previous thread, I do pyspark CLI and run the exact lines just fine (in the same machine and same user).
On Thursday, December 17, 2015 12:55 AM, Amos B. Elberg <amos.elb...@me.com> wrote: #yiv4996738167 body{font-family:Helvetica, Arial;font-size:13px;}If its failing at the reduce step and you’re getting an OOM error, then its probably not the *executor* memory that’s the issue. There’s also master memory and backend memory. When you do a reduce(), Spark wants to dump the data into the Spark “backend” which is the JVM process that initiated the spark job. That’s probably sitting on the same machine as your zeppelin server. Anyway, if its a lot of data, your algorithm can run fine on the executors but OOM on the reduce when all the data hits the backend. What you want to do is look at your logs closely and try to figure out if the stack traces you’re seeing are actually coming from executory processes or the backend or master. You can increase the memory for the backend by configuring spark.driver.memory. Other alternatives are to change your algorithm, such as with reduceByKey(), so the reduce step happens in chunks on the executors rather than on the backend. But, just a warning — whenever I’ve had issues of OOM’ing the backend, trying to fix it by adjusting memory settings always turned out to be a rabbit-hole. So, you could also interpret the error as a yellow-flag that you should re-engineer your algorithm. That’s what I do when I see the error now. From: Hoc Phan <quang...@yahoo.com> Reply: users@zeppelin.incubator.apache.org <users@zeppelin.incubator.apache.org>, Hoc Phan <quang...@yahoo.com> Date: December 17, 2015 at 3:33:15 AM To: users@zeppelin.incubator.apache.org <users@zeppelin.incubator.apache.org>, Hoc Phan <quang...@yahoo.com> Subject: Re: Spark Job aborted due to stage failure - error=13 Permission denied Hi Any help on this? I am stuckfor a week. I have tried to follow thisthreadhttp://mail-archives.apache.org/mod_mbox/incubator-zeppelin-users/201506.mbox/%3CCABU7W=ZwKPyuPYzTQncg9wCSAs-v=c1c+welsvzx4qj7eg-...@mail.gmail.com%3E I set thesewithout luck: > export PYTHONPATH="${SPARK_HOME}/python:${SPARK_HOME > }/python/lib/py4j-0.8.2.1-src.zip" > > export SPARK_YARN_USER_ENV="PYTHONPATH=${PYTHONPATH}" > It onlyfailed at .reduce() step. I think it is trying to do some IO at/usr/bin in one of the Cloudera worker nodes. Why is that? But in thelog, I also saw: java.lang.OutOfMemoryError: GC overhead limit exceeded So I am quiteconfused. I set executor memory to 5g and that didn't help. On Tuesday, December 15,2015 10:34 AM, Hoc Phan <quang...@yahoo.com>wrote: Hi /usr/bin iswhere pyspark and spark-shell is located. But all have executablepermission.The trouble I don't get is when I ssh into the machine andlogin as "zeppelin" user, I was able to go through the same scriptin pyspark. So my question is what Zeppelin is trying to access?Using what user? What is a way to trace andtroubleshoot? On Tuesday, December 15,2015 9:46 AM, Felix Cheung <felixcheun...@hotmail.com>wrote: It looks like it doesn't have permission to launchsomething Caused by: java.io.IOException: Cannot run program "/usr/bin":error=13, Permission denied Perhaps the file path is incorrect? It looks to point to/usr/bin which is likely a directory. On Mon, Dec 14, 2015 at 12:25PM -0800, "Hoc Phan" <quang...@yahoo.com>wrote: Hi When Iinstalled Zeppelin, I created a zeppelin user with belowpermission:uid=500(zeppelin)gid=490(hdfs)groups=490(hdfs),492(hadoop),501(supergroup) I ranthis via pyspark just fine under this zeppelin user myfile =sc.textFile("hdfs://cdhe1master.fbdl.local:8020/user/zeppelin/testcdh.log")lineLengths =myfile.map(lambda s: len(s))totalLength =lineLengths.reduce(lambda a, b: a + b) printtotalLength However, when Irun the same thing using Zeppelin, I got this error below. Anyidea? Py4JJavaError: Anerror occurred while callingz:org.apache.spark.api.python.PythonRDD.collectAndServe.:org.apache.spark.SparkException: Job aborted due to stage failure:Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task1.3 in stage 1.0 (TID 14, cdhe1worker0.fbdl.local):java.io.IOException: Cannot run program "/usr/bin": error=13,Permission deniedatjava.lang.ProcessBuilder.start(ProcessBuilder.java:1047)atorg.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:160)atorg.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)atorg.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)atorg.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:135)atorg.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73)atorg.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)atorg.apache.spark.rdd.RDD.iterator(RDD.scala:264)atorg.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)atorg.apache.spark.scheduler.Task.run(Task.scala:88)atorg.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)atjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)atjava.lang.Thread.run(Thread.java:745)Caused by:java.io.IOException: error=13, Permission deniedatjava.lang.UNIXProcess.forkAndExec(Native Method)atjava.lang.UNIXProcess.<init>(UNIXProcess.java:186)atjava.lang.ProcessImpl.start(ProcessImpl.java:130)atjava.lang.ProcessBuilder.start(ProcessBuilder.java:1028)... 13 more Driverstacktrace:atorg.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1294)atorg.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1282)atorg.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1281)atscala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)atscala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)atorg.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1281)atorg.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)atorg.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)atscala.Option.foreach(Option.scala:236)atorg.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)atorg.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1507)atorg.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1469)atorg.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)atorg.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)atorg.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)atorg.apache.spark.SparkContext.runJob(SparkContext.scala:1824)atorg.apache.spark.SparkContext.runJob(SparkContext.scala:1837)atorg.apache.spark.SparkContext.runJob(SparkContext.scala:1850)atorg.apache.spark.SparkContext.runJob(SparkContext.scala:1921)atorg.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)atorg.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)atorg.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)atorg.apache.spark.rdd.RDD.withScope(RDD.scala:306)atorg.apache.spark.rdd.RDD.collect(RDD.scala:904)atorg.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:373)atorg.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)atsun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)atsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)atsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)atjava.lang.reflect.Method.invoke(Method.java:606)atpy4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)atpy4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)atpy4j.Gateway.invoke(Gateway.java:259)atpy4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)atpy4j.commands.CallCommand.execute(CallCommand.java:79)atpy4j.GatewayConnection.run(GatewayConnection.java:207)atjava.lang.Thread.run(Thread.java:744)Caused by:java.io.IOException: Cannot run program "/usr/bin": error=13,Permission deniedatjava.lang.ProcessBuilder.start(ProcessBuilder.java:1047)atorg.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:160)atorg.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)atorg.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)atorg.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:135)atorg.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73)atorg.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)atorg.apache.spark.rdd.RDD.iterator(RDD.scala:264)atorg.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)atorg.apache.spark.scheduler.Task.run(Task.scala:88)atorg.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)atjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)atjava.lang.Thread.run(Thread.java:745)Caused by:java.io.IOException: error=13, Permission deniedatjava.lang.UNIXProcess.forkAndExec(Native Method)atjava.lang.UNIXProcess.<init>(UNIXProcess.java:186)atjava.lang.ProcessImpl.start(ProcessImpl.java:130)atjava.lang.ProcessBuilder.start(ProcessBuilder.java:1028)... 13 more (<class'py4j.protocol.Py4JJavaError'>, Py4JJavaError(u'An erroroccurred while callingz:org.apache.spark.api.python.PythonRDD.collectAndServe.\n',JavaObject id=o87), <traceback object at 0x1d37b00>)