Hi Mike, This appears to be an access issue on Windows + Python. Can you try setting up the PYTHON_PATH environment variable as described in this stackoverflow post https://stackoverflow.com/questions/60414394/createprocess-error-5-access-is-denied-pyspark
- Sadha On Mon, Jul 29, 2024 at 3:39 PM mike Jadoo <mikejad...@gmail.com> wrote: > Thanks. I just downloaded the corretto but I got this error message, > which was the same as before. [It was shared with me that this saying that > I have limited resources, i think] > > ---------------------------------------------------------------------------Py4JJavaError > Traceback (most recent call last) > Cell In[3], line 13 8 squared_rdd = rdd.map(lambda x: x * x) 10 # > Persist the DataFrame in memory 11 > #squared_rdd.persist(StorageLevel.MEMORY_ONLY) 12 # Collect the results > into a list---> 13 result = squared_rdd.collect() 15 # Print the result > 16 print(result) > > File > C:\spark\spark-3.5.1-bin-hadoop3\python\lib\pyspark.zip\pyspark\rdd.py:1833, > in RDD.collect(self) 1831 with SCCallSiteSync(self.context): 1832 > assert self.ctx._jvm is not None-> 1833 sock_info = > self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) 1834 return > list(_load_from_socket(sock_info, self._jrdd_deserializer)) > > File ~\anaconda3\Lib\site-packages\py4j\java_gateway.py:1322, in > JavaMember.__call__(self, *args) 1316 command = proto.CALL_COMMAND_NAME +\ > 1317 self.command_header +\ 1318 args_command +\ 1319 > proto.END_COMMAND_PART 1321 answer = > self.gateway_client.send_command(command)-> 1322 return_value = > get_return_value( 1323 answer, self.gateway_client, self.target_id, > self.name) 1325 for temp_arg in temp_args: 1326 if hasattr(temp_arg, > "_detach"): > > File > C:\spark\spark-3.5.1-bin-hadoop3\python\lib\pyspark.zip\pyspark\errors\exceptions\captured.py:179, > in capture_sql_exception.<locals>.deco(*a, **kw) 177 def deco(*a: Any, > **kw: Any) -> Any: 178 try:--> 179 return f(*a, **kw) 180 > except Py4JJavaError as e: 181 converted = > convert_exception(e.java_exception) > > File ~\anaconda3\Lib\site-packages\py4j\protocol.py:326, in > get_return_value(answer, gateway_client, target_id, name) 324 value = > OUTPUT_CONVERTER[type](answer[2:], gateway_client) 325 if answer[1] == > REFERENCE_TYPE:--> 326 raise Py4JJavaError( 327 "An error > occurred while calling {0}{1}{2}.\n". 328 format(target_id, ".", > name), value) 329 else: 330 raise Py4JError( 331 "An > error occurred while calling {0}{1}{2}. Trace:\n{3}\n". 332 > format(target_id, ".", name, value)) > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 > in stage 0.0 failed 1 times, most recent failure: Lost task 7.0 in stage 0.0 > (TID 7) (mjadoo.myfiosgateway.com executor driver): java.io.IOException: > Cannot run program "C:\Users\mikej\AppData\Local\Programs\Python\Python312": > CreateProcess error=5, Access is denied > at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128) > at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071) > at > org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:181) > at > org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109) > at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124) > at > org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:174) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:67) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:331) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93) > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166) > at org.apache.spark.scheduler.Task.run(Task.scala:141) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620) > at > org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) > at > org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: java.io.IOException: CreateProcess error=5, Access is denied > at java.base/java.lang.ProcessImpl.create(Native Method) > at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:492) > at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:153) > at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1107) > ... 19 more > > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1247) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3060) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2994) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2983) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:989) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2398) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2419) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2438) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2463) > at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1049) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:410) > at org.apache.spark.rdd.RDD.collect(RDD.scala:1048) > at > org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:195) > at > org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) > at py4j.Gateway.invoke(Gateway.java:282) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at > py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) > at py4j.ClientServerConnection.run(ClientServerConnection.java:106) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: java.io.IOException: Cannot run program > "C:\Users\mikej\AppData\Local\Programs\Python\Python312": CreateProcess > error=5, Access is denied > at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128) > at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071) > at > org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:181) > at > org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109) > at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124) > at > org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:174) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:67) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:331) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93) > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166) > at org.apache.spark.scheduler.Task.run(Task.scala:141) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620) > at > org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) > at > org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > ... 1 more > > > On Mon, Jul 29, 2024 at 4:34 PM Sadha Chilukoori <sage.quoti...@gmail.com> > wrote: > >> Hi Mike, >> >> I'm not sure about the minimum requirements of a machine for running >> Spark. But to run some Pyspark scripts (and Jupiter notbebooks) on a local >> machine, I found the following steps are the easiest. >> >> >> I installed Amazon corretto and updated the java_home variable as >> instructed here >> https://docs.aws.amazon.com/corretto/latest/corretto-11-ug/downloads-list.html >> (Any other java works too, I'm used to corretto from work). >> >> Then installed the Pyspark module using pip, which enabled me run Pyspark >> on my machine. >> >> -Sadha >> >> On Mon, Jul 29, 2024, 12:51 PM mike Jadoo <mikejad...@gmail.com> wrote: >> >>> Hello, >>> >>> I am trying to run Pyspark on my computer without success. I follow >>> several different directions from online sources and it appears that I need >>> to get a faster computer. >>> >>> I wanted to ask what are some recommendations for computer >>> specifications to run PySpark (Apache Spark). >>> >>> Any help would be greatly appreciated. >>> >>> Thank you, >>> >>> Mike >>> >>