Where are you running your Spark cluster? Can you post the command line that you are using to run your application?
Spark is designed to process a lot of data by distributing work to a cluster of a machines. When you submit a job, it starts executor processes on the cluster. So, what you are seeing is somewhat expected, (although 25 processes on a single node seem too high) From: Sachit Murarka <connectsac...@gmail.com> Date: Tuesday, October 13, 2020 at 8:15 AM To: spark users <user@spark.apache.org> Subject: RE: [EXTERNAL] Multiple applications being spawned CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Adding Logs. When it launches the multiple applications , following logs get generated on the terminal Also it retries the task always: 20/10/13 12:04:30 WARN TaskSetManager: Lost task XX in stage XX (TID XX, executor 5): java.net.SocketException: Broken pipe (Write failed) at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) at java.net.SocketOutputStream.write(SocketOutputStream.java:155) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) at java.io.DataOutputStream.write(DataOutputStream.java:107) at java.io.FilterOutputStream.write(FilterOutputStream.java:97) at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:212) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:224) at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:224) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224) at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:561) at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:346) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945) at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:195) Kind Regards, Sachit Murarka On Tue, Oct 13, 2020 at 4:02 PM Sachit Murarka <connectsac...@gmail.com<mailto:connectsac...@gmail.com>> wrote: Hi Users, When action(I am using count and write) gets executed in my spark job , it launches many more application instances(around 25 more apps). In my spark code , I am running the transformations through Dataframes then converting dataframe to rdd then applying zipwithindex , then converting it back to dataframe and then applying 2 actions(Count & Write). Please note : This was working fine till the previous week, it has started giving this issue since yesterday. Could you please tell what can be the reason for this behavior? Kind Regards, Sachit Murarka