Where are you running your Spark cluster? Can you post the command line that 
you are using to run your application?

Spark is designed to process a lot of data by distributing work to a cluster of 
a machines. When you submit a job, it starts executor processes on the cluster. 
So, what you are seeing is somewhat expected, (although 25 processes on a 
single node seem too high)

From: Sachit Murarka <connectsac...@gmail.com>
Date: Tuesday, October 13, 2020 at 8:15 AM
To: spark users <user@spark.apache.org>
Subject: RE: [EXTERNAL] Multiple applications being spawned


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.


Adding Logs.

When it launches the multiple applications , following logs get generated on 
the terminal
Also it retries the task always:

20/10/13 12:04:30 WARN TaskSetManager: Lost task XX in stage XX (TID XX, 
executor 5): java.net.SocketException: Broken pipe (Write failed)
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
        at 
org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:212)
        at 
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:224)
        at 
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:224)
        at scala.collection.Iterator$class.foreach(Iterator.scala:891)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
        at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
        at 
org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:561)
        at 
org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:346)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
        at 
org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:195)

Kind Regards,
Sachit Murarka


On Tue, Oct 13, 2020 at 4:02 PM Sachit Murarka 
<connectsac...@gmail.com<mailto:connectsac...@gmail.com>> wrote:
Hi Users,
When action(I am using count and write) gets executed in my spark job , it 
launches many more application instances(around 25 more apps).

In my spark code ,  I am running the transformations through Dataframes then 
converting dataframe to rdd then applying zipwithindex , then converting it 
back to dataframe and then applying 2 actions(Count & Write).

Please note : This was working fine till the previous week, it has started 
giving this issue since yesterday.

Could you please tell what can be the reason for this behavior?

Kind Regards,
Sachit Murarka

Reply via email to