Here's the executor logs
```

java.io.IOException: Connection from
ip-172-31-16-143.ec2.internal/172.31.16.143:7337 closed
        at 
org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146)
        at 
org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:117)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
        at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
        at 
io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
        at 
io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
        at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
        at 
io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
        at 
org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:225)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
        at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
        at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
        at 
io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
        at 
io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:818)
        at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:750)
2023-08-19 04:33:53,429 INFO shuffle.RetryingBlockFetcher: Retrying
fetch (1/10) for 757 outstanding blocks after 60000 ms
```
And within node manager logs from the failing host I got these logs below

```

2023-08-19 07:38:38,498 ERROR
org.apache.spark.network.server.ChunkFetchRequestHandler
(shuffle-server-4-59): Error sending result
ChunkFetchSuccess[streamChunkId=StreamChunkId[streamId=279757106070,chunkIndex=642],buffer=FileSegmentManagedBuffer[file=/mnt2/yarn/usercache/zeppelin/appcache/application_1691862880080_0016/blockmgr-36010488-99a9-4780-b65f-40e0f2f8f150/37/shuffle_6_784261_0.data,offset=2856408,length=338]]
to /172.31.23.144:35102; closing connection
java.nio.channels.ClosedChannelException

```


Also here's my configurations

[image: Screenshot 2023-08-19 at 8.47.08 AM.png]


On Sat, Aug 19, 2023 at 4:36 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> That error message *FetchFailedException: Failed to connect to
> <executor_IP> on port 7337 *happens when a task running on one executor
> node tries to fetch data from another executor node but fails to establish
> a connection to the specified port (7337 in this case). In a nutshell it is
> performing network IO among your executors.
>
> Check the following:
>
> - Any network issue or connectivity problems anong nodes that your
> executors are running on
> - any executor failure causing this error. Check the executor logs
> - Concurrency and Thread Issues: If there are too many concurrent
> connections or thread limitations,
>   it could result in failed connections. *Adjust
> spark.shuffle.io.clientThreads*
> - It might be prudent to do the same to *spark.shuffle.io.server.Threads*
> - Check how stable your environment is. Observe any issues reported in
> Spark UI
>
> HTH
>
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 18 Aug 2023 at 23:30, Nebi Aydin <nayd...@binghamton.edu> wrote:
>
>>
>> Hi, sorry for duplicates. First time user :)
>> I keep getting fetchfailedexception 7337 port closed. Which is external
>> shuffle service port.
>> I was trying to tune these parameters.
>> I have around 1000 executors and 5000 cores.
>> I tried to set spark.shuffle.io.serverThreads to 2k. Should I also set 
>> spark.shuffle.io.clientThreads
>> to 2000?
>> Does shuffle client threads allow one executor to fetch from multiple
>> nodes shuffle service?
>>
>> Thanks
>> On Fri, Aug 18, 2023 at 17:42 Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> These two threads that you sent seem to be duplicates of each other?
>>>
>>> Anyhow I trust that you are familiar with the concept of shuffle in
>>> Spark. Spark Shuffle is an expensive operation since it involves the
>>> following
>>>
>>>    -
>>>
>>>    Disk I/O
>>>    -
>>>
>>>    Involves data serialization and deserialization
>>>    -
>>>
>>>    Network I/O
>>>
>>> Basically these are based on the concept of map/reduce in Spark and
>>> these parameters you posted relate to various aspects of threading and
>>> concurrency.
>>>
>>> HTH
>>>
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect/Engineering Lead
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 18 Aug 2023 at 20:39, Nebi Aydin <nayd...@binghamton.edu.invalid>
>>> wrote:
>>>
>>>>
>>>> I want to learn differences among below thread configurations.
>>>>
>>>> spark.shuffle.io.serverThreads
>>>> spark.shuffle.io.clientThreads
>>>> spark.shuffle.io.threads
>>>> spark.rpc.io.serverThreads
>>>> spark.rpc.io.clientThreads
>>>> spark.rpc.io.threads
>>>>
>>>> Thanks.
>>>>
>>>

Reply via email to