Here's the executor logs ``` java.io.IOException: Connection from ip-172-31-16-143.ec2.internal/172.31.16.143:7337 closed at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146) at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:117) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241) at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81) at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241) at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81) at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:225) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901) at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:818) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:750) 2023-08-19 04:33:53,429 INFO shuffle.RetryingBlockFetcher: Retrying fetch (1/10) for 757 outstanding blocks after 60000 ms ``` And within node manager logs from the failing host I got these logs below
``` 2023-08-19 07:38:38,498 ERROR org.apache.spark.network.server.ChunkFetchRequestHandler (shuffle-server-4-59): Error sending result ChunkFetchSuccess[streamChunkId=StreamChunkId[streamId=279757106070,chunkIndex=642],buffer=FileSegmentManagedBuffer[file=/mnt2/yarn/usercache/zeppelin/appcache/application_1691862880080_0016/blockmgr-36010488-99a9-4780-b65f-40e0f2f8f150/37/shuffle_6_784261_0.data,offset=2856408,length=338]] to /172.31.23.144:35102; closing connection java.nio.channels.ClosedChannelException ``` Also here's my configurations [image: Screenshot 2023-08-19 at 8.47.08 AM.png] On Sat, Aug 19, 2023 at 4:36 AM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > That error message *FetchFailedException: Failed to connect to > <executor_IP> on port 7337 *happens when a task running on one executor > node tries to fetch data from another executor node but fails to establish > a connection to the specified port (7337 in this case). In a nutshell it is > performing network IO among your executors. > > Check the following: > > - Any network issue or connectivity problems anong nodes that your > executors are running on > - any executor failure causing this error. Check the executor logs > - Concurrency and Thread Issues: If there are too many concurrent > connections or thread limitations, > it could result in failed connections. *Adjust > spark.shuffle.io.clientThreads* > - It might be prudent to do the same to *spark.shuffle.io.server.Threads* > - Check how stable your environment is. Observe any issues reported in > Spark UI > > HTH > > > Mich Talebzadeh, > Solutions Architect/Engineering Lead > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Fri, 18 Aug 2023 at 23:30, Nebi Aydin <nayd...@binghamton.edu> wrote: > >> >> Hi, sorry for duplicates. First time user :) >> I keep getting fetchfailedexception 7337 port closed. Which is external >> shuffle service port. >> I was trying to tune these parameters. >> I have around 1000 executors and 5000 cores. >> I tried to set spark.shuffle.io.serverThreads to 2k. Should I also set >> spark.shuffle.io.clientThreads >> to 2000? >> Does shuffle client threads allow one executor to fetch from multiple >> nodes shuffle service? >> >> Thanks >> On Fri, Aug 18, 2023 at 17:42 Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >>> Hi, >>> >>> These two threads that you sent seem to be duplicates of each other? >>> >>> Anyhow I trust that you are familiar with the concept of shuffle in >>> Spark. Spark Shuffle is an expensive operation since it involves the >>> following >>> >>> - >>> >>> Disk I/O >>> - >>> >>> Involves data serialization and deserialization >>> - >>> >>> Network I/O >>> >>> Basically these are based on the concept of map/reduce in Spark and >>> these parameters you posted relate to various aspects of threading and >>> concurrency. >>> >>> HTH >>> >>> >>> Mich Talebzadeh, >>> Solutions Architect/Engineering Lead >>> London >>> United Kingdom >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Fri, 18 Aug 2023 at 20:39, Nebi Aydin <nayd...@binghamton.edu.invalid> >>> wrote: >>> >>>> >>>> I want to learn differences among below thread configurations. >>>> >>>> spark.shuffle.io.serverThreads >>>> spark.shuffle.io.clientThreads >>>> spark.shuffle.io.threads >>>> spark.rpc.io.serverThreads >>>> spark.rpc.io.clientThreads >>>> spark.rpc.io.threads >>>> >>>> Thanks. >>>> >>>