Hi, any inputs regarding following situation will be appreciated: We are running with dynamic allocation(spark v.2.2.0), i.e. with external shuffle service with Mesos cluster(1.1.0) Sometimes due to network failures and/or order of offers excepted by different frameworks the application framework starts before external shuffle service on some node. In this case framework(spark) can't connect to external shuffle service and this makes it to abort itself(see stderr below) However the driver continues to run, and spark context still alive.
I0412 07:31:25.827283 274 sched.cpp:759] Framework registered with 15d9838f-b266-413b-842d-f7c3567bd04a-0051 Exception in thread "Thread-295" java.io.IOException: Failed to connect to xxx-yyy00105.mycompany.com/10.106.14.61:7337 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182) at org.apache.spark.network.shuffle.mesos.MesosExternalShuffleClient.registerDriverWithShuffleService(MesosExternalShuffleClient.java:75) at org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend.statusUpdate(MesosCoarseGrainedSchedulerBackend.scala:537) Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: xxx-yyy00105.mycompany.com/10.106.14.61:7337 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) at java.lang.Thread.run(Thread.java:748) I0412 07:35:12.032925 277 sched.cpp:2055] Asked to abort the driver I0412 07:35:12.033035 277 sched.cpp:1233] Aborting framework 15d9838f-b266-413b-842d-f7c3567bd04a-0051 -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org