Hi,
any inputs regarding following situation will be appreciated:
We are running with dynamic allocation(spark v.2.2.0), i.e. with external
shuffle service with Mesos cluster(1.1.0)
Sometimes due to network failures and/or order of offers excepted by
different frameworks the application framework starts before external
shuffle service on some node. 
In this case framework(spark) can't connect to external shuffle service and
this makes it to abort itself(see stderr below)
However the driver continues to run, and spark context still alive.


I0412 07:31:25.827283   274 sched.cpp:759] Framework registered with
15d9838f-b266-413b-842d-f7c3567bd04a-0051
Exception in thread "Thread-295" java.io.IOException: Failed to connect to
xxx-yyy00105.mycompany.com/10.106.14.61:7337
        at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
        at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
        at
org.apache.spark.network.shuffle.mesos.MesosExternalShuffleClient.registerDriverWithShuffleService(MesosExternalShuffleClient.java:75)
        at
org.apache.spark.scheduler.cluster.mesos.MesosCoarseGrainedSchedulerBackend.statusUpdate(MesosCoarseGrainedSchedulerBackend.scala:537)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException:
Connection refused: xxx-yyy00105.mycompany.com/10.106.14.61:7337
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)
        at
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)
        at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)
        at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
        at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
        at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
        at
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
        at java.lang.Thread.run(Thread.java:748)
I0412 07:35:12.032925   277 sched.cpp:2055] Asked to abort the driver
I0412 07:35:12.033035   277 sched.cpp:1233] Aborting framework
15d9838f-b266-413b-842d-f7c3567bd04a-0051






--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to