Hi,

I am using Spark 1.2 and facing network related issues while performing
simple computations.

This is a custom cluster set up using ec2 machines and spark prebuilt
binary from apache site. The problem is only when we have workers on other
machines(networking involved). Having a single node for the master and the
slave works correctly.

The error log from slave node is attached below. It is reading textFile
from local FS(copied each node) and counting it. The first 30 tasks get
completed within 5 seconds. Then, it takes several minutes to complete
another 10 tasks and eventually dies.

Sometimes, one of the workers completes all the tasks assigned to it.
Different workers have different behavior at different
times(non-deterministic).

Is it related to something specific to EC2?



15/09/24 13:04:40 INFO Executor: Running task 117.0 in stage 0.0 (TID 117)

15/09/24 13:04:41 INFO TorrentBroadcast: Started reading broadcast variable
1

15/09/24 13:04:41 INFO SendingConnection: Initiating connection to
[master_ip:56305]

15/09/24 13:04:41 INFO SendingConnection: Connected to
[master_ip/master_ip_address:56305], 1 messages pending

15/09/24 13:05:41 INFO TorrentBroadcast: Started reading broadcast variable
1

15/09/24 13:05:41 ERROR Executor: Exception in task 77.0 in stage 0.0 (TID
77)

java.io.IOException: sendMessageReliably failed because ack was not
received within 60 sec

        at
org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918)

        at
org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917)

        at scala.Option.foreach(Option.scala:236)

        at
org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917)

        at
io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)

        at
io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)

        at
io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)

        at java.lang.Thread.run(Thread.java:745)

15/09/24 13:05:41 INFO CoarseGrainedExecutorBackend: Got assigned task 122

15/09/24 13:05:41 INFO Executor: Running task 3.1 in stage 0.0 (TID 122)

15/09/24 13:06:41 ERROR Executor: Exception in task 113.0 in stage 0.0 (TID
113)

java.io.IOException: sendMessageReliably failed because ack was not
received within 60 sec

        at
org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918)

        at
org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917)

        at scala.Option.foreach(Option.scala:236)

        at
org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917)

        at
io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)

        at
io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)

        at
io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)

        at java.lang.Thread.run(Thread.java:745)

15/09/24 13:06:41 INFO TorrentBroadcast: Started reading broadcast variable
1

15/09/24 13:06:41 INFO SendingConnection: Initiating connection to
[master_ip/master_ip_address:44427]

15/09/24 13:06:41 INFO SendingConnection: Connected to
[master_ip/master_ip_address:44427], 1 messages pending

15/09/24 13:07:41 ERROR Executor: Exception in task 37.0 in stage 0.0 (TID
37)

java.io.IOException: sendMessageReliably failed because ack was not
received within 60 sec

        at
org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918)

        at
org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917)

        at scala.Option.foreach(Option.scala:236)

        at
org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917)

        at
io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)

        at
io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)

        at
io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)

        at java.lang.Thread.run(Thread.java:745)





I checked the network speed between the master and the slave and it is able
to scp large files at a speed of 60 MB/s.

Any leads on how this can be fixed?



Thanks and Regards,

Suraj Sheth

Reply via email to