Hey everyone,

We are running into an issue where spark jobs will sometimes hang
indefinitely. We are on Spark 1.3.1 (working on upgrading soon), Java 8,
and using mesos with spark.mesos.coarse=false. I'm fairly certain that the
issue comes up when we do shuffle operations.
My pipeline reads data from hbase, and then runs LogisticRegression on it
using grid search to find the optimal parameters. At each iteration, I use
BinaryClassificationMetrics to compute the areaUnderROC and areaUnderPR.

We suspect that this is some kind of bug which is causing
java.net.Inet6AddressImpl.lookupAllHostAddr to hang.

Any ideas?

Thread dump:
Thread 1086: Executor task launch worker-62 (RUNNABLE)

java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:907)
java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1302)
java.net.InetAddress.getAllByName0(InetAddress.java:1255)
java.net.InetAddress.getAllByName(InetAddress.java:1171)
java.net.InetAddress.getAllByName(InetAddress.java:1105)
java.net.InetAddress.getByName(InetAddress.java:1055)
java.net.InetSocketAddress.<init>(InetSocketAddress.java:220)
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:126)
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:149)
org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:262)
org.apache.spark.storage.ShuffleBlockFetcherIterator.<init>(ShuffleBlockFetcherIterator.scala:115)
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:76)
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
org.apache.spark.scheduler.Task.run(Task.scala:64)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

Thanks,
Asher Krim

Reply via email to