Hey everyone, We are running into an issue where spark jobs will sometimes hang indefinitely. We are on Spark 1.3.1 (working on upgrading soon), Java 8, and using mesos with spark.mesos.coarse=false. I'm fairly certain that the issue comes up when we do shuffle operations. My pipeline reads data from hbase, and then runs LogisticRegression on it using grid search to find the optimal parameters. At each iteration, I use BinaryClassificationMetrics to compute the areaUnderROC and areaUnderPR.
We suspect that this is some kind of bug which is causing java.net.Inet6AddressImpl.lookupAllHostAddr to hang. Any ideas? Thread dump: Thread 1086: Executor task launch worker-62 (RUNNABLE) java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:907) java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1302) java.net.InetAddress.getAllByName0(InetAddress.java:1255) java.net.InetAddress.getAllByName(InetAddress.java:1171) java.net.InetAddress.getAllByName(InetAddress.java:1105) java.net.InetAddress.getByName(InetAddress.java:1055) java.net.InetSocketAddress.<init>(InetSocketAddress.java:220) org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:126) org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78) org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120) org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87) org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:149) org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:262) org.apache.spark.storage.ShuffleBlockFetcherIterator.<init>(ShuffleBlockFetcherIterator.scala:115) org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:76) org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40) org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) org.apache.spark.rdd.RDD.iterator(RDD.scala:244) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) org.apache.spark.rdd.RDD.iterator(RDD.scala:244) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) org.apache.spark.rdd.RDD.iterator(RDD.scala:242) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) org.apache.spark.rdd.RDD.iterator(RDD.scala:244) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) org.apache.spark.rdd.RDD.iterator(RDD.scala:244) org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) org.apache.spark.rdd.RDD.iterator(RDD.scala:244) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) org.apache.spark.scheduler.Task.run(Task.scala:64) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:745) Thanks, Asher Krim