Hi, I have several issues related to HDFS, that may have different roots. I'm posting as much information as I can, with the hope that I can get your opinion on at least some of them. Basically the cases are:
- HDFS classes not found - Connections with some datanode seems to be slow/ unexpectedly close. - Executors become lost (and cannot be relaunched due to an out of memory error) * What I'm looking for: - HDFS misconfiguration/ tuning advises - Global setup flaws (impact of VMs and NUMA mismatch, for example) - For the last category of issue, I'd like to know why, when the executor dies, JVM's memory is not freed, thus not allowing a new executor to be launched.* My setup is the following: 1 hypervisor with 32 cores and 50 GB of RAM, 5 VMs running in this hv. Each vms has 5 cores and 7GB. Each node has 1 worker setup with 4 cores 6 GB available (the remaining resources are intended to be used by hdfs/os I run a Wordcount workload with a dataset of 4GB, on a spark 1.4.0 / hdfs 2.5.2 setup. I got the binaries from official websites (no local compiling). (1) & 2) are logged on the worker, in the work/app-id/exec-id/stderr file) *1) Hadoop class related issues* /15:34:32: DEBUG HadoopRDD: SplitLocationInfo and other new Hadoop classes are unavailable. Using the older Hadoop location info code. java.lang.ClassNotFoundException: org.apache.hadoop.mapred.InputSplitWithLocationInfo/ / 15:40:46: DEBUG SparkHadoopUtil: Couldn't find method for retrieving thread-level FileSystem input data java.lang.NoSuchMethodException: org.apache.hadoop.fs.FileSystem$Statistics.getThreadStatistics()/ *2) HDFS performance related issues* The following error arise: / 15:43:16: ERROR TransportRequestHandler: Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=284992323013, chunkIndex=2}, buffer=FileSegmentManagedBuffer{file=/tmp/spark-b17f3299-99f3-4147-929f-1f236c812d0e/executor-d4ceae23-b9d9-4562-91c2-2855baeb8664/blockmgr-10da9c53-c20a-45f7-a430-2e36d799c7e1/2f/shuffle_0_14_0.data, offset=15464702, length=998530}} to /192.168.122.168:59299; closing connection java.io.IOException: Broken pipe/ /15:43:16 ERROR TransportRequestHandler: Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=284992323013, chunkIndex=0}, buffer=FileSegmentManagedBuffer{file=/tmp/spark-b17f3299-99f3-4147-929f-1f236c812d0e/executor-d4ceae23-b9d9-4562-91c2-2855baeb8664/blockmgr-10da9c53-c20a-45f7-a430-2e36d799c7e1/31/shuffle_0_12_0.data, offset=15238441, length=980944}} to /192.168.122.168:59299; closing connection java.io.IOException: Broken pipe/ /15:44:28 : WARN TransportChannelHandler: Exception in connection from /192.168.122.15:50995 java.io.IOException: Connection reset by peer/ (note that it's on another executor) Some time later: / 15:44:52 DEBUG DFSClient: DFSClient seqno: -2 status: SUCCESS status: ERROR downstreamAckTimeNanos: 0 15:44:52 WARN DFSClient: DFSOutputStream ResponseProcessor exception for block BP-845049430-155.99.144.31-1435598542277:blk_1073742427_1758 java.io.IOException: Bad response ERROR for block BP-845049430-155.99.144.31-1435598542277:blk_1073742427_1758 from datanode x.x.x.x:50010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:819)/ The following two errors appears several times: /15:51:05 ERROR Executor: Exception in task 19.0 in stage 1.0 (TID 51) java.nio.channels.ClosedChannelException at org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:1528) at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:98) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58) at java.io.DataOutputStream.write(DataOutputStream.java:107) at org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.writeObject(TextOutputFormat.java:81) at org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.write(TextOutputFormat.java:102) at org.apache.spark.SparkHadoopWriter.write(SparkHadoopWriter.scala:95) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1110) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) / /15:51:19 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] received message AssociationError [akka.tcp://sparkExecutor@192.168.122.142:38277] -> [akka.tcp://sparkDriver@x.x.x.x:34732]: Error [Invalid address: akka.tcp://sparkDriver@x.x.x.x:34732] [ akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkDriver@x.x.x.x:34732 Caused by: akka.remote.transport.Transport$InvalidAssociationException: Connection refused: /x.x.x.x:34732 ] from Actor[akka://sparkExecutor/deadLetters]/ In the datanode's logs: /ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: localhost.localdomain:50010:DataXceiver error processing WRITE_BLOCK operation src: /192.168.122.15:56468 dst: /192.168.122.229:50010 java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.122.229:50010 remote=/192.168.122.15:56468]/ I also can find the following warnings: / 2015-07-13 15:46:57,927 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:718ms (threshold=300ms) 2015-07-13 15:46:59,933 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 1298ms (threshold=300ms)/ 3) Executors losses Early in the job, the master's logs display the following messages: /15/07/13 13:46:50 INFO Master: Removing executor app-20150713133347-0000/5 because it is EXITED 15/07/13 13:46:50 INFO Master: Launching executor app-20150713133347-0000/9 on worker worker-20150713153302-192.168.122.229-59013 15/07/13 13:46:50 DEBUG Master: [actor] handled message (2.247517 ms) ExecutorStateChanged(app-20150713133347-0000,5,EXITED,Some(Command exited with code 1),Some(1)) from Actor[akka.tcp://sparkWorker@192.168.122.229:59013/user/Worker#-83763597]/ This will not cease until the job completes, or ends up failing (depending on the number of executors actually failing. Here is the java logs available on each attempted executor launch (in work/app-id/exec-id on the worker): http://pastebin.com/B4FbXvHR -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HDFS-performances-unexpected-death-of-executors-tp23803.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org