Hi,

I have several issues related to HDFS, that may have different roots. I'm
posting as much information as I can, with the hope that I can get your
opinion on at least some of them. Basically the cases are:

- HDFS classes not found
- Connections with some datanode seems to be slow/ unexpectedly close.
- Executors become lost (and cannot be relaunched due to an out of memory
error)

*
What I'm looking for:
- HDFS misconfiguration/ tuning advises
- Global setup flaws (impact of VMs and NUMA mismatch, for example)
- For the last category of issue, I'd like to know why, when the executor
dies, JVM's memory is not freed, thus not allowing a new executor to be
launched.*

My setup is the following:
1 hypervisor with 32 cores and 50 GB of RAM, 5 VMs running in this hv. Each
vms has 5 cores and 7GB.
Each node has 1 worker setup with 4 cores 6 GB available (the remaining
resources are intended to be used by hdfs/os

I run a Wordcount workload with a dataset of 4GB, on a spark 1.4.0 / hdfs
2.5.2 setup. I got the binaries from official websites (no local compiling).

(1) & 2) are logged on the worker, in the work/app-id/exec-id/stderr file)

*1) Hadoop class related issues*

/15:34:32: DEBUG HadoopRDD: SplitLocationInfo and other new Hadoop classes
are unavailable. Using the older Hadoop location info code.
java.lang.ClassNotFoundException:
org.apache.hadoop.mapred.InputSplitWithLocationInfo/

/
15:40:46: DEBUG SparkHadoopUtil: Couldn't find method for retrieving
thread-level FileSystem input data
java.lang.NoSuchMethodException:
org.apache.hadoop.fs.FileSystem$Statistics.getThreadStatistics()/


*2) HDFS performance related issues*

The following error arise: 

/ 15:43:16: ERROR TransportRequestHandler: Error sending result
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=284992323013,
chunkIndex=2},
buffer=FileSegmentManagedBuffer{file=/tmp/spark-b17f3299-99f3-4147-929f-1f236c812d0e/executor-d4ceae23-b9d9-4562-91c2-2855baeb8664/blockmgr-10da9c53-c20a-45f7-a430-2e36d799c7e1/2f/shuffle_0_14_0.data,
offset=15464702, length=998530}} to /192.168.122.168:59299; closing
connection
java.io.IOException: Broken pipe/

/15:43:16 ERROR TransportRequestHandler: Error sending result
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=284992323013,
chunkIndex=0},
buffer=FileSegmentManagedBuffer{file=/tmp/spark-b17f3299-99f3-4147-929f-1f236c812d0e/executor-d4ceae23-b9d9-4562-91c2-2855baeb8664/blockmgr-10da9c53-c20a-45f7-a430-2e36d799c7e1/31/shuffle_0_12_0.data,
offset=15238441, length=980944}} to /192.168.122.168:59299; closing
connection
java.io.IOException: Broken pipe/


/15:44:28 : WARN TransportChannelHandler: Exception in connection from
/192.168.122.15:50995
java.io.IOException: Connection reset by peer/ (note that it's on another
executor)

Some time later: 
/
15:44:52 DEBUG DFSClient: DFSClient seqno: -2 status: SUCCESS status: ERROR
downstreamAckTimeNanos: 0
15:44:52 WARN DFSClient: DFSOutputStream ResponseProcessor exception  for
block BP-845049430-155.99.144.31-1435598542277:blk_1073742427_1758
java.io.IOException: Bad response ERROR for block
BP-845049430-155.99.144.31-1435598542277:blk_1073742427_1758 from datanode
x.x.x.x:50010
        at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:819)/

The following two errors appears several times:

/15:51:05 ERROR Executor: Exception in task 19.0 in stage 1.0 (TID 51)
java.nio.channels.ClosedChannelException
        at
org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:1528)
        at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:98)
        at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at
org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.writeObject(TextOutputFormat.java:81)
        at
org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.write(TextOutputFormat.java:102)
        at
org.apache.spark.SparkHadoopWriter.write(SparkHadoopWriter.scala:95)
        at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1110)
        at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
        at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)
        at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116)
        at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
        at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
        at org.apache.spark.scheduler.Task.run(Task.scala:70)
        at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
/

/15:51:19 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor]
received message AssociationError
[akka.tcp://sparkExecutor@192.168.122.142:38277] ->
[akka.tcp://sparkDriver@x.x.x.x:34732]: Error [Invalid address:
akka.tcp://sparkDriver@x.x.x.x:34732] [
akka.remote.InvalidAssociation: Invalid address:
akka.tcp://sparkDriver@x.x.x.x:34732
Caused by: akka.remote.transport.Transport$InvalidAssociationException:
Connection refused: /x.x.x.x:34732
] from Actor[akka://sparkExecutor/deadLetters]/


In the datanode's logs:

/ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
localhost.localdomain:50010:DataXceiver error processing WRITE_BLOCK
operation  src: /192.168.122.15:56468 dst: /192.168.122.229:50010
java.net.SocketTimeoutException: 60000 millis timeout while waiting for
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/192.168.122.229:50010 remote=/192.168.122.15:56468]/

I also can find the following warnings:
/
2015-07-13 15:46:57,927 WARN
org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write
data to disk cost:718ms (threshold=300ms)
2015-07-13 15:46:59,933 WARN
org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write
packet to mirror took 1298ms (threshold=300ms)/

3) Executors losses

Early in the job, the master's logs display the following messages:

/15/07/13 13:46:50 INFO Master: Removing executor app-20150713133347-0000/5
because it is EXITED
15/07/13 13:46:50 INFO Master: Launching executor app-20150713133347-0000/9
on worker worker-20150713153302-192.168.122.229-59013
15/07/13 13:46:50 DEBUG Master: [actor] handled message (2.247517 ms)
ExecutorStateChanged(app-20150713133347-0000,5,EXITED,Some(Command exited
with code 1),Some(1)) from
Actor[akka.tcp://sparkWorker@192.168.122.229:59013/user/Worker#-83763597]/

This will not cease until the job completes, or ends up failing (depending
on the number of executors actually failing.

Here is the java logs available on each attempted executor launch (in 
work/app-id/exec-id on the worker):
http://pastebin.com/B4FbXvHR









--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/HDFS-performances-unexpected-death-of-executors-tp23803.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to