It seems like an issue w/ Hadoop. What do you get when you run hdfs dfsadmin -report?
Anecdotally(And w/o specifics as it has been a while), I've generally used Parquet instead of ORC as I've gotten a bunch of random problems reading and writing ORC w/ Spark... but given ORC performs a lot better w/ Hive it can be a pain. On Sun, Dec 18, 2016 at 5:49 PM, Joseph Naegele <jnaeg...@grierforensics.com > wrote: > Hi all, > > I'm having trouble with a relatively simple Spark SQL job. I'm using Spark > 1.6.3. I have a dataset of around 500M rows (average 128 bytes per record). > It's current compressed size is around 13 GB, but my problem started when > it was much smaller, maybe 5 GB. This dataset is generated by performing a > query on an existing ORC dataset in HDFS, selecting a subset of the > existing data (i.e. removing duplicates). When I write this dataset to HDFS > using ORC I get the following exceptions in the driver: > > org.apache.spark.SparkException: Task failed while writing rows > Caused by: java.lang.RuntimeException: Failed to commit task > Suppressed: java.lang.IllegalArgumentException: Column has wrong number > of index entries found: 0 expected: 32 > Caused by: java.io.IOException: All datanodes 127.0.0.1:50010 are bad. > Aborting... > > This happens multiple times. The executors tell me the following a few > times before the same exceptions as above: > > 2016-12-09 02:38:12.193 INFO DefaultWriterContainer: Using output > committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter > 2016-12-09 02:41:04.679 WARN DFSClient: DFSOutputStream ResponseProcessor > exception for block BP-1695049761-192.168.2.211-1479228275669 > :blk_1073862425_121642 > java.io.EOFException: Premature EOF: no length prefix available > at org.apache.hadoop.hdfs.protocolPB.PBHelper. > vintPrefixed(PBHelper.java:2203) > at org.apache.hadoop.hdfs.protocol.datatransfer. > PipelineAck.readFields(PipelineAck.java:176) > at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ > ResponseProcessor.run(DFSOutputStream.java:867) > > My HDFS datanode says: > > 2016-12-09 02:39:24,783 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: > src: /127.0.0.1:57836, dest: /127.0.0.1:50010, bytes: 14808395, op: > HDFS_WRITE, cliID: > DFSClient_attempt_201612090102_0000_m_000025_0_956624542_193, > offset: 0, srvID: 1003b822-200c-4b93-9f88-f474c0b6ce4a, blockid: BP- > 1695049761-192.168.2.211-1479228275669:blk_1073862420_121637, duration: > 93026972 > 2016-12-09 02:39:24,783 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > PacketResponder: > BP-1695049761-192.168.2.211-1479228275669:blk_1073862420_121637, > type=LAST_IN_PIPELINE, downstreams=0:[] terminating > 2016-12-09 02:39:49,262 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: > XXX.XXX.XXX.XXX:50010:DataXceiver error processing WRITE_BLOCK operation > src: /127.0.0.1:57790 dst: /127.0.0.1:50010 > java.net.SocketTimeoutException: 60000 millis timeout while waiting for > channel > to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/127.0.0.1:50010 remote=/127.0.0.1:57790] > > It looks like the datanode is receiving the block on multiple ports > (threads?) and one of the sending connections terminates early. > > I was originally running 6 executors with 6 cores and 24 GB RAM each > (Total: 36 cores, 144 GB) and experienced many of these issues, where > occasionally my job would fail altogether. Lowering the number of cores > appears to reduce the frequency of these errors, however I'm now down to 4 > executors with 2 cores each (Total: 8 cores), which is significantly less, > and still see approximately 1-3 task failures. > > Details: > - Spark 1.6.3 - Standalone > - RDD compression enabled > - HDFS replication disabled > - Everything running on the same host > - Otherwise vanilla configs for Hadoop and Spark > > Does anybody have any ideas or hints? I can't imagine the problem is > solely related to the number of executor cores. > > Thanks, > Joe Naegele >