Thanks Michael, hdfs dfsadmin -report tells me:
Configured Capacity: 7999424823296 (7.28 TB) Present Capacity: 7997657774971 (7.27 TB) DFS Remaining: 7959091768187 (7.24 TB) DFS Used: 38566006784 (35.92 GB) DFS Used%: 0.48% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 ------------------------------------------------- Live datanodes (1): Name: 127.0.0.1:50010 (localhost) Hostname: XXX.XXX.XXX Decommission Status : Normal Configured Capacity: 7999424823296 (7.28 TB) DFS Used: 38566006784 (35.92 GB) Non DFS Used: 1767048325 (1.65 GB) DFS Remaining: 7959091768187 (7.24 TB) DFS Used%: 0.48% DFS Remaining%: 99.50% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 17 Last contact: Mon Dec 19 13:00:06 EST 2016 The Hadoop exception occurs because it times out after 60 seconds in a “select” call on a java.nio.channels.SocketChannel, while waiting to read from the socket. This implies the client writer isn’t writing on the socket as expected, but shouldn’t this all be handled by the Hadoop library within Spark? It looks like a few similar, but rare, cases have been reported before, e.g. https://issues.apache.org/jira/browse/HDFS-770 which is *very* old. If you’re pretty sure Spark couldn’t be responsible for issues at this level I’ll stick to the Hadoop mailing list. Thanks --- Joe Naegele Grier Forensics From: Michael Stratton [mailto:michael.strat...@komodohealth.com] Sent: Monday, December 19, 2016 10:00 AM To: Joseph Naegele <jnaeg...@grierforensics.com> Cc: user <user@spark.apache.org> Subject: Re: [Spark SQL] Task failed while writing rows It seems like an issue w/ Hadoop. What do you get when you run hdfs dfsadmin -report? Anecdotally(And w/o specifics as it has been a while), I've generally used Parquet instead of ORC as I've gotten a bunch of random problems reading and writing ORC w/ Spark... but given ORC performs a lot better w/ Hive it can be a pain. On Sun, Dec 18, 2016 at 5:49 PM, Joseph Naegele <jnaeg...@grierforensics.com <mailto:jnaeg...@grierforensics.com> > wrote: Hi all, I'm having trouble with a relatively simple Spark SQL job. I'm using Spark 1.6.3. I have a dataset of around 500M rows (average 128 bytes per record). It's current compressed size is around 13 GB, but my problem started when it was much smaller, maybe 5 GB. This dataset is generated by performing a query on an existing ORC dataset in HDFS, selecting a subset of the existing data (i.e. removing duplicates). When I write this dataset to HDFS using ORC I get the following exceptions in the driver: org.apache.spark.SparkException: Task failed while writing rows Caused by: java.lang.RuntimeException: Failed to commit task Suppressed: java.lang.IllegalArgumentException: Column has wrong number of index entries found: 0 expected: 32 Caused by: java.io.IOException: All datanodes 127.0.0.1:50010 are bad. Aborting... This happens multiple times. The executors tell me the following a few times before the same exceptions as above: 2016-12-09 02:38:12.193 INFO DefaultWriterContainer: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter 2016-12-09 02:41:04.679 WARN DFSClient: DFSOutputStream ResponseProcessor exception for block BP-1695049761-192.168.2.211-1479228275669:blk_1073862425_121642 java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2203) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:867) My HDFS datanode says: 2016-12-09 02:39:24,783 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /127.0.0.1:57836, dest: /127.0.0.1:50010, bytes: 14808395, op: HDFS_WRITE, cliID: DFSClient_attempt_201612090102_0000_m_000025_0_956624542_193, offset: 0, srvID: 1003b822-200c-4b93-9f88-f474c0b6ce4a, blockid: BP-1695049761-192.168.2.211-1479228275669:blk_1073862420_121637, duration: 93026972 2016-12-09 02:39:24,783 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1695049761-192.168.2.211-1479228275669:blk_1073862420_121637, type=LAST_IN_PIPELINE, downstreams=0:[] terminating 2016-12-09 02:39:49,262 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: XXX.XXX.XXX.XXX:50010:DataXceiver error processing WRITE_BLOCK operation src: /127.0.0.1:57790 dst: /127.0.0.1:50010 <http://java.net> java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:50010 remote=/127.0.0.1:57790] It looks like the datanode is receiving the block on multiple ports (threads?) and one of the sending connections terminates early. I was originally running 6 executors with 6 cores and 24 GB RAM each (Total: 36 cores, 144 GB) and experienced many of these issues, where occasionally my job would fail altogether. Lowering the number of cores appears to reduce the frequency of these errors, however I'm now down to 4 executors with 2 cores each (Total: 8 cores), which is significantly less, and still see approximately 1-3 task failures. Details: - Spark 1.6.3 - Standalone - RDD compression enabled - HDFS replication disabled - Everything running on the same host - Otherwise vanilla configs for Hadoop and Spark Does anybody have any ideas or hints? I can't imagine the problem is solely related to the number of executor cores. Thanks, Joe Naegele