I don't think the issue is an empty partition, but it may not hurt to try a repartition prior to writing just to rule it out due to the premature EOF exception.
On Mon, Dec 19, 2016 at 1:53 PM, Joseph Naegele <jnaeg...@grierforensics.com > wrote: > Thanks Michael, hdfs dfsadmin -report tells me: > > > > Configured Capacity: 7999424823296 (7.28 TB) > > Present Capacity: 7997657774971 (7.27 TB) > > DFS Remaining: 7959091768187 (7.24 TB) > > DFS Used: 38566006784 (35.92 GB) > > DFS Used%: 0.48% > > Under replicated blocks: 0 > > Blocks with corrupt replicas: 0 > > Missing blocks: 0 > > Missing blocks (with replication factor 1): 0 > > > > ------------------------------------------------- > > Live datanodes (1): > > > > Name: 127.0.0.1:50010 (localhost) > > Hostname: XXX.XXX.XXX > > Decommission Status : Normal > > Configured Capacity: 7999424823296 (7.28 TB) > > DFS Used: 38566006784 (35.92 GB) > > Non DFS Used: 1767048325 (1.65 GB) > > DFS Remaining: 7959091768187 (7.24 TB) > > DFS Used%: 0.48% > > DFS Remaining%: 99.50% > > Configured Cache Capacity: 0 (0 B) > > Cache Used: 0 (0 B) > > Cache Remaining: 0 (0 B) > > Cache Used%: 100.00% > > Cache Remaining%: 0.00% > > Xceivers: 17 > > Last contact: Mon Dec 19 13:00:06 EST 2016 > > > > The Hadoop exception occurs because it times out after 60 seconds in a > “select” call on a java.nio.channels.SocketChannel, while waiting to read > from the socket. This implies the client writer isn’t writing on the socket > as expected, but shouldn’t this all be handled by the Hadoop library within > Spark? > > > > It looks like a few similar, but rare, cases have been reported before, > e.g. https://issues.apache.org/jira/browse/HDFS-770 which is *very* old. > > > > If you’re pretty sure Spark couldn’t be responsible for issues at this > level I’ll stick to the Hadoop mailing list. > > > > Thanks > > --- > > Joe Naegele > > Grier Forensics > > > > *From:* Michael Stratton [mailto:michael.strat...@komodohealth.com] > *Sent:* Monday, December 19, 2016 10:00 AM > *To:* Joseph Naegele <jnaeg...@grierforensics.com> > *Cc:* user <user@spark.apache.org> > *Subject:* Re: [Spark SQL] Task failed while writing rows > > > > It seems like an issue w/ Hadoop. What do you get when you run hdfs > dfsadmin -report? > > > > Anecdotally(And w/o specifics as it has been a while), I've generally used > Parquet instead of ORC as I've gotten a bunch of random problems reading > and writing ORC w/ Spark... but given ORC performs a lot better w/ Hive it > can be a pain. > > > > On Sun, Dec 18, 2016 at 5:49 PM, Joseph Naegele < > jnaeg...@grierforensics.com> wrote: > > Hi all, > > I'm having trouble with a relatively simple Spark SQL job. I'm using Spark > 1.6.3. I have a dataset of around 500M rows (average 128 bytes per record). > It's current compressed size is around 13 GB, but my problem started when > it was much smaller, maybe 5 GB. This dataset is generated by performing a > query on an existing ORC dataset in HDFS, selecting a subset of the > existing data (i.e. removing duplicates). When I write this dataset to HDFS > using ORC I get the following exceptions in the driver: > > org.apache.spark.SparkException: Task failed while writing rows > Caused by: java.lang.RuntimeException: Failed to commit task > Suppressed: java.lang.IllegalArgumentException: Column has wrong number > of index entries found: 0 expected: 32 > > Caused by: java.io.IOException: All datanodes 127.0.0.1:50010 are bad. > Aborting... > > This happens multiple times. The executors tell me the following a few > times before the same exceptions as above: > > > > 2016-12-09 02:38:12.193 INFO DefaultWriterContainer: Using output > committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter > > 2016-12-09 02:41:04.679 WARN DFSClient: DFSOutputStream ResponseProcessor > exception for block BP-1695049761-192.168.2.211- > 1479228275669:blk_1073862425_121642 > > java.io.EOFException: Premature EOF: no length prefix available > > at org.apache.hadoop.hdfs.protocolPB.PBHelper. > vintPrefixed(PBHelper.java:2203) > > at org.apache.hadoop.hdfs.protocol.datatransfer. > PipelineAck.readFields(PipelineAck.java:176) > > at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ > ResponseProcessor.run(DFSOutputStream.java:867) > > > My HDFS datanode says: > > 2016-12-09 02:39:24,783 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: > src: /127.0.0.1:57836, dest: /127.0.0.1:50010, bytes: 14808395, op: > HDFS_WRITE, cliID: > DFSClient_attempt_201612090102_0000_m_000025_0_956624542_193, > offset: 0, srvID: 1003b822-200c-4b93-9f88-f474c0b6ce4a, blockid: > BP-1695049761-192.168.2.211-1479228275669:blk_1073862420_121637, > duration: 93026972 > > 2016-12-09 02:39:24,783 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > PacketResponder: > BP-1695049761-192.168.2.211-1479228275669:blk_1073862420_121637, > type=LAST_IN_PIPELINE, downstreams=0:[] terminating > > 2016-12-09 02:39:49,262 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: > XXX.XXX.XXX.XXX:50010:DataXceiver error processing WRITE_BLOCK operation > src: /127.0.0.1:57790 dst: /127.0.0.1:50010 > > java.net.SocketTimeoutException: 60000 millis timeout while waiting for > channel > to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/127.0.0.1:50010 remote=/127.0.0.1:57790] > > > It looks like the datanode is receiving the block on multiple ports > (threads?) and one of the sending connections terminates early. > > I was originally running 6 executors with 6 cores and 24 GB RAM each > (Total: 36 cores, 144 GB) and experienced many of these issues, where > occasionally my job would fail altogether. Lowering the number of cores > appears to reduce the frequency of these errors, however I'm now down to 4 > executors with 2 cores each (Total: 8 cores), which is significantly less, > and still see approximately 1-3 task failures. > > Details: > - Spark 1.6.3 - Standalone > - RDD compression enabled > - HDFS replication disabled > - Everything running on the same host > - Otherwise vanilla configs for Hadoop and Spark > > Does anybody have any ideas or hints? I can't imagine the problem is > solely related to the number of executor cores. > > Thanks, > Joe Naegele > > >