I don't think the issue is an empty partition, but it may not hurt to try a
repartition prior to writing just to rule it out due to the premature EOF
exception.

On Mon, Dec 19, 2016 at 1:53 PM, Joseph Naegele <jnaeg...@grierforensics.com
> wrote:

> Thanks Michael, hdfs dfsadmin -report tells me:
>
>
>
> Configured Capacity: 7999424823296 (7.28 TB)
>
> Present Capacity: 7997657774971 (7.27 TB)
>
> DFS Remaining: 7959091768187 (7.24 TB)
>
> DFS Used: 38566006784 (35.92 GB)
>
> DFS Used%: 0.48%
>
> Under replicated blocks: 0
>
> Blocks with corrupt replicas: 0
>
> Missing blocks: 0
>
> Missing blocks (with replication factor 1): 0
>
>
>
> -------------------------------------------------
>
> Live datanodes (1):
>
>
>
> Name: 127.0.0.1:50010 (localhost)
>
> Hostname: XXX.XXX.XXX
>
> Decommission Status : Normal
>
> Configured Capacity: 7999424823296 (7.28 TB)
>
> DFS Used: 38566006784 (35.92 GB)
>
> Non DFS Used: 1767048325 (1.65 GB)
>
> DFS Remaining: 7959091768187 (7.24 TB)
>
> DFS Used%: 0.48%
>
> DFS Remaining%: 99.50%
>
> Configured Cache Capacity: 0 (0 B)
>
> Cache Used: 0 (0 B)
>
> Cache Remaining: 0 (0 B)
>
> Cache Used%: 100.00%
>
> Cache Remaining%: 0.00%
>
> Xceivers: 17
>
> Last contact: Mon Dec 19 13:00:06 EST 2016
>
>
>
> The Hadoop exception occurs because it times out after 60 seconds in a
> “select” call on a java.nio.channels.SocketChannel, while waiting to read
> from the socket. This implies the client writer isn’t writing on the socket
> as expected, but shouldn’t this all be handled by the Hadoop library within
> Spark?
>
>
>
> It looks like a few similar, but rare, cases have been reported before,
> e.g. https://issues.apache.org/jira/browse/HDFS-770 which is *very* old.
>
>
>
> If you’re pretty sure Spark couldn’t be responsible for issues at this
> level I’ll stick to the Hadoop mailing list.
>
>
>
> Thanks
>
> ---
>
> Joe Naegele
>
> Grier Forensics
>
>
>
> *From:* Michael Stratton [mailto:michael.strat...@komodohealth.com]
> *Sent:* Monday, December 19, 2016 10:00 AM
> *To:* Joseph Naegele <jnaeg...@grierforensics.com>
> *Cc:* user <user@spark.apache.org>
> *Subject:* Re: [Spark SQL] Task failed while writing rows
>
>
>
> It seems like an issue w/ Hadoop. What do you get when you run hdfs
> dfsadmin -report?
>
>
>
> Anecdotally(And w/o specifics as it has been a while), I've generally used
> Parquet instead of ORC as I've gotten a bunch of random problems reading
> and writing ORC w/ Spark... but given ORC performs a lot better w/ Hive it
> can be a pain.
>
>
>
> On Sun, Dec 18, 2016 at 5:49 PM, Joseph Naegele <
> jnaeg...@grierforensics.com> wrote:
>
> Hi all,
>
> I'm having trouble with a relatively simple Spark SQL job. I'm using Spark
> 1.6.3. I have a dataset of around 500M rows (average 128 bytes per record).
> It's current compressed size is around 13 GB, but my problem started when
> it was much smaller, maybe 5 GB. This dataset is generated by performing a
> query on an existing ORC dataset in HDFS, selecting a subset of the
> existing data (i.e. removing duplicates). When I write this dataset to HDFS
> using ORC I get the following exceptions in the driver:
>
> org.apache.spark.SparkException: Task failed while writing rows
> Caused by: java.lang.RuntimeException: Failed to commit task
> Suppressed: java.lang.IllegalArgumentException: Column has wrong number
> of index entries found: 0 expected: 32
>
> Caused by: java.io.IOException: All datanodes 127.0.0.1:50010 are bad.
> Aborting...
>
> This happens multiple times. The executors tell me the following a few
> times before the same exceptions as above:
>
>
>
> 2016-12-09 02:38:12.193 INFO DefaultWriterContainer: Using output
> committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
>
> 2016-12-09 02:41:04.679 WARN DFSClient: DFSOutputStream ResponseProcessor
> exception  for block BP-1695049761-192.168.2.211-
> 1479228275669:blk_1073862425_121642
>
> java.io.EOFException: Premature EOF: no length prefix available
>
>         at org.apache.hadoop.hdfs.protocolPB.PBHelper.
> vintPrefixed(PBHelper.java:2203)
>
>         at org.apache.hadoop.hdfs.protocol.datatransfer.
> PipelineAck.readFields(PipelineAck.java:176)
>
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$
> ResponseProcessor.run(DFSOutputStream.java:867)
>
>
> My HDFS datanode says:
>
> 2016-12-09 02:39:24,783 INFO 
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
> src: /127.0.0.1:57836, dest: /127.0.0.1:50010, bytes: 14808395, op:
> HDFS_WRITE, cliID: 
> DFSClient_attempt_201612090102_0000_m_000025_0_956624542_193,
> offset: 0, srvID: 1003b822-200c-4b93-9f88-f474c0b6ce4a, blockid:
> BP-1695049761-192.168.2.211-1479228275669:blk_1073862420_121637,
> duration: 93026972
>
> 2016-12-09 02:39:24,783 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> PacketResponder: 
> BP-1695049761-192.168.2.211-1479228275669:blk_1073862420_121637,
> type=LAST_IN_PIPELINE, downstreams=0:[] terminating
>
> 2016-12-09 02:39:49,262 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
> XXX.XXX.XXX.XXX:50010:DataXceiver error processing WRITE_BLOCK operation
> src: /127.0.0.1:57790 dst: /127.0.0.1:50010
>
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for 
> channel
> to be ready for read. ch : java.nio.channels.SocketChannel[connected
> local=/127.0.0.1:50010 remote=/127.0.0.1:57790]
>
>
> It looks like the datanode is receiving the block on multiple ports
> (threads?) and one of the sending connections terminates early.
>
> I was originally running 6 executors with 6 cores and 24 GB RAM each
> (Total: 36 cores, 144 GB) and experienced many of these issues, where
> occasionally my job would fail altogether. Lowering the number of cores
> appears to reduce the frequency of these errors, however I'm now down to 4
> executors with 2 cores each (Total: 8 cores), which is significantly less,
> and still see approximately 1-3 task failures.
>
> Details:
> - Spark 1.6.3 - Standalone
> - RDD compression enabled
> - HDFS replication disabled
> - Everything running on the same host
> - Otherwise vanilla configs for Hadoop and Spark
>
> Does anybody have any ideas or hints? I can't imagine the problem is
> solely related to the number of executor cores.
>
> Thanks,
> Joe Naegele
>
>
>

Reply via email to