Thanks Michael, hdfs dfsadmin -report tells me:

 

Configured Capacity: 7999424823296 (7.28 TB)

Present Capacity: 7997657774971 (7.27 TB)

DFS Remaining: 7959091768187 (7.24 TB)

DFS Used: 38566006784 (35.92 GB)

DFS Used%: 0.48%

Under replicated blocks: 0

Blocks with corrupt replicas: 0

Missing blocks: 0

Missing blocks (with replication factor 1): 0

 

-------------------------------------------------

Live datanodes (1):

 

Name: 127.0.0.1:50010 (localhost)

Hostname: XXX.XXX.XXX

Decommission Status : Normal

Configured Capacity: 7999424823296 (7.28 TB)

DFS Used: 38566006784 (35.92 GB)

Non DFS Used: 1767048325 (1.65 GB)

DFS Remaining: 7959091768187 (7.24 TB)

DFS Used%: 0.48%

DFS Remaining%: 99.50%

Configured Cache Capacity: 0 (0 B)

Cache Used: 0 (0 B)

Cache Remaining: 0 (0 B)

Cache Used%: 100.00%

Cache Remaining%: 0.00%

Xceivers: 17

Last contact: Mon Dec 19 13:00:06 EST 2016

 

The Hadoop exception occurs because it times out after 60 seconds in a “select” 
call on a java.nio.channels.SocketChannel, while waiting to read from the 
socket. This implies the client writer isn’t writing on the socket as expected, 
but shouldn’t this all be handled by the Hadoop library within Spark?

 

It looks like a few similar, but rare, cases have been reported before, e.g. 
https://issues.apache.org/jira/browse/HDFS-770 which is *very* old.

 

If you’re pretty sure Spark couldn’t be responsible for issues at this level 
I’ll stick to the Hadoop mailing list.

 

Thanks

---

Joe Naegele

Grier Forensics

 

From: Michael Stratton [mailto:michael.strat...@komodohealth.com] 
Sent: Monday, December 19, 2016 10:00 AM
To: Joseph Naegele <jnaeg...@grierforensics.com>
Cc: user <user@spark.apache.org>
Subject: Re: [Spark SQL] Task failed while writing rows

 

It seems like an issue w/ Hadoop. What do you get when you run hdfs dfsadmin 
-report?

 

Anecdotally(And w/o specifics as it has been a while), I've generally used 
Parquet instead of ORC as I've gotten a bunch of random problems reading and 
writing ORC w/ Spark... but given ORC performs a lot better w/ Hive it can be a 
pain.

 

On Sun, Dec 18, 2016 at 5:49 PM, Joseph Naegele <jnaeg...@grierforensics.com 
<mailto:jnaeg...@grierforensics.com> > wrote:

Hi all,

I'm having trouble with a relatively simple Spark SQL job. I'm using Spark 
1.6.3. I have a dataset of around 500M rows (average 128 bytes per record). 
It's current compressed size is around 13 GB, but my problem started when it 
was much smaller, maybe 5 GB. This dataset is generated by performing a query 
on an existing ORC dataset in HDFS, selecting a subset of the existing data 
(i.e. removing duplicates). When I write this dataset to HDFS using ORC I get 
the following exceptions in the driver:

org.apache.spark.SparkException: Task failed while writing rows
Caused by: java.lang.RuntimeException: Failed to commit task
Suppressed: java.lang.IllegalArgumentException: Column has wrong number of 
index entries found: 0 expected: 32

Caused by: java.io.IOException: All datanodes 127.0.0.1:50010 are bad. 
Aborting...

This happens multiple times. The executors tell me the following a few times 
before the same exceptions as above:

 

2016-12-09 02:38:12.193 INFO DefaultWriterContainer: Using output committer 
class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter

2016-12-09 02:41:04.679 WARN DFSClient: DFSOutputStream ResponseProcessor 
exception  for block 
BP-1695049761-192.168.2.211-1479228275669:blk_1073862425_121642

java.io.EOFException: Premature EOF: no length prefix available

        at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2203)

        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)

        at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:867)


My HDFS datanode says:

2016-12-09 02:39:24,783 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/127.0.0.1:57836, dest: /127.0.0.1:50010, bytes: 14808395, op: HDFS_WRITE, 
cliID: DFSClient_attempt_201612090102_0000_m_000025_0_956624542_193, offset: 0, 
srvID: 1003b822-200c-4b93-9f88-f474c0b6ce4a, blockid: 
BP-1695049761-192.168.2.211-1479228275669:blk_1073862420_121637, duration: 
93026972

2016-12-09 02:39:24,783 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
PacketResponder: 
BP-1695049761-192.168.2.211-1479228275669:blk_1073862420_121637, 
type=LAST_IN_PIPELINE, downstreams=0:[] terminating

2016-12-09 02:39:49,262 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
XXX.XXX.XXX.XXX:50010:DataXceiver error processing WRITE_BLOCK operation  src: 
/127.0.0.1:57790 dst: /127.0.0.1:50010 

 <http://java.net> java.net.SocketTimeoutException: 60000 millis timeout while 
waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/127.0.0.1:50010 
remote=/127.0.0.1:57790]


It looks like the datanode is receiving the block on multiple ports (threads?) 
and one of the sending connections terminates early.

I was originally running 6 executors with 6 cores and 24 GB RAM each (Total: 36 
cores, 144 GB) and experienced many of these issues, where occasionally my job 
would fail altogether. Lowering the number of cores appears to reduce the 
frequency of these errors, however I'm now down to 4 executors with 2 cores 
each (Total: 8 cores), which is significantly less, and still see approximately 
1-3 task failures.

Details:
- Spark 1.6.3 - Standalone
- RDD compression enabled
- HDFS replication disabled
- Everything running on the same host
- Otherwise vanilla configs for Hadoop and Spark

Does anybody have any ideas or hints? I can't imagine the problem is solely 
related to the number of executor cores.

Thanks,
Joe Naegele

 

Reply via email to