Hi!
I'm bulkloading from Hadoop to Cassandra. Currently in the process of
moving to new hardware for both Hadoop and Cassandra, and while
testrunning bulkload, I see the following error:
Exception in thread "Streaming to /2001:4c28:1:413:0:1:1:12:1"
java.lang.RuntimeException: java.io.EOFException at
com.google.common.base.Throwables.propagate(Throwables.java:155) at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662) Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375) at
org.apache.cassandra.streaming.FileStreamTask.receiveReply(FileStreamTask.java:193)
at
org.apache.cassandra.streaming.FileStreamTask.stream(FileStreamTask.java:180)
at
org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:91)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
... 3 more
I see no exceptions related to this on the destination node
(2001:4c28:1:413:0:1:1:12:1).
This makes the whole map task fail with:
2014-01-27 10:46:50,878 ERROR org.apache.hadoop.security.UserGroupInformation:
PriviledgedActionException as:forsberg (auth:SIMPLE) cause:java.io.IOException:
Too many hosts failed: [/2001:4c28:1:413:0:1:1:12]
2014-01-27 10:46:50,878 WARN org.apache.hadoop.mapred.Child: Error running child
java.io.IOException: Too many hosts failed: [/2001:4c28:1:413:0:1:1:12]
at
org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:244)
at
org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:209)
at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:540)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:650)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
at org.apache.hadoop.mapred.Child.main(Child.java:260)
2014-01-27 10:46:50,880 INFO org.apache.hadoop.mapred.Task: Runnning cleanup
for the task
The failed task was on hadoop worker node hdp01-12-4.
However, hadoop later retries this map task on a different hadoop worker node
(hdp01-10-2), and that retry succeeds.
So that's weird, but I could live with it. Now, however, comes the real trouble
- the hadoop job does not finish due to one task running on hdp01-12-4 being
stuck with this:
Exception in thread "Streaming to /2001:4c28:1:413:0:1:1:12:1"
java.lang.IllegalStateException: target reports current file is
/opera/log2/hadoop/mapred/local/taskTracker/forsberg/jobcache/job_201401161243_0288/attempt_201401161243_0288_m_000473_0/work/tmp/iceland_test/Data_hourly/iceland_test-Data_hourly-ib-1-Data.db
but is
/opera/log6/hadoop/mapred/local/taskTracker/forsberg/jobcache/job_201401161243_0288/attempt_201401161243_0288_m_000000_0/work/tmp/iceland_test/Data_hourly/iceland_test-Data_hourly-ib-1-Data.db
at
org.apache.cassandra.streaming.StreamOutSession.validateCurrentFile(StreamOutSession.java:154)
at
org.apache.cassandra.streaming.StreamReplyVerbHandler.doVerb(StreamReplyVerbHandler.java:45)
at
org.apache.cassandra.streaming.FileStreamTask.receiveReply(FileStreamTask.java:199)
at
org.apache.cassandra.streaming.FileStreamTask.stream(FileStreamTask.java:180)
at
org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:91)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
This just sits there forever, or at least until the hadoop task timeout kicks
in.
So two questions here:
1) Any clues on what might cause the first EOFException? It seems to appear for
*some* of my bulkloads. Not all, but frequent enough to be a problem. Like,
every 10:th bulkload I do seems to have the problem.
2) The second problem I have a feeling could be related to
https://issues.apache.org/jira/browse/CASSANDRA-4223, but with the extra quirk
that with the bulkload case, we have *multiple java processes* creating
streaming sessions on the same host, so streaming session IDs are not unique.
I'm thinking 2) happens because the EOFException made the streaming session in
1) sit around on the target node without being closed.
This is on Cassandra 1.2.1. I know that's pretty old, but I would like to avoid
upgrading until I have made this migration from old to new hardware. Upgrading
to 1.2.13 might be an option.
Any hints welcome.
Thanks,
\EF