Hi!

I'm bulkloading from Hadoop to Cassandra. Currently in the process of moving to new hardware for both Hadoop and Cassandra, and while testrunning bulkload, I see the following error:

Exception in thread "Streaming to /2001:4c28:1:413:0:1:1:12:1" java.lang.RuntimeException: java.io.EOFException at com.google.common.base.Throwables.propagate(Throwables.java:155) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.cassandra.streaming.FileStreamTask.receiveReply(FileStreamTask.java:193) at org.apache.cassandra.streaming.FileStreamTask.stream(FileStreamTask.java:180) at org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:91) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ... 3 more

I see no exceptions related to this on the destination node (2001:4c28:1:413:0:1:1:12:1).

This makes the whole map task fail with:

2014-01-27 10:46:50,878 ERROR org.apache.hadoop.security.UserGroupInformation: 
PriviledgedActionException as:forsberg (auth:SIMPLE) cause:java.io.IOException: 
Too many hosts failed: [/2001:4c28:1:413:0:1:1:12]
2014-01-27 10:46:50,878 WARN org.apache.hadoop.mapred.Child: Error running child
java.io.IOException: Too many hosts failed: [/2001:4c28:1:413:0:1:1:12]
        at 
org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:244)
        at 
org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:209)
        at 
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:540)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:650)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
        at org.apache.hadoop.mapred.Child.main(Child.java:260)
2014-01-27 10:46:50,880 INFO org.apache.hadoop.mapred.Task: Runnning cleanup 
for the task

The failed task was on hadoop worker node hdp01-12-4.

However, hadoop later retries this map task on a different hadoop worker node 
(hdp01-10-2), and that retry succeeds.

So that's weird, but I could live with it. Now, however, comes the real trouble 
- the hadoop job does not finish due to one task running on hdp01-12-4 being 
stuck with this:

Exception in thread "Streaming to /2001:4c28:1:413:0:1:1:12:1" 
java.lang.IllegalStateException: target reports current file is 
/opera/log2/hadoop/mapred/local/taskTracker/forsberg/jobcache/job_201401161243_0288/attempt_201401161243_0288_m_000473_0/work/tmp/iceland_test/Data_hourly/iceland_test-Data_hourly-ib-1-Data.db
 but is 
/opera/log6/hadoop/mapred/local/taskTracker/forsberg/jobcache/job_201401161243_0288/attempt_201401161243_0288_m_000000_0/work/tmp/iceland_test/Data_hourly/iceland_test-Data_hourly-ib-1-Data.db
        at 
org.apache.cassandra.streaming.StreamOutSession.validateCurrentFile(StreamOutSession.java:154)
        at 
org.apache.cassandra.streaming.StreamReplyVerbHandler.doVerb(StreamReplyVerbHandler.java:45)
        at 
org.apache.cassandra.streaming.FileStreamTask.receiveReply(FileStreamTask.java:199)
        at 
org.apache.cassandra.streaming.FileStreamTask.stream(FileStreamTask.java:180)
        at 
org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:91)
        at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
        at java.lang.Thread.run(Thread.java:662)

This just sits there forever, or at least until the hadoop task timeout kicks 
in.

So two questions here:

1) Any clues on what might cause the first EOFException? It seems to appear for 
*some* of my bulkloads. Not all, but frequent enough to be a problem. Like, 
every 10:th bulkload I do seems to have the problem.

2) The second problem I have a feeling could be related to 
https://issues.apache.org/jira/browse/CASSANDRA-4223, but with the extra quirk 
that with the bulkload case, we have *multiple java processes* creating 
streaming sessions on the same host, so streaming session IDs are not unique.

I'm thinking 2) happens because the EOFException made the streaming session in 
1) sit around on the target node without being closed.

This is on Cassandra 1.2.1. I know that's pretty old, but I would like to avoid 
upgrading until I have made this migration from old to new hardware. Upgrading 
to 1.2.13 might be an option.

Any hints welcome.

Thanks,
\EF








Reply via email to