Hi,

During an attempt to bootstrap a new node into a 1.2.16 ring the new node
saw one of the streaming nodes periodically disappear:

 INFO [GossipTasks:1] 2014-06-10 00:28:52,572 Gossiper.java (line 823)
InetAddress /10.156.1.2 is now DOWN
ERROR [GossipTasks:1] 2014-06-10 00:28:52,574 AbstractStreamSession.java
(line 108) Stream failed because /10.156.1.2 died or was restarted/removed
(streams may still be active in background, but further streams won't be
started)
 WARN [GossipTasks:1] 2014-06-10 00:28:52,574 RangeStreamer.java (line 246)
Streaming from /10.156.1.2 failed
 INFO [HANDSHAKE-/10.156.1.2] 2014-06-10 00:28:57,922
OutboundTcpConnection.java (line 418) Handshaking version with /10.156.1.2
 INFO [GossipStage:1] 2014-06-10 00:28:57,943 Gossiper.java (line 809)
InetAddress /10.156.1.2 is now UP

This brief interruption was enough to kill the streaming from node
10.156.1.2. Node 10.156.1.2 saw a similar "broken pipe" exception from the
bootstrapping node:

ERROR [Streaming to /10.156.193.1.3] 2014-06-10 01:22:02,345
CassandraDaemon.java (line 191) Exception in thread Thread[Streaming to /
10.156.1.3:1,5,main]
java.lang.RuntimeException: java.io.IOException: Broken pipe
        at com.google.common.base.Throwables.propagate(Throwables.java:160)
        at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)
Caused by: java.io.IOException: Broken pipe
        at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
        at
sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:420)
        at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:552)
        at
org.apache.cassandra.streaming.compress.CompressedFileStreamTask.stream(CompressedFileStreamTask.java:93)
        at
org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:91)
        at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)


During bootstrapping we notice a significant spike in CPU and latency
across the board on the ring (CPU 50->85% and write latencies 60ms ->
250ms). It seems likely that this persistent high load led to the hiccup
that caused the gossiper to see the streaming node as briefly down.

What is the proper way to recover from this? The original estimate was
almost 24 hours to stream all the data required to bootstrap this single
node (streaming set to unlimited) and this occurred 6 hours into the
bootstrap. With such high load from streaming it seems that simply
restarting will inevitably hit this problem again.


Cheers,

Mike

-- 

  Mike Heffner <m...@librato.com>
  Librato, Inc.

Reply via email to