Hi, During an attempt to bootstrap a new node into a 1.2.16 ring the new node saw one of the streaming nodes periodically disappear:
INFO [GossipTasks:1] 2014-06-10 00:28:52,572 Gossiper.java (line 823) InetAddress /10.156.1.2 is now DOWN ERROR [GossipTasks:1] 2014-06-10 00:28:52,574 AbstractStreamSession.java (line 108) Stream failed because /10.156.1.2 died or was restarted/removed (streams may still be active in background, but further streams won't be started) WARN [GossipTasks:1] 2014-06-10 00:28:52,574 RangeStreamer.java (line 246) Streaming from /10.156.1.2 failed INFO [HANDSHAKE-/10.156.1.2] 2014-06-10 00:28:57,922 OutboundTcpConnection.java (line 418) Handshaking version with /10.156.1.2 INFO [GossipStage:1] 2014-06-10 00:28:57,943 Gossiper.java (line 809) InetAddress /10.156.1.2 is now UP This brief interruption was enough to kill the streaming from node 10.156.1.2. Node 10.156.1.2 saw a similar "broken pipe" exception from the bootstrapping node: ERROR [Streaming to /10.156.193.1.3] 2014-06-10 01:22:02,345 CassandraDaemon.java (line 191) Exception in thread Thread[Streaming to / 10.156.1.3:1,5,main] java.lang.RuntimeException: java.io.IOException: Broken pipe at com.google.common.base.Throwables.propagate(Throwables.java:160) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: java.io.IOException: Broken pipe at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:420) at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:552) at org.apache.cassandra.streaming.compress.CompressedFileStreamTask.stream(CompressedFileStreamTask.java:93) at org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:91) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) During bootstrapping we notice a significant spike in CPU and latency across the board on the ring (CPU 50->85% and write latencies 60ms -> 250ms). It seems likely that this persistent high load led to the hiccup that caused the gossiper to see the streaming node as briefly down. What is the proper way to recover from this? The original estimate was almost 24 hours to stream all the data required to bootstrap this single node (streaming set to unlimited) and this occurred 6 hours into the bootstrap. With such high load from streaming it seems that simply restarting will inevitably hit this problem again. Cheers, Mike -- Mike Heffner <m...@librato.com> Librato, Inc.