Hi, just a follow up. We've seen this behavior multiple times now. It seems that the receiving node loses connectivity to the cluster and thus thinks that it is the sole online node, whereas the rest of the cluster thinks that it is the only offline node, really just after the streaming is over. I am not sure what causes that, but it is reproducible. Restart of the affected node helps.
We have 3 datacenters (RF=1 for each datacenter) where we are moving the tokens. This happens only in one of them. Regards Jiri Horky On 12/19/2014 08:20 PM, Jiri Horky wrote: > Hi list, > > we added a new node to existing 8-nodes cluster with C* 1.2.9 without > vnodes and because we are almost totally out of space, we are shuffling > the token fone node after another (not in parallel). During one of this > move operations, the receiving node died and thus the streaming failed: > > WARN [Streaming to /X.Y.Z.18:2] 2014-12-19 19:25:56,227 > StorageService.java (line 3703) Streaming to /X.Y.Z.18 failed > INFO [RMI TCP Connection(12940)-X.Y.Z.17] 2014-12-19 19:25:56,233 > ColumnFamilyStore.java (line 629) Enqueuing flush of > Memtable-local@433096244(70/70 serialized/live bytes, 2 ops) > INFO [FlushWriter:3772] 2014-12-19 19:25:56,238 Memtable.java (line > 461) Writing Memtable-local@433096244(70/70 serialized/live bytes, 2 ops) > ERROR [Streaming to /X.Y.Z.18:2] 2014-12-19 19:25:56,246 > CassandraDaemon.java (line 192) Exception in thread Thread[Streaming to > /X.Y.Z.18:2,5,RMI Runtime] > java.lang.RuntimeException: java.io.IOException: Broken pipe > at com.google.common.base.Throwables.propagate(Throwables.java:160) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > > After restart of the receiving node, we tried to perform the move again, > but it failed with: > > Exception in thread "main" java.io.IOException: target token > 113427455640312821154458202477256070486 is already owned by another node. > at > org.apache.cassandra.service.StorageService.move(StorageService.java:2930) > > So we tried to move it with a token just 1 higher, to trigger the > movement. This didn't move anything, but finished successfully: > > INFO [Thread-5520] 2014-12-19 20:00:24,689 StreamInSession.java (line > 199) Finished streaming session 4974f3c0-87b1-11e4-bf1b-97d9ac6bd256 > from /X.Y.Z.18 > > Now, it is quite improbable that the first streaming was done and it > died just after copying everything, as the ERROR was the last message > about streaming in the logs. Is there any way how to make sure the data > are really moved and thus running nodetool cleanup is safe? > > Thank you. > Jiri Hoky