Hi all,
We have a 6-node cassandra cluster which has worked fine for a long
time through upgrades starting from 0.8.x to 1.1.x. Recently we
upgraded to 1.2.2, and since then streaming repair doesn't work
anymore (everything else works, gossip, serving Thrift queries etc.).
We upgraded to 1.2.3, upgraded the JDK to the latest version (1.7u17),
but nothing helped. The only error message in the logs is the
following pasted below:
INFO [AntiEntropyStage:1] 2013-03-25 09:30:33,493
StreamOutSession.java (line 162) Streaming to /xxx.xxx.xxx.xxx
INFO [Streaming to /10.181.129.193:1] 2013-03-25 09:30:33,859
StreamReplyVerbHandler.java (line 50) Need to re-stream file
/var/lib/cassandra/data/....db to /xxx.xxx.xxx.xxx
INFO [Streaming to /10.181.129.193:1] 2013-03-25 09:30:33,994
StreamReplyVerbHandler.java (line 50) Need to re-stream file
/var/lib/cassandra/data/....db to /xxx.xxx.xxx.xxx
INFO [Streaming to /10.181.129.193:1] 2013-03-25 09:30:34,190
StreamReplyVerbHandler.java (line 50) Need to re-stream file
/var/lib/cassandra/data/.....db to /xxx.xxx.xxx.xxx
ERROR [Streaming to /10.181.129.193:1] 2013-03-25 09:30:34,474
CassandraDaemon.java (line 164) Exception in thread Thread[Streaming
to /xxx.xxx.xxx.xxx:1,5,main]
java.lang.RuntimeException: java.io.EOFException
at com.google.common.base.Throwables.propagate(Throwables.java:160)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at
org.apache.cassandra.streaming.FileStreamTask.receiveReply(FileStreamTask.java:193)
at
org.apache.cassandra.streaming.compress.CompressedFileStreamTask.stream(CompressedFileStreamTask.java:114)
at
org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:91)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
... 3 more
Subsequently the repair command hangs, and the nodes start running out
of memory after a few cycles with the heap being full of Merkle tree
related datastructures.
We've now discovered that when we turn internode encryption off then
the streaming works again. Is there something that could explain why
the regular internode network traffic works (else thrift queries
should also fail), but the streaming doesn't?
Our internode encryption settings were:
server_encryption_options:
internode_encryption: all
keystore: conf/.keystore
keystore_password: xxxxxxxx
truststore: conf/.truststore
truststore_password: xxxxxxxx
protocol: TLS
algorithm: SunX509
store_type: JKS
cipher_suites: [TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA]
Best regards,
Mathijs