I originally opened this issue on stackoverflow (
https://stackoverflow.com/questions/57516660/cassandra-node-to-node-encryption-throws-unable-to-gossip-with-peers-exception
).

However, I haven't gotten any responses in over a week.  I'm going to post
it here and maybe someone will have an idea on where I can look.

We currently run a multi region cassandra cluster in AWS. It runs in four
regions, 12 nodes per region. It runs without node to node encryption (or
client encryption either). We are trying to enable inter datacenter node to
node encryption. However, when we flip encryption over we get an exception
that nodes are unable to gossip with any peers.

It could possibly be that we didn't build our jks keystore/truststores
correctly (more on how we built these files below). But, we additionally do
not see intra datacenter communication working (which should be set to
unencrypted communication). Additionally, cqlsh cannot connect to the node
either; even though we have (by default) client_auth_required set to false.

ERROR [main] 2019-08-15 18:46:32,241 CassandraDaemon.java:749 -
Exception encountered during startup
java.lang.RuntimeException: Unable to gossip with any peers
        at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1435)
~[apache-cassandra-3.11.4.jar:3.11.4]
        at 
org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:566)
~[apache-cassandra-3.11.4.jar:3.11.4]
        at 
org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:823)
~[apache-cassandra-3.11.4.jar:3.11.4]
        at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:683)
~[apache-cassandra-3.11.4.jar:3.11.4]
        at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:632)
~[apache-cassandra-3.11.4.jar:3.11.4]
        at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:388)
[apache-cassandra-3.11.4.jar:3.11.4]
        at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:620)
[apache-cassandra-3.11.4.jar:3.11.4]
        at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:732)
[apache-cassandra-3.11.4.jar:3.11.4]
INFO  [main] 2019-08-15 18:47:07,384 YamlConfigurationLoader.java:89 -
Configuration location: file:/etc/cassandra/cassandra.yaml


Something to note is that this error message occurs after a few minutes of
the node being up. (i.e. there is a delay between start up before this
exception is thrown).

*Information about our cassandra setup*

cassandra version: 3.11.4
JDK version: openjdk-8.
Linux: Ubuntu 18.04 (bionic).

*cassandra.yaml*

endpoint_snitch: Ec2MultiRegionSnitch

server_encryption_options:
  internode_encryption: dc
  keystore: <omitted>
  keystore_password: <omitted>
  truststore: <omitted>
  truststore_password: <omitted>

client_encryption_options:
  enabled: false

*cassandra-rackdc.properties*

prefer_local=true

*No obvious errors with SSH output*

When starting cassandra with JVM_OPTS="$JVM_OPTS -Djavax.net.debug=ssl" added
to cassandra-env.sh we see SSL logs printed to stdout (*Note: Subject and
Issuer were omitted on purpose)*.

found key for : cassy-us-west-2
adding as trusted cert:
  Subject: ...
  Issuer:  ...
  Algorithm: RSA; Serial number: 0xdad28d843fc73325d4c1a75207d4e74
  Valid from Fri May 27 00:00:00 UTC 2016 until Tue May 26 23:59:59 UTC 2026

...

trigger seeding of SecureRandom
done seeding SecureRandom

Looking at Java SE SSL/TLS connection debugging
<https://docs.oracle.com/javase/7/docs/technotes/guides/security/jsse/ReadDebug.html>,
this looks correct. But to note, we see this series of messages (along with
the RSA key signature output) repeated several times in rapid fire. We
never observe any messages about the trust store being added; however that
might be something that occurs only on client initiation (?)

Additionally, we do see cassandra report that the Encrypted Messaging
service has been started.

INFO  [main] 2019-08-15 18:45:31,022 MessagingService.java:704 -
Starting Encrypted Messaging Service on SSL port 7001

*Doesn't appear to be a cassandra.yaml configuration problem*

We can bring the node back online by simply configuring internode_encryption:
none. This action seems to rule out a broadcast_address or rpc_address
configuration problem.

*How we built our keystore/truststores*

We followed the basic template datastax docs for preparing SSL certificates
<https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/configuration/secureSSLCertWithCA.html>.
One minor difference was that our private key and CSRs were generated using
openssl. One per each region (we plan to share key/signed certs across
nodes in regions). This was created using a command template as:

openssl req -new -newkey rsa:2048 -out cassy-<region>.csr -keyout
cassy-<region>.key -config cassy-<region>.conf -subj "..." -nodes
-sha256

The generated CSR was then signed by an internal root CA. Because we
generated our files using openssl, we had to build our jks files by
importing our certs into them.

*Commands to generate truststore*

We distribute this one file to all nodes.

keytool -importcert
    -keystore generic-server-truststore.jks
    -alias rootCa
    -file rootCa.crt
    -noprompt
    -keypass omitted
    -storepass omitted

*Commands to generate keystore*

This was done one per region; but essentially we created a keystore with
keytool, then deleted the key entry and then imported our key entry using
keytool from a pkcs12 file.

keytool -genkeypair -keyalg RSA -alias cassy-${region} -keystore
cassy-${region}.jks -storepass omitted -keypass omitted -validity 365
-keysize 2048 -dname "..."

keytool -delete -alias cassy-${region} -keystore cassy-${region}.jks
-storepass omitted

openssl pkcs12 -export -in signed_certs/${region}.pem -inkey
keys/cassandra.${region}.key -name cassy-${region} -out ${region}.p12

keytool -importkeystore -deststorepass omitted -destkeystore
cassy-${region}.jks -srckeystore ${region}.p12 -srcstoretype PKCS12

keytool -importcert -keystore cassy-${region}.jks -alias rootCa -file
ca.crt -noprompt -keypass omitted -storepass omitted

Looking back at this, I don't remember why we used keytool to generate a
keypair/keystore, then deleted and imported. I think it was because the
keytool importkeystore command refused to run if the keystore didn't
already exist.

*ca.crt and pem file*

The ca.crt file contains the root certificate and the intermediate
certificate that was used to sign the CSR. The pem file contains the signed
CSR returned to us, the intermediate cert, and the root CA (in that order).

*openssl verify ca.crt and pem*

openssl verify -CAfile ca.crt us-west-2.pem
signed_certs/us-west-2.pem: OK

*Command output after enabling encryption*

*nodetool status (output truncated)*

Datacenter: us-east
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens       Owns (effective)  Host ID
                             Rack
?N  52.44.11.221    ?          256          25.4%             null
                             1c
...
?N  52.204.232.195  ?          256          23.2%             null
                             1d
Datacenter: us-west-2
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens       Owns (effective)  Host ID
                             Rack
?N  34.209.2.144    ?          256          26.5%             null
                             2c
UN  52.40.32.177    105.99 GiB  256          23.7%             null
                              2c
?N  34.210.109.203  ?          256          24.7%             null
                             2a
...

With the online node being the node with encryption set.

*cqlsh to localhost*

cassy-node6:~$ cqlsh
Connection error: ('Unable to connect to any servers', {'127.0.0.1':
error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error:
Connection refused")})

*cqlsh to remote node* Remote node is a node with encryption enabled

cassy-node6:~$ cqlsh 10.0.2.7
Connection error: ('Unable to connect to any servers', {'10.0.2.7':
error(111, "Tried connecting to [('10.0.2.7', 9042)]. Last error:
Connection refused")})

Reply via email to