Hi, I have a strong suspicion that you're encountering a resource leak bug in the Erlang SSL libraries that Riak uses. By odd coincidence, I ran into a very similar issue at my last job working on a completely different project, and I helped develop a fix a couple of years ago. The patch was accepted somewhere in the Erlang R17 timeframe, but Riak doesn't support R17 yet so you probably don't have a version of Erlang with this particular fix in place.
To check whether this resource leak is being hit, you can attach an Erlang shell to a node using the "riak attach" command and copy/paste this one line of code: lists:keyfind(size, 1, ets:info(element(2, lists:keyfind(ssl_otp_cacertificate_db, 1, [{ets:info(T, name), T} || T <- ets:all()])))). Running the above command should give you something that looks like this: {size, 0} You will likely see a number larger than 0, but the general format should be the same. In normal usage, this number should be fairly small, but in your case you may see it continue to grow larger and larger over time (specifically, you may see it incrementing by one for each new incoming SSL connection, and never decrementing, even after connections are closed). If this value starts to get up into the tens or hundreds of thousands or more, establishing new SSL connections will start to get slower and slower, much like you're seeing. If you can verify that this size value continues to grow over time, we can take a look at backporting the relevant fix to our custom Basho fork of Erlang. Let me know what you find, and we can take it from there. Thanks! Nick On Mon, Jul 27, 2015 at 5:59 AM, ジョハンガル <gall.jo...@linecorp.com> wrote: > Hello, > > I would really appreciate help about our authentication problem. > > We have developed an orchestration platform for internal cloud needs using > RIAK KV 2.1.1 (our first deployment of the technology). > Up to now we were only using raw HTTP but for security purposes we have > been switching to TLS v1.2 with Protocol Buffer clients (standard riak java > client). > > At first everything works smoothly. > > > Then eventually (a few hours), with a load of about 10-15 authentications > by second our cluster CPU usage starts to slowly ramp up to saturation > until the the data-store turns unresponsive. (Since we cannot authenticate > there are no other requests). > > > > Our cluster is currently composed of 2 24 cores xeon machines with 64GB of > RAM each, and bonded 2b1Gbps NICS. Running on standard updated Centos6.6. > RAM consumption doesn't go over 10%. > We are currently storing a few megabytes of data at most. > > At first: > curl -vvv -u **:** https://**** > will do the 2 first steps of SSL authentication, CLIENT HELLO and SERVER > HELLO and the 3rd message (client receiving CERTIFICATE from server will > get slower and slower and slower). > Then the ssl handshake in the riak java client will simply timeout. > > Reading the logs and combining with: > https://github.com/basho/riak_api/blob/develop/src/riak_api_pb_server.erl > tells me than ssl:ssl_accept never returns (well, it eventually returns > with Reason as the atom closed, seemingly the client timeout-ing and > closing the connection). > > > > supervisor:which_children(whereis(riak_api_pb_sup)) gives me a count of > ~7000 processes. > etop refuses to start. > > > > Have you experienced anything similar? > > As for our riak.conf configuration: > > ## Acceptable values: > ## - an integer > erlang.async_threads = 64 > ring_size = 32 > .. ssl things setup .. > storage_backend = multi > ### > multi_backend.default = bitcask_99 > ### 1h - ephemeral - no safety > multi_backend.bitcask_1h.storage_backend = bitcask > multi_backend.bitcask_1h.bitcask.expiry = 1h > multi_backend.bitcask_1h.bitcask.expiry.grace_time = 1h > multi_backend.bitcask_1h.bitcask.data_root = > $(platform_data_dir)/bitcask_1h > multi_backend.bitcask_1h.bitcask.max_file_size = 2GB > multi_backend.bitcask_1h.bitcask.merge.thresholds.fragmentation = 99 > ### 3d - ephemeral - a weekend to restart data generator before auto expiry > multi_backend.bitcask_3d.storage_backend = bitcask > multi_backend.bitcask_3d.bitcask.expiry = 3d > multi_backend.bitcask_3d.bitcask.expiry.grace_time = 1h > multi_backend.bitcask_3d.bitcask.data_root = > $(platform_data_dir)/bitcask_3d > multi_backend.bitcask_3d.bitcask.max_file_size = 2GB > multi_backend.bitcask_3d.bitcask.merge.thresholds.fragmentation = 99 > ### 3m - long term logs > multi_backend.bitcask_3m.storage_backend = bitcask > multi_backend.bitcask_3m.bitcask.expiry = 3m > multi_backend.bitcask_3m.bitcask.expiry.grace_time = 1h > multi_backend.bitcask_3m.bitcask.data_root = > $(platform_data_dir)/bitcask_3m > multi_backend.bitcask_3m.bitcask.max_file_size = 2GB > multi_backend.bitcask_3m.bitcask.merge.thresholds.fragmentation = 45 > ### persistent - low amount of data > multi_backend.bitcask_99.storage_backend = bitcask > multi_backend.bitcask_99.bitcask.expiry = off > multi_backend.bitcask_99.bitcask.data_root = > $(platform_data_dir)/bitcask_99 > multi_backend.bitcask_99.bitcask.max_file_size = 128MB > multi_backend.bitcask_99.bitcask.merge.thresholds.fragmentation = 20 > > ### SECURITY RELATED CUSTOM CONF ### > > tls_protocols.sslv3 = off > tls_protocols.tlsv1 = off > tls_protocols.tlsv1.1 = on > tls_protocols.tlsv1.2 = on > > secure_referer_check = on > > honor_cipher_order = on > ----------------------------------------- > riak-admin security ciphers > ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA256:ECDH-RSA-AES256-SHA384:ECDH-ECDSA-AES256-SHA384:AES256-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:DHE-RSA-AES128-SHA256:DHE-DSS-AES128-SHA256:ECDH-RSA-AES128-SHA256:ECDH-ECDSA-AES128-SHA256:AES128-SHA256 > > > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com