Great! Thanks for your help in verifying this issue. I'll look into getting a patch backported to Basho's Erlang fork.
Just to verify, what is the exact version of Erlang you are running? If it seems like it might take a while to get a patched version of Erlang vetted and built, I could potentially build a quick temporary patch that you could try installing on your system, which may fix the problem in the short term. Thanks again, Nick On Tue, Jul 28, 2015 at 12:53 AM, ジョハンガル <gall.jo...@linecorp.com> wrote: > Hello! > > > > lists:keyfind(size, 1, ets:info(element(2, > lists:keyfind(ssl_otp_cacertificate_db, 1, [{ets:info(T, name), T} || T <- > ets:all()])))). > > > > returns: > > {size,996816} > > and it keeps growing and growing > > > > So I think we hit our culprit! > > > > -----Original Message----- > *From:* "Nick Marino"<nmar...@basho.com> > *To:* "ジョハンガル"<gall.jo...@linecorp.com>; > *Cc:* "riak-users"<riak-users@lists.basho.com>; > *Sent:* 2015-07-28 (火) 02:37:26 > *Subject:* Re: fresh riak 2.1.1 install systematically slowing down and > crashing in ~1day > > Hi, > > I have a strong suspicion that you're encountering a resource leak bug in > the Erlang SSL libraries that Riak uses. By odd coincidence, I ran into a > very similar issue at my last job working on a completely different > project, and I helped develop a fix a couple of years ago. The patch was > accepted somewhere in the Erlang R17 timeframe, but Riak doesn't support > R17 yet so you probably don't have a version of Erlang with this particular > fix in place. > > To check whether this resource leak is being hit, you can attach an Erlang > shell to a node using the "riak attach" command and copy/paste this one > line of code: > > lists:keyfind(size, 1, ets:info(element(2, > lists:keyfind(ssl_otp_cacertificate_db, 1, [{ets:info(T, name), T} || T <- > ets:all()])))). > > Running the above command should give you something that looks like this: > > {size, 0} > > You will likely see a number larger than 0, but the general format should > be the same. In normal usage, this number should be fairly small, but in > your case you may see it continue to grow larger and larger over time > (specifically, you may see it incrementing by one for each new incoming SSL > connection, and never decrementing, even after connections are closed). If > this value starts to get up into the tens or hundreds of thousands or more, > establishing new SSL connections will start to get slower and slower, much > like you're seeing. > > If you can verify that this size value continues to grow over time, we can > take a look at backporting the relevant fix to our custom Basho fork of > Erlang. Let me know what you find, and we can take it from there. > > Thanks! > Nick > > On Mon, Jul 27, 2015 at 5:59 AM, ジョハンガル <gall.jo...@linecorp.com> wrote: > > Hello, > > I would really appreciate help about our authentication problem. > > We have developed an orchestration platform for internal cloud needs using > RIAK KV 2.1.1 (our first deployment of the technology). > Up to now we were only using raw HTTP but for security purposes we have > been switching to TLS v1.2 with Protocol Buffer clients (standard riak java > client). > > At first everything works smoothly. > > > Then eventually (a few hours), with a load of about 10-15 authentications > by second our cluster CPU usage starts to slowly ramp up to saturation > until the the data-store turns unresponsive. (Since we cannot authenticate > there are no other requests). > > > > Our cluster is currently composed of 2 24 cores xeon machines with 64GB of > RAM each, and bonded 2b1Gbps NICS. Running on standard updated Centos6.6. > RAM consumption doesn't go over 10%. > We are currently storing a few megabytes of data at most. > > At first: > curl -vvv -u **:** https://**** > will do the 2 first steps of SSL authentication, CLIENT HELLO and SERVER > HELLO and the 3rd message (client receiving CERTIFICATE from server will > get slower and slower and slower). > Then the ssl handshake in the riak java client will simply timeout. > > Reading the logs and combining with: > https://github.com/basho/riak_api/blob/develop/src/riak_api_pb_server.erl > tells me than ssl:ssl_accept never returns (well, it eventually returns > with Reason as the atom closed, seemingly the client timeout-ing and > closing the connection). > > > > supervisor:which_children(whereis(riak_api_pb_sup)) gives me a count of > ~7000 processes. > etop refuses to start. > > > > Have you experienced anything similar? > > As for our riak.conf configuration: > > ## Acceptable values: > ## - an integer > erlang.async_threads = 64 > ring_size = 32 > .. ssl things setup .. > storage_backend = multi > ### > multi_backend.default = bitcask_99 > ### 1h - ephemeral - no safety > multi_backend.bitcask_1h.storage_backend = bitcask > multi_backend.bitcask_1h.bitcask.expiry = 1h > multi_backend.bitcask_1h.bitcask.expiry.grace_time = 1h > multi_backend.bitcask_1h.bitcask.data_root = > $(platform_data_dir)/bitcask_1h > multi_backend.bitcask_1h.bitcask.max_file_size = 2GB > multi_backend.bitcask_1h.bitcask.merge.thresholds.fragmentation = 99 > ### 3d - ephemeral - a weekend to restart data generator before auto expiry > multi_backend.bitcask_3d.storage_backend = bitcask > multi_backend.bitcask_3d.bitcask.expiry = 3d > multi_backend.bitcask_3d.bitcask.expiry.grace_time = 1h > multi_backend.bitcask_3d.bitcask.data_root = > $(platform_data_dir)/bitcask_3d > multi_backend.bitcask_3d.bitcask.max_file_size = 2GB > multi_backend.bitcask_3d.bitcask.merge.thresholds.fragmentation = 99 > ### 3m - long term logs > multi_backend.bitcask_3m.storage_backend = bitcask > multi_backend.bitcask_3m.bitcask.expiry = 3m > multi_backend.bitcask_3m.bitcask.expiry.grace_time = 1h > multi_backend.bitcask_3m.bitcask.data_root = > $(platform_data_dir)/bitcask_3m > multi_backend.bitcask_3m.bitcask.max_file_size = 2GB > multi_backend.bitcask_3m.bitcask.merge.thresholds.fragmentation = 45 > ### persistent - low amount of data > multi_backend.bitcask_99.storage_backend = bitcask > multi_backend.bitcask_99.bitcask.expiry = off > multi_backend.bitcask_99.bitcask.data_root = > $(platform_data_dir)/bitcask_99 > multi_backend.bitcask_99.bitcask.max_file_size = 128MB > multi_backend.bitcask_99.bitcask.merge.thresholds.fragmentation = 20 > > ### SECURITY RELATED CUSTOM CONF ### > > tls_protocols.sslv3 = off > tls_protocols.tlsv1 = off > tls_protocols.tlsv1.1 = on > tls_protocols.tlsv1.2 = on > > secure_referer_check = on > > honor_cipher_order = on > ----------------------------------------- > riak-admin security ciphers > ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA256:ECDH-RSA-AES256-SHA384:ECDH-ECDSA-AES256-SHA384:AES256-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:DHE-RSA-AES128-SHA256:DHE-DSS-AES128-SHA256:ECDH-RSA-AES128-SHA256:ECDH-ECDSA-AES128-SHA256:AES128-SHA256 > > > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com