We have a 5-node Riak cluster and we're having problems keeping the HTTPS listener running properly. The problem typically manifests itself a few hours after Riak is started. When it happens, the HTTPS listener on a Riak node will accept new connections but will never respond to them. Connections made via curl or OpenSSL's s_client show the client sending the SSL hello but never getting a response. When this happens, the OS does show pending data for the socket that isn't being processed (trimmed output):
# ss -lt State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 129 128 1.2.3.4:8098 *:* One of the times the Erlang VM was in this state I grabbed a crash dump via SIGUSR1. The Mochiweb process shows up in a "Waiting" state: =proc:<0.190.0> State: Waiting Name: 'https_1.2.3.4:8098_mochiweb' Spawned as: proc_lib:init_p/5 Spawned by: <0.150.0> Started: Thu Apr 17 21:01:12 2014 Message queue length: 0 Number of heap fragments: 0 Heap fragment data: 0 Link list: [#Port<0.3873>, <0.4203.0>, <0.150.0>, <0.5788.48>, <0.12559.49>, <0.4819.45>, <0.17031.51>, <0.19186.51>, <0.18428.51>, <0.25106.51>, <0.20568.51>, <0.16399.51>, <0.25307.51>, <0.25382.51>, <0.31884.51>, <0.30289.51>, <0.29247.51>, <0.25168.51>] Reductions: 50203 Stack+heap: 1597 OldHeap: 0 Heap unused: 495 OldHeap unused: 0 Program counter: 0x00007fb03fea4de8 (gen_server:loop/6 + 264) CP: 0x0000000000000000 (invalid) arity = 0 All the processes linked from the main Mochiweb process are also in a "Waiting" state. If I connect to the riak console and manually kill the mochiweb process (via exit(pid(...), kill)), its supervisor restarts it and the node starts servicing HTTPS requests again. We do have the Erlang cluster behind haproxy but the SSL connections hang even if you try to connect locally from the machine running the RIak service. We're using a lightly modified config from what is suggested in the docs ( http://docs.basho.com/riak/1.3.1/cookbooks/Load-Balancing-and-Proxy-Configuration/) with a much lower max connections setting. When the hangs happen, netstat only shows a handful of open connections to the haproxy front end. It's also worth pointing out that when the hangs happen, there are no messages that show up in the log files that indicate any errors. The rest of the services on the Riak node don't appear to be affected as well - we still get periodic anti-"entropy" exchange log messages and all the usual suspects in riak-admin status check out. We are using a pretty standard OS configuration - Ubuntu 12.04 LTS with the Basho apt repo, riak 1.4.8-1, erts-5.9.1 that comes bundled with the Riak packages. Are there any known issues with accessing Riak over its HTTPS interface or any known problems with erts' SSL implementation? As of now we're forced to use periodic rolling restarts on the nodes in our production cluster to keep the HTTPS listeners functional, which is a pretty disgusting workaround. Thanks for taking the time to read this. I'd appreciate any insight or guidance on how to address/track down this problem. -Adam Leko
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com