We have a 5-node Riak cluster and we're having problems keeping the HTTPS
listener running properly. The problem typically manifests itself a few
hours after Riak is started. When it happens, the HTTPS listener on a Riak
node will accept new connections but will never respond to them.
Connections made via curl or OpenSSL's s_client show the client sending the
SSL hello but never getting a response. When this happens, the OS does show
pending data for the socket that isn't being processed (trimmed output):

# ss -lt
State      Recv-Q Send-Q    Local Address:Port    Peer Address:Port
LISTEN     129    128        1.2.3.4:8098         *:*
One of the times the Erlang VM was in this state I grabbed a crash dump via
SIGUSR1. The Mochiweb process shows up in a "Waiting" state:

=proc:<0.190.0>
State: Waiting
Name: 'https_1.2.3.4:8098_mochiweb'
Spawned as: proc_lib:init_p/5
Spawned by: <0.150.0>
Started: Thu Apr 17 21:01:12 2014
Message queue length: 0
Number of heap fragments: 0
Heap fragment data: 0
Link list: [#Port<0.3873>, <0.4203.0>, <0.150.0>, <0.5788.48>,
<0.12559.49>, <0.4819.45>, <0.17031.51>, <0.19186.51>, <0.18428.51>,
<0.25106.51>, <0.20568.51>, <0.16399.51>, <0.25307.51>, <0.25382.51>,
<0.31884.51>, <0.30289.51>, <0.29247.51>, <0.25168.51>]
Reductions: 50203
Stack+heap: 1597
OldHeap: 0
Heap unused: 495
OldHeap unused: 0
Program counter: 0x00007fb03fea4de8 (gen_server:loop/6 + 264)
CP: 0x0000000000000000 (invalid)
arity = 0

All the processes linked from the main Mochiweb process are also in a
"Waiting" state. If I connect to the riak console and manually kill the
mochiweb process (via exit(pid(...), kill)), its supervisor restarts it and
the node starts servicing HTTPS requests again.

We do have the Erlang cluster behind haproxy but the SSL connections hang
even if you try to connect locally from the machine running the RIak
service. We're using a lightly modified config from what is suggested in
the docs (
http://docs.basho.com/riak/1.3.1/cookbooks/Load-Balancing-and-Proxy-Configuration/)
with a much lower max connections setting. When the hangs happen, netstat
only shows a handful of open connections to the haproxy front end.

It's also worth pointing out that when the hangs happen, there are no
messages that show up in the log files that indicate any errors. The rest
of the services on the Riak node don't appear to be affected as well - we
still get periodic anti-"entropy" exchange log messages and all the usual
suspects in riak-admin status check out.

We are using a pretty standard OS configuration - Ubuntu 12.04 LTS with the
Basho apt repo, riak 1.4.8-1, erts-5.9.1 that comes bundled with the Riak
packages.

Are there any known issues with accessing Riak over its HTTPS interface or
any known problems with erts' SSL implementation? As of now we're forced to
use periodic rolling restarts on the nodes in our production cluster to
keep the HTTPS listeners functional, which is a pretty disgusting
workaround.

Thanks for taking the time to read this. I'd appreciate any insight or
guidance on how to address/track down this problem.

-Adam Leko
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to