Hi,
We've been seeing some issues with Riak CS for a while in a specific
situation. Maybe you can advise if we're doing something wrong?

Our setup has redundant haproxy instances in front of a cluster of riak
nodes, for both HTTP and PBC. The haproxy instances share a floating IP
address.
Only one node holds the IP, but if it goes down, another takes it up.

Our Riak CS nodes are configured to talk to the haproxy on that floating IP.

The problem occurs if the floating IP moves from one haproxy to another.

Suddenly we see a flurry of errors in riak-cs log files.

This is presumably because it was holding open TCP connections, and the new
haproxy instance doesn't know anything about them, so they get TCP RESET
and shutdown.

The problem is that riak-cs doesn't try to reconnect and retry immediately,
instead it just throws a 503 error back to the client. Who then retries,
but Riak-CS has a pool of a couple of hundred connections to cycle through,
all of which throw the error!

Does this sound like it is a likely description of the fault?
Do you have any ways to mitigate this issue in Riak CS when using TCP load
balancing above Riak PBC?

Toby
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to