Hi guys, I've switched our configuration around, so that Riak CS now talks to 127.0.0.1:8087 instead of the haproxy version.
We have immediately re-encountered the problems that caused us to move to haproxy. On start-up, riak takes slightly longer than riak-cs to get ready, and so riak-cs logs the following then exits. Restarting riak-cs again (so now 15 seconds after Riak started) results in a successful start-up, but obviously this is really annoying for our ops guys to have to remember to do after restarting riak or rebooting a machine. How do other people avoid this issue in production? ``` 2017-01-20 12:23:12.937 [warning] <0.150.0>@riak_cs_app:check_bucket_props:187 Unable to verify moss.users bucket settings (disconnected). 2017-01-20 12:23:12.937 [warning] <0.150.0>@riak_cs_app:check_bucket_props:187 Unable to verify moss.access bucket settings (disconnected). 2017-01-20 12:23:12.937 [warning] <0.150.0>@riak_cs_app:check_bucket_props:187 Unable to verify moss.storage bucket settings (disconnected). 2017-01-20 12:23:12.937 [warning] <0.150.0>@riak_cs_app:check_bucket_props:187 Unable to verify moss.buckets bucket settings (disconnected). 2017-01-20 12:23:12.937 [error] <0.150.0>@riak_cs_app:sanity_check:125 Could not verify bucket properties. Error was disconnected. 2017-01-20 12:23:12.938 [error] <0.149.0> CRASH REPORT Process <0.149.0> with 0 neighbours exited with reason: {error_verifying_props,{riak_cs_app,start,[normal,[]]}} in application_master:init/4 line 133 2017-01-20 12:23:12.938 [info] <0.7.0> Application riak_cs exited with reason: {error_verifying_props,{riak_cs_app,start,[normal,[]]}} ``` On Fri, 6 Jan 2017 at 12:33 Toby Corkindale <t...@dryft.net> wrote: > Hi Shaun, > We've been running Riak CS since its early days, so it's possible best > practice has changed.. but I'm sure at some point it was suggested to put a > haproxy between CS and KV to guard against the issues of start-up race > condition, individual KV losses, and brief KV restarts. > I'm sure we used to have continual issues with CS being dead on nodes > before we moved to the haproxy solution. That was probably on Debian > Squeeze though, and these days we're on Ubuntu LTS, and so if CS is > launched from Upstart at least it can retry to start, whereas on old-school > init systems it just gets one attempt and then dies. > > Moving on though-- > > Even if it's not the majority practice, shouldn't CS still be able to > withstand dropping and reconnecting it's protocol buffer TCP connections? > > CS still has a problem with not handling the case when its long-standing > idle PBC connections to KV get reset. Regardless of whether that's because > the local KV process is restarted, or because we've failed over to a new > haproxy. > The errors get pushed back to the S3 client software, but even if they > retry, they get repeated errors because, I think, CS has such a large pool > of PBC connections. You have to work through a large portion of this pool > before you finally get to one that's reconnected and is good. > > In our case, the pool size is multiplied by the number of CS instances, so > quite a large number. > Most client software has retry limits built in, at much lower values. > > While it will come good eventually, there's a significant period of time > where everything fails, all our monitoring goes red, etc. which we'd like > to avoid! > > I'm surprised this problem doesn't come up more for other users; I don't > feel like we're running at a large scale.. but maybe we're using a more > dynamic architecture than major users. > > Toby > > On Thu, 5 Jan 2017 at 20:04 Shaun McVey <smc...@basho.com> wrote: > > Hi Toby, > > > I thought that it was recommended AGAINST talking to a co-located Riak > on the same host? > > I'm not sure where you heard that from, but that's not the case. We do > discourage running other parts of an application on the same hosts, such as > client front-ends for example. From the top of my head (which means > there's probably an exception to the rule), all our customers have nodes > set up in the way Magnus described: one CS instance talking locally to its > KV instance directly on the same node. The load balancing comes between > the CS node and the client. > > Riak shouldn't take a particularly long time to start at all. We have > customers that have terabytes of data per node and a KV node can be > restarted in just a minute or two. As long as you have valid bitcask hint > files in place (which requires a proper shutdown beforehand), then a node > should come up quickly. If you have nodes that you feel are taking a > particularly long time to start up, that may be a symptom of another issue > unrelated to this discussion. > > If, for any reason, you need to shut down KV, you would then just remove > the CS node from the HAproxy configuration so it doesn't deal with any > requests. The other CS nodes then take the additional load. There > shouldn't be a need to restart CS if you remove it from the load balancer. > Having said that, you shouldn't have to worry about restarting CS as far as > I'm aware. You might see failures if KV is down, but once it's up and > running again, CS will continue to deal with new requests without > problems. Any failures to connect to its KV node should be passed to the > client/front-end, which should have all the proper logic for re-attempts or > error reporting. > > > I'm surprised more people with highly-available Riak CS installations > haven't hit the same issues. > > As I mentioned, our customers go with the setup Magnus described. I can't > speak for setups like yours as I've not seen them in the wild. > > Kind Regards, > Shaun > > On Wed, Jan 4, 2017 at 11:26 PM, Toby Corkindale <t...@dryft.net> wrote: > > Hi Magnus, > I thought that it was recommended AGAINST talking to a co-located Riak on > the same host? > The reason being, the local Riak will take longer to start up than Riak > CS, once you have a sizeable amount of data. This means Riak CS starts up, > fails to connect to Riak, and exits. > You also end up in a situation where you must always restart Riak CS if > you restart the co-located Riak. (Otherwise the Riak CS PBC connections > suffer the same problem as I described in my earlier email, where Riak CS > doesn't realise it needs to reconnect them and returns errors). > > Putting haproxy between Riak CS and Riak solved the problem of needing the > local Riak to be started first. > But it seems we just were putting the core problem off, rather than > solving it. ie. That Riak CS doesn't understand it needs to re-connect and > retry. > > I'm surprised more people with highly-available Riak CS installations > haven't hit the same issues. > > Toby > > On Wed, 4 Jan 2017 at 21:42 Magnus Kessler <mkess...@basho.com> wrote: > > Hi Toby, > > As far as I know Riak CS has none of the more advanced retry capabilities > that Riak KV has. However, in the design of CS there seems to be an > assumption that a CS instance will talk to a co-located KV node on the same > host. To achieve high availability, in CS deployments HAProxy is often > deployed in front of the CS nodes. Could you please let me know if this is > an option for your setup? > > Kind Regards, > > Magnus > > > On 4 January 2017 at 01:04, Toby Corkindale <t...@dryft.net> wrote: > > Hello all, > Now that we're all back from the end-of-year holidays, I'd like to bump > this question up. > I feel like this has been a long-standing problem with Riak CS not > handling dropped TCP connections. > Last time the cause was haproxy dropping idle TCP connections after too > long, but we solved that at the haproxy end. > > This time, it's harder -- we're failing over to a different Riak backend, > so the TCP connections between Riak CS and Riak PBC *have* to go down, but > Riak CS just doesn't handle it well at all. > > Is there a trick to configuring it better? > > Thanks > Toby > > > On Thu, 22 Dec 2016 at 16:48 Toby Corkindale <t...@dryft.net> wrote: > > Hi, > We've been seeing some issues with Riak CS for a while in a specific > situation. Maybe you can advise if we're doing something wrong? > > Our setup has redundant haproxy instances in front of a cluster of riak > nodes, for both HTTP and PBC. The haproxy instances share a floating IP > address. > Only one node holds the IP, but if it goes down, another takes it up. > > Our Riak CS nodes are configured to talk to the haproxy on that floating > IP. > > The problem occurs if the floating IP moves from one haproxy to another. > > Suddenly we see a flurry of errors in riak-cs log files. > > This is presumably because it was holding open TCP connections, and the > new haproxy instance doesn't know anything about them, so they get TCP > RESET and shutdown. > > The problem is that riak-cs doesn't try to reconnect and retry > immediately, instead it just throws a 503 error back to the client. Who > then retries, but Riak-CS has a pool of a couple of hundred connections to > cycle through, all of which throw the error! > > Does this sound like it is a likely description of the fault? > Do you have any ways to mitigate this issue in Riak CS when using TCP load > balancing above Riak PBC? > > Toby > > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > > > -- > Magnus Kessler > Client Services Engineer > Basho Technologies Limited > > Registered Office - 8 Lincoln’s Inn Fields London WC2A 3BP Reg 07970431 > > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com