Ah hah!
Coming back to this problem after a month seems to have jogged my
thoughts. I've
figured it out.

The Basho guide to haproxy for Riak CS specifies:
    timeout client    5000
    timeout server    5000

In the example configuration. Those are values in milliseconds.
So if the client or server does not send any data over the connection for
5000 milliseconds, haproxy considers it dead, and closes the connection.

On a moderately well loaded production system, you probably have enough
data coming and going to keep those connections alive, but on a testing
instance, the connections will be getting closed down all the time.

My fix is to add these lines:
    timeout tunnel 7d
    timeout client-fin 30s

But I'd appreciate your thoughts too.

-Toby



On Mon, 6 Jul 2015 at 13:13 Toby Corkindale <t...@dryft.net> wrote:

> Hi Matt,
> I've tested against both haproxy 1.4 and haproxy 1.5,  both Riak 2.0.5 and
> 2.1.1, and Riak CS 2.0.1 and Stanchion 2.0.0.
> I've test both KVM and LXC virtualisation.
>
> In all combinations I tested, the problem persists. If Riak CS connects to
> Riak via haproxy, then Riak CS frequently loses the connection and returns
> a failure to the s3 client.
>
> The failures show up very quickly, if you want to try and replicate this.
> I'll upload the configurations for you. They're for a three-node cluster,
> which is obviously too small for production, but should be the minimum to
> demonstrate this bug, right? I'm sure it'd manifest on a four or five node
> cluster too.
>
> I run this command:
>     perl mkfiles.pl; s3cmd -c s3cfg sync *.txt s3://test/
> (mkfiles.pl just creates 100 small unique files by writing a bit of junk
> and an index value to them.)
>
> If Riak-CS is talking to the haproxy port instead of directly to Riak,
> it'll fail partway through the sync. Every time. The number of files it
> gets through varies between 1 and 40ish, I'd say, but I haven't actually
> kept track.
>
> Configuration files and test script at:
> https://www.dropbox.com/s/cnt9y0fegrb42fb/haproxy-riak-cs-issue.zip?dl=0
>
> Toby
>
> On Sat, 4 Jul 2015 at 01:30 Matthew Brender <mbren...@basho.com> wrote:
>
>> Hey Toby,
>>
>> Did you find anything further during this testing? I'd love to make sure
>> others on riak-user know how to configure local testing to prevent this
>> situation.
>>
>> Cheers,
>> Matt
>>
>>
>> *Matt Brender | Developer Advocacy Lead*
>> Basho Technologies
>> t: @mjbrender <https://twitter.com/mjbrender>
>> c: +1 617.817.3195
>>
>> On Fri, Jun 5, 2015 at 2:16 AM, Toby Corkindale <t...@dryft.net> wrote:
>>
>>> Hi Kota,
>>> Our production nodes are Riak CS 1.5 and Riak 1.4.x -- they're running
>>> haproxy 1.4.x, and it's all been happy for some time now.
>>>
>>> Testing the new nodes, still same haproxy versions, but Riak CS 2.0.1
>>> and Riak 1.0.5.
>>> Very confused as to why the connections are being dropped when going
>>> through haproxy. The problem persists even after restarting CS.
>>> I tried staggering the restarts.. increasing the PB request pool.. etc..
>>>  no change.
>>>
>>> But it works fine if CS connects directly to the localhost riak pb.
>>> (Which isn't a great idea.. big Riak instances sometimes take too long
>>> to start, and CS falls over because it started too fast and couldn't
>>> connect, if you're going to localhost)
>>>
>>> Confusing! I'm wondering if it's because the testing machines are in
>>> virtual machines, compared to production which is real hardware.
>>> But.. normally haproxy still works fine on VMs.
>>>
>>> I'll continue to play around.. Must be something that's botched on the
>>> testing setup... but don't want to replicate that into production!
>>>
>>>
>>> On Fri, 5 Jun 2015 at 13:59 Kota Uenishi <k...@basho.com> wrote:
>>>
>>>> Toby,
>>>>
>>>> As PB connection management haven't been changed between CS 1.5 and
>>>> 2.0, I think it should work. What's the version the load balancing
>>>> working stable? It depends of the reason why connection has been cut,
>>>> but I would recommend you restart just the CS node and recreate the
>>>> connection pool.
>>>>
>>>> On Thu, Jun 4, 2015 at 2:33 PM, Toby Corkindale <t...@dryft.net> wrote:
>>>> > Hi,
>>>> > I've been happily using haproxy in front of Riak and Riak CS 1.x in
>>>> > production for quite a while.
>>>> >
>>>> > I've been trying to bring up a new cluster based on riak/cs 2.0.x
>>>> recently,
>>>> > as you've probably noticed from the flurry of emails to this list :)
>>>> >
>>>> > I'm discovering that if I have haproxy sitting between riak-cs and
>>>> riak,
>>>> > then I get a lot of errors about disconnections. Initially I thought
>>>> this
>>>> > must be related to pb backlogs or pool sizes or file handle limits --
>>>> but
>>>> > I've played with all those things to no avail.
>>>> >
>>>> > I *have* noticed that if I get riak-cs to connect directly to a riak
>>>> > (bypassing haproxy) then everything is fine, including with the
>>>> original
>>>> > default request pool and backlog sizes.
>>>> >
>>>> > I am essentially using the recommended haproxy.cfg, which has worked
>>>> fine in
>>>> > production elsewhere.
>>>> >
>>>> > Any suggestions?
>>>> > Error message sample follows:
>>>> >
>>>> > 2015-06-04 15:26:16.447 [warning]
>>>> > <0.283.0>@riak_cs_riak_client:get_user_with_pbc:293 Fetching user re
>>>> > cord with strong option failed: disconnected
>>>> > 2015-06-04 15:26:16.447 [warning]
>>>> > <0.2095.0>@riak_cs_pbc:check_connection_status:97 Connection status
>>>> > of <0.287.0> at maybe_create_user: {false,[]}
>>>> >
>>>> >
>>>
>>>
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to