HI Emeric,

Thanks for your response, responses inline

> 
> On Mar 29, 2017, at 04:31, Emeric Brun <[email protected]> wrote:
> 
> Hi Aaron,
> 
> On 03/28/2017 10:03 PM, Aaron van Meerten wrote:
>> Hi HAProxy List,
>> [snip]
>>> 
>>> I’ve got a fleet of 10 proxy servers peering with each other, fronting 
>>> several backend servers.  I have a very simple stick table setup which I’ve 
>>> pasted examples of below.  Basically I use a URL parameter to control 
>>> server stickiness.
>>> 
>>> This works great, and is an amazing solution to a sticky problem for our 
>>> BOSH-based XMPP messaging, as long as the stick table entries stay in sync. 
>>>  However, sometimes one HAProxy instance will lose one or more entries 
>>> which are still present on the others.
> 
> Is it still the same instance?

It ends up being different instances across the fleet, although several are 
more often the culprit than the rest.

> 
>>> This state persists between minutes and hours, in which the out-of-sync 
>>> server continues to receive updates on some entries but is missing others.
> 
> In peers protocol, a peer is responsible to push its local updates to the 
> other peers.  But A peer won't 'forward' updates coming from an other peer 
> (except for a startup resync request).
> 
> So we could reach your case if communication failed between 2 peers (the peer 
> learns the updates from all the peers except one).

That seems to fit the behavior I’ve seen.  Interestingly enough, other entries 
from the same peer DO end up being pushed even while missing entries are still 
missed.


> 
>>> A restart of the server can resolve the issue by causing the table to 
>>> refresh, but this is less than ideal.
> 
> At restart, the node will ask for a re-sync to any available peer.

which accounts for a restart resolving the issue, especially if the peer 
connection is now re-established between all peers and updates are flowing as 
expected

>>> 
>>> When it occurs, it appears that all the other servers continue to update 
>>> the “TTL” on the entry, but the errant server slowly allows the entry to 
>>> expire and be removed.
>>> I have developed a tool which pulls the stick table from each proxy and 
>>> compares the entries.  There’s obviously some room for expiry times to be 
>>> different on each proxy, but I’d expect that entries which are regularly 
>>> refreshed on all other peers should be propagated everywhere.
>>> 
>>> I suspect somehow either ephemeral network connectivity between the peers 
>>> or some other error, but I haven’t seen anything in the logs that seem 
>>> relevant.  
>>> 
>>> lsof analysis of open TCP sockets shows all peers connected on 1024 as 
>>> expected.
>>> 
> 
> When you are facing the issue, could you launch a tcpdump between this 
> instance and ALL the other peers, to check if they exchange some data.


There is definitely some data flowing.  I am even able to add new entries to 
each other peer and see them appear on the “errant” peer, but particular 
entries that are missing stay missing until a full restart (or they expire from 
the other peers)

I will grab a better tcpdump of this situation and try to provide it for 
analysis.

> 
>>> I wondered if this list would have any ideas on further avenues for 
>>> analysis on this particular problem.  I’ve seen this happen consistently on 
>>> HAProxy 1.6 and 1.7 through several point releases of each.  If anything it 
>>> seems more frequent in 1.7.
>>> 
>>> Please let me know if you have any good ideas or if anyone has seen 
>>> behavior like this before. 
>>> 
>>> Thanks,
>>> 
>>> -Aaron van Meerten
> 
> R,
> Emeric

Cheers,

-Aaron


Reply via email to