Hi Daniel, On Tue, Aug 28, 2018 at 1:46 AM Daniel Schneller < [email protected]> wrote:
> Hi! > > There seems to be some kind of problem when backend servers (in this case > ELBs) change their IP > addresses. > > At some point, apparently, the ELB behind the DNS name in my config > changed it address(es). > Lots of haproxys we use as sidecars on our application servers failed > their health checks afterwards with > L4 timeouts. For testing, I reloaded haproxy on one of them, and the error > went away. > > The resolvers section has two servers: a local dnsmasq and the AWS VPC DNS > server at the "magic" > address 169.254.19.253. > > On a different instance I captured some traffic. The pcap shows the DNS > queries and responses > for the backend server name going to both 127.0.0.1:53 and > 169.254.169.253:53. Both servers > reply with the same answers, carrying the current IPs. Those are the same > as shown by dig in a shell. > (10.205.100.120 and 10.205.100.61). > > However, haproxy apparently still uses and old address 10.205.100.53 that > the ELB probably had at > some point -- hard to tell after the fact. In the pcap I can "ICMP Host > Unreachable" responses for > attempts to connect to 10.205.100.53 on all the ports my backends specify. > > At first I suspected length issues, but the responses are just 174 bytes > long. > If needed, I can provide the pcap privately. > > This can be a real fun-killer when all the sidecars suddenly lose > connection across tens of VMs... > > Am I missing something in my config, or is this an actual (maybe known?) > bug? > > Configuration, dig output, and version info below. > > Kind regards, > > Daniel > > > > > dig output (actual name is a little longer, > I cut off the name for brevity and privacy). > --------- > [aws:staging-staging] root:~# dig @169.254.169.253 > loadbalancer-internal.private > > ; <<>> DiG 9.9.5-3ubuntu0.17-Ubuntu <<>> @169.254.169.253 > loadbalancer-internal.private > ; (1 server found) > ;; global options: +cmd > ;; Got answer: > ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 501 > ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1 > > ;; OPT PSEUDOSECTION: > ; EDNS: version: 0, flags:; udp: 4096 > ;; QUESTION SECTION: > ;loadbalancer-internal.private. IN A > > ;; ANSWER SECTION: > loadbalancer-internal.private. 11 IN A 10.205.100.120 > loadbalancer-internal.private. 11 IN A 10.205.100.61 > > ;; Query time: 0 msec > ;; SERVER: 169.254.169.253#53(169.254.169.253) > ;; WHEN: Mon Aug 27 15:51:48 CEST 2018 > ;; MSG SIZE rcvd: 141 > > > > [aws:staging-staging] root:~# dig @127.0.0.1 loadbalancer-internal.private > > ; <<>> DiG 9.9.5-3ubuntu0.17-Ubuntu <<>> @127.0.0.1 > loadbalancer-internal.private > ; (1 server found) > ;; global options: +cmd > ;; Got answer: > ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 20706 > ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0 > > ;; QUESTION SECTION: > ;loadbalancer-internal.private. IN A > > ;; ANSWER SECTION: > loadbalancer-internal.private. 5 IN A 10.205.100.61 > loadbalancer-internal.private. 5 IN A 10.205.100.120 > > ;; Query time: 0 msec > ;; SERVER: 127.0.0.1#53(127.0.0.1) > ;; WHEN: Mon Aug 27 15:51:54 CEST 2018 > ;; MSG SIZE rcvd: 130 > --------- > > > > haproxy.cfg (there are more proxies in the real thing, but they are all > the the same, > just for different ports): > --------------------- > global > log /dev/log len 350 local1 info > log-tag haproxy > stats socket /var/run/haproxy.stat user haproxy group haproxy mode 600 > level admin > chroot /var/lib/haproxy > user haproxy > group haproxy > hard-stop-after 30s > > > defaults > mode tcp > log global > option tcplog > option dontlognull > option http-keep-alive > timeout http-request 10s > timeout queue 1m > timeout connect 5s > timeout client 2m > timeout server 2m > timeout http-keep-alive 10s > timeout check 5s > retries 3 > maxconn 2000 > > resolvers default > nameserver local 127.0.0.1:53 > nameserver aws 169.254.169.253:53 > > Maybe try tuning the "hold valid" parameter, see https://cbonte.github.io/haproxy-dconv/1.7/configuration.html#5.3.2 The default value is 30s so setting it to 1s would make more sense when the backend IP's often change. > > listen rabbitmq > bind 127.0.0.1:5671 > option dontlog-normal > server lb-internal loadbalancer-internal.private:5671 resolvers default > check addr loadbalancer-internal.private port 5671 > --------------------- > > > > Log: > -------------------- > ... > Aug 27 16:49:09 xxx haproxy[2090]: 127.0.0.1:35891 > [27/Aug/2018:16:49:09.031] rabbitmq rabbitmq/<NOSRV> -1/-1/0 0 SC 0/0/0/0/0 > 0/0 > ... > -------------------- > > > > Version info: > --------------------- > [aws:staging-staging] root:~# haproxy -vvv > HA-Proxy version 1.7.11-1ppa1~trusty 2018/04/30 > Copyright 2000-2018 Willy Tarreau <[email protected]> > > Build options : > TARGET = linux2628 > CPU = generic > CC = gcc > CFLAGS = -g -O2 -fPIE -fstack-protector --param=ssp-buffer-size=4 > -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 > OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 > USE_LUA=1 USE_PCRE=1 USE_PCRE_JIT=1 USE_NS=1 > > Default settings : > maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200 > > Encrypted password support via crypt(3): yes > Built with zlib version : 1.2.8 > Running on zlib version : 1.2.8 > Compression algorithms supported : identity("identity"), > deflate("deflate"), raw-deflate("deflate"), gzip("gzip") > Built with OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014 > Running on OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014 > OpenSSL library supports TLS extensions : yes > OpenSSL library supports SNI : yes > OpenSSL library supports prefer-server-ciphers : yes > Built with PCRE version : 8.31 2012-07-06 > Running on PCRE version : 8.31 2012-07-06 > PCRE library supports JIT : no (libpcre build without JIT?) > Built with Lua version : Lua 5.3.1 > Built with transparent proxy support using: IP_TRANSPARENT > IPV6_TRANSPARENT IP_FREEBIND > Built with network namespace support > > Available polling systems : > epoll : pref=300, test result OK > poll : pref=200, test result OK > select : pref=150, test result OK > Total: 3 (3 usable), will use epoll. > > Available filters : > [COMP] compression > [TRACE] trace > [SPOE] spoe > --------------------- > > > -- > Daniel Schneller > Principal Cloud Engineer > > CenterDevice GmbH > Rheinwerkallee 3 > 53227 Bonn > www.centerdevice.com > > __________________________________________ > Geschäftsführung: Dr. Patrick Peschlow, Dr. Lukas Pustina, Michael > Rosbach, Handelsregister-Nr.: HRB 18655, HR-Gericht: Bonn, > USt-IdNr.: DE-815299431 > > Diese E-Mail einschließlich evtl. beigefügter Dateien enthält vertrauliche > und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige > Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren > Sie bitte sofort den Absender und löschen Sie diese E-Mail und evtl. > beigefügter Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder > Öffnen evtl. beigefügter Dateien sowie die unbefugte Weitergabe > dieser E-Mail ist nicht gestattet. > > > -- Igor Cicimov | DevOps p. +61 (0) 433 078 728 e. [email protected] <http://encompasscorporation.com/> w*.* www.encompasscorporation.com a. Level 4, 65 York Street, Sydney 2000

