[Bug 1894879] Re: frequent crashes when using do-resolve()

Simon Déziel Thu, 10 Sep 2020 11:11:19 -0700

** Description changed:

+ [Impact]
+ 
+ * using the do-resolve action causes frequent crashes (>15 times/day on
+   a given machine) and sometimes put haproxy into a bogus state where DNS
+   resolution stops working forever without crashing (DoS)
+ 
+ * to address the problem, 3 upstream fixes that were released in a later
+   micro release are backported. Those fixes address problems affecting
+   the do-resolve action that was introduced in the 2.0 branch (which is
+   why only Focal is affected)
+ 
+ [Test case]
+ 
+ Those crashes are due to the do-resolve action not being thread-safe. As
+ such, it is hard to trigger this on demand, except with production load.
+ 
+ The proposed fix should be tested for any regression, ideally on a SMP
+ enabled (multiple CPUs/cores) machine as the crashes happen more easily on 
those.
+ 
+ haproxy should be configured like in your production environment and put
+ to test to see if it still behaves like it should. The more concurrent
+ traffic the better, especially if you make use of the do-resolve action.
+ 
+ Any crash should be visible in the journal so you can keep an eye on
+ "journalctl -fu haproxy.service" during the tests. Abnormal exits should
+ be visible with:
+ 
+  journalctl -u haproxy --grep ALERT | grep -F 'exited with code '
+ 
+ With the proposed fix, there should be none even when under heavy
+ concurrent traffic.
+ 
+ [Regression Potential]
+ 
+ The backported fixes address locking and memory management issues. It's
+ possible they could introduce more locking problems or memory leaks. That
+ said, they only change code related to the DNS/do-resolve action which is
+ already known to trigger frequent crashes.
+ 
+ The backported fixes come from a later micro version (2.0.16, released on
+ 2020-07-17) so it has seen real world production usage already. Upstream
+ also released another micro version since then and they didn't revert nor
+ rework the DNS/do-resolve action.
+ 
+ Furthermore, the patch was tested on a production machine that used to
+ crash >15 times per day. We are now 48 hours after the patch deployment
+ and we saw no crash and no sign of regression either.
+ 
+ [Original description]
+ 
  haproxy 2.0.13-2 keeps crashing for us:
  
  # journalctl --since yesterday -u haproxy --grep ALERT | grep -F 'exited with 
code '
  Sep 07 18:14:57 foo haproxy[16831]: [ALERT] 250/181457 (16831) : Current 
worker #1 (16839) exited with code 139 (Segmentation fault)
  Sep 07 19:45:23 foo haproxy[16876]: [ALERT] 250/194523 (16876) : Current 
worker #1 (16877) exited with code 139 (Segmentation fault)
  Sep 07 19:49:01 foo haproxy[16916]: [ALERT] 250/194901 (16916) : Current 
worker #1 (16919) exited with code 139 (Segmentation fault)
  Sep 07 19:49:02 foo haproxy[16939]: [ALERT] 250/194902 (16939) : Current 
worker #1 (16942) exited with code 139 (Segmentation fault)
  Sep 07 19:49:03 foo haproxy[16953]: [ALERT] 250/194903 (16953) : Current 
worker #1 (16955) exited with code 139 (Segmentation fault)
  Sep 07 19:49:37 foo haproxy[16964]: [ALERT] 250/194937 (16964) : Current 
worker #1 (16965) exited with code 139 (Segmentation fault)
  Sep 07 23:41:13 foo haproxy[16982]: [ALERT] 250/234113 (16982) : Current 
worker #1 (16984) exited with code 139 (Segmentation fault)
  Sep 07 23:41:14 foo haproxy[17076]: [ALERT] 250/234114 (17076) : Current 
worker #1 (17077) exited with code 139 (Segmentation fault)
  Sep 07 23:43:20 foo haproxy[17090]: [ALERT] 250/234320 (17090) : Current 
worker #1 (17093) exited with code 139 (Segmentation fault)
  Sep 07 23:43:50 foo haproxy[17113]: [ALERT] 250/234350 (17113) : Current 
worker #1 (17116) exited with code 139 (Segmentation fault)
  Sep 07 23:43:51 foo haproxy[17134]: [ALERT] 250/234351 (17134) : Current 
worker #1 (17135) exited with code 139 (Segmentation fault)
  Sep 07 23:44:44 foo haproxy[17146]: [ALERT] 250/234444 (17146) : Current 
worker #1 (17147) exited with code 139 (Segmentation fault)
  Sep 08 00:18:54 foo haproxy[17164]: [ALERT] 251/001854 (17164) : Current 
worker #1 (17166) exited with code 134 (Aborted)
  Sep 08 00:27:51 foo haproxy[17263]: [ALERT] 251/002751 (17263) : Current 
worker #1 (17266) exited with code 139 (Segmentation fault)
  Sep 08 00:30:36 foo haproxy[17286]: [ALERT] 251/003036 (17286) : Current 
worker #1 (17289) exited with code 134 (Aborted)
  Sep 08 00:37:01 foo haproxy[17307]: [ALERT] 251/003701 (17307) : Current 
worker #1 (17310) exited with code 139 (Segmentation fault)
  Sep 08 00:40:31 foo haproxy[17331]: [ALERT] 251/004031 (17331) : Current 
worker #1 (17334) exited with code 139 (Segmentation fault)
  Sep 08 00:41:14 foo haproxy[17650]: [ALERT] 251/004114 (17650) : Current 
worker #1 (17651) exited with code 139 (Segmentation fault)
  Sep 08 00:41:59 foo haproxy[17669]: [ALERT] 251/004159 (17669) : Current 
worker #1 (17672) exited with code 139 (Segmentation fault)
  
  The server in question uses a config that looks like this:
  
  global
-   maxconn 50000
-   log /dev/log    local0
-   log /dev/log    local1 notice
-   chroot /var/lib/haproxy
-   stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd 
listeners
-   stats timeout 30s
-   user haproxy
-   group haproxy
-   daemon
+   maxconn 50000
+   log /dev/log    local0
+   log /dev/log    local1 notice
+   chroot /var/lib/haproxy
+   stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd 
listeners
+   stats timeout 30s
+   user haproxy
+   group haproxy
+   daemon
  
  defaults
-   maxconn 50000
-   log     global
-   mode    tcp
-   option  tcplog
-   # dontlognull won't log sessions where the DNS resolution failed
-   #option    dontlognull
-   timeout connect 5s
-   timeout client  15s
-   timeout server  15s
+   maxconn 50000
+   log     global
+   mode    tcp
+   option  tcplog
+   # dontlognull won't log sessions where the DNS resolution failed
+   #option    dontlognull
+   timeout connect 5s
+   timeout client  15s
+   timeout server  15s
  
  resolvers mydns
-   nameserver local 127.0.0.1:53
-   accepted_payload_size 1400
-   timeout resolve       1s
-   timeout retry         1s
-   hold other           30s
-   hold refused         30s
-   hold nx              30s
-   hold timeout         30s
-   hold valid           10s
-   hold obsolete        30s
+   nameserver local 127.0.0.1:53
+   accepted_payload_size 1400
+   timeout resolve       1s
+   timeout retry         1s
+   hold other           30s
+   hold refused         30s
+   hold nx              30s
+   hold timeout         30s
+   hold valid           10s
+   hold obsolete        30s
  
  frontend foo
-   ...
+   ...
  
-   # dns lookup
-   tcp-request content do-resolve(txn.remote_ip,mydns,ipv4) 
var(txn.remote_host)
-   tcp-request content capture var(txn.remote_ip) len 40
+   # dns lookup
+   tcp-request content do-resolve(txn.remote_ip,mydns,ipv4) 
var(txn.remote_host)
+   tcp-request content capture var(txn.remote_ip) len 40
  
-   # XXX: the remaining rejections happen in the backend to provide better 
logging
-   # reject connection on DNS resolution error
-   use_backend be_dns_failed unless { var(txn.remote_ip) -m found }
+   # XXX: the remaining rejections happen in the backend to provide better 
logging
+   # reject connection on DNS resolution error
+   use_backend be_dns_failed unless { var(txn.remote_ip) -m found }
  
-   ...
-   # at this point, we should let the connection through
-   default_backend be_allowed
+   ...
+   # at this point, we should let the connection through
+   default_backend be_allowed
  
- 
- When reaching out to haproxy folks in #haproxy, 
https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=ef131ae was mentioned 
as a potential fix.
+ When reaching out to haproxy folks in #haproxy,
+ https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=ef131ae was
+ mentioned as a potential fix.
  
  https://www.haproxy.org/bugs/bugs-2.0.13.html shows 3 commits with
  "dns":
  
  * https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=ef131ae
  * https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=39eb766
  * https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=74d704f
  
  It would be great to have those (at least ef131ae) SRU'ed to Focal
  (Bionic doesn't isn't affected as it runs on 1.8).
  
  Additional information:
  
  # lsb_release -rd
  Description:  Ubuntu 20.04.1 LTS
  Release:      20.04
  
  # apt-cache policy haproxy
  haproxy:
-   Installed: 2.0.13-2
-   Candidate: 2.0.13-2
-   Version table:
-  *** 2.0.13-2 500
-         500 http://archive.ubuntu.com/ubuntu focal/main amd64 Packages
-         100 /var/lib/dpkg/status
+   Installed: 2.0.13-2
+   Candidate: 2.0.13-2
+   Version table:
+  *** 2.0.13-2 500
+         500 http://archive.ubuntu.com/ubuntu focal/main amd64 Packages
+         100 /var/lib/dpkg/status


** Patch added: "haproxy-lp1894879.debdiff"
   
https://bugs.launchpad.net/ubuntu/+source/haproxy/+bug/1894879/+attachment/5409451/+files/haproxy-lp1894879.debdiff

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1894879

Title:
  frequent crashes when using do-resolve()

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/haproxy/+bug/1894879/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1894879] Re: frequent crashes when using do-resolve()

Reply via email to