** Description changed: + [Impact] + + * using the do-resolve action causes frequent crashes (>15 times/day on + a given machine) and sometimes put haproxy into a bogus state where DNS + resolution stops working forever without crashing (DoS) + + * to address the problem, 3 upstream fixes that were released in a later + micro release are backported. Those fixes address problems affecting + the do-resolve action that was introduced in the 2.0 branch (which is + why only Focal is affected) + + [Test case] + + Those crashes are due to the do-resolve action not being thread-safe. As + such, it is hard to trigger this on demand, except with production load. + + The proposed fix should be tested for any regression, ideally on a SMP + enabled (multiple CPUs/cores) machine as the crashes happen more easily on those. + + haproxy should be configured like in your production environment and put + to test to see if it still behaves like it should. The more concurrent + traffic the better, especially if you make use of the do-resolve action. + + Any crash should be visible in the journal so you can keep an eye on + "journalctl -fu haproxy.service" during the tests. Abnormal exits should + be visible with: + + journalctl -u haproxy --grep ALERT | grep -F 'exited with code ' + + With the proposed fix, there should be none even when under heavy + concurrent traffic. + + [Regression Potential] + + The backported fixes address locking and memory management issues. It's + possible they could introduce more locking problems or memory leaks. That + said, they only change code related to the DNS/do-resolve action which is + already known to trigger frequent crashes. + + The backported fixes come from a later micro version (2.0.16, released on + 2020-07-17) so it has seen real world production usage already. Upstream + also released another micro version since then and they didn't revert nor + rework the DNS/do-resolve action. + + Furthermore, the patch was tested on a production machine that used to + crash >15 times per day. We are now 48 hours after the patch deployment + and we saw no crash and no sign of regression either. + + [Original description] + haproxy 2.0.13-2 keeps crashing for us: # journalctl --since yesterday -u haproxy --grep ALERT | grep -F 'exited with code ' Sep 07 18:14:57 foo haproxy[16831]: [ALERT] 250/181457 (16831) : Current worker #1 (16839) exited with code 139 (Segmentation fault) Sep 07 19:45:23 foo haproxy[16876]: [ALERT] 250/194523 (16876) : Current worker #1 (16877) exited with code 139 (Segmentation fault) Sep 07 19:49:01 foo haproxy[16916]: [ALERT] 250/194901 (16916) : Current worker #1 (16919) exited with code 139 (Segmentation fault) Sep 07 19:49:02 foo haproxy[16939]: [ALERT] 250/194902 (16939) : Current worker #1 (16942) exited with code 139 (Segmentation fault) Sep 07 19:49:03 foo haproxy[16953]: [ALERT] 250/194903 (16953) : Current worker #1 (16955) exited with code 139 (Segmentation fault) Sep 07 19:49:37 foo haproxy[16964]: [ALERT] 250/194937 (16964) : Current worker #1 (16965) exited with code 139 (Segmentation fault) Sep 07 23:41:13 foo haproxy[16982]: [ALERT] 250/234113 (16982) : Current worker #1 (16984) exited with code 139 (Segmentation fault) Sep 07 23:41:14 foo haproxy[17076]: [ALERT] 250/234114 (17076) : Current worker #1 (17077) exited with code 139 (Segmentation fault) Sep 07 23:43:20 foo haproxy[17090]: [ALERT] 250/234320 (17090) : Current worker #1 (17093) exited with code 139 (Segmentation fault) Sep 07 23:43:50 foo haproxy[17113]: [ALERT] 250/234350 (17113) : Current worker #1 (17116) exited with code 139 (Segmentation fault) Sep 07 23:43:51 foo haproxy[17134]: [ALERT] 250/234351 (17134) : Current worker #1 (17135) exited with code 139 (Segmentation fault) Sep 07 23:44:44 foo haproxy[17146]: [ALERT] 250/234444 (17146) : Current worker #1 (17147) exited with code 139 (Segmentation fault) Sep 08 00:18:54 foo haproxy[17164]: [ALERT] 251/001854 (17164) : Current worker #1 (17166) exited with code 134 (Aborted) Sep 08 00:27:51 foo haproxy[17263]: [ALERT] 251/002751 (17263) : Current worker #1 (17266) exited with code 139 (Segmentation fault) Sep 08 00:30:36 foo haproxy[17286]: [ALERT] 251/003036 (17286) : Current worker #1 (17289) exited with code 134 (Aborted) Sep 08 00:37:01 foo haproxy[17307]: [ALERT] 251/003701 (17307) : Current worker #1 (17310) exited with code 139 (Segmentation fault) Sep 08 00:40:31 foo haproxy[17331]: [ALERT] 251/004031 (17331) : Current worker #1 (17334) exited with code 139 (Segmentation fault) Sep 08 00:41:14 foo haproxy[17650]: [ALERT] 251/004114 (17650) : Current worker #1 (17651) exited with code 139 (Segmentation fault) Sep 08 00:41:59 foo haproxy[17669]: [ALERT] 251/004159 (17669) : Current worker #1 (17672) exited with code 139 (Segmentation fault) The server in question uses a config that looks like this: global - maxconn 50000 - log /dev/log local0 - log /dev/log local1 notice - chroot /var/lib/haproxy - stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners - stats timeout 30s - user haproxy - group haproxy - daemon + maxconn 50000 + log /dev/log local0 + log /dev/log local1 notice + chroot /var/lib/haproxy + stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners + stats timeout 30s + user haproxy + group haproxy + daemon defaults - maxconn 50000 - log global - mode tcp - option tcplog - # dontlognull won't log sessions where the DNS resolution failed - #option dontlognull - timeout connect 5s - timeout client 15s - timeout server 15s + maxconn 50000 + log global + mode tcp + option tcplog + # dontlognull won't log sessions where the DNS resolution failed + #option dontlognull + timeout connect 5s + timeout client 15s + timeout server 15s resolvers mydns - nameserver local 127.0.0.1:53 - accepted_payload_size 1400 - timeout resolve 1s - timeout retry 1s - hold other 30s - hold refused 30s - hold nx 30s - hold timeout 30s - hold valid 10s - hold obsolete 30s + nameserver local 127.0.0.1:53 + accepted_payload_size 1400 + timeout resolve 1s + timeout retry 1s + hold other 30s + hold refused 30s + hold nx 30s + hold timeout 30s + hold valid 10s + hold obsolete 30s frontend foo - ... + ... - # dns lookup - tcp-request content do-resolve(txn.remote_ip,mydns,ipv4) var(txn.remote_host) - tcp-request content capture var(txn.remote_ip) len 40 + # dns lookup + tcp-request content do-resolve(txn.remote_ip,mydns,ipv4) var(txn.remote_host) + tcp-request content capture var(txn.remote_ip) len 40 - # XXX: the remaining rejections happen in the backend to provide better logging - # reject connection on DNS resolution error - use_backend be_dns_failed unless { var(txn.remote_ip) -m found } + # XXX: the remaining rejections happen in the backend to provide better logging + # reject connection on DNS resolution error + use_backend be_dns_failed unless { var(txn.remote_ip) -m found } - ... - # at this point, we should let the connection through - default_backend be_allowed + ... + # at this point, we should let the connection through + default_backend be_allowed - - When reaching out to haproxy folks in #haproxy, https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=ef131ae was mentioned as a potential fix. + When reaching out to haproxy folks in #haproxy, + https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=ef131ae was + mentioned as a potential fix. https://www.haproxy.org/bugs/bugs-2.0.13.html shows 3 commits with "dns": * https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=ef131ae * https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=39eb766 * https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=74d704f It would be great to have those (at least ef131ae) SRU'ed to Focal (Bionic doesn't isn't affected as it runs on 1.8). Additional information: # lsb_release -rd Description: Ubuntu 20.04.1 LTS Release: 20.04 # apt-cache policy haproxy haproxy: - Installed: 2.0.13-2 - Candidate: 2.0.13-2 - Version table: - *** 2.0.13-2 500 - 500 http://archive.ubuntu.com/ubuntu focal/main amd64 Packages - 100 /var/lib/dpkg/status + Installed: 2.0.13-2 + Candidate: 2.0.13-2 + Version table: + *** 2.0.13-2 500 + 500 http://archive.ubuntu.com/ubuntu focal/main amd64 Packages + 100 /var/lib/dpkg/status
** Patch added: "haproxy-lp1894879.debdiff" https://bugs.launchpad.net/ubuntu/+source/haproxy/+bug/1894879/+attachment/5409451/+files/haproxy-lp1894879.debdiff -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1894879 Title: frequent crashes when using do-resolve() To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/haproxy/+bug/1894879/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs