Hi Lukas, and first, many thanks for sharing your thoughts and opinions on this.
[ responding to both of your messages at once ] On Wed, Jul 10, 2024 at 09:30:55PM +0200, Lukas Tribus wrote: > On Wed, 10 Jul 2024 at 16:39, Willy Tarreau <w...@1wt.eu> wrote: > > > > Another change that will need to be backported after some time concerns > > the handling of default FD limits. (...) > I wholeheartedly hate default implicit limits and I also pretty much > disagree with fd-hard-limit in general, but allow me to quote your own > post here from github issue #2043 comment > https://github.com/haproxy/haproxy/issues/2043#issuecomment-1433593837 I don't like having had to deal with such limits for ~23 years now but the facts is that it's one of the strict, non-bypassable, system-imposed limits. The problem is that while the vast majority of users don't care about the number of FDs, this value cannot be changed at runtime and does have serious implications on RAM usage and even ou ability to accept connections that we're engaging in processing cleanly. So in any case we need to respect a limit, and for this we have to compose with what operating systems are doing. For decades they would present 1024 soft and more hard, but not that much (e.g. 4k) and it was needed to start as root to go beyond. Then some OSes started to expose much higher hard values by default (256k to 1M) so that it was no longer requird to be root to start a service. During this time, such limits were more or less tailored around RAM sizing. Now it seems we're reaching a limit, with extreme values being advertised without any relation with allocated RAM. I think that containers are part of the cause of this. > > we used to have a 2k maxconn limit for a very long time and it was causing > > much more harm than such an error: the process used to start well and was > > working perfectly fine until the day there was a big rush on the site and it > > wouldn't accept more connections than the default limit. I'm not that much > > tempted by setting new high default limits. We do have some users running > > with 2+ million concurrent connections, or roughly 5M FDs. That's already > > way > > above what most users would consider an acceptable default limit, and > > anything > > below this could mean that such users wouldn't know about the setting and > > could > > get trapped. > > I disagree that we need to heuristically guess those values like I > believe I said in the past. My problem is that such limit *does* exist (and if you look at ulimit -a, it's one of the rare ones that's never unlimited), we have to apply a value, with too low one we reject traffic at the worst possible moments (when there are the most possible witnesses of your site falling down) and with too high one we cannot start anymore. Limits are imposed to the process and it needs to work within. > "But containers ..." should not be an argument to forgo the principle > of least surprise. I agree even though they're part of the problem (but no longer the only one). > There are ways to push defaults like this out if really needed: with > default configuration files, like we have in examples/ and like > distributions provide in their repositories. This default the users > will then find in the configuration file and can look it up in the > documentation if they want. I'm not against encouraging users to find sane limits in the config files they copy-paste all over the place. Just like I think that if systemd starts to advertise very large values, we should probably encourage to ship unit files setting the hard limit to 1M or so (i.e. the previously implicit hard value presented to the daemon). > At the very least we need a *stern* configuration warning that we now > default to 1M fd, although I would personally consider this (lack of > all fd-hard-limit, ulimit and global maxconn) leading to heuristic > fd-hard-limit a critical error. When Valentine, who worked on the patch, talked to me about the problem, these were among the possibilities we thought about. I initially disagreed with the error because I considered that having to set yet another limit to keep your config running is quite a pain (a long time ago users were really bothered by the relation between ulimit-n and maxconn). But I was wrong on one point, I forgot that fd-hard-limit only applies when maxconn/ulimit-n etc are not set, so that wouldn't affect users who already set their values correctly. > I also consider backporting this change - even with a configuration > warning - dangerous. I know, but we don't decide what distro users start their stable version on :-/ > So here a few proposals: > > Proposal 1: > > - remove fd-hard-limit as it was a confusing mistake in the first place No, I disagree with this one. fd-hard-limit *is* useful. It says "if you don't know what a good value is, stay within system limits and in no case beyond this". I consider that it adds reliability to configs and will stop the mess of users forcing absurd maxconn that they don't necessarily understand the impact. I'm even using it myself not due to resources, just because it allows haproxy to start faster by not having to initialize all 1M FD entries that I know I'm not going to use. > - exit with a configuration error when global maxconn is not set That goes with the above, I disagree with this. This has been our mistake in the old ages, that maxconn was needed to be edited in configs. It was also Apache's problem at the 1.3 era where we started to deploy haproxy everywhere in front of it due to MaxClient being impossible to tune. The default 150 was way too low even for moderate sites that learned it the hard way by having the site fail to respond, and restarting with a larger value the next day would cause swap and OOM making the situation worse. System limits are present and whenever we can we should follow them because they're generally adjusted at central places where users expect to find them. The maxconn is service-specific within system-imposed sizing constraints. It makes sense that some just want to take whatever the OS offers. > - put global maxconn in all example configurations, encourage > Debian/RH to do the same What I would really like is to no longer see any maxconn in a regular configuration because there's no good value and we've seen them copied over and over. How many times we asked "are you sure you really need that high a maxconn?" in bug reports (even if it was unrelated to the problem). > - document accordingly > > > Proposal 2: > > - keep fd-hard-limit > - exit with a configuration error when fd-hard-limit needs to guess 1M That's an option I think I can live with even if by default it will really annoy all users by mandating yet another obscure setting, especially for developers starting it in foreground on the command line. In this case we might want to provide a command-line equivalent argument and suggest it in the error message. > - put fd-hard-limit in all example configurations, encourage Debian/RH > to do the same Instead I'd really encourage them to put the limit into the systemd unit file I guess, since it's where the change happens in the first place. But that's something that we need to discuss here with other users and distro maintainers as well. > - document accordingly Agreed on this. Valentine is currently working on explaining the relation between all of these settings to put in the management doc so that one does not need to first know the keyword to figure how it relates to others. I think this will help quite a bit. > Otherwise the next bug report will be that haproxy OOM's (in > production and only when encountering load) by default with systems > with less than 16 GB of RAM. The same bug reporter just needs a VM > with 8 GB RAM or less. I'm confused now, I don't see how, given that the change only *lowers* an existing limit, it never raises it. It's precisely because of the risk of OOM with OSes switching the default from one million FDs to one billion that we're proposing to keep the previous limit of 1 million as a sane upper bound. The only risk I'm seeing would be users discovering that they cannot accept more than ~500k concurrent connections on a large system. But I claim that those dealing with such loads *do* careful size and configure their systems and services (RAM, fd, conntrack, monitoring tools etc). Thus I'm not sure which scenario you have in mind that this change could result in such a report as above. Thanks! Willy