On Sat, Nov 4, 2017 at 5:39 PM, Rob Landley <r...@landley.net> wrote: > On 11/03/2017 08:07 PM, Linus Torvalds wrote: >>> >>> But it boils down to "got the limit wrong, the exec failed after the >>> fork(), dynamic recovery from which is awkward so I'm trying to figure >>> out the right limit". > > Sounds later like dynamic recovery is what you recommend. (Awkward > doesn't mean I can't do it.)
So my preferred solution would certainly be for xargs to realize that _SC_ARG_MAX is at most an "informed guess" rather than some absolute value. Or, alternatively, simply use a _SC_ARG_MAX that is small enough that we can say "yes, that's what we've always supported". That, in practice, is the 128kB value. It's probably also big enough that nobody cares any more. Sure, you *could* feed bigger arguments, and we do end up basically supporting it because people who don't write proper scripts will simply do grep some-value $(find . -name '*.c') and that works pretty well if your code-base isn't humongous. It still works for the kernel, for example, but there's no question that it's still not something we *guarantee* works. It doesn't work if you don't limit it to *.c files: [torvalds@i7 linux]$ grep some-value $(find .) -bash: /usr/bin/grep: Argument list too long so when the kernel expanded the argument list from the traditional 128kB, it was for _convenience_, it was not meant for people who do proper scripting and use xargs. See the difference? >> What _is_ the stack limit when using toybox? Is it just entirely unlimited? > > Answer to second question on ubuntu 14.04: > > landley@driftwood:~/linux/linux/fs$ ulimit -s 999999999 > landley@driftwood:~/linux/linux/fs$ ulimit -s > 999999999 Oh, I can do that on Fedora too. But the thing is, nobody actually seems to do that. And our regression rule has never been "behavior doesn't change". That would mean that we could never make any changes at all. For example, we do things like add new error handling etc all the time, which we then sometimes even add tests for in our kselftest directory. So clearly behavior changes all the time and we don't consider that a regression per se. The rule for a regression for the kernel is that some real user workflow breaks. Not some test. Not a "look, I used to be able to do X, now I can't". So that's why this whole "is this a test or a real workload" question is important to me, and what odd RLIMIT_STACK people actually use. I'm not interested in "can you reproduce it", because that's simply not the issue for us from a regression standpoint. That said, I tried this under Fedora: ulimit -s unlimited find / -print0 2> /dev/null | xargs -0 /usr/bin/echo | wc and it reported 991 lines. That would match just using 128kB as the breaking point for xargs. So apparently at least Fedora doesn't seem to use that "stack limit / 4" thing, but the traditional 128kB value. I don't know if that is because of 'xargs', or because of library differences - I didn't check. And I have a hard time judging whether the regression report is a "look, this behavior changed", or a "we actually have a real regression visible to actual users". See above why that matters to me. > It sounds like "probe and recover" is my best option. I actually suspect "just use 128kB" is the actual best option in practice. Sure, "probe and recover" is the best option in theory, but it's a lot more complex, and there doesn't actually seem to be a lot of upside. So to take my silly example of piping "find /" to xargs: I can't imagine that anybody sane should care really whether it results in 900+ execve's (for that 128kB limit) or just 20 (for some "optimal" 6MB limit). And there is actually a somewhat real fear with the "probe-and-recover" model: the already mentioned nasty case "sometimes we don't return E2BIG, we might just succeed the execve, but then kill the process very early due to some _other_ limit being triggered". That nasty case really shouldn't happen, but it was the case we considered (and rejected) for suid binaries. So that's the real reason the kernel behavior changed: we had a security issue with the "we allow basically unlimited stack growth" where a huge argv/envp could be used to grow the stack into some other segment. Realistically, it was only a security issue with suid binaries, but as mentioned in this thread, the problem is that we do the argument copying even before we've decided whether the execve is going to be a suid execve. So then - exactly so that we do *not* have that nasty case of "execve succeeds, but we kill the process immediately if it turns into a potential security issue", we do that "limit the stack growth for everybody". Linus