On Thu, 21 Nov 2024 at 19:11, Josh Poimboeuf <jpoim...@kernel.org> wrote: > > On Thu, Nov 21, 2024 at 05:02:06PM -0800, Linus Torvalds wrote: > > [ Time passes ] > > > > Ugh. I tried it. It looks like this: > > > > #define inlined_get_user(res, ptr) ({ \ > > __label__ fail2, fail1; \ > > __auto_type __up = (ptr); \ > > int __ret = 0; \ > > if (can_do_masked_user_access()) \ > > __up = masked_user_access_begin(__up); \ > > else if (!user_read_access_begin(__up, sizeof(*__up))) \ > > goto fail1; \ > > unsafe_get_user(res, ptr, fail2); \
That 'ptr' needs to be '__up' of course. Other than that it actually seems to work. And by "seems to work" I mean "I replaced get_user() with this macro instead, and my default kernel continued to build fine". I didn't actually *test* the end result, although i did look at a few more cases of code generation, and visually it looks sane enough. > That actually doesn't seem so bad, it's easy enough to follow the logic. I'm not loving the "if (0)" with the labels inside of it. But it's the only way I can see to make a statement expression like this work, since you can't have a "return" inside of it. > We could easily use that macro in size-specific inline functions > selected by a macro with a sizeof(type) switch statement -- not so bad > IMO if they improve code usage and generation. The point of it being a macro is that then it doesn't need the size-specific games at all, because it "just works" since the types end up percolating inside the logic. Of course, that depends on unsafe_get_user() itself having the size-specific games in it, so that's a bit of a lie, but at least it needs the games in just one place. And yes, having inline wrappers anyway could then allow for the compiler making the "inline or not" choice, but the compiler really doesn't end up having any idea of whether something is more important from a performance angle, so that wouldn't actually be a real improvement. > Then all the arches have to do is implement unsafe_*_user_{1,2,4,8} and > the "one good implementation" idea comes together? Yeah, except honestly, basically *none* of them do. The list or architectures that actually implement the unsafe accessors is sad: it's x86, powerpc, and arm64. That's it. Which is - along with my bloat worry - what makes me go "it's not worth fighting". Also, honestly, it is *very* rare that "get_user()" and "put_user()" is performance-critical or even noticeable. We have lots of important user copies, and the path and argument copy (aka "strncpy_from_user()") is very important, but very few other places tend to be. So I see "copy_from_user()" in profiles, and I see "strncpy_from_user()", and I've seen a few special cases that I've converted (look at get_sigset_argpack(), for example - it shows up on some select-heavy loads). And now you found another one in that futex code, and that is indeed special. But honestly, most get_user() cases are really in things like random ioctls etc. Which is why I suspect we'd be better off just doing the important ones one by one. And doing the important ones individually may sound really nasty, but if they are important, it does kind of make sense. For example, one place I suspect *is* worth looking at is the execve() argument handling. But to optimize that isn't about inlining get_user(). It's about doing more than one word for each user access, particularly with CLAC/STAC being as slow as they often are. So what you actually would want to do is to combine these two ret = -EFAULT; str = get_user_arg_ptr(argv, argc); if (IS_ERR(str)) goto out; len = strnlen_user(str, MAX_ARG_STRLEN); if (!len) goto out; into one thing, if you really cared enough. I've looked at it, and always gone "I don't _quite_ care that much", even though argument handling definitely shows up on some benchmarks (partly because it mostly shows up on fairly artificial ones - the rest of execve is so expensive that even when we waste some time on argument handling, it's not noticeable enough to be worth spending huge amount of effort on). But you could also look at the pure argv counting code (aka "count()" in fs/exec.c). That *also* shows up quite a bit, and there batching things migth be a much easier thing. But again, it's not about inlining get_user(), it's about doing multiple accesses without turning user accesses on and off all the time. Anyway, that was a long way of saying: I really think we should just special-case the (few) important cases that get reported. Because any *big* improvements will come not from just inlining. Linus