On Sun, 13 Oct 2024 at 20:54, Josh Poimboeuf <jpoim...@kernel.org> wrote: > > If I understand correctly, LAM bits are for the benefit of SW and are > ignored by HW? Can we just mask those bits off?
Yes. But then you waste time on the masking, but particularly if it then causes silly extra overhead just to get the mask. That was why the whole original LAM patches were dismissed - not because an "and" instruction to remove the LAM bits would be expensive, but because when you do static branching depending on it and then load the mask from memory, the code goes from nice simple straight-line efficient code to a horrible mess. The early LAM patches admittedly made it even worse because it did a per-thread mask etc, so it really got quite nasty and wasn't just *one* static jump to a constant. But still.. > > So maybe the "what is the actual cycle latency of detecting the > > faulting instruction" really is the core question here. > > I think I remember reading that STAC/CLAC can't necessarily be relied on > as a speculation barrier, depending on microarchitectural details. It > might be safest to assume we can't rely on that. Masking is relatively > cheap anyway. The WHOLE BUG is all about "microarchitectural details". So arguing that STAC is a microarchitectural detail and not architecturally guaranteed isn't an argument. Because architecturally, this but simply does not exist, and so some architectural definition simply doesn't matter. So all that matters is whether the broken microarchitectures that used the wrong bit for testing also have a STAC that basically hides the bug. Christ. The hardware should have used a bit that is *MEANINGFUL*, namely bit #63, not some random meaningless bit that just happens to be one of the bits that is then checked for canonicality. And it's so annoying, because from a *hardware* perspective, bit #63 vs bit #48 is completely irrelevant. It's literally just a wire choice But from an architectural perspective, bit #63 is not only the *actual* bit that is the real difference ("kernel is at the top of the address space") but for software, bit #48 is fundamentally harder to test. IOW, it's completely the wrong effing bit to test. Honestly, the Intel meltdown thing I at least understood *why* it happened. This AMD "use meaningless bit #48" bug just strikes me as active malice. The paper I found for this bug doesn't go into any details of what the cycle count issues are until the full address is verified, and doesn't even mention which micro-architectures are affected. It says "Zen+" and "Zen 2", from a quick read. The AMD note doesn't say even that. Is this *all* Zen cores? If not, when did it get fixed? > So far I have something like the below which is completely untested and > probably actively wrong in some places. This is worse than actively wrong. It's just adding insult to injury. > +static inline bool __access_ok(const void __user *ptr, unsigned long size) > +{ > + unsigned long addr = (__force unsigned long)untagged_addr(ptr); > + unsigned long limit = TASK_SIZE_MAX; We're not encouraging this complete microarchitectural incompetence by using something expensive like TASK_SIZE_MAX, which actually expands to be conditional on pgtable_l5_enabled() and a shift (maybe the compiler ends up moving the shift into both sides of the conditional?). Which isn't even the right thing, because presumably as far as the HARDWARE is concerned, the actual width of the TLB is what gets tested, and isn't actually dependent on whether 5-level paging is actually *enabled* or not, but on whether the hardware *supports* 5-level paging. I guess as long as we aren't using the high bits, we can just clear all of them (so if we just always clear bit #57 when we also clear #48, we are ok), but it still seems wrong in addition to being pointlessly expensive. > +#define mask_user_address(x) \ > + ((typeof(x))((__force unsigned long)(x) & ((1UL << > __VIRTUAL_MASK_SHIFT) - 1))) No. Again, that code makes it look like it's some nice simple constant. It's not. __VIRTUAL_MASK_SHIFT is that conditional thing based on pgtable_l5_enabled(). So now you have static branches etc craziness instead of having a nice constant. We can fix this, but no, we're not going "hardware people did something incredibly stupid, so now to balance things out, *we* will do something equally stupid". We are going to be better than that. So the way to fix this properly - if it even needs fixing - is to just have an actual assembler alternate that does something like this in get_user.S, and does the same for the C code for mask_user_address(). And it needs a *big* comment about the stupidity of hardware checking the wrong bit that has no semantic meaning except for it (too late!) being tested for being canonical with the actual bit that matters (ie bit #63). And again, the whole "if it even needs fixing" is a big thing. Maybe STAC is just taking long enough that the canonicality check *has* been done. We know the STAC isn't a serializing instruction, but we also *do* know that STAC sure as hell will synchronize at least to some degree with memory access permission testing, because that's literally the whole and only point of it. (Of course, if the AC bit was just passed along from the front end and tagged all the instructions, the actual CLAC/STAC instructions wouldn't need to serializing with actual instruction execution, but if that was the case they wouldn't be as expensive as I see them being in profiles, so we know the uarch isn't doing something that clever). Christ. I thought I was over being annoyed by hardware bugs. But this hardware bug is just annoying in how *stupid* it is. Anyway, the attached patch (a) only fixes get_user() - I assume put_user() doesn't even need this, because the store will go into the store buffer, and certainly be killed before it gets anywhere else? (b) only does the asm side, the mask_user_address() would need to do the same thing using the C version: alternative() (c) is entirely untested, and I might have gotten the constants wrong or some other logic wrong. but at least the code it generates doesn't look actively like garbage. It generates something like this: mov %rax,%rdx sar $0x3f,%rdx or %rdx,%rax movabs $0x80007fffffffffff,%rdx and %rdx,%rax which looks about as good as it gets (assuming I didn't screw up the constant). The "or" still sets all bits when it's one of the real kernel addresses), but the "and" will now guarantee that the end result is canonical for positive addresses, and guaranteed non-canonical - and for this AMD bug, a zero bit 48/57 - for the invalid kernel range. Yes, it basically means that as far as the kernel is concerned, "everything is LAM". The kernel would actually accept all positive addresses, and only fault (now with a non-canonical GP fault) when users try to use negative addresses. Which is arguably also quite horrendous, but it's effectively what having LAM enabled would do in hardware anyway, so hey, it's "forward looking". Bah. And we could make the masking constants be 0xfeff7fffffffffff and 0xfeffffffffffff, and *only* mask off bit 48/57. And we could even make the whole "and" be conditional on the AMD bug with a new X86_FEATURE_CANONICAL_LEAK thing. So there are certainly options in this area. Linus
arch/x86/lib/getuser.S | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S index d066aecf8aeb..7d5730aa18b8 100644 --- a/arch/x86/lib/getuser.S +++ b/arch/x86/lib/getuser.S @@ -37,11 +37,17 @@ #define ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC +#define X86_CANONICAL_MASK ALTERNATIVE \ + "movq $0x80007fffffffffff,%rdx", \ + "movq $0x80ffffffffffffff,%rdx", X86_FEATURE_LA57 + .macro check_range size:req .if IS_ENABLED(CONFIG_X86_64) mov %rax, %rdx sar $63, %rdx or %rdx, %rax + X86_CANONICAL_MASK + and %rdx,%rax .else cmp $TASK_SIZE_MAX-\size+1, %eax jae .Lbad_get_user