On Mon, Sep 16, 2019 at 07:25:48AM -0700, Dave Hansen wrote: > On 9/16/19 2:00 AM, Kirill A. Shutemov wrote: > >>> I think we also need to make it clear that this is workaround for a broken > >>> hardware: speculative execution must not trigger a halt. > >> I think the word broken is a bit loaded here. According to the UEFI > >> spec (version 2.8, page 167), "Regions that are backed by the physical > >> hardware, but are not supposed to be accessed by the OS, must be > >> returned as EfiReservedMemoryType." Our interpretation is that > >> includes speculative accesses. > > +Dave. > > > > I don't think it is. Speculative access is done by hardware, not OS. > > > > BTW, isn't it a BIOS issue? > > > > I believe it should have a way to hide a range of physical address space > > from OS or force a caching mode on to exclude it from speculative > > execution. Like setup MTRRs or something. > > Ugh. I bet that was a fun one to chase down. Have the hardware > engineers learned a lesson or are they hiding behind the EFI spec in an > act of pure cowardice? ;)
Yes, it was fun. My main BIOS contact has explained to me how they are stuck between a rock and a hard place on any other options for this. > The patch is small and fixes a real problem. The changelog is OK, > although I'd prefer some differentiation between "occupied by the > kernel" and the kernel *image*. OK, is the phrase "kernel image" generally understood to cover everything from _text to _end, including the bss? As long as that's true, I will adopt this phrase. > The code is also gloriously free of any comments about what it's > doing or why. I'm intending to add something like this in the next version: /* * Only the region occupied by the kernel image has so far been checked against * the table of usable memory regions provided by the firmware, so * invalidate pages outside that region. A page table entry that maps to * a reserved area of memory would allow processor speculation into that * area, and on some hardware (particularly the UV platform) speculation * into reserved areas can cause a system halt. */ > But, I'm left with lots of questions: > > Why do PMD-level changes fix this? Is it because we 2MB pad the kernel > image? Why can't we still get within 2MB of the memory address in > question? This fix works for our hardware because the problematic reserved regions are 64M aligned, and going up to the next 2MB boundary from _end is not going to cross the next 64M boundary. One could argue the next step would be going into boot/compressed/{kaslr.c, misc.c} and rounding the size of the kernel up to the next 2MB boundary to ensure the chosen random location is covered by usable RAM up to the next PMD-level boundary. I did not go there because for us it is not necessary. > Is it in the lower 1MB, by chance? No, this is a reserved range at the top physical addresses for each NUMA node in a collection of them. > If this is all about avoiding EFI reserved ranges, why doesn't the > patch *LOOK* At EFI reserved ranges? Because the range the kernel image is located in is already checked against them in boot/compressed/kaslr.c. This will now be explained in the comment I mention above, which you had not yet seen. --> Steve Wahl -- Steve Wahl, Hewlett Packard Enterprise