>>> On Apr 4, 2019, at 4:55 PM, Khalid Aziz <khalid.a...@oracle.com> wrote: >>> >>> On 4/3/19 10:10 PM, Andy Lutomirski wrote: >>> On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz <khalid.a...@oracle.com> wrote: >>> >>> XPFO flushes kernel space TLB entries for pages that are now mapped >>> in userspace on not only the current CPU but also all other CPUs >>> synchronously. Processes on each core allocating pages causes a >>> flood of IPI messages to all other cores to flush TLB entries. >>> Many of these messages are to flush the entire TLB on the core if >>> the number of entries being flushed from local core exceeds >>> tlb_single_page_flush_ceiling. The cost of TLB flush caused by >>> unmapping pages from physmap goes up dramatically on machines with >>> high core count. >>> >>> This patch flushes relevant TLB entries for current process or >>> entire TLB depending upon number of entries for the current CPU >>> and posts a pending TLB flush on all other CPUs when a page is >>> unmapped from kernel space and mapped in userspace. Each core >>> checks the pending TLB flush flag for itself on every context >>> switch, flushes its TLB if the flag is set and clears it. >>> This patch potentially aggregates multiple TLB flushes into one. >>> This has very significant impact especially on machines with large >>> core counts. >> >> Why is this a reasonable strategy? > > Ideally when pages are unmapped from physmap, all CPUs would be sent IPI > synchronously to flush TLB entry for those pages immediately. This may > be ideal from correctness and consistency point of view, but it also > results in IPI storm and repeated TLB flushes on all processors. Any > time a page is allocated to userspace, we are going to go through this > and it is very expensive. On a 96-core server, performance degradation > is 26x!!
Indeed. XPFO is expensive. > > When xpfo unmaps a page from physmap only (after mapping the page in > userspace in response to an allocation request from userspace) on one > processor, there is a small window of opportunity for ret2dir attack on > other cpus until the TLB entry in physmap for the unmapped pages on > other cpus is cleared. Why do you think this window is small? Intervals of seconds to months between context switches aren’t unheard of. And why is a small window like this even helpful? For a ret2dir attack, you just need to get CPU A to allocate a page and write the ret2dir payload and then get CPU B to return to it before context switching. This should be doable quite reliably. So I don’t really have a suggestion, but I think that a 44% regression to get a weak defense like this doesn’t seem worthwhile. I bet that any of a number of CFI techniques (RAP-like or otherwise) will be cheaper and protect against ret2dir better. And they’ll also protect against using other kernel memory as a stack buffer. There are plenty of those — think pipe buffers, network buffers, any page cache not covered by XPFO, XMM/YMM saved state, etc. _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu