On Jun 6, 2019, at 10:28 PM, Nadav Amit <na...@vmware.com> wrote:
>> On May 31, 2019, at 3:07 PM, Nadav Amit <na...@vmware.com> wrote: >> >>> On May 31, 2019, at 2:47 PM, Andy Lutomirski <l...@amacapital.net> wrote: >>> >>> >>> On May 31, 2019, at 2:33 PM, Nadav Amit <na...@vmware.com> wrote: >>> >>>>> On May 31, 2019, at 2:14 PM, Andy Lutomirski <l...@kernel.org> wrote: >>>>> >>>>>> On Thu, May 30, 2019 at 11:37 PM Nadav Amit <na...@vmware.com> wrote: >>>>>> When we flush userspace mappings, we can defer the TLB flushes, as long >>>>>> the following conditions are met: >>>>>> >>>>>> 1. No tables are freed, since otherwise speculative page walks might >>>>>> cause machine-checks. >>>>>> >>>>>> 2. No one would access userspace before flush takes place. Specifically, >>>>>> NMI handlers and kprobes would avoid accessing userspace. >>>>> >>>>> I think I need to ask the big picture question. When someone calls >>>>> flush_tlb_mm_range() (or the other entry points), if no page tables >>>>> were freed, they want the guarantee that future accesses (initiated >>>>> observably after the flush returns) will not use paging entries that >>>>> were replaced by stores ordered before flush_tlb_mm_range(). We also >>>>> need the guarantee that any effects from any memory access using the >>>>> old paging entries will become globally visible before >>>>> flush_tlb_mm_range(). >>>>> >>>>> I'm wondering if receipt of an IPI is enough to guarantee any of this. >>>>> If CPU 1 sets a dirty bit and CPU 2 writes to the APIC to send an IPI >>>>> to CPU 1, at what point is CPU 2 guaranteed to be able to observe the >>>>> dirty bit? An interrupt entry today is fully serializing by the time >>>>> it finishes, but interrupt entries are epicly slow, and I don't know >>>>> if the APIC waits long enough. Heck, what if IRQs are off on the >>>>> remote CPU? There are a handful of places where we touch user memory >>>>> with IRQs off, and it's (sadly) possible for user code to turn off >>>>> IRQs with iopl(). >>>>> >>>>> I *think* that Intel has stated recently that SMT siblings are >>>>> guaranteed to stop speculating when you write to the APIC ICR to poke >>>>> them, but SMT is very special. >>>>> >>>>> My general conclusion is that I think the code needs to document what >>>>> is guaranteed and why. >>>> >>>> I think I might have managed to confuse you with a bug I made (last minute >>>> bug when I was doing some cleanup). This bug does not affect the >>>> performance >>>> much, but it might led you to think that I use the APIC sending as >>>> synchronization. >>>> >>>> The idea is not for us to rely on write to ICR as something serializing. >>>> The >>>> flow should be as follows: >>>> >>>> >>>> CPU0 CPU1 >>>> >>>> flush_tlb_mm_range() >>>> __smp_call_function_many() >>>> [ prepare call_single_data (csd) ] >>>> [ lock csd ] >>>> [ send IPI ] >>>> (*) >>>> [ wait for csd to be unlocked ] >>>> [ interrupt ] >>>> [ copy csd info to stack ] >>>> [ csd unlock ] >>>> [ find csd is unlocked ] >>>> [ continue (**) ] >>>> [ flush TLB ] >>>> >>>> >>>> At (**) the pages might be recycled, written-back to disk, etc. Note that >>>> during (*), CPU0 might do some local TLB flushes, making it very likely >>>> that >>>> CSD will be unlocked by the time it gets there. >>>> >>>> As you can see, I don’t rely on any special micro-architectural behavior. >>>> The synchronization is done purely in software. >>>> >>>> Does it make more sense now? >>> >>> Yes. Have you benchmarked this? >> >> Partially. Numbers are indeed worse. Here are preliminary results, comparing >> to v1 (concurrent): >> >> n_threads before concurrent +async >> --------- ------ ---------- ------ >> 1 661 663 663 >> 2 1436 1225 (-14%) 1115 (-22%) >> 4 1571 1421 (-10%) 1289 (-18%) >> >> Note that the benefit of “async" would be greater if the initiator does not >> flush the TLB at all. This might happen in the case of kswapd, for example. >> Let me try some micro-optimizations first, run more benchmarks and get back >> to you. > > So I ran some more benchmarking (my benchmark is not very suitable), and tried > more stuff that did not help (checking for more work before returning from the > IPI handler, and avoid redundant IPIs in such case). > > Anyhow, with a fixed version, I ran a more standard benchmark on DAX: > > $ mkfs.ext4 /dev/pmem0 > $ mount -o dax /dev/pmem0 /mnt/mem > $ cd /mnt/mem > $ bash -c 'echo 0 > > /sys/devices/platform/e820_pmem/ndbus0/region0/namespace0.0/block/pmem0/dax/write_cache’ > $ sysbench fileio --file-total-size=3G --file-test-mode=rndwr \ > --file-io-mode=mmap --threads=4 --file-fsync-mode=fdatasync prepare > $ sysbench fileio --file-total-size=3G --file-test-mode=rndwr \ > --file-io-mode=mmap --threads=4 --file-fsync-mode=fdatasync run > > ( as you can see, I disabled the write-cache, since my machine does not have > clwb/clflushopt and clflush appears to become a bottleneck otherwise ) > > > The results are: > events (avg/stddev) > ------------------- > base 1263689.0000/11481.10 > concurrent 1310123.5000/19205.79 (+3.6%) > concurrent + async 1326750.2500/24563.61 (+4.9%) > > So which version do you want me to submit? With or without the async part? I think it would be best to submit it without the async part. You can always submit that later.