On 22.08.19 00:31, Richard Henderson wrote: > On 8/21/19 2:33 PM, David Hildenbrand wrote: >>> NOTDIRTY cannot fault at all. The associated rcu critical section is ugly >>> enough to make me not want to do anything except continue to go through the >>> regular MMIO path. >>> >>> In any case, so long as we eliminate *access* faults by probing the page >>> table, >>> then falling back to the byte-by-byte loop is, AFAICS, sufficient to >>> implement >>> the instructions correctly. >> >> "In any case, so long as we eliminate *access* faults by probing the >> page table" - that's what I'm doing in this patch (and even more correct >> in the prototype patch I shared), no? (besides the watchpoint madness below) > > Correct. > > My main objection to your current patch is that you perform the access checks > within MVC, and then do some more tlb lookups in fast_memmove.
Right, however I think splitting up both steps is nicer (especially if we mix and match memmove and memset in MVCLE later). Maybe combining a tuned-up probe_write() with something similar (but different to) access_prepare(). General changes: 1. Check watchpoints in probe_write() 2. Make probe_write() bail out if it would cross pages() - caller has to assure it falls into a single page (TARGET_PAGE_SIZE ?). 3. Return a pointer from probe_write() -> If NULL is returned, reads/writes have to go via memops, e.g., single byte access 4. Make probe_write() return addresses of TLB entries that are already invalid again (-> LAP) What I propose for s390x: 1. access_prepare(): Collect the pointers using probe_write() - maximum is two pages. If we get NULL, there is no fault but we have to fallback to read/write using memops. 2. access_memset() / access_set() / access_move() ... pass the access information we collected. Prototype I shared can be heavily simplified - don't fall back to paddrs but instead do single-byte access, don't allow flexible array of pages, etc. 3. First convert MVC to the new access mechanism, later fix + convert the others. > > I think that fast_memmove is where the access checks should live. That allows > incremental improvement to combine access checks + host address lookup, which > cannot currently be done in one step with existing interfaces. > > I guess you would still want access checks within MVC for the case in which > you > must fall back to byte-by-byte because of destructive overlap. That's why I think introducing a new set of helpers makes sense. Once all users of fast_memmove() etc are gone, we can then throw them away. > >> "falling back to the byte-by-byte loop is, AFAICS, sufficient" >> >> I don't think this is sufficient. E.g., LAP protected pages >> (PAGE_WRITE_INV which immediately requires a new MMU walk on the next >> access) will trigger a new MMU walk on every byte access (that's why I >> chose to pre-translate in my prototype). > > LAP protected pages is exactly why probe_write should return the host address, > so that we can do the access check + host address lookup in one step. > Yes, this should also be done in tlb_vaddr_to_host() - but as we'll be converting to a probe_write() thingy, it might not be required to handle the special "TLB_INVALID_MASK set after tlb_fill()" case. > But in the meantime... > >> If another CPU modified the >> page tables in between, we could suddenly get a fault - although we >> checked early. What am I missing? > > You're concerned with a bare write to the page table by cpu B, while cpu A is > executing, and before cpu B issues the cross-cpu tlb flush? Not bare writes, these are invalid. Take IPTE for example. Set the invalid bit, then do a sync CPU flush. If we re-translate a vaddr after probe_write() - which would currently happen in case of LAP - then we could suddenly run into the invalid bit, triggering a fault. The sync CPU flush still wasn't processed, but we see the "invalid" bit early because we decided to re-translate after already modifying memory. This re-translation really has to be avoided. > > The tlb victim cache should prevent having to re-read a tlb entry from memory, > at least for MVC. The unlimited size we currently support for MVCL and MVCLE > could act weird, but would be fixed by limiting the execution as discussed. Yes, MVCL/MVCLE need a major overhaul. But good to know that a second tlb_fill() will not result in the other TLB entry to suddenly get invalid and require a new MMU translation (-> victim cache). That's another concern I had. > > Honestly, the os has to make sure that the page remains valid until after the > flush completes, otherwise it's an os bug. The cross-cpu tlb flush happens > via > async_run_on_cpu, and of course never occurs while we are executing a TB. Yet > another reason to limit the amount of work any one instruction does. ;-) In my example, the page would still be valid, however we the invalid bit has already been set. The next MMU translation would choke on that and trigger a fault. I can only see this (re-translation) happening for LAP right now. And LAP pages (lowcore) should be always mapped, so I don't think this is a very important case to handle. > > >> I see that we use BP_STOP_BEFORE_ACCESS for PER (Program Event >> Recording) on s390x. I don't think that's correct. We want to get >> notified after the values were changed. >> >> "A storage-alteration event occurs whenever a CPU, >> by using a logical or virtual address, makes a store >> access without an access exception to the storage >> area designated by control registers 10 and 11. ..." >> >> "For a PER instruction-fetching nullification event, the >> unit of operation is nullified. For other PER events, >> the unit of operation is completed" >> >> Oh man, why is everything I take a look at broken. > > Heh. > >>> In the latter case, if the instruction has had any side effects prior to the >>> longjmp, they will be re-done when we re-start the current instruction. >>> >>> To me this seems like a rather large bug in our implementation of >>> watchpoints, >>> as it only really works properly for simple load/store/load-op-store type >>> instructions. Anything that works on many addresses and doesn't delay side >>> effects until all accesses are complete will Do The Wrong Thing. >>> >>> The fix, AFAICS, is for probe_write to call check_watchpoint(), so that we >>> take the debug exit early. >> >> Indeed. I see what you mean now. (I was ignoring the "before access" >> because I was assuming we don't need it on s390x) >> >> probe_write() would have to check for all BP_STOP_BEFORE_ACCESS watchpoints. > > !BP_STOP_BEFORE_ACCESS watchpoints exit to the main loop as well, so that it > can restart and then single-step the current instruction. > > We need it the check in probe_write for all cases. Thanks for the clarification! > >> Yes, that's what I mean, TARGET_PAGE_SIZE, but eventually crossing a >> page boundary. The longer I stare at the MVCL code, the more broken it >> is. There are more nice things buried in the PoP. MVCL does not detect >> access exceptions beyond the next 2k. So we have to limit it there >> differently. > > That language is indeed odd. > > The only reading of that paragraph that makes sense to me is that the hardware > *must* interrupt MVCL after every 2k bytes processed. The idea that the user > can magically write to a read-only page simply by providing length = 2MB and > page that is initially writable is dumb. I cannot imagine that is a correct > reading. > > Getting clarification from an IBM engineer on that would be good; otherwise I > would just ignore that and proceed as if all access checks are performed. > >> So what I understand is that >> >> - we should handle watchpoints in probe_write() >> - not bypass IO memory (especially NOTDIRTY). We cannot always relay on >> getting access to a host page. > > Correct. > > > r~ > -- Thanks, David / dhildenb