On Fri, Sep 10, 2021, Paolo Bonzini wrote: > On 10/09/21 17:34, Sean Christopherson wrote: > > > Yang explained to me (offlist) that this is needed because Windows fails > > > to > > > reboot without it. We would need a way to ask Linux to reinitialize the > > > vEPC, that doesn't involve munmap/mmap; this could be for example > > > fallocate(FALLOC_FL_ZERO_RANGE). > > > > > > What do you all think? > > > > Mechanically, FALLOC_FL_ZERO_RANGE could work, but it's definitely a > > stretch of > > the semantics as the underlying memory would not actually be zeroed. > > The contents are not visible to anyone, so they might well be zero > (not entirely serious, but also not entirely unserious).
Yeah, it wouldn't be a sticking point, just odd. > > The only other option that comes to mind is a dedicated ioctl(). > > If it is not too restrictive to do it for the whole mapping at once, > that would be fine. Oooh, rats. That reminds me of a complication. If QEMU creates multiple EPC sections, e.g. for a vNUMA setup, resetting each section individually will fail if the guest did an unclean RESET and a given enclaves has EPC pages from multiple sections. E.g. an SECS in vEPC[X] can have children in vEPC[0..N-1], and all those children need to be removed before the SECS can be removed. Yay SGX! > If it makes sense to do it for a range, however, > the effort of defining a ioctl() API is probably not worth it when > fallocate() is available. > > Either way, I suppose it would be just something like > > /* for fallocate; otherwise just use xa_for_each */ > size = min_t(unsigned long, size, -start); > end = (start + size - 1) >> PAGE_SHIFT; > start >>= PAGE_SHIFT; > > /* > * Removing in two passes lets us remove SECS pages as well, > * since they can only be EREMOVE'd after all their child pages. > * An EREMOVE failure on the second pass means that the SECS > * page still has children on another instance. Since it is > * still in the xarray, bury our head in the sand and ignore > * it; sgx_vepc_release() will deal with it. > */ > LIST_HEAD(secs_pages); > xa_for_each_range(&vepc->page_array, index, entry, start, end) { > if (!sgx_vepc_free_page(entry)) > list_add_tail(&epc_page->list, &secs_pages); > } > > list_for_each_entry_safe(epc_page, tmp, &secs_pages, list) { > list_del(&epc_page->list); > sgx_vepc_free_page(epc_page); > } > So another advantage of the ioctl would be e.g. being able to return the > number of successfully EREMOVEd pages. But since QEMU would not do > anything with the return value and _also_ bury its head in the sand, > that would only be useful if you guys have other uses in mind. There are two options: 1) QEMU has to handle "failure", or 2) the kernel provides an ioctl() that takes multiple vEPC fds and handles the SECS dependencies. #1 is probably the least awful option. For #2, in addition to the kernel having to deal with multiple fds, it would also need a second list_head object in each page so that it could track which pages failed to be removed. Using the existing list_head would work for now, but it won't work if/when an EPC cgroup is added. Note, for #1, QEMU would have to potentially do three passes. 1. Remove child pages for a given vEPC. 2. Remove SECS for a given vEPC that were pinned by children in the same vEPC. 3. Remove SECS for all vEPC that were pinned by children in different vEPC.