From: Christoph Hellwig <h...@infradead.org> Sent: Monday, May 16, 2022 12:35 AM > > I don't really understand how 'childs' fit in here. The code also > doesn't seem to be usable without patch 2 and a caller of the > new functions added in patch 2, so it is rather impossible to review. > > Also: > > 1) why is SEV/TDX so different from other cases that need bounce > buffering to treat it different and we can't work on a general > scalability improvement > 2) per previous discussions at how swiotlb itself works, it is > clear that another option is to just make pages we DMA to > shared with the hypervisor. Why don't we try that at least > for larger I/O?
Tianyu already responded, but let me offer an expanded view. I have better knowledge of AMD's SEV-SNP than of Intel's TDX, so my details might be off for TDX. Taking your question (2) first, two things must be done when guest memory pages transition between the "shared with the hypervisor" and the "private to the guest" states: A) Some bookkeeping between the guest and host, which requires a hypercall. Doing a hypercall isn't super-fast, but for large I/Os, it could be a reasonable tradeoff if we could avoid bounce buffer copying. B) The contents of the memory buffer must transition between encrypted and not encrypted. The hardware doesn't provide any mechanism to do such a transition "in place". The only way to transition is for the CPU to copy the contents between an encrypted area and an unencrypted area of memory. Because of (B), we're stuck needing bounce buffers. There's no way to avoid them with the current hardware. Tianyu also pointed out not wanting to expose uninitialized guest memory to the host, so, for example, sharing a read buffer with the host requires that it first be initialized to zero. For your question (1), I think we all would agree that SEV-SNP and TDX usage of bounce buffers isn't fundamentally different from other uses -- they just put a lot more load on the bounce buffering mechanism. If done well, general swiotlb scalability improvements should be sufficient and are much preferred. You made a recent comment about almost being done removing all knowledge of swiotlb from drivers [1]. I agree with that goal. However, Tianyu's recent patches for improving swiotlb scalability don't align with that goal. A while back, you suggested using per-device swiotlb regions [2], and I think Tianyu's patch sets have taken that approach. But that approach requires going beyond the existing per-device swiotlb regions to get scalability with multi-channel devices, and that's leading us in the wrong direction. We should reset and make sure we agree on the top-level approach. 1) We want general scalability improvements to swiotlb. These improvements should scale to high CPUs counts (> 100) and for multiple NUMA nodes. 2) Drivers should not require any special knowledge of swiotlb to benefit from the improvements. No new swiotlb APIs should be need to be used by drivers -- the swiotlb scalability improvements should be transparent. 3) The scalability improvements should not be based on device boundaries since a single device may have multiple channels doing bounce buffering on multiple CPUs in parallel. Anything else? The patch from Andi Kleen [3] (not submitted upstream, but referenced by Tianyu as the basis for his patches) seems like a good starting point for meeting the top-level approach. Andi and Robin had some back-and-forth about Andi's patch that I haven't delved into yet, but getting that worked out seems like a better overall approach. I had an offline chat with Tianyu, and he would agree as well. Agree? Disagree? Michael [1] https://lore.kernel.org/lkml/ymqonhkbt8fty...@infradead.org/ [2] https://lore.kernel.org/lkml/20220222080543.ga5...@lst.de/ [3] https://github.com/intel/tdx/commit/4529b578 _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu