On 12/10/20 8:56 AM, xuxiaoyang (C) wrote:


On 2020/12/9 22:42, Eric Farman wrote:


On 12/9/20 6:54 AM, Cornelia Huck wrote:
On Tue, 8 Dec 2020 21:55:53 +0800
"xuxiaoyang (C)" <xuxiaoya...@huawei.com> wrote:

On 2020/11/21 15:58, xuxiaoyang (C) wrote:
vfio_pin_pages() accepts an array of unrelated iova pfns and processes
each to return the physical pfn.  When dealing with large arrays of
contiguous iovas, vfio_iommu_type1_pin_pages is very inefficient because
it is processed page by page.In this case, we can divide the iova pfn
array into multiple continuous ranges and optimize them.  For example,
when the iova pfn array is {1,5,6,7,9}, it will be divided into three
groups {1}, {5,6,7}, {9} for processing.  When processing {5,6,7}, the
number of calls to pin_user_pages_remote is reduced from 3 times to once.
For single page or large array of discontinuous iovas, we still use
vfio_pin_page_external to deal with it to reduce the performance loss
caused by refactoring.

Signed-off-by: Xiaoyang Xu <xuxiaoya...@huawei.com>

(...)


hi Cornelia Huck, Eric Farman, Zhenyu Wang, Zhi Wang

vfio_pin_pages() accepts an array of unrelated iova pfns and processes
each to return the physical pfn.  When dealing with large arrays of
contiguous iovas, vfio_iommu_type1_pin_pages is very inefficient because
it is processed page by page.  In this case, we can divide the iova pfn
array into multiple continuous ranges and optimize them.  I have a set
of performance test data for reference.

The patch was not applied
                      1 page           512 pages
no huge pages:     1638ns           223651ns
THP:               1668ns           222330ns
HugeTLB:           1526ns           208151ns

The patch was applied
                      1 page           512 pages
no huge pages       1735ns           167286ns
THP:               1934ns           126900ns
HugeTLB:           1713ns           102188ns

As Alex Williamson said, this patch lacks proof that it works in the
real world. I think you will have some valuable opinions.

Looking at this from the vfio-ccw angle, I'm not sure how much this
would buy us, as we deal with IDAWs, which are designed so that they
can be non-contiguous. I guess this depends a lot on what the guest
does.

This would be my concern too, but I don't have data off the top of my head to 
say one way or another...


Eric, any opinion? Do you maybe also happen to have a test setup that
mimics workloads actually seen in the real world?


...I do have some test setups, which I will try to get some data from in a 
couple days. At the moment I've broken most of those setups trying to implement 
some other stuff, and can't revert back at the moment. Will get back to this.

Eric
.

Thank you for your reply. Looking forward to your test data.

Xu,

The scenario I ran was a host kernel 5.10.0-rc7 with qemu 5.2.0, with a Fedora 32 guest with 4 VCPU and 4GB memory. I tried this a handful of times across a couple different hosts, so the likelihood that these numbers are outliers are pretty low. The histograms below come from a simple bpftrace, recording the number of pages asked to be pinned, and the length of time (in nanoseconds) it took to pin all those pages. I separated out the length of time for a request of one page versus a request of multiple pages, because as you will see the former far outnumbers the latter.

The first thing I tried was simply to boot the guest via vfio-ccw, to see how the patch itself behaved:

@1_page_ns      BASE            +PATCH
256, 512        12531   42.50%  12744   42.26%
512, 1K         5660    19.20%  5611    18.61%
1K, 2K          8416    28.54%  8947    29.67%
2K, 4K          2694    9.14%   2669    8.85%
4K, 8K          164     0.56%   169     0.56%
8K, 16K         14      0.05%   14      0.05%
16K, 32K        2       0.01%   3       0.01%
32K, 64K        0       0.00%   0       0.00%
64K, 128K       0       0.00%   0       0.00%

@n_pages_ns     BASE            +PATCH
256, 512        0       0.00%   0       0.00%
512, 1K         67      0.97%   48      0.68%
1K, 2K          1598    23.13%  1036    14.71%
2K, 4K          2784    40.30%  3112    44.17%
4K, 8K          1288    18.64%  1579    22.41%
8K, 16K         1011    14.63%  1032    14.65%
16K, 32K        159     2.30%   234     3.32%
32K, 64K        1       0.01%   2       0.03%
64K, 128K       0       0.00%   2       0.03%

@npage          BASE            +PATCH
1               29484   81.02%  30157   81.06%
2, 4            3298    9.06%   3385    9.10%
4, 8            1011    2.78%   1029    2.77%
8, 16           2600    7.14%   2631    7.07%


The second thing I tried was simply fio, running it for about 10 minutes with a few minutes each for sequential read, sequential write, random read, and random write. (I tried this with both the guest booted off vfio-ccw and virtio-blk, but the difference was negligible.) The results in this space are similar as well:

@1_page_ns      BASE            +PATCH
256, 512        5648104 66.79%  6615878 66.75%
512, 1K         1784047 21.10%  2082852 21.01%
1K, 2K          648877  7.67%   771964  7.79%
2K, 4K          339551  4.01%   396381  4.00%
4K, 8K          32513   0.38%   40359   0.41%
8K, 16K         2602    0.03%   2884    0.03%
16K, 32K        758     0.01%   762     0.01%
32K, 64K        434     0.01%   352     0.00%

@n_pages_ns     BASE            +PATCH
256, 512        0       0.00%   0       0.00%
512, 1K         470803  12.18%  360524  7.95%
1K, 2K          1305166 33.75%  1739183 38.37%
2K, 4K          1338277 34.61%  1471161 32.46%
4K, 8K          733480  18.97%  937341  20.68%
8K, 16K         16954   0.44%   20708   0.46%
16K, 32K        1278    0.03%   2197    0.05%
32K, 64K        707     0.02%   703     0.02%

@npage          BASE            +PATCH
1               8457107 68.62%  9911624 68.62%
2, 4            2066957 16.77%  2446462 16.94%
4, 8            359989  2.92%   417188  2.89%
8, 16           1440006 11.68%  1668482 11.55%


I tried a smattering of other tests that might be more realistic, but the results were all pretty similar so there's no point in appending them here. Across the board, the amount of time spent on a multi-page request grows with the supplied patch. It doesn't get me very excited.

If you are wondering why this might be, Conny's initial take about IDAWs being non-contiguous by design is spot on. Let's observe the page counts given to vfio_iommu_type1_pin_contiguous_pages() in addition to the counts in vfio_iommu_type1_pin_pages(). The following is an example of one guest boot PLUS an fio run:

vfio_iommu_type1_pin_pages npage:
1       9890332         68.64%
2, 4    2438213         16.92%
4, 8    416278          2.89%
8, 16   1663201         11.54%
Total   14408024        

vfio_iommu_type1_pin_contiguous_pages npage:
1       16384925        86.89%
2, 4    1327548         7.04%
4, 8    727564          3.86%
8, 16   417182          2.21%
Total   18857219        

Yup... 87% of the calls to vfio_iommu_type1_pin_contiguous_pages() do so with a length of just a single page.

Happy to provide more data if desired, but it doesn't look like a benefit to vfio-ccw's use.

Thanks,
Eric



Regards,
Xu

Reply via email to