On 21/04/17 13:42, Alexey Kardashevskiy wrote: > On 21/04/17 05:16, gowrishankar muthukrishnan wrote: >> On Thursday 20 April 2017 07:52 PM, Alexey Kardashevskiy wrote: >>> On 20/04/17 23:25, Alexey Kardashevskiy wrote: >>>> On 20/04/17 19:04, Jonas Pfefferle1 wrote: >>>>> Alexey Kardashevskiy <a...@ozlabs.ru> wrote on 20/04/2017 09:24:02: >>>>> >>>>>> From: Alexey Kardashevskiy <a...@ozlabs.ru> >>>>>> To: dev@dpdk.org >>>>>> Cc: Alexey Kardashevskiy <a...@ozlabs.ru>, j...@zurich.ibm.com, >>>>>> Gowrishankar Muthukrishnan <gowrishanka...@in.ibm.com> >>>>>> Date: 20/04/2017 09:24 >>>>>> Subject: [PATCH dpdk 5/5] RFC: vfio/ppc64/spapr: Use correct bus >>>>>> addresses for DMA map >>>>>> >>>>>> VFIO_IOMMU_SPAPR_TCE_CREATE ioctl() returns the actual bus address for >>>>>> just created DMA window. It happens to start from zero because the >>>>>> default >>>>>> window is removed (leaving no windows) and new window starts from zero. >>>>>> However this is not guaranteed and the new window may start from another >>>>>> address, this adds an error check. >>>>>> >>>>>> Another issue is that IOVA passed to VFIO_IOMMU_MAP_DMA should be a PCI >>>>>> bus address while in this case a physical address of a user page is used. >>>>>> This changes IOVA to start from zero in a hope that the rest of DPDK >>>>>> expects this. >>>>> This is not the case. DPDK expects a 1:1 mapping PA==IOVA. It will use the >>>>> phys_addr of the memory segment it got from /proc/self/pagemap cf. >>>>> librte_eal/linuxapp/eal/eal_memory.c. We could try setting it here to the >>>>> actual iova which basically makes the whole virtual to phyiscal mapping >>>>> with pagemap unnecessary which I believe should be the case for VFIO >>>>> anyway. Pagemap should only be needed when using pci_uio. >>>> >>>> Ah, ok, makes sense now. But it sure needs a big fat comment there as it is >>>> not obvious why host RAM address is used there as DMA window start is not >>>> guaranteed. >>> Well, either way there is some bug - ms[i].phys_addr and ms[i].addr_64 both >>> have exact same value, in my setup it is 3fffb33c0000 which is a userspace >>> address - at least ms[i].phys_addr must be physical address. >> >> This patch breaks i40e_dev_init() in my server. >> >> EAL: PCI device 0004:01:00.0 on NUMA socket 1 >> EAL: probe driver: 8086:1583 net_i40e >> EAL: using IOMMU type 7 (sPAPR) >> eth_i40e_dev_init(): Failed to init adminq: -32 >> EAL: Releasing pci mapped resource for 0004:01:00.0 >> EAL: Calling pci_unmap_resource for 0004:01:00.0 at 0x3fff82aa0000 >> EAL: Requested device 0004:01:00.0 cannot be used >> EAL: PCI device 0004:01:00.1 on NUMA socket 1 >> EAL: probe driver: 8086:1583 net_i40e >> EAL: using IOMMU type 7 (sPAPR) >> eth_i40e_dev_init(): Failed to init adminq: -32 >> EAL: Releasing pci mapped resource for 0004:01:00.1 >> EAL: Calling pci_unmap_resource for 0004:01:00.1 at 0x3fff82aa0000 >> EAL: Requested device 0004:01:00.1 cannot be used >> EAL: No probed ethernet devices >> >> I have two memseg each of 1G size. Their mapped PA and VA are also different. >> >> (gdb) p /x ms[0] >> $3 = {phys_addr = 0x1e0b000000, {addr = 0x3effaf000000, addr_64 = >> 0x3effaf000000}, >> len = 0x40000000, hugepage_sz = 0x1000000, socket_id = 0x1, nchannel = >> 0x0, nrank = 0x0} >> (gdb) p /x ms[1] >> $4 = {phys_addr = 0xf6d000000, {addr = 0x3efbaf000000, addr_64 = >> 0x3efbaf000000}, >> len = 0x40000000, hugepage_sz = 0x1000000, socket_id = 0x0, nchannel = >> 0x0, nrank = 0x0} >> >> Could you please recheck this. May be, if new DMA window does not start >> from bus address 0, >> only then you reset dma_map.iova for this offset ? > > As we figured out, it is --no-huge effect. > > Another thing - as I read the code - the window size comes from > rte_eal_get_physmem_size(). On my 512GB machine, DPDK allocates only 16GB > window so it is far away from 1:1 mapping which is believed to be DPDK > expectation. Looking now for a better version of rte_eal_get_physmem_size()...
I have not found any helper to get a total RAM size or round-up-to-power-of-two - I could look through memory segments, find the one with highest ending physical address, round it up to power of two (requirement on POWER8 platform for a DMA window size) and use it as a DMA window size - is there kernel's order_base_2() analog? > > > And another problem - after few unsuccessful starts of app/testpmd, all > huge pages are gone: > > aik@stratton2:~$ cat /proc/meminfo > MemTotal: 535527296 kB > MemFree: 516662272 kB > MemAvailable: 515501696 kB > ... > HugePages_Total: 1024 > HugePages_Free: 0 > HugePages_Rsvd: 0 > HugePages_Surp: 0 > Hugepagesize: 16384 kB > > > How is that possible? What is pinning these pages so testpmd process exit > does not clear that up? Still not clear, any ideas why might be causing this? btw what is the correct way of running DPDK with hugepages? I basically create a folder in ~aik/hugepages and do sudo mount -t hugetlbfs hugetlbfs ~aik/hugepages sudo sysctl vm.nr_hugepages=4096 This creates bunch of pages: aik@stratton2:~$ cat /proc/meminfo | grep HugePage AnonHugePages: 0 kB ShmemHugePages: 0 kB HugePages_Total: 4096 HugePages_Free: 4096 HugePages_Rsvd: 0 HugePages_Surp: 0 And then I am watching testpmd to detect hugepages (it does see 4096 16MB pages) to allocate pages: rte_eal_hugepage_init() calls map_all_hugepages(... orig=1) - here all 4096 pages are allocated, then it calls map_all_hugepages(... orig=0) - and here I get lots of "EAL: Cannot get a virtual area: Cannot allocate memory" due to obvious reason - all pages are allocated. Since you folks have this tested somehow - what am I doing wrong? :) This is all very confusing - what is that orig=0/1 business is all about? > > > > >> >> >> Thanks, >> Gowrishankar >> >>> >>>> >>>>>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> >>>>>> --- >>>>>> lib/librte_eal/linuxapp/eal/eal_vfio.c | 12 ++++++++++-- >>>>>> 1 file changed, 10 insertions(+), 2 deletions(-) >>>>>> >>>>>> diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/ >>>>>> librte_eal/linuxapp/eal/eal_vfio.c >>>>>> index 46f951f4d..8b8e75c4f 100644 >>>>>> --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c >>>>>> +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c >>>>>> @@ -658,7 +658,7 @@ vfio_spapr_dma_map(int vfio_container_fd) >>>>>> { >>>>>> const struct rte_memseg *ms = rte_eal_get_physmem_layout(); >>>>>> int i, ret; >>>>>> - >>>>>> + phys_addr_t io_offset; >>>>>> struct vfio_iommu_spapr_register_memory reg = { >>>>>> .argsz = sizeof(reg), >>>>>> .flags = 0 >>>>>> @@ -702,6 +702,13 @@ vfio_spapr_dma_map(int vfio_container_fd) >>>>>> return -1; >>>>>> } >>>>>> + io_offset = create.start_addr; >>>>>> + if (io_offset) { >>>>>> + RTE_LOG(ERR, EAL, " DMA offsets other than zero is not >>>>>> supported, " >>>>>> + "new window is created at %lx\n", io_offset); >>>>>> + return -1; >>>>>> + } >>>>>> + >>>>>> /* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */ >>>>>> for (i = 0; i < RTE_MAX_MEMSEG; i++) { >>>>>> struct vfio_iommu_type1_dma_map dma_map; >>>>>> @@ -723,7 +730,7 @@ vfio_spapr_dma_map(int vfio_container_fd) >>>>>> dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map); >>>>>> dma_map.vaddr = ms[i].addr_64; >>>>>> dma_map.size = ms[i].len; >>>>>> - dma_map.iova = ms[i].phys_addr; >>>>>> + dma_map.iova = io_offset; >>>>>> dma_map.flags = VFIO_DMA_MAP_FLAG_READ | >>>>>> VFIO_DMA_MAP_FLAG_WRITE; >>>>>> @@ -735,6 +742,7 @@ vfio_spapr_dma_map(int vfio_container_fd) >>>>>> return -1; >>>>>> } >>>>>> + io_offset += dma_map.size; >>>>>> } >>>>>> return 0; >>>>>> -- >>>>>> 2.11.0 >>>>>> >>>> >>> >> >> > > -- Alexey