Re: [dpdk-dev] [PATCH] vfio: fix sPAPR IOMMU DMA window size
Alexey Kardashevskiy wrote on 08/08/2017 09:38:00 AM: > From: Alexey Kardashevskiy > To: Jonas Pfefferle , anatoly.bura...@intel.com > Cc: dev@dpdk.org > Date: 08/08/2017 09:38 AM > Subject: Re: [PATCH] vfio: fix sPAPR IOMMU DMA window size > > On 08/08/17 01:11, Jonas Pfefferle wrote: > > DMA window size needs to be big enough to span all memory segment's > > physical addresses. We do not need multiple levels of IOMMU tables > > as we already span ~70TB of physical memory with 16MB hugepages. > > > > Signed-off-by: Jonas Pfefferle > > --- > > lib/librte_eal/linuxapp/eal/eal_vfio.c | 25 ++--- > > 1 file changed, 22 insertions(+), 3 deletions(-) > > > > diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/ > librte_eal/linuxapp/eal/eal_vfio.c > > index 946df7e..8502216 100644 > > --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c > > +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c > > @@ -722,6 +722,18 @@ vfio_type1_dma_map(int vfio_container_fd) > > return 0; > > } > > > > +static uint64_t > > +roundup_next_pow2(uint64_t n) > > +{ > > + uint32_t i; > > + > > + n--; > > + for (i = 1; i < sizeof(n) * CHAR_BIT; i += i) > > + n |= n >> i; > > + > > + return ++n; > > +} > > + > > wow :) > > QEMU does it using __builtin_ctzll() (used below for the page_shift) > without a loop: > > https://git.qemu.org/gitweb.cgi?p=qemu.git;a=blob;f=include/qemu/ > host- > utils.h;h=95cf4f4163e50457cdf808263065ca5ef3f935da;hb=f22ab6cb0c47bd2a2785b7d58130949bd7d8d9af#l382 > > > Anyway, seems working. Ok let me fix that :) > > > Reviewed-by: Alexey Kardashevskiy > > > > > > static int > > vfio_spapr_dma_map(int vfio_container_fd) > > { > > @@ -759,10 +771,12 @@ vfio_spapr_dma_map(int vfio_container_fd) > >return -1; > > } > > > > - /* calculate window size based on number of hugepages configured */ > > - create.window_size = rte_eal_get_physmem_size(); > > + /* physicaly pages are sorted descending i.e. ms[0].phys_addr is max */ > > + /* create DMA window from 0 to max(phys_addr + len) */ > > + /* sPAPR requires window size to be a power of 2 */ > > + create.window_size = roundup_next_pow2(ms[0].phys_addr + ms [0].len); > > create.page_shift = __builtin_ctzll(ms->hugepage_sz); > > - create.levels = 2; > > + create.levels = 1; > > > > ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create); > > if (ret) { > > @@ -771,6 +785,11 @@ vfio_spapr_dma_map(int vfio_container_fd) > >return -1; > > } > > > > + if (create.start_addr != 0) { > > + RTE_LOG(ERR, EAL, " DMA window start address != 0\n"); > > + return -1; > > + } > > + > > /* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */ > > for (i = 0; i < RTE_MAX_MEMSEG; i++) { > >struct vfio_iommu_type1_dma_map dma_map; > > > > > -- > Alexey >
Re: [dpdk-dev] [PATCH] vfio: fix sPAPR IOMMU DMA window size
"Ananyev, Konstantin" wrote on 08/08/2017 10:27:28 AM: > From: "Ananyev, Konstantin" > To: Alexey Kardashevskiy , Jonas Pfefferle > , "Burakov, Anatoly" > Cc: "dev@dpdk.org" > Date: 08/08/2017 10:27 AM > Subject: RE: [dpdk-dev] [PATCH] vfio: fix sPAPR IOMMU DMA window size > > > > > -Original Message- > > From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Alexey Kardashevskiy > > Sent: Tuesday, August 8, 2017 10:38 AM > > To: Jonas Pfefferle ; Burakov, Anatoly > > > Cc: dev@dpdk.org > > Subject: Re: [dpdk-dev] [PATCH] vfio: fix sPAPR IOMMU DMA window size > > > > On 08/08/17 01:11, Jonas Pfefferle wrote: > > > DMA window size needs to be big enough to span all memory segment's > > > physical addresses. We do not need multiple levels of IOMMU tables > > > as we already span ~70TB of physical memory with 16MB hugepages. > > > > > > Signed-off-by: Jonas Pfefferle > > > --- > > > lib/librte_eal/linuxapp/eal/eal_vfio.c | 25 + +--- > > > 1 file changed, 22 insertions(+), 3 deletions(-) > > > > > > diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/ > librte_eal/linuxapp/eal/eal_vfio.c > > > index 946df7e..8502216 100644 > > > --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c > > > +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c > > > @@ -722,6 +722,18 @@ vfio_type1_dma_map(int vfio_container_fd) > > > return 0; > > > } > > > > > > +static uint64_t > > > +roundup_next_pow2(uint64_t n) > > > +{ > > > + uint32_t i; > > > + > > > + n--; > > > + for (i = 1; i < sizeof(n) * CHAR_BIT; i += i) > > > + n |= n >> i; > > > + > > > + return ++n; > > > +} > > > + > > > > wow :) > > > > QEMU does it using __builtin_ctzll() (used below for the page_shift) > > without a loop: > > > > https://git.qemu.org/gitweb.cgi?p=qemu.git;a=blob;f=include/qemu/host- > > > utils.h;h=95cf4f4163e50457cdf808263065ca5ef3f935da;hb=f22ab6cb0c47bd2a2785b7d58130949bd7d8d9af#l382 > > > > > > Anyway, seems working. > > As I remember, there already exists rte_align64pow2(). > Konstantin Thanks. Fixed it. > > > > > > > Reviewed-by: Alexey Kardashevskiy > > > > > > > > > > > static int > > > vfio_spapr_dma_map(int vfio_container_fd) > > > { > > > @@ -759,10 +771,12 @@ vfio_spapr_dma_map(int vfio_container_fd) > > >return -1; > > > } > > > > > > - /* calculate window size based on number of hugepages configured */ > > > - create.window_size = rte_eal_get_physmem_size(); > > > + /* physicaly pages are sorted descending i.e. ms > [0].phys_addr is max */ > > > + /* create DMA window from 0 to max(phys_addr + len) */ > > > + /* sPAPR requires window size to be a power of 2 */ > > > + create.window_size = roundup_next_pow2(ms[0].phys_addr + ms [0].len); > > > create.page_shift = __builtin_ctzll(ms->hugepage_sz); > > > - create.levels = 2; > > > + create.levels = 1; > > > > > > ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create); > > > if (ret) { > > > @@ -771,6 +785,11 @@ vfio_spapr_dma_map(int vfio_container_fd) > > >return -1; > > > } > > > > > > + if (create.start_addr != 0) { > > > + RTE_LOG(ERR, EAL, " DMA window start address != 0\n"); > > > + return -1; > > > + } > > > + > > > /* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */ > > > for (i = 0; i < RTE_MAX_MEMSEG; i++) { > > >struct vfio_iommu_type1_dma_map dma_map; > > > > > > > > > -- > > Alexey
Re: [dpdk-dev] [PATCH v3] vfio: fix sPAPR IOMMU DMA window size
"Burakov, Anatoly" wrote on 08/08/2017 11:15:24 AM: > From: "Burakov, Anatoly" > To: Jonas Pfefferle > Cc: "dev@dpdk.org" , "a...@ozlabs.ru" > Date: 08/08/2017 11:18 AM > Subject: RE: [PATCH v3] vfio: fix sPAPR IOMMU DMA window size > > From: Jonas Pfefferle [mailto:j...@zurich.ibm.com] > > Sent: Tuesday, August 8, 2017 9:41 AM > > To: Burakov, Anatoly > > Cc: dev@dpdk.org; a...@ozlabs.ru; Jonas Pfefferle > > Subject: [PATCH v3] vfio: fix sPAPR IOMMU DMA window size > > > > DMA window size needs to be big enough to span all memory segment's > > physical addresses. We do not need multiple levels of IOMMU tables > > as we already span ~70TB of physical memory with 16MB hugepages. > > > > Signed-off-by: Jonas Pfefferle > > --- > > v2: > > * roundup to next power 2 function without loop. > > > > v3: > > * Replace roundup_next_pow2 with rte_align64pow2 > > > > lib/librte_eal/linuxapp/eal/eal_vfio.c | 13 ++--- > > 1 file changed, 10 insertions(+), 3 deletions(-) > > > > diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c > > b/lib/librte_eal/linuxapp/eal/eal_vfio.c > > index 946df7e..550c41c 100644 > > --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c > > +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c > > @@ -759,10 +759,12 @@ vfio_spapr_dma_map(int vfio_container_fd) > >return -1; > > } > > > > - /* calculate window size based on number of hugepages configured > > */ > > - create.window_size = rte_eal_get_physmem_size(); > > + /* physicaly pages are sorted descending i.e. ms[0].phys_addr is max > > */ > > Do we always expect that to be the case in the future? Maybe it > would be safer to walk the memsegs list. > > Thanks, > Anatoly I had this loop in before but removed it in favor of simplicity. If we believe that the ordering is going to change in the future I'm happy to bring back the loop. Is there other code which is relying on the fact that the memsegs are sorted by their physical addresses? > > > + /* create DMA window from 0 to max(phys_addr + len) */ > > + /* sPAPR requires window size to be a power of 2 */ > > + create.window_size = rte_align64pow2(ms[0].phys_addr + > > ms[0].len); > > create.page_shift = __builtin_ctzll(ms->hugepage_sz); > > - create.levels = 2; > > + create.levels = 1; > > > > ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, > > &create); > > if (ret) { > > @@ -771,6 +773,11 @@ vfio_spapr_dma_map(int vfio_container_fd) > >return -1; > > } > > > > + if (create.start_addr != 0) { > > + RTE_LOG(ERR, EAL, " DMA window start address != 0\n"); > > + return -1; > > + } > > + > > /* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */ > > for (i = 0; i < RTE_MAX_MEMSEG; i++) { > >struct vfio_iommu_type1_dma_map dma_map; > > -- > > 2.7.4 >
Re: [dpdk-dev] [PATCH v3] vfio: fix sPAPR IOMMU DMA window size
"Burakov, Anatoly" wrote on 08/08/2017 11:43:43 AM: > From: "Burakov, Anatoly" > To: Jonas Pfefferle1 > Cc: "a...@ozlabs.ru" , "dev@dpdk.org" > Date: 08/08/2017 11:43 AM > Subject: RE: [PATCH v3] vfio: fix sPAPR IOMMU DMA window size > > > From: Jonas Pfefferle1 [mailto:j...@zurich.ibm.com] > > Sent: Tuesday, August 8, 2017 10:30 AM > > To: Burakov, Anatoly > > Cc: a...@ozlabs.ru; dev@dpdk.org > > Subject: RE: [PATCH v3] vfio: fix sPAPR IOMMU DMA window size > > > > "Burakov, Anatoly" wrote on 08/08/2017 > > 11:15:24 AM: > > > > > From: "Burakov, Anatoly" > > > To: Jonas Pfefferle > > > Cc: "dev@dpdk.org" , "a...@ozlabs.ru" > > > Date: 08/08/2017 11:18 AM > > > Subject: RE: [PATCH v3] vfio: fix sPAPR IOMMU DMA window size > > > > > > From: Jonas Pfefferle [mailto:j...@zurich.ibm.com] > > > > Sent: Tuesday, August 8, 2017 9:41 AM > > > > To: Burakov, Anatoly > > > > Cc: dev@dpdk.org; a...@ozlabs.ru; Jonas Pfefferle > > > > Subject: [PATCH v3] vfio: fix sPAPR IOMMU DMA window size > > > > > > > > DMA window size needs to be big enough to span all memory segment's > > > > physical addresses. We do not need multiple levels of IOMMU tables > > > > as we already span ~70TB of physical memory with 16MB hugepages. > > > > > > > > Signed-off-by: Jonas Pfefferle > > > > --- > > > > v2: > > > > * roundup to next power 2 function without loop. > > > > > > > > v3: > > > > * Replace roundup_next_pow2 with rte_align64pow2 > > > > > > > > lib/librte_eal/linuxapp/eal/eal_vfio.c | 13 ++--- > > > > 1 file changed, 10 insertions(+), 3 deletions(-) > > > > > > > > diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c > > > > b/lib/librte_eal/linuxapp/eal/eal_vfio.c > > > > index 946df7e..550c41c 100644 > > > > --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c > > > > +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c > > > > @@ -759,10 +759,12 @@ vfio_spapr_dma_map(int vfio_container_fd) > > > > return -1; > > > > } > > > > > > > > - /* calculate window size based on number of hugepages configured > > > > */ > > > > - create.window_size = rte_eal_get_physmem_size(); > > > > + /* physicaly pages are sorted descending i.e. ms[0].phys_addr is max > > > > */ > > > > > > Do we always expect that to be the case in the future? Maybe it > > > would be safer to walk the memsegs list. > > > > > > Thanks, > > > Anatoly > > > > I had this loop in before but removed it in favor of simplicity. > > If we believe that the ordering is going to change in the future > > I'm happy to bring back the loop. Is there other code which is > > relying on the fact that the memsegs are sorted by their physical > > addresses? > > I don't think there is. In any case, I think making assumptions > about particulars of memseg organization is not a very good practice. > > I seem to recall us doing similar things in other places, so maybe > down the line we could introduce a new API (or internal-only) > function to get a memseg with min/max address. For now I think a > loop will do. Ok. Makes sense to me. Let me resubmit a new version with the loop. > > > > > > > > > > + /* create DMA window from 0 to max(phys_addr + len) */ > > > > + /* sPAPR requires window size to be a power of 2 */ > > > > + create.window_size = rte_align64pow2(ms[0].phys_addr + > > > > ms[0].len); > > > > create.page_shift = __builtin_ctzll(ms->hugepage_sz); > > > > - create.levels = 2; > > > > + create.levels = 1; > > > > > > > > ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, > > > > &create); > > > > if (ret) { > > > > @@ -771,6 +773,11 @@ vfio_spapr_dma_map(int vfio_container_fd) > > > > return -1; > > > > } > > > > > > > > + if (create.start_addr != 0) { > > > > + RTE_LOG(ERR, EAL, " DMA window start address != 0\n"); > > > > + return -1; > > > > + } > > > > + > > > > /* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */ > > > > for (i = 0; i < RTE_MAX_MEMSEG; i++) { > > > > struct vfio_iommu_type1_dma_map dma_map; > > > > -- > > > > 2.7.4 > > > >
Re: [dpdk-dev] [PATCH dpdk 5/5] RFC: vfio/ppc64/spapr: Use correct bus addresses for DMA map
Alexey Kardashevskiy wrote on 20/04/2017 09:24:02: > From: Alexey Kardashevskiy > To: dev@dpdk.org > Cc: Alexey Kardashevskiy , j...@zurich.ibm.com, > Gowrishankar Muthukrishnan > Date: 20/04/2017 09:24 > Subject: [PATCH dpdk 5/5] RFC: vfio/ppc64/spapr: Use correct bus > addresses for DMA map > > VFIO_IOMMU_SPAPR_TCE_CREATE ioctl() returns the actual bus address for > just created DMA window. It happens to start from zero because the default > window is removed (leaving no windows) and new window starts from zero. > However this is not guaranteed and the new window may start from another > address, this adds an error check. > > Another issue is that IOVA passed to VFIO_IOMMU_MAP_DMA should be a PCI > bus address while in this case a physical address of a user page is used. > This changes IOVA to start from zero in a hope that the rest of DPDK > expects this. This is not the case. DPDK expects a 1:1 mapping PA==IOVA. It will use the phys_addr of the memory segment it got from /proc/self/pagemap cf. librte_eal/linuxapp/eal/eal_memory.c. We could try setting it here to the actual iova which basically makes the whole virtual to phyiscal mapping with pagemap unnecessary which I believe should be the case for VFIO anyway. Pagemap should only be needed when using pci_uio. > > Signed-off-by: Alexey Kardashevskiy > --- > lib/librte_eal/linuxapp/eal/eal_vfio.c | 12 ++-- > 1 file changed, 10 insertions(+), 2 deletions(-) > > diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/ > librte_eal/linuxapp/eal/eal_vfio.c > index 46f951f4d..8b8e75c4f 100644 > --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c > +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c > @@ -658,7 +658,7 @@ vfio_spapr_dma_map(int vfio_container_fd) > { > const struct rte_memseg *ms = rte_eal_get_physmem_layout(); > int i, ret; > - > + phys_addr_t io_offset; > struct vfio_iommu_spapr_register_memory reg = { >.argsz = sizeof(reg), >.flags = 0 > @@ -702,6 +702,13 @@ vfio_spapr_dma_map(int vfio_container_fd) >return -1; > } > > + io_offset = create.start_addr; > + if (io_offset) { > + RTE_LOG(ERR, EAL, " DMA offsets other than zero is not supported, " > +"new window is created at %lx\n", io_offset); > + return -1; > + } > + > /* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */ > for (i = 0; i < RTE_MAX_MEMSEG; i++) { >struct vfio_iommu_type1_dma_map dma_map; > @@ -723,7 +730,7 @@ vfio_spapr_dma_map(int vfio_container_fd) >dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map); >dma_map.vaddr = ms[i].addr_64; >dma_map.size = ms[i].len; > - dma_map.iova = ms[i].phys_addr; > + dma_map.iova = io_offset; >dma_map.flags = VFIO_DMA_MAP_FLAG_READ | > VFIO_DMA_MAP_FLAG_WRITE; > > @@ -735,6 +742,7 @@ vfio_spapr_dma_map(int vfio_container_fd) > return -1; >} > > + io_offset += dma_map.size; > } > > return 0; > -- > 2.11.0 >
Re: [dpdk-dev] [PATCH dpdk 5/5] RFC: vfio/ppc64/spapr: Use correct bus addresses for DMA map
Alexey Kardashevskiy wrote on 20/04/2017 16:22:01: > From: Alexey Kardashevskiy > To: Jonas Pfefferle1 > Cc: dev@dpdk.org, Gowrishankar Muthukrishnan > , Adrian Schuepbach > Date: 20/04/2017 16:22 > Subject: Re: [PATCH dpdk 5/5] RFC: vfio/ppc64/spapr: Use correct bus > addresses for DMA map > > On 20/04/17 23:25, Alexey Kardashevskiy wrote: > > On 20/04/17 19:04, Jonas Pfefferle1 wrote: > >> Alexey Kardashevskiy wrote on 20/04/2017 09:24:02: > >> > >>> From: Alexey Kardashevskiy > >>> To: dev@dpdk.org > >>> Cc: Alexey Kardashevskiy , j...@zurich.ibm.com, > >>> Gowrishankar Muthukrishnan > >>> Date: 20/04/2017 09:24 > >>> Subject: [PATCH dpdk 5/5] RFC: vfio/ppc64/spapr: Use correct bus > >>> addresses for DMA map > >>> > >>> VFIO_IOMMU_SPAPR_TCE_CREATE ioctl() returns the actual bus address for > >>> just created DMA window. It happens to start from zero because the default > >>> window is removed (leaving no windows) and new window starts from zero. > >>> However this is not guaranteed and the new window may start from another > >>> address, this adds an error check. > >>> > >>> Another issue is that IOVA passed to VFIO_IOMMU_MAP_DMA should be a PCI > >>> bus address while in this case a physical address of a user page is used. > >>> This changes IOVA to start from zero in a hope that the rest of DPDK > >>> expects this. > >> > >> This is not the case. DPDK expects a 1:1 mapping PA==IOVA. It will use the > >> phys_addr of the memory segment it got from /proc/self/pagemap cf. > >> librte_eal/linuxapp/eal/eal_memory.c. We could try setting it here to the > >> actual iova which basically makes the whole virtual to phyiscal mapping > >> with pagemap unnecessary which I believe should be the case for VFIO > >> anyway. Pagemap should only be needed when using pci_uio. > > > > > > Ah, ok, makes sense now. But it sure needs a big fat comment there as it is > > not obvious why host RAM address is used there as DMA window start is not > > guaranteed. > > Well, either way there is some bug - ms[i].phys_addr and ms[i].addr_64 both > have exact same value, in my setup it is 3fffb33c which is a userspace > address - at least ms[i].phys_addr must be physical address. This might be the case if you are not using hugetlbfs i.e. passing "--no-huge" cf. eal_memory.c:980 /* hugetlbfs can be disabled */ if (internal_config.no_hugetlbfs) { addr = mmap(NULL, internal_config.memory, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); if (addr == MAP_FAILED) { RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n", __func__, strerror(errno)); return -1; } mcfg->memseg[0].phys_addr = (phys_addr_t)(uintptr_t)addr; mcfg->memseg[0].addr = addr; mcfg->memseg[0].hugepage_sz = RTE_PGSIZE_4K; mcfg->memseg[0].len = internal_config.memory; mcfg->memseg[0].socket_id = 0; return 0; } If it fails to get the virt2phys mapping it actually assigns iovas starting from 0 to the memory segments, cf. set_physaddrs eal_memory.c:263 > > > > > > > >> > >>> > >>> Signed-off-by: Alexey Kardashevskiy > >>> --- > >>> lib/librte_eal/linuxapp/eal/eal_vfio.c | 12 ++-- > >>> 1 file changed, 10 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/ > >>> librte_eal/linuxapp/eal/eal_vfio.c > >>> index 46f951f4d..8b8e75c4f 100644 > >>> --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c > >>> +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c > >>> @@ -658,7 +658,7 @@ vfio_spapr_dma_map(int vfio_container_fd) > >>> { > >>> const struct rte_memseg *ms = rte_eal_get_physmem_layout(); > >>> int i, ret; > >>> - > >>> + phys_addr_t io_offset; > >>> struct vfio_iommu_spapr_register_memory reg = { > >>>.argsz = sizeof(reg), > >>>.flags = 0 > >>> @@ -702,6 +702,13 @@ vfio_spapr_dma_map(int vfio_container_fd) > >>>return -1; > >>> } > >>> > >>> + io_offset = create.start_addr; > >>> + if (io_offset) { > >>> + RTE_LOG(ERR, EAL, " DMA o
Re: [dpdk-dev] [PATCH] vfio: refactor PCI BAR mapping
Hi Anatoly, "Burakov, Anatoly" wrote on 09/19/2017 01:40:51 PM: > From: "Burakov, Anatoly" > To: Jonas Pfefferle , dev@dpdk.org > Date: 09/19/2017 01:41 PM > Subject: Re: [dpdk-dev] [PATCH] vfio: refactor PCI BAR mapping > > Hi Jonas, > > On 17-Aug-17 12:35 PM, Jonas Pfefferle wrote: > > Split pci_vfio_map_resource for primary and secondary processes. > > Save all relevant mapping data in primary process to allow > > the secondary process to perform mappings. > > > > Signed-off-by: Jonas Pfefferle > > --- > > lib/librte_eal/common/include/rte_pci.h| 7 + > > lib/librte_eal/linuxapp/eal/eal_pci_vfio.c | 447 +++ > ++ > > 2 files changed, 271 insertions(+), 183 deletions(-) > > > > diff --git a/lib/librte_eal/common/include/rte_pci.h b/lib/ > librte_eal/common/include/rte_pci.h > > index 8b12339..0821af9 100644 > > --- a/lib/librte_eal/common/include/rte_pci.h > > +++ b/lib/librte_eal/common/include/rte_pci.h > > @@ -214,6 +214,12 @@ struct pci_map { > > uint64_t phaddr; > > }; > > > > +struct pci_msix_table { > > + int bar_index; > > + uint32_t offset; > > + uint32_t size; > > +}; > > + > > /** > >* A structure describing a mapped PCI resource. > >* For multi-process we need to reproduce all PCI mappings in secondary > > @@ -226,6 +232,7 @@ struct mapped_pci_resource { > > char path[PATH_MAX]; > > int nb_maps; > > struct pci_map maps[PCI_MAX_RESOURCE]; > > + struct pci_msix_table msix_table; > > }; > > > > /** mapped pci device list */ > > diff --git a/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c b/lib/ > librte_eal/linuxapp/eal/eal_pci_vfio.c > > index aa9d96e..f37552a 100644 > > --- a/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c > > +++ b/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c > > @@ -88,8 +88,7 @@ pci_vfio_write_config(const struct > rte_intr_handle *intr_handle, > > > > /* get PCI BAR number where MSI-X interrupts are */ > > static int > > -pci_vfio_get_msix_bar(int fd, int *msix_bar, uint32_t *msix_table_offset, > > -uint32_t *msix_table_size) > > +pci_vfio_get_msix_bar(int fd, struct pci_msix_table *msix_table) > > { > > int ret; > > uint32_t reg; > > @@ -161,9 +160,10 @@ pci_vfio_get_msix_bar(int fd, int *msix_bar, > uint32_t *msix_table_offset, > > return -1; > >} > > > > - *msix_bar = reg & RTE_PCI_MSIX_TABLE_BIR; > > - *msix_table_offset = reg & RTE_PCI_MSIX_TABLE_OFFSET; > > - *msix_table_size = 16 * (1 + (flags & RTE_PCI_MSIX_FLAGS_QSIZE)); > > + msix_table->bar_index = reg & RTE_PCI_MSIX_TABLE_BIR; > > + msix_table->offset = reg & RTE_PCI_MSIX_TABLE_OFFSET; > > + msix_table->size = > > +16 * (1 + (flags & RTE_PCI_MSIX_FLAGS_QSIZE)); > > > >return 0; > > } > > @@ -300,25 +300,152 @@ pci_vfio_setup_interrupts(struct > rte_pci_device *dev, int vfio_dev_fd) > > return -1; > > } > > > > -/* > > - * map the PCI resources of a PCI device in virtual memory (VFIO version). > > - * primary and secondary processes follow almost exactly the same path > > - */ > > -int > > -pci_vfio_map_resource(struct rte_pci_device *dev) > > +static int > > +pci_vfio_is_ioport_bar(int vfio_dev_fd, int bar_index) > > +{ > > + uint32_t ioport_bar; > > + int ret; > > + > > + ret = pread64(vfio_dev_fd, &ioport_bar, sizeof(ioport_bar), > > + VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) > > + + PCI_BASE_ADDRESS_0 + bar_index*4); > > + if (ret != sizeof(ioport_bar)) { > > + RTE_LOG(ERR, EAL, "Cannot read command (%x) from config space!\n", > > + PCI_BASE_ADDRESS_0 + bar_index*4); > > + return -1; > > + } > > + > > + if (ioport_bar & PCI_BASE_ADDRESS_SPACE_IO) { > > + RTE_LOG(INFO, EAL, "Ignore mapping IO port bar(%d) addr: %x\n", > > + bar_index, ioport_bar); > > This log message should probably go to the "continue" portion of the > calling code, it looks out of place here. Agree. I will move it. > > > + return 1; > > + } > > + return 0; > > +} > > + > > +static int > > +pci_vfio_setup_device(struct rte_pci_device *dev, int vfio_dev_fd) > > +{ > > + if (pci_vfio_setup_interrupts(dev, vfio_dev_fd) != 0) { > > + RTE_LOG(ERR, EAL, "Error setting up interrupts!\n"); > > + return -1; > > + } > > + > > + /* set bus mastering for the device */ > > + if (pci_vfio_set_bus_master(vfio_dev_fd, true)) { > > + RTE_LOG(ERR, EAL, "Cannot set up bus mastering!\n"); > > + return -1; > > + } > > + > > + /* Reset the device */ > > + ioctl(vfio_dev_fd, VFIO_DEVICE_RESET); > > + > > + return 0; > > +} > > + > > +static int > > +pci_vfio_mmap_bar(int vfio_dev_fd, struct mapped_pci_resource *vfio_res, > > + int bar_index, int additional_flags) > > +{ > > + struct memreg { > > + unsigned long offset, size; > > + } memreg[2] = {}; > > + void *bar_addr; > > + struct pci_msix_table *msix_table = &vfio_
Re: [dpdk-dev] [PATCH v2] vfio: refactor PCI BAR mapping
Hi Anatoly, "Burakov, Anatoly" wrote on 10/04/2017 03:13:22 PM: > From: "Burakov, Anatoly" > To: Jonas Pfefferle , dev@dpdk.org > Date: 10/04/2017 03:14 PM > Subject: Re: [PATCH v2] vfio: refactor PCI BAR mapping > > On 25-Sep-17 4:04 PM, Jonas Pfefferle wrote: > > Split pci_vfio_map_resource for primary and secondary processes. > > Save all relevant mapping data in primary process to allow > > the secondary process to perform mappings. > > > > Signed-off-by: Jonas Pfefferle > > --- > > v2: > > * fix zero size and offset when trying to mmap non msix bar > > > > lib/librte_eal/common/include/rte_pci.h| 7 + > > lib/librte_eal/linuxapp/eal/eal_pci_vfio.c | 446 +++ > ++ > > 2 files changed, 271 insertions(+), 182 deletions(-) > > > > diff --git a/lib/librte_eal/common/include/rte_pci.h b/lib/ > librte_eal/common/include/rte_pci.h > > index 8b12339..0821af9 100644 > > --- a/lib/librte_eal/common/include/rte_pci.h > > +++ b/lib/librte_eal/common/include/rte_pci.h > > @@ -214,6 +214,12 @@ struct pci_map { > > uint64_t phaddr; > > }; > > > > +struct pci_msix_table { > > + int bar_index; > > + uint32_t offset; > > + uint32_t size; > > +}; > > + > > /** > >* A structure describing a mapped PCI resource. > >* For multi-process we need to reproduce all PCI mappings in secondary > > @@ -226,6 +232,7 @@ struct mapped_pci_resource { > > char path[PATH_MAX]; > > int nb_maps; > > struct pci_map maps[PCI_MAX_RESOURCE]; > > + struct pci_msix_table msix_table; > > }; > > > > /** mapped pci device list */ > > diff --git a/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c b/lib/ > librte_eal/linuxapp/eal/eal_pci_vfio.c > > index aa9d96e..6443bd5 100644 > > --- a/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c > > +++ b/lib/librte_eal/linuxapp/eal/eal_pci_vfio.c > > @@ -88,8 +88,7 @@ pci_vfio_write_config(const struct > rte_intr_handle *intr_handle, > > > > /* get PCI BAR number where MSI-X interrupts are */ > > static int > > -pci_vfio_get_msix_bar(int fd, int *msix_bar, uint32_t *msix_table_offset, > > -uint32_t *msix_table_size) > > +pci_vfio_get_msix_bar(int fd, struct pci_msix_table *msix_table) > > { > > int ret; > > uint32_t reg; > > @@ -161,9 +160,10 @@ pci_vfio_get_msix_bar(int fd, int *msix_bar, > uint32_t *msix_table_offset, > > return -1; > >} > > > > - *msix_bar = reg & RTE_PCI_MSIX_TABLE_BIR; > > - *msix_table_offset = reg & RTE_PCI_MSIX_TABLE_OFFSET; > > - *msix_table_size = 16 * (1 + (flags & RTE_PCI_MSIX_FLAGS_QSIZE)); > > + msix_table->bar_index = reg & RTE_PCI_MSIX_TABLE_BIR; > > + msix_table->offset = reg & RTE_PCI_MSIX_TABLE_OFFSET; > > + msix_table->size = > > +16 * (1 + (flags & RTE_PCI_MSIX_FLAGS_QSIZE)); > > > >return 0; > > } > > @@ -300,25 +300,150 @@ pci_vfio_setup_interrupts(struct > rte_pci_device *dev, int vfio_dev_fd) > > return -1; > > } > > > > -/* > > - * map the PCI resources of a PCI device in virtual memory (VFIO version). > > - * primary and secondary processes follow almost exactly the same path > > - */ > > -int > > -pci_vfio_map_resource(struct rte_pci_device *dev) > > +static int > > +pci_vfio_is_ioport_bar(int vfio_dev_fd, int bar_index) > > +{ > > + uint32_t ioport_bar; > > + int ret; > > + > > + ret = pread64(vfio_dev_fd, &ioport_bar, sizeof(ioport_bar), > > + VFIO_GET_REGION_ADDR(VFIO_PCI_CONFIG_REGION_INDEX) > > + + PCI_BASE_ADDRESS_0 + bar_index*4); > > + if (ret != sizeof(ioport_bar)) { > > + RTE_LOG(ERR, EAL, "Cannot read command (%x) from config space!\n", > > + PCI_BASE_ADDRESS_0 + bar_index*4); > > + return -1; > > + } > > + > > + return ioport_bar & PCI_BASE_ADDRESS_SPACE_IO; > > Not sure i like this. I think it's better to be explicit, e.g. > return ioport_bar & PCI_BASE_ADDRESS_SPACE_IO != 0; > > Makes no difference (both because return value is non-zero and because > PCI_BASE_ADDRESS_SPACE_IO is 0x01), but still, better to make intentions > clear i think. I agree. I wanted to change this anyway but forgot. V3 is going out in a sec. Thanks for the ack ;) > > Otherwise, did a quick smoke-test and it works, so > > Acked-by: Anatoly Burakov > > Keep the ack if you decide to submit a v3 :) >
Re: [dpdk-dev] [PATCH v4] vfio: fix sPAPR IOMMU DMA window size
Hi Thomas, Can you please apply this patch? Thanks, Jonas Alexey Kardashevskiy wrote on 08/10/2017 08:03:17 AM: > From: Alexey Kardashevskiy > To: Jonas Pfefferle , anatoly.bura...@intel.com > Cc: dev@dpdk.org > Date: 08/10/2017 08:03 AM > Subject: Re: [PATCH v4] vfio: fix sPAPR IOMMU DMA window size > > On 08/08/17 21:16, Jonas Pfefferle wrote: > > DMA window size needs to be big enough to span all memory segment's > > physical addresses. We do not need multiple levels of IOMMU tables > > as we already span ~70TB of physical memory with 16MB hugepages. > > > > Signed-off-by: Jonas Pfefferle > > > Reviewed-by: Alexey Kardashevskiy > > > > > --- > > v2: > > * roundup to next power 2 function without loop. > > > > v3: > > * Replace roundup_next_pow2 with rte_align64pow2 > > > > v4: > > * do not assume ordering of physical addresses of memsegs > > > > lib/librte_eal/linuxapp/eal/eal_vfio.c | 20 +--- > > 1 file changed, 17 insertions(+), 3 deletions(-) > > > > diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/ > librte_eal/linuxapp/eal/eal_vfio.c > > index 946df7e..7d5d61d 100644 > > --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c > > +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c > > @@ -759,10 +759,19 @@ vfio_spapr_dma_map(int vfio_container_fd) > >return -1; > > } > > > > - /* calculate window size based on number of hugepages configured */ > > - create.window_size = rte_eal_get_physmem_size(); > > + /* create DMA window from 0 to max(phys_addr + len) */ > > + for (i = 0; i < RTE_MAX_MEMSEG; i++) { > > + if (ms[i].addr == NULL) > > + break; > > + > > + create.window_size = RTE_MAX(create.window_size, > > +ms[i].phys_addr + ms[i].len); > > + } > > + > > + /* sPAPR requires window size to be a power of 2 */ > > + create.window_size = rte_align64pow2(create.window_size); > > create.page_shift = __builtin_ctzll(ms->hugepage_sz); > > - create.levels = 2; > > + create.levels = 1; > > > > ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create); > > if (ret) { > > @@ -771,6 +780,11 @@ vfio_spapr_dma_map(int vfio_container_fd) > >return -1; > > } > > > > + if (create.start_addr != 0) { > > + RTE_LOG(ERR, EAL, " DMA window start address != 0\n"); > > + return -1; > > + } > > + > > /* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */ > > for (i = 0; i < RTE_MAX_MEMSEG; i++) { > >struct vfio_iommu_type1_dma_map dma_map; > > > > > -- > Alexey >
Re: [dpdk-dev] [PATCH v7 7/9] linuxapp/eal_vfio: honor iova mode before mapping
Hi @all I just stumbled upon this patch while testing on POWER. RTE_IOVA_VA will not work for the sPAPR code since the dma window size is currently determined by the physical address only. I'm preparing a patch to address this. Thanks, Jonas "dev" wrote on 08/31/2017 05:26:16 AM: > From: Santosh Shukla > To: dev@dpdk.org > Cc: tho...@monjalon.net, jerin.ja...@caviumnetworks.com, > hemant.agra...@nxp.com, olivier.m...@6wind.com, > maxime.coque...@redhat.com, sergio.gonzalez.mon...@intel.com, > bruce.richard...@intel.com, shreyansh.j...@nxp.com, > gaetan.ri...@6wind.com, anatoly.bura...@intel.com, > step...@networkplumber.org, acon...@redhat.com, Santosh Shukla > > Date: 08/31/2017 05:28 AM > Subject: [dpdk-dev] [PATCH v7 7/9] linuxapp/eal_vfio: honor iova > mode before mapping > Sent by: "dev" > > Check iova mode and accordingly map iova to pa or va. > > Signed-off-by: Santosh Shukla > Signed-off-by: Jerin Jacob > Reviewed-by: Maxime Coquelin > --- > lib/librte_eal/linuxapp/eal/eal_vfio.c | 10 -- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/ > librte_eal/linuxapp/eal/eal_vfio.c > index c8a97b7e7..b32cd09a2 100644 > --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c > +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c > @@ -706,7 +706,10 @@ vfio_type1_dma_map(int vfio_container_fd) >dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map); >dma_map.vaddr = ms[i].addr_64; >dma_map.size = ms[i].len; > - dma_map.iova = ms[i].phys_addr; > + if (rte_eal_iova_mode() == RTE_IOVA_VA) > + dma_map.iova = dma_map.vaddr; > + else > + dma_map.iova = ms[i].phys_addr; >dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE; > >ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map); > @@ -792,7 +795,10 @@ vfio_spapr_dma_map(int vfio_container_fd) >dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map); >dma_map.vaddr = ms[i].addr_64; >dma_map.size = ms[i].len; > - dma_map.iova = ms[i].phys_addr; > + if (rte_eal_iova_mode() == RTE_IOVA_VA) > + dma_map.iova = dma_map.vaddr; > + else > + dma_map.iova = ms[i].phys_addr; >dma_map.flags = VFIO_DMA_MAP_FLAG_READ | > VFIO_DMA_MAP_FLAG_WRITE; > > -- > 2.13.0 >
[dpdk-dev] Huge mapping secondary process linux
Hi @all, I'm trying to make sense of the hugepage memory mappings in librte_eal/linuxapp/eal/eal_memory.c: * In rte_eal_hugepage_attach (line 1347) when we try to do a private mapping on /dev/zero (line 1393) why do we not use MAP_FIXED if we need the addresses to be identical with the primary process? * On POWER we have this weird business going on where we use MAP_HUGETLB because according to this commit: commit 284ae3e9ff9a92575c28c858efd2c85c8de6d440 Author: Chao Zhu Date: Thu Apr 6 15:36:09 2017 +0530 eal/ppc: fix mmap for memory initialization On IBM POWER platform, when mapping /dev/zero file to hugepage memory space, mmap will not respect the requested address hint. This will cause the memory initialization for the second process fails. This patch adds the required mmap flags to make it work. Beside this, users need to set the nr_overcommit_hugepages to expand the VA range. When doing the initialization, users need to set both nr_hugepages and nr_overcommit_hugepages to the same value, like 64, 128, etc. mmap address hints are not respected. Looking at the mmap code in the kernel this is not true entirely however under some circumstances the hint can be ignored ( http://elixir.free-electrons.com/linux/latest/source/arch/powerpc/mm/mmap.c#L103 ). However I believe we can remove the extra case for PPC if we use MAP_FIXED when doing the secondary process mappings because we need them to be identical anyway. We could also use MAP_FIXED when doing the primary process mappings resp. get_virtual_area if we want to have any guarantees when specifying a base address. Any thoughts? Thanks, Jonas
Re: [dpdk-dev] Huge mapping secondary process linux
"Burakov, Anatoly" wrote on 10/27/2017 04:06:44 PM: > From: "Burakov, Anatoly" > To: Jonas Pfefferle1 , dev@dpdk.org > Cc: chao...@linux.vnet.ibm.com, bruce.richard...@intel.com > Date: 10/27/2017 04:06 PM > Subject: Re: [dpdk-dev] Huge mapping secondary process linux > > On 27-Oct-17 1:43 PM, Jonas Pfefferle1 wrote: > > > > > > Hi @all, > > > > I'm trying to make sense of the hugepage memory mappings in > > librte_eal/linuxapp/eal/eal_memory.c: > > * In rte_eal_hugepage_attach (line 1347) when we try to do a private > > mapping on /dev/zero (line 1393) why do we not use MAP_FIXED if we need the > > addresses to be identical with the primary process? > > * On POWER we have this weird business going on where we use MAP_HUGETLB > > because according to this commit: > > > > commit 284ae3e9ff9a92575c28c858efd2c85c8de6d440 > > Author: Chao Zhu > > Date: Thu Apr 6 15:36:09 2017 +0530 > > > > eal/ppc: fix mmap for memory initialization > > > > On IBM POWER platform, when mapping /dev/zero file to hugepage memory > > space, mmap will not respect the requested address hint. This will > > cause > > the memory initialization for the second process fails. This patch adds > > the required mmap flags to make it work. Beside this, users need to set > > the nr_overcommit_hugepages to expand the VA range. When > > doing the initialization, users need to set both nr_hugepages and > > nr_overcommit_hugepages to the same value, like 64, 128, etc. > > > > mmap address hints are not respected. Looking at the mmap code in the > > kernel this is not true entirely however under some circumstances the hint > > can be ignored ( > > https://urldefense.proofpoint.com/v2/url? > u=http-3A__elixir.free-2Delectrons.com_linux_latest_source_arch_powerpc_mm_mmap.c-23L103&d=DwICaQ&c=jf_iaSHvJObTbx- > siA1ZOg&r=rOdXhRsgn8Iur7bDE0vgwvo6TC8OpoDN- > pXjigIjRW0&m=cttQcHlAYixhsYS3lz- > BAdEeg4dpbwGdPnj2R3I8Do0&s=Gp0TIjUtIed05Jgb7XnlocpCYZdFXZXiH0LqIWiNMhA&e= > > ). However I believe we can remove the extra case for PPC if we use > > MAP_FIXED when doing the secondary process mappings because we need them to > > be identical anyway. We could also use MAP_FIXED when doing the primary > > process mappings resp. get_virtual_area if we want to have any guarantees > > when specifying a base address. Any thoughts? > > > > Thanks, > > Jonas > > > hi Jonas, > > MAP_FIXED is not used because it's dangerous, it unmaps anything that is > already mapped into that space. We would rather know that we can't map > something than unwittingly unmap something that was mapped before. Ok, I see. Maybe we can add a check to the primary process's memory mappings whether the hint has been respected or not? At least warn if it hasn't. > > -- > Thanks, > Anatoly >
Re: [dpdk-dev] Huge mapping secondary process linux
"Burakov, Anatoly" wrote on 10/27/2017 04:44:52 PM: > From: "Burakov, Anatoly" > To: Jonas Pfefferle1 > Cc: bruce.richard...@intel.com, chao...@linux.vnet.ibm.com, dev@dpdk.org > Date: 10/27/2017 04:45 PM > Subject: Re: [dpdk-dev] Huge mapping secondary process linux > > On 27-Oct-17 3:28 PM, Jonas Pfefferle1 wrote: > > "Burakov, Anatoly" wrote on 10/27/2017 > > 04:06:44 PM: > > > > > From: "Burakov, Anatoly" > > > To: Jonas Pfefferle1 , dev@dpdk.org > > > Cc: chao...@linux.vnet.ibm.com, bruce.richard...@intel.com > > > Date: 10/27/2017 04:06 PM > > > Subject: Re: [dpdk-dev] Huge mapping secondary process linux > > > > > > On 27-Oct-17 1:43 PM, Jonas Pfefferle1 wrote: > > > > > > > > > > > > Hi @all, > > > > > > > > I'm trying to make sense of the hugepage memory mappings in > > > > librte_eal/linuxapp/eal/eal_memory.c: > > > > * In rte_eal_hugepage_attach (line 1347) when we try to do a private > > > > mapping on /dev/zero (line 1393) why do we not use MAP_FIXED if we > > need the > > > > addresses to be identical with the primary process? > > > > * On POWER we have this weird business going on where we use > > MAP_HUGETLB > > > > because according to this commit: > > > > > > > > commit 284ae3e9ff9a92575c28c858efd2c85c8de6d440 > > > > Author: Chao Zhu > > > > Date: Thu Apr 6 15:36:09 2017 +0530 > > > > > > > > eal/ppc: fix mmap for memory initialization > > > > > > > > On IBM POWER platform, when mapping /dev/zero file to hugepage > > memory > > > > space, mmap will not respect the requested address hint.This will > > > > cause > > > > the memory initialization for the second process fails. This > > patch adds > > > > the required mmap flags to make it work. Beside this, users > > need to set > > > > the nr_overcommit_hugepages to expand the VA range. When > > > > doing the initialization, users need to set both nr_hugepages and > > > > nr_overcommit_hugepages to the same value, like 64, 128, etc. > > > > > > > > mmap address hints are not respected. Looking at the mmap code in the > > > > kernel this is not true entirely however under some circumstances > > the hint > > > > can be ignored ( > > > > https://urldefense.proofpoint.com/v2/url? > > > > > > u=http-3A__elixir.free-2Delectrons.com_linux_latest_source_arch_powerpc_mm_mmap.c-23L103&d=DwICaQ&c=jf_iaSHvJObTbx- > > > siA1ZOg&r=rOdXhRsgn8Iur7bDE0vgwvo6TC8OpoDN- > > > pXjigIjRW0&m=cttQcHlAYixhsYS3lz- > > > BAdEeg4dpbwGdPnj2R3I8Do0&s=Gp0TIjUtIed05Jgb7XnlocpCYZdFXZXiH0LqIWiNMhA&e= > > > > ). However I believe we can remove the extra case for PPC if we use > > > > MAP_FIXED when doing the secondary process mappings because we need > > them to > > > > be identical anyway. We could also use MAP_FIXED when doing the primary > > > > process mappings resp. get_virtual_area if we want to have any > > guarantees > > > > when specifying a base address. Any thoughts? > > > > > > > > Thanks, > > > > Jonas > > > > > > > hi Jonas, > > > > > > MAP_FIXED is not used because it's dangerous, it unmaps anything that is > > > already mapped into that space. We would rather know that we can't map > > > something than unwittingly unmap something that was mapped before. > > > > Ok, I see. Maybe we can add a check to the primary process's memory > > mappings whether the hint has been respected or not? At least warn if it > > hasn't. > > Hi Jonas, > > I'm unfamiliar with POWER platform, so i'm afraid you'd have to explain > a bit more what you mean by "hint has been respected" :) Hi Anatoly, What I meant was the mmap address hint: "If addr is not NULL, then the kernel takes it as a hint about where to place the mapping; on Linux, the mapping will be created at a nearby page boundary." This is actually not true on POWER. It can happen that the address hint is ignored and you get any address back that fits your mapping. Thanks, Jonas > > > -- > Thanks, > Anatoly >
Re: [dpdk-dev] Huge mapping secondary process linux
"dev" wrote on 10/27/2017 04:58:01 PM: > From: "Jonas Pfefferle1" > To: "Burakov, Anatoly" > Cc: bruce.richard...@intel.com, chao...@linux.vnet.ibm.com, dev@dpdk.org > Date: 10/27/2017 04:58 PM > Subject: Re: [dpdk-dev] Huge mapping secondary process linux > Sent by: "dev" > > > "Burakov, Anatoly" wrote on 10/27/2017 04:44:52 > PM: > > > From: "Burakov, Anatoly" > > To: Jonas Pfefferle1 > > Cc: bruce.richard...@intel.com, chao...@linux.vnet.ibm.com, dev@dpdk.org > > Date: 10/27/2017 04:45 PM > > Subject: Re: [dpdk-dev] Huge mapping secondary process linux > > > > On 27-Oct-17 3:28 PM, Jonas Pfefferle1 wrote: > > > "Burakov, Anatoly" wrote on 10/27/2017 > > > 04:06:44 PM: > > > > > > > From: "Burakov, Anatoly" > > > > To: Jonas Pfefferle1 , dev@dpdk.org > > > > Cc: chao...@linux.vnet.ibm.com, bruce.richard...@intel.com > > > > Date: 10/27/2017 04:06 PM > > > > Subject: Re: [dpdk-dev] Huge mapping secondary process linux > > > > > > > > On 27-Oct-17 1:43 PM, Jonas Pfefferle1 wrote: > > > > > > > > > > > > > > > Hi @all, > > > > > > > > > > I'm trying to make sense of the hugepage memory mappings in > > > > > librte_eal/linuxapp/eal/eal_memory.c: > > > > > * In rte_eal_hugepage_attach (line 1347) when we try to do a > private > > > > > mapping on /dev/zero (line 1393) why do we not use MAP_FIXED if we > > > > need the > > > > > addresses to be identical with the primary process? > > > > > * On POWER we have this weird business going on where we use > > > MAP_HUGETLB > > > > > because according to this commit: > > > > > > > > > > commit 284ae3e9ff9a92575c28c858efd2c85c8de6d440 > > > > > Author: Chao Zhu > > > > > Date: Thu Apr 6 15:36:09 2017 +0530 > > > > > > > > > > eal/ppc: fix mmap for memory initialization > > > > > > > > > > On IBM POWER platform, when mapping /dev/zero file to > hugepage > > > memory > > > > > space, mmap will not respect the requested address hint.This > will > > > > > cause > > > > > the memory initialization for the second process fails. This > > > patch adds > > > > > the required mmap flags to make it work. Beside this, users > > > need to set > > > > > the nr_overcommit_hugepages to expand the VA range. When > > > > > doing the initialization, users need to set both nr_hugepages > and > > > > > nr_overcommit_hugepages to the same value, like 64, 128, etc. > > > > > > > > > > mmap address hints are not respected. Looking at the mmap code in > the > > > > > kernel this is not true entirely however under some circumstances > > > the hint > > > > > can be ignored ( > > > > > https://urldefense.proofpoint.com/v2/url? > > > > > > > > > > u=http-3A__elixir.free-2Delectrons.com_linux_latest_source_arch_powerpc_mm_mmap.c-23L103&d=DwICaQ&c=jf_iaSHvJObTbx- > > > > > siA1ZOg&r=rOdXhRsgn8Iur7bDE0vgwvo6TC8OpoDN- > > > > pXjigIjRW0&m=cttQcHlAYixhsYS3lz- > > > > > BAdEeg4dpbwGdPnj2R3I8Do0&s=Gp0TIjUtIed05Jgb7XnlocpCYZdFXZXiH0LqIWiNMhA&e= > > > > > ). However I believe we can remove the extra case for PPC if we > use > > > > > MAP_FIXED when doing the secondary process mappings because we > need > > > them to > > > > > be identical anyway. We could also use MAP_FIXED when doing the > primary > > > > > process mappings resp. get_virtual_area if we want to have any > > > guarantees > > > > > when specifying a base address. Any thoughts? > > > > > > > > > > Thanks, > > > > > Jonas > > > > > > > > > hi Jonas, > > > > > > > > MAP_FIXED is not used because it's dangerous, it unmaps anything > that is > > > > already mapped into that space. We would rather know that we can't > map > > > > something than unwittingly unmap something that was mapped before. > > > > > > Ok, I see. Maybe we can add a check to the primary process's memory > > > mappings whether the hint has been respected or not? At least warn if > it > > > hasn't. > > > > Hi Jonas, > > > > I'm unfamiliar with POWER platform, so i'm afraid you'd have to explain > > a bit more what you mean by "hint has been respected" :) > > Hi Anatoly, > > What I meant was the mmap address hint: > > "If addr is not NULL, then the kernel takes it as a hint > about where to place the mapping; on Linux, the mapping will be > created at a nearby page boundary." > > This is actually not true on POWER. It can happen that the address hint is > ignored and you get any address back that fits your mapping. > > Thanks, > Jonas Actually looking through the kernel code this is also not guaranteed on x86. ( http://elixir.free-electrons.com/linux/latest/source/arch/x86/kernel/sys_x86_64.c#L165 ) So in any case the address hint can be ignored by the kernel and you get any address that fits your mapping. My suggestion is to check when we do the initial mapping in get_virtual_area if the hint was respected or not, i.e. if the returned address == PAGE_ALIGN(address_hint). Thanks, Jonas > > > > > > > -- > > Thanks, > > Anatoly > > >
Re: [dpdk-dev] Huge mapping secondary process linux
"Burakov, Anatoly" wrote on 27/10/2017 18:00:27: > From: "Burakov, Anatoly" > To: Jonas Pfefferle1 > Cc: bruce.richard...@intel.com, chao...@linux.vnet.ibm.com, dev@dpdk.org > Date: 27/10/2017 18:00 > Subject: Re: [dpdk-dev] Huge mapping secondary process linux > > On 27-Oct-17 4:16 PM, Jonas Pfefferle1 wrote: > > "dev" wrote on 10/27/2017 04:58:01 PM: > > > > > From: "Jonas Pfefferle1" > > > To: "Burakov, Anatoly" > > > Cc: bruce.richard...@intel.com, chao...@linux.vnet.ibm.com, dev@dpdk.org > > > Date: 10/27/2017 04:58 PM > > > Subject: Re: [dpdk-dev] Huge mapping secondary process linux > > > Sent by: "dev" > > > > > > > > > "Burakov, Anatoly" wrote on 10/27/2017 > > 04:44:52 > > > PM: > > > > > > > From: "Burakov, Anatoly" > > > > To: Jonas Pfefferle1 > > > > Cc: bruce.richard...@intel.com, chao...@linux.vnet.ibm.com, > > dev@dpdk.org > > > > Date: 10/27/2017 04:45 PM > > > > Subject: Re: [dpdk-dev] Huge mapping secondary process linux > > > > > > > > On 27-Oct-17 3:28 PM, Jonas Pfefferle1 wrote: > > > > > "Burakov, Anatoly" wrote on 10/27/2017 > > > > > 04:06:44 PM: > > > > > > > > > > Â > From: "Burakov, Anatoly" > > > > > Â > To: Jonas Pfefferle1 , dev@dpdk.org > > > > > Â > Cc: chao...@linux.vnet.ibm.com, bruce.richard...@intel.com > > > > > Â > Date: 10/27/2017 04:06 PM > > > > > Â > Subject: Re: [dpdk-dev] Huge mapping secondary process linux > > > > > Â > > > > > > Â > On 27-Oct-17 1:43 PM, Jonas Pfefferle1 wrote: > > > > > Â > > > > > > > Â > > > > > > > Â > > Hi @all, > > > > > Â > > > > > > > Â > > I'm trying to make sense of the hugepage memory mappings in > > > > > Â > > librte_eal/linuxapp/eal/eal_memory.c: > > > > > Â > > * In rte_eal_hugepage_attach (line 1347) when we try to do a > > > private > > > > > Â > > mapping on /dev/zero (line 1393) why do we not use MAP_FIXED > > if we > > > > > > > > need the > > > > > Â > > addresses to be identical with the primary process? > > > > > Â > > * On POWER we have this weird business going on where we use > > > > > MAP_HUGETLB > > > > > Â > > because according to this commit: > > > > > Â > > > > > > > Â > > commit 284ae3e9ff9a92575c28c858efd2c85c8de6d440 > > > > > Â > > Author: Chao Zhu > > > > > Â > > Date: Â Thu Apr 6 15:36:09 2017 +0530 > > > > > Â > > > > > > > Â > > Â Â Â eal/ppc: fix mmap for memory initialization > > > > > Â > > > > > > > Â > > Â Â Â On IBM POWER platform, when mapping /dev/zero file to > > > hugepage > > > > > memory > > > > > Â > > Â Â Â space, mmap will not respect the requested address > > hint.This > > > will > > > > > Â > > cause > > > > > Â > > Â Â Â the memory initialization for the second > process fails. > > This > > > > > patch adds > > > > > Â > > Â Â Â the required mmap flags to make it work. > Beside this, users > > > > > need to set > > > > > Â > > Â Â Â the nr_overcommit_hugepages to expand the VA > range. When > > > > > Â > > Â Â Â doing the initialization, users need to set both > > nr_hugepages > > > and > > > > > Â > > Â Â Â nr_overcommit_hugepages to the same value, like 64, > > 128, etc. > > > > > Â > > > > > > > Â > > mmap address hints are not respected. Looking at the mmap > > code in > > > the > > > > > Â > > kernel this is not true entirely however under some > > circumstances > > > > > the hint > > > > > Â > > can be ignored ( > > > > > Â > > https://urldefense.proofpoint.com/v2/url? > > > > > Â > > > > > > > > > > > > > > > > u=http-3A__elixir.free-2Delectrons.com_linux_l
Re: [dpdk-dev] [PATCH] bus/pci: fix vfio device reset
Jerin Jacob wrote on 28/10/2017 08:22:55: > From: Jerin Jacob > To: dev@dpdk.org > Cc: tho...@monjalon.net, Jerin Jacob > , Jonas Pfefferle > , Anatoly Burakov > Date: 28/10/2017 08:23 > Subject: [dpdk-dev] [PATCH] bus/pci: fix vfio device reset > > If the device is not capable of resetting, then Linux kernel updates > the errno as EINVAL. > https://urldefense.proofpoint.com/v2/url? > u=http-3A__elixir.free-2Delectrons.com_linux_v4. > 9_source_drivers_vfio_pci_vfio-5Fpci.c-23L887&d=DwIBAg&c=jf_iaSHvJObTbx- > siA1ZOg&r=rOdXhRsgn8Iur7bDE0vgwvo6TC8OpoDN-pXjigIjRW0&m=V- > sbOkvx7qxOMbwyk3n1Fb_1NCjAhl0io- > hqldJ0r6M&s=OXV7lrxTpahVIsA3J3nCNUlLqW21nlMiQiYveAzyQhc&e= > > Honor the EINVAL errno value to avoid pci vfio setup failure. > > Fixes: f25f8f367644 ("bus/pci: check VFIO reset ioctl error") > > Cc: Jonas Pfefferle > Cc: Anatoly Burakov > > Signed-off-by: Jerin Jacob > --- > drivers/bus/pci/linux/pci_vfio.c | 7 +-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/drivers/bus/pci/linux/pci_vfio.c b/drivers/bus/pci/ > linux/pci_vfio.c > index 360eed380..11df64846 100644 > --- a/drivers/bus/pci/linux/pci_vfio.c > +++ b/drivers/bus/pci/linux/pci_vfio.c > @@ -338,8 +338,11 @@ pci_vfio_setup_device(struct rte_pci_device > *dev, int vfio_dev_fd) >return -1; > } > > - /* Reset the device */ > - if (ioctl(vfio_dev_fd, VFIO_DEVICE_RESET)) { > + /* > +* Reset the device. If the device is not capable of resetting, > +* then it updates errno as EINVAL. > +*/ > + if (ioctl(vfio_dev_fd, VFIO_DEVICE_RESET) && errno != EINVAL) { >RTE_LOG(ERR, EAL, "Unable to reset device! Error: %d (%s)\n", > errno, strerror(errno)); >return -1; > -- > 2.14.3 > Looks good to me. Reviewed-by: Jonas Pfefferle Thanks, Jonas
Re: [dpdk-dev] [PATCH v7 7/9] linuxapp/eal_vfio: honor iova mode before mapping
Thomas Monjalon wrote on 11/02/2017 11:17:10 AM: > From: Thomas Monjalon > To: Jonas Pfefferle1 > Cc: dev@dpdk.org, Santosh Shukla > , jerin.ja...@caviumnetworks.com, > hemant.agra...@nxp.com, olivier.m...@6wind.com, > maxime.coque...@redhat.com, sergio.gonzalez.mon...@intel.com, > bruce.richard...@intel.com, shreyansh.j...@nxp.com, > gaetan.ri...@6wind.com, anatoly.bura...@intel.com, > step...@networkplumber.org, acon...@redhat.com > Date: 11/02/2017 11:17 AM > Subject: Re: [dpdk-dev] [PATCH v7 7/9] linuxapp/eal_vfio: honor iova > mode before mapping > > Hi > > 26/10/2017 14:57, Jonas Pfefferle1: > > > > Hi @all > > > > I just stumbled upon this patch while testing on POWER. RTE_IOVA_VA will > > not work for the sPAPR code since the dma window size is currently > > determined by the physical address only. > > Is itaffecting POWER8? It is. > > > I'm preparing a patch to address this. > > Any news? > Can you use virtual addresses? After a long discussion with Alexey (CC) we came to the conclusion that with the current sPAPR iommu driver we cannot use virtual addresses since the iova is restricted to lay in the DMA window which itself is restricted to physical RAM addresses resp. with the current code 0 to hotplug memory max. However, Alexey is working on a patch to lift this restriction on the DMA window size which should allow us to do VA:VA mappings in the future. For now we should fall back to PA in the dynamic iova mode check. I will send an according patch later today. > >
Re: [dpdk-dev] [PATCH v7 7/9] linuxapp/eal_vfio: honor iova mode before mapping
"dev" wrote on 11/02/2017 11:26:57 AM: > From: "Jonas Pfefferle1" > To: Thomas Monjalon > Cc: acon...@redhat.com, anatoly.bura...@intel.com, > bruce.richard...@intel.com, dev@dpdk.org, gaetan.ri...@6wind.com, > hemant.agra...@nxp.com, jerin.ja...@caviumnetworks.com, > maxime.coque...@redhat.com, olivier.m...@6wind.com, Santosh Shukla > , > sergio.gonzalez.mon...@intel.com, shreyansh.j...@nxp.com, > step...@networkplumber.org, "Alexey Kardashevskiy" > Date: 11/02/2017 11:27 AM > Subject: Re: [dpdk-dev] [PATCH v7 7/9] linuxapp/eal_vfio: honor iova > mode before mapping > Sent by: "dev" > > > Thomas Monjalon wrote on 11/02/2017 11:17:10 AM: > > > From: Thomas Monjalon > > To: Jonas Pfefferle1 > > Cc: dev@dpdk.org, Santosh Shukla > > , jerin.ja...@caviumnetworks.com, > > hemant.agra...@nxp.com, olivier.m...@6wind.com, > > maxime.coque...@redhat.com, sergio.gonzalez.mon...@intel.com, > > bruce.richard...@intel.com, shreyansh.j...@nxp.com, > > gaetan.ri...@6wind.com, anatoly.bura...@intel.com, > > step...@networkplumber.org, acon...@redhat.com > > Date: 11/02/2017 11:17 AM > > Subject: Re: [dpdk-dev] [PATCH v7 7/9] linuxapp/eal_vfio: honor iova > > mode before mapping > > > > Hi > > > > 26/10/2017 14:57, Jonas Pfefferle1: > > > > > > Hi @all > > > > > > I just stumbled upon this patch while testing on POWER. RTE_IOVA_VA > will > > > not work for the sPAPR code since the dma window size is currently > > > determined by the physical address only. > > > > Is itaffecting POWER8? > > It is. > > > > > > I'm preparing a patch to address this. > > > > Any news? > > Can you use virtual addresses? > > After a long discussion with Alexey (CC) we came to the conclusion that > with the current sPAPR iommu driver we cannot use virtual addresses since > the iova is restricted to lay in the DMA window which itself is restricted > to physical RAM addresses resp. with the current code 0 to hotplug memory > max. However, Alexey is working on a patch to lift this restriction on the > DMA window size which should allow us to do VA:VA mappings in the future. > For now we should fall back to PA in the dynamic iova mode check. I will > send an according patch later today. I looked into this yesterday but I'm not sure what the right solution is here. At the time rte_pci_get_iommu_class is called we already know which IOMMU types are supported because vfio_get_container_fd resp. vfio_has_supported_extensions has been called however we do not know which one is going to be used (Decided later in vfio_setup_device resp. vfio_set_iommu_type). We can choose a iova mode which is supported by all types but if the modes are exclusive to the types we have to guess which one is going to be used. Or let the user decide? Thanks, Jonas > > > > > >
Re: [dpdk-dev] [PATCH v7 7/9] linuxapp/eal_vfio: honor iova mode before mapping
Thomas Monjalon wrote on 11/03/2017 11:28:10 AM: > From: Thomas Monjalon > To: Jonas Pfefferle1 > Cc: acon...@redhat.com, anatoly.bura...@intel.com, > bruce.richard...@intel.com, dev@dpdk.org, gaetan.ri...@6wind.com, > hemant.agra...@nxp.com, jerin.ja...@caviumnetworks.com, > maxime.coque...@redhat.com, olivier.m...@6wind.com, Santosh Shukla > , > sergio.gonzalez.mon...@intel.com, shreyansh.j...@nxp.com, > step...@networkplumber.org, Alexey Kardashevskiy > Date: 11/03/2017 11:28 AM > Subject: Re: [dpdk-dev] [PATCH v7 7/9] linuxapp/eal_vfio: honor iova > mode before mapping > > 03/11/2017 10:56, Jonas Pfefferle1: > > Thomas Monjalon wrote on 11/02/2017 11:17:10 AM: > > > > 26/10/2017 14:57, Jonas Pfefferle1: > > > > > > > > > > Hi @all > > > > > > > > > > I just stumbled upon this patch while testing on POWER. RTE_IOVA_VA > > > will > > > > > not work for the sPAPR code since the dma window size is currently > > > > > determined by the physical address only. > > > > > > > > Is itaffecting POWER8? > > > > > > It is. > > > > > > > > > > > > I'm preparing a patch to address this. > > > > > > > > Any news? > > > > Can you use virtual addresses? > > > > > > After a long discussion with Alexey (CC) we came to the conclusion that > > > with the current sPAPR iommu driver we cannot use virtual addresses since > > > the iova is restricted to lay in the DMA window which itself is > > restricted > > > to physical RAM addresses resp. with the current code 0 to hotplug memory > > > max. However, Alexey is working on a patch to lift this restriction on > > the > > > DMA window size which should allow us to do VA:VA mappings in the future. > > > For now we should fall back to PA in the dynamic iova mode check. I will > > > send an according patch later today. > > > > I looked into this yesterday but I'm not sure what the right solution is > > here. > > At the time rte_pci_get_iommu_class is called we already know which IOMMU > > types are supported because vfio_get_container_fd resp. > > vfio_has_supported_extensions has been called however we do not know which > > one is going to be used (Decided later in vfio_setup_device resp. > > vfio_set_iommu_type). We can choose a iova mode which is supported by all > > types but if the modes are exclusive to the types we have to guess which > > one is going to be used. Or let the user decide? > > You can keep the old behaviour, restricting to physical memory, > until you support virtual addressing. > It can be just a #ifdef RTE_ARCH_PPC_64. > Ok but we might want to refine this in the future. IMO It looks much cleaner to decide this on the iommu type plus this would also cover the noiommu case without having this extra check reading the sysfs variable.
Re: [dpdk-dev] [PATCH v7 7/9] linuxapp/eal_vfio: honor iova mode before mapping
Thomas Monjalon wrote on 11/03/2017 11:54:45 AM: > From: Thomas Monjalon > To: Jonas Pfefferle1 > Cc: acon...@redhat.com, Alexey Kardashevskiy , > anatoly.bura...@intel.com, bruce.richard...@intel.com, dev@dpdk.org, > gaetan.ri...@6wind.com, hemant.agra...@nxp.com, > jerin.ja...@caviumnetworks.com, maxime.coque...@redhat.com, > olivier.m...@6wind.com, Santosh Shukla > , > sergio.gonzalez.mon...@intel.com, shreyansh.j...@nxp.com, > step...@networkplumber.org > Date: 11/03/2017 11:55 AM > Subject: Re: [dpdk-dev] [PATCH v7 7/9] linuxapp/eal_vfio: honor iova > mode before mapping > > 03/11/2017 11:44, Jonas Pfefferle1: > > Thomas Monjalon wrote on 11/03/2017 11:28:10 AM: > > > 03/11/2017 10:56, Jonas Pfefferle1: > > > > Thomas Monjalon wrote on 11/02/2017 11:17:10 AM: > > > > > > 26/10/2017 14:57, Jonas Pfefferle1: > > > > > > > > > > > > > > Hi @all > > > > > > > > > > > > > > I just stumbled upon this patch while testing on POWER. > > RTE_IOVA_VA > > > > > will > > > > > > > not work for the sPAPR code since the dma window size is > > currently > > > > > > > determined by the physical address only. > > > > > > > > > > > > Is itaffecting POWER8? > > > > > > > > > > It is. > > > > > > > > > > > > > > > > > > I'm preparing a patch to address this. > > > > > > > > > > > > Any news? > > > > > > Can you use virtual addresses? > > > > > > > > > > After a long discussion with Alexey (CC) we came to the conclusion > > that > > > > > with the current sPAPR iommu driver we cannot use virtual addresses > > since > > > > > the iova is restricted to lay in the DMA window which itself is > > > > restricted > > > > > to physical RAM addresses resp. with the current code 0 to hotplug > > memory > > > > > max. However, Alexey is working on a patch to lift this restriction > > on > > > > the > > > > > DMA window size which should allow us to do VA:VA mappings in the > > future. > > > > > For now we should fall back to PA in the dynamic iova mode check. I > > will > > > > > send an according patch later today. > > > > > > > > I looked into this yesterday but I'm not sure what the right solution > > is > > > > here. > > > > At the time rte_pci_get_iommu_class is called we already know which > > IOMMU > > > > types are supported because vfio_get_container_fd resp. > > > > vfio_has_supported_extensions has been called however we do not know > > which > > > > one is going to be used (Decided later in vfio_setup_device resp. > > > > vfio_set_iommu_type). We can choose a iova mode which is supported by > > all > > > > types but if the modes are exclusive to the types we have to guess > > which > > > > one is going to be used. Or let the user decide? > > > > > > You can keep the old behaviour, restricting to physical memory, > > > until you support virtual addressing. > > > It can be just a #ifdef RTE_ARCH_PPC_64. > > > > > > > Ok but we might want to refine this in the future. IMO It looks much > > cleaner > > to decide this on the iommu type plus this would also cover the noiommu > > case without having this extra check reading the sysfs variable. > > You are using the word "this" too many times to help me understand :) What I meant is a fix that decides which iova mode to use based on the iommu types supported (determined by vfio_get_container_fd) instead of another extra case for PPC much like the noiommu check. Both should be covered by the supported types based check. IMO much cleaner and simpler to support new iommu types. > > Anyway, please send a quick fix today for 17.11. > The RC3 will be probably closed before Monday. > Will do.
Re: [dpdk-dev] [PATCH v3 4/6] eal/memory: rename memory API to iovatypes
"dev" wrote on 11/03/2017 02:58:43 PM: > From: Thomas Monjalon > To: santosh > Cc: dev@dpdk.org, olivier.m...@6wind.com, > jerin.ja...@caviumnetworks.com, hemant.agra...@nxp.com, > anatoly.bura...@intel.com > Date: 11/03/2017 02:59 PM > Subject: Re: [dpdk-dev] [PATCH v3 4/6] eal/memory: rename memory API > to iova types > Sent by: "dev" > > 03/11/2017 12:35, santosh: > > On Friday 03 November 2017 04:41 PM, Thomas Monjalon wrote: > > > 20/10/2017 14:31, Santosh Shukla: > > >> Renamed memory translational api to _iova types. > > >> The following api renamed from: > > >> > > >> rte_mempool_populate_phys() > > >> rte_mempool_populate_phys_tab() > > > These functions still have "physical addresses" in their description. > > > It is not consistent. > > > > > > Please provide ABI compatibility for mempool functions. > > > > > >> rte_eal_using_phys_addrs() > > > Why renaming rte_eal_using_phys_addrs? > > > I think we need to review how it is used. > > > Maybe it requires a rework. > > > > > >> rte_mem_virt2phy() > > >> rte_dump_physmem_layout() > > >> rte_eal_get_physmem_layout() > > >> rte_eal_get_physmem_size() > > > Those 3 functions deal with physical memory layout. > > > I don't see a need to rename them. > > > > In iova=va mode, those api will have va address. > > Yes addresses can be virtual. > But it is still physically segmented. > > > > But the dump function needs a change to avoid printing > > > "phys" even in VA case. > > > > > >> rte_malloc_virt2phy() > > >> rte_mem_phy2mch() > > > This last function was removed with Xen. > > > It is wrong to rename it in the release notes. > > > It should just be removed from the map file (I will send a patch). > > > > > >> To the following iova types api: > > >> > > >> rte_mempool_populate_iova() > > >> rte_mempool_populate_iova_tab() > > >> rte_eal_using_iova_addrs() > > >> rte_mem_virt2iova() > > > [...] > > >> rte_malloc_virt2iova() > > > I am not convinced by the names virt2iova. > > > I sounds like "virt to virt". > > > What about "virt2io" or "virt2io_addr"? > > > > no, iova can be _pa or _va, its an io_addr indeed but > > I prefer virt2iova, same mentioned in deprecation notice (no > strong opinion), > > > > More suggestion on naming pl.? > > I like virt2io_addr but virt2iova is not so bad. > I agree with Santosh that we need more opinions. virt2iova +1 Another suggestion va2iova to be consistent with the names > > > > As the ABI is broken in EAL 17.11, we do not care about compatibility. > > > But we must keep an alias to the old function name in order to allow > > > a smooth API transition for applications. > > > I suggest to add static inline functions with the old names and set > > > the deprecated attribute. > > > > ok, Will address in 18.02. > > I would prefer these changes made in 17.11 as announced. > As you are not willing to work on it, I am trying to send some > updated patches. >
Re: [dpdk-dev] [PATCH v2] vfio: noiommu check error handling
Thomas Monjalon wrote on 11/06/2017 09:25:15 PM: > From: Thomas Monjalon > To: Jonas Pfefferle , anatoly.bura...@intel.com > Cc: dev@dpdk.org > Date: 11/06/2017 09:55 PM > Subject: Re: [dpdk-dev] [PATCH v2] vfio: noiommu check error handling > > 31/10/2017 16:59, Jonas Pfefferle: > > Check and report errors on open/read in noiommu check. > > > > Signed-off-by: Jonas Pfefferle > > I cannot decide to apply this patch as it does not explain what > it is fixing, and as it is not reviewed. > This patch adds error handling and logging to the noiommu check. Also, on older kernels when the noiommu_enable file does not exist it assumes noiommu is not enabled instead of returning -1. Note that in rte_pci_get_iommu_class (drivers/bus/pci/linux/pci.c) is the only usage of the function and it assumes return == 1: noiommu is enabled any other return value noiommu disabled, i.e. my code change does not change the behavior of this function. We might want to check for errors in rte_pci_get_iommu_class as well since assuming it is not enabled when we cannot open and read it might lead to iova == VA being used even if noiommu is enabled. All this comes back to what I proposed before: instead of the noiommu and PPC64 check we should decide which iova mode to use depending on the iommu types available. The type should be already available at the point where we decide on the iova type. (iommu types supported is checked by vfio_get_container_fd)
Re: [dpdk-dev] [PATCH v2] vfio: noiommu check error handling
Thomas Monjalon wrote on 11/07/2017 10:40:15 AM: > From: Thomas Monjalon > To: Jonas Pfefferle1 , anatoly.bura...@intel.com > Cc: dev@dpdk.org > Date: 11/07/2017 10:40 AM > Subject: Re: [dpdk-dev] [PATCH v2] vfio: noiommu check error handling > > 07/11/2017 10:05, Jonas Pfefferle1: > > Thomas Monjalon wrote on 11/06/2017 09:25:15 PM: > > > > > From: Thomas Monjalon > > > To: Jonas Pfefferle , anatoly.bura...@intel.com > > > Cc: dev@dpdk.org > > > Date: 11/06/2017 09:55 PM > > > Subject: Re: [dpdk-dev] [PATCH v2] vfio: noiommu check error handling > > > > > > 31/10/2017 16:59, Jonas Pfefferle: > > > > Check and report errors on open/read in noiommu check. > > > > > > > > Signed-off-by: Jonas Pfefferle > > > > > > I cannot decide to apply this patch as it does not explain what > > > it is fixing, and as it is not reviewed. > > > > > > > This patch adds error handling and logging to the noiommu check. > > Also, on older kernels when the noiommu_enable file does not exist it > > assumes noiommu is not enabled instead of returning -1. > > Note that in rte_pci_get_iommu_class (drivers/bus/pci/linux/pci.c) > > is the only usage of the function and it assumes return == 1: > > noiommu is enabled any other return value noiommu disabled, i.e. > > my code change does not change the behavior of this function. > > We might want to check for errors in rte_pci_get_iommu_class > > as well since assuming it is not enabled when we cannot open > > and read it might lead to iova == VA being used even if noiommu is > > enabled. > > > > All this comes back to what I proposed before: instead of > > the noiommu and PPC64 check we should decide which iova mode > > to use depending on the iommu types available. > > The type should be already available at the point where we > > decide on the iova type. > > (iommu types supported is checked by vfio_get_container_fd) > > Is there something urgent for 17.11? > Or can it be refined in 18.02? Nothing urgent. We can refine this for 18.02. > > Anatoly, any thought? >
Re: [dpdk-dev] Huge mapping secondary process linux
"Chao Zhu" wrote on 11/07/2017 09:25:26 AM: > From: "Chao Zhu" > To: "'Jonas Pfefferle1'" , "'Burakov, Anatoly'" > > Cc: , > Date: 11/07/2017 11:00 AM > Subject: RE: [dpdk-dev] Huge mapping secondary process linux > > > > From: Jonas Pfefferle1 [mailto:j...@zurich.ibm.com] > Sent: 2017年10月28日 3:23 > To: Burakov, Anatoly > Cc: bruce.richard...@intel.com; chao...@linux.vnet.ibm.com; dev@dpdk.org > Subject: Re: [dpdk-dev] Huge mapping secondary process linux > > "Burakov, Anatoly" wrote on 27/10/2017 18:00:27: > > > From: "Burakov, Anatoly" > > To: Jonas Pfefferle1 > > Cc: bruce.richard...@intel.com, chao...@linux.vnet.ibm.com, dev@dpdk.org > > Date: 27/10/2017 18:00 > > Subject: Re: [dpdk-dev] Huge mapping secondary process linux > > > > On 27-Oct-17 4:16 PM, Jonas Pfefferle1 wrote: > > > "dev" wrote on 10/27/2017 04:58:01 PM: > > > > > > > From: "Jonas Pfefferle1" > > > > To: "Burakov, Anatoly" > > > > Cc: bruce.richard...@intel.com, chao...@linux.vnet.ibm.com, > dev@dpdk.org > > > > Date: 10/27/2017 04:58 PM > > > > Subject: Re: [dpdk-dev] Huge mapping secondary process linux > > > > Sent by: "dev" > > > > > > > > > > > > "Burakov, Anatoly" wrote on 10/27/2017 > > > 04:44:52 > > > > PM: > > > > > > > > > From: "Burakov, Anatoly" > > > > > To: Jonas Pfefferle1 > > > > > Cc: bruce.richard...@intel.com, chao...@linux.vnet.ibm.com, > > > dev@dpdk.org > > > > > Date: 10/27/2017 04:45 PM > > > > > Subject: Re: [dpdk-dev] Huge mapping secondary process linux > > > > > > > > > > On 27-Oct-17 3:28 PM, Jonas Pfefferle1 wrote: > > > > > > "Burakov, Anatoly" wrote on 10/27/2017 > > > > > > 04:06:44 PM: > > > > > > > > > > > > Â > From: "Burakov, Anatoly" > > > > > > Â > To: Jonas Pfefferle1 , dev@dpdk.org > > > > > > Â > Cc: chao...@linux.vnet.ibm.com, bruce.richard...@intel.com > > > > > > Â > Date: 10/27/2017 04:06 PM > > > > > > Â > Subject: Re: [dpdk-dev] Huge mapping secondary process linux > > > > > > Â > > > > > > > Â > On 27-Oct-17 1:43 PM, Jonas Pfefferle1 wrote: > > > > > > Â > > > > > > > > Â > > > > > > > > Â > > Hi @all, > > > > > > Â > > > > > > > > Â > > I'm trying to make sense of the hugepage memory mappings in > > > > > > Â > > librte_eal/linuxapp/eal/eal_memory.c: > > > > > > Â > > * In rte_eal_hugepage_attach (line 1347) when we try to do a > > > > private > > > > > > Â > > mapping on /dev/zero (line 1393) why do we not use MAP_FIXED > > > if we > > > > > > > > > > need the > > > > > > Â > > addresses to be identical with the primary process? > > > > > > Â > > * On POWER we have this weird business going on where we use > > > > > > MAP_HUGETLB > > > > > > Â > > because according to this commit: > > > > > > Â > > > > > > > > Â > > commit 284ae3e9ff9a92575c28c858efd2c85c8de6d440 > > > > > > Â > > Author: Chao Zhu > > > > > > Â > > Date: Â Thu Apr 6 15:36:09 2017 +0530 > > > > > > Â > > > > > > > > Â > > Â Â Â eal/ppc: fix mmap for memory initialization > > > > > > Â > > > > > > > > Â > > Â Â Â On IBM POWER platform, when mapping /dev/zero file to > > > > hugepage > > > > > > memory > > > > > > Â > > Â Â Â space, mmap will not respect the requested address > > > hint.This > > > > will > > > > > > Â > > cause > > > > > > Â > > Â Â Â the memory initialization for the second > > process fails. > > > This > > > > > > patch adds > > > > > > Â > > Â Â Â the required mmap flags to make it work. > > Beside this, users > > > > > > need to set > > > > > > Â >
Re: [dpdk-dev] [PATCH] pci: get IOMMU class sPAPR iommu fix
Thomas Monjalon wrote on 11/07/2017 12:38:11 AM: > From: Thomas Monjalon > To: Jonas Pfefferle > Cc: dev@dpdk.org, anatoly.bura...@intel.com > Date: 11/07/2017 12:38 AM > Subject: Re: [dpdk-dev] [PATCH] pci: get IOMMU class sPAPR iommu fix > > 03/11/2017 13:05, Jonas Pfefferle: > > PPC64 sPAPR iommu does not support iova as va. > > Use pa mode instead. > > > > Signed-off-by: Jonas Pfefferle > > Applied, thanks > Hi Thomas, I just noticed that I send in an old patch by mistake. It uses the wrong define RTE_ARCH_PPC64 instead of RTE_ARCH_PPC_64. Sorry for this, I will send in a fix for this in the next hour. Thanks, Jonas
Re: [dpdk-dev] [PATCH] mem: warn if address hint is not respected
"Burakov, Anatoly" wrote on 11/07/2017 02:54:24 PM: > From: "Burakov, Anatoly" > To: Thomas Monjalon > Cc: dev@dpdk.org, Jonas Pfefferle , jianfeng@intel.com > Date: 11/07/2017 02:54 PM > Subject: Re: [dpdk-dev] [PATCH] mem: warn if address hint is not respected > > On 06-Nov-17 8:26 PM, Thomas Monjalon wrote: > > 31/10/2017 10:08, Jonas Pfefferle: > >> Print a warning if the --base-virtaddr hint is not respected > >> since this might lead to problems when mapping memory in > >> the secondary process. > >> > >> Signed-off-by: Jonas Pfefferle > > > > Anatoly, please review this patch. > > It does not seem to fix something, so it is candidate for 18.02. > > > > For some reason my Thunderbird ate the original email, so i'll reply to > this one. > > One nitpick would be that we're calling get_virtual_area many times and > it would probably be a good idea to make pagesize static and call > sysconf only once. Otherwise, We should address this in a separate patch and introduce a pagesize function for everyone to use. sysconf is used like this all over the place. > > Acked-by: Anatoly Burakov > > -- > Thanks, > Anatoly >
Re: [dpdk-dev] Huge mapping secondary process linux
"Chao Zhu" wrote on 11/09/2017 04:08:36 AM: > From: "Chao Zhu" > To: "'Jonas Pfefferle1'" > Cc: "'Burakov, Anatoly'" , > , > Date: 11/09/2017 04:08 AM > Subject: RE: [dpdk-dev] Huge mapping secondary process linux > > > > From: Jonas Pfefferle1 [mailto:j...@zurich.ibm.com] > Sent: 2017年11月7日 18:16 > To: Chao Zhu > Cc: 'Burakov, Anatoly' ; > bruce.richard...@intel.com; dev@dpdk.org > Subject: RE: [dpdk-dev] Huge mapping secondary process linux > > "Chao Zhu" wrote on 11/07/2017 09:25:26 AM: > > > From: "Chao Zhu" > > To: "'Jonas Pfefferle1'" , "'Burakov, Anatoly'" > > > > Cc: , > > Date: 11/07/2017 11:00 AM > > Subject: RE: [dpdk-dev] Huge mapping secondary process linux > > > > > > > > From: Jonas Pfefferle1 [mailto:j...@zurich.ibm.com] > > Sent: 2017年10月28日 3:23 > > To: Burakov, Anatoly > > Cc: bruce.richard...@intel.com; chao...@linux.vnet.ibm.com; dev@dpdk.org > > Subject: Re: [dpdk-dev] Huge mapping secondary process linux > > > > "Burakov, Anatoly" wrote on 27/10/201718:00:27: > > > > > From: "Burakov, Anatoly" > > > To: Jonas Pfefferle1 > > > Cc: bruce.richard...@intel.com, chao...@linux.vnet.ibm.com, dev@dpdk.org > > > Date: 27/10/2017 18:00 > > > Subject: Re: [dpdk-dev] Huge mapping secondary process linux > > > > > > On 27-Oct-17 4:16 PM, Jonas Pfefferle1 wrote: > > > > "dev" wrote on 10/27/2017 04:58:01 PM: > > > > > > > > > From: "Jonas Pfefferle1" > > > > > To: "Burakov, Anatoly" > > > > > Cc: bruce.richard...@intel.com, chao...@linux.vnet.ibm.com, > > dev@dpdk.org > > > > > Date: 10/27/2017 04:58 PM > > > > > Subject: Re: [dpdk-dev] Huge mapping secondary process linux > > > > > Sent by: "dev" > > > > > > > > > > > > > > > "Burakov, Anatoly" wrote on 10/27/2017 > > > > 04:44:52 > > > > > PM: > > > > > > > > > > > From: "Burakov, Anatoly" > > > > > > To: Jonas Pfefferle1 > > > > > > Cc: bruce.richard...@intel.com, chao...@linux.vnet.ibm.com, > > > > dev@dpdk.org > > > > > > Date: 10/27/2017 04:45 PM > > > > > > Subject: Re: [dpdk-dev] Huge mapping secondary process linux > > > > > > > > > > > > On 27-Oct-17 3:28 PM, Jonas Pfefferle1 wrote: > > > > > > > "Burakov, Anatoly" wrote on10/27/2017 > > > > > > > 04:06:44 PM: > > > > > > > > > > > > > > Â > From: "Burakov, Anatoly" > > > > > > > Â > To: Jonas Pfefferle1 , dev@dpdk.org > > > > > > > Â > Cc: chao...@linux.vnet.ibm.com, bruce.richard...@intel.com > > > > > > > Â > Date: 10/27/2017 04:06 PM > > > > > > > Â > Subject: Re: [dpdk-dev] Huge mapping secondary process linux > > > > > > > Â > > > > > > > > Â > On 27-Oct-17 1:43 PM, Jonas Pfefferle1 wrote: > > > > > > > Â > > > > > > > > > Â > > > > > > > > > Â > > Hi @all, > > > > > > > Â > > > > > > > > > Â > > I'm trying to make sense of the hugepage memory mappings in > > > > > > > Â > > librte_eal/linuxapp/eal/eal_memory.c: > > > > > > > Â > > * In rte_eal_hugepage_attach (line 1347) when we > try to do a > > > > > private > > > > > > > Â > > mapping on /dev/zero (line 1393) why do we not > use MAP_FIXED > > > > if we > > > > > > > > > > > > need the > > > > > > > Â > > addresses to be identical with the primary process? > > > > > > > Â > > * On POWER we have this weird business going on > where we use > > > > > > > MAP_HUGETLB > > > > > > > Â > > because according to this commit: > > > > > > > Â > > > > > > > > > Â > > commit 284ae3e9ff9a92575c28c858efd2c85c8de6d440 > > > > > > > Â > > Author: Chao Zhu > > > > > > > Â > &g