[dpdk-dev] [PATCH] rte_ring: fix racy dequeue/enqueue in ppc64

2018-07-11 Thread Takeshi Yoshimura
SPDK blobfs encountered a crash around rte_ring dequeues in ppc64.
It uses a single consumer and multiple producers for a rte_ring.
The problem was a load-load reorder in rte_ring_sc_dequeue_bulk().

The reordered loads happened on r->prod.tail in
__rte_ring_move_cons_head() (rte_ring_generic.h) and ring[idx] in
DEQUEUE_PTRS() (rte_ring.h). They have a load-load control
dependency, but the code does not satisfy it. Note that they are
not reordered if __rte_ring_move_cons_head() with is_sc != 1 because
cmpset invokes a read barrier.

The paired stores on these loads are in ENQUEUE_PTRS() and
update_tail(). Simplified code around the reorder is the following.

Consumer Producer
load idx[ring]
 store idx[ring]
 store r->prod.tail
load r->prod.tail

In this case, the consumer loads old idx[ring] and confirms the load
is valid with the new r->prod.tail.

I added a read barrier in the case where __IS_SC is passed to
__rte_ring_move_cons_head(). I also fixed __rte_ring_move_prod_head()
to avoid similar problems with a single producer.

Cc: sta...@dpdk.org

Signed-off-by: Takeshi Yoshimura 
---
 lib/librte_ring/rte_ring_generic.h | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/lib/librte_ring/rte_ring_generic.h 
b/lib/librte_ring/rte_ring_generic.h
index ea7dbe5b9..477326180 100644
--- a/lib/librte_ring/rte_ring_generic.h
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -90,9 +90,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int 
is_sp,
return 0;
 
*new_head = *old_head + n;
-   if (is_sp)
+   if (is_sp) {
+   rte_smp_rmb();
r->prod.head = *new_head, success = 1;
-   else
+   } else
success = rte_atomic32_cmpset(&r->prod.head,
*old_head, *new_head);
} while (unlikely(success == 0));
@@ -158,9 +159,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, unsigned int 
is_sc,
return 0;
 
*new_head = *old_head + n;
-   if (is_sc)
+   if (is_sc) {
+   rte_smp_rmb();
r->cons.head = *new_head, success = 1;
-   else
+   } else
success = rte_atomic32_cmpset(&r->cons.head, *old_head,
*new_head);
} while (unlikely(success == 0));
-- 
2.17.1



[dpdk-dev] [PATCH] vfio: fix workaround of BAR0 mapping

2018-07-11 Thread Takeshi Yoshimura
The workaround of BAR0 mapping does not work if BAR0 area is smaller
than page size (64KB in ppc). In addition, we no longer need the
workaround in recent Linux because VFIO allows MSIX mapping (*).
This fix is just to skip the workaround if BAR0 is smarller than a page.

(*): "vfio-pci: Allow mapping MSIX BAR", 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/vfio/pci/vfio_pci.c?h=v4.18-rc3&id=a32295c612c57990d17fb0f41e7134394b2f35f6

Fixes: 90a1633b2347 ("eal/linux: allow to map BARs with MSI-X tables")

Signed-off-by: Takeshi Yoshimura 
---
 drivers/bus/pci/linux/pci_vfio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/bus/pci/linux/pci_vfio.c b/drivers/bus/pci/linux/pci_vfio.c
index aeeaa9ed8..facda64bb 100644
--- a/drivers/bus/pci/linux/pci_vfio.c
+++ b/drivers/bus/pci/linux/pci_vfio.c
@@ -337,7 +337,7 @@ pci_vfio_mmap_bar(int vfio_dev_fd, struct 
mapped_pci_resource *vfio_res,
/* Skip this BAR */
return 0;
 
-   if (msix_table->bar_index == bar_index) {
+   if (msix_table->bar_index == bar_index && (uint64_t)PAGE_SIZE < 
bar->size) {
/*
 * VFIO will not let us map the MSI-X table,
 * but we can map around it.
-- 
2.17.1



[dpdk-dev] [PATCH v2] vfio: fix workaround of BAR0 mapping

2018-07-11 Thread Takeshi Yoshimura
The workaround of BAR0 mapping does not work if BAR0 area is smaller
than page size (64KB in ppc). In addition, we no longer need the
workaround in recent Linux because VFIO allows MSIX mapping (*).
This fix is just to skip the workaround if BAR0 is smarller than a page.

(*): "vfio-pci: Allow mapping MSIX BAR",
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
commit/id=a32295c612c57990d17fb0f41e7134394b2f35f6

Fixes: 90a1633b2347 ("eal/linux: allow to map BARs with MSI-X tables")

Signed-off-by: Takeshi Yoshimura 
---

Fixed checkpatch warnings

 drivers/bus/pci/linux/pci_vfio.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/bus/pci/linux/pci_vfio.c b/drivers/bus/pci/linux/pci_vfio.c
index aeeaa9ed8..b8f4a11f4 100644
--- a/drivers/bus/pci/linux/pci_vfio.c
+++ b/drivers/bus/pci/linux/pci_vfio.c
@@ -337,7 +337,8 @@ pci_vfio_mmap_bar(int vfio_dev_fd, struct 
mapped_pci_resource *vfio_res,
/* Skip this BAR */
return 0;
 
-   if (msix_table->bar_index == bar_index) {
+   if (msix_table->bar_index == bar_index &&
+   bar->size > (uint64_t)PAGE_SIZE) {
/*
 * VFIO will not let us map the MSI-X table,
 * but we can map around it.
-- 
2.17.1



[dpdk-dev] [PATCH v3] vfio: fix workaround of BAR0 mapping

2018-07-13 Thread Takeshi Yoshimura
The workaround of BAR0 mapping gives up and immediately returns an
error if it cannot map around the MSI-X. However, recent version
of VFIO allows MSIX mapping (*).

I fixed not to return immediately but try mapping. In old Linux, mmap
just fails and returns the same error as the code before my fix . In
recent Linux, mmap succeeds and this patch enables running DPDK in
specific environments (e.g., ppc64le with HGST NVMe)

(*): "vfio-pci: Allow mapping MSIX BAR",
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
commit/id=a32295c612c57990d17fb0f41e7134394b2f35f6

Fixes: 90a1633b2347 ("eal/linux: allow to map BARs with MSI-X tables")

Signed-off-by: Takeshi Yoshimura 
---

Thanks, Anatoly.

I updated the patch not to affect behaviors of older Linux and
other environments as well as possible. This patch adds another
chance to mmap BAR0.

I noticed that the check at line 350 already includes the check
of page size, so this patch does not fix the check.

Regards,
Takeshi

drivers/bus/pci/linux/pci_vfio.c | 35 ++-
 1 file changed, 18 insertions(+), 17 deletions(-)

diff --git a/drivers/bus/pci/linux/pci_vfio.c b/drivers/bus/pci/linux/pci_vfio.c
index aeeaa9ed8..eb9b8031d 100644
--- a/drivers/bus/pci/linux/pci_vfio.c
+++ b/drivers/bus/pci/linux/pci_vfio.c
@@ -348,24 +348,25 @@ pci_vfio_mmap_bar(int vfio_dev_fd, struct 
mapped_pci_resource *vfio_res,
table_start &= PAGE_MASK;
 
if (table_start == 0 && table_end >= bar->size) {
-   /* Cannot map this BAR */
-   RTE_LOG(DEBUG, EAL, "Skipping BAR%d\n", bar_index);
-   bar->size = 0;
-   bar->addr = 0;
-   return 0;
+   /* Cannot map around this BAR, but try */
+   RTE_LOG(DEBUG, EAL,
+   "Trying to map BAR%d that contains the MSI-X\n",
+   bar_index);
+   memreg[0].offset = bar->offset;
+   memreg[0].size = bar->size;
+   } else {
+   memreg[0].offset = bar->offset;
+   memreg[0].size = table_start;
+   memreg[1].offset = bar->offset + table_end;
+   memreg[1].size = bar->size - table_end;
+
+   RTE_LOG(DEBUG, EAL,
+   "Trying to map BAR%d that contains the MSI-X "
+   "table. Trying offsets: "
+   "0x%04lx:0x%04lx, 0x%04lx:0x%04lx\n", bar_index,
+   memreg[0].offset, memreg[0].size,
+   memreg[1].offset, memreg[1].size);
}
-
-   memreg[0].offset = bar->offset;
-   memreg[0].size = table_start;
-   memreg[1].offset = bar->offset + table_end;
-   memreg[1].size = bar->size - table_end;
-
-   RTE_LOG(DEBUG, EAL,
-   "Trying to map BAR%d that contains the MSI-X "
-   "table. Trying offsets: "
-   "0x%04lx:0x%04lx, 0x%04lx:0x%04lx\n", bar_index,
-   memreg[0].offset, memreg[0].size,
-   memreg[1].offset, memreg[1].size);
} else {
memreg[0].offset = bar->offset;
memreg[0].size = bar->size;
-- 
2.15.1



Re: [dpdk-dev] [PATCH] rte_ring: fix racy dequeue/enqueue in ppc64

2018-07-16 Thread Takeshi Yoshimura
> Adding rte_smp_rmb() cause performance regression on non x86 platforms.
> Having said that, load-load barrier can be expressed very  well with C11 
> memory
> model. I guess ppc64 supports C11 memory model. If so,
> Could you try CONFIG_RTE_RING_USE_C11_MEM_MODEL=y for ppc64 and check
> original issue?

Yes, the performance regression happens on non-x86 with single
producer/consumer.
The average latency of an enqueue was increased from 21 nsec to 24 nsec in my
simple experiment. But, I think it is worth it.


I also tested C11 rte_ring, however, it caused the same race condition in ppc64.
I tried to fix the C11 problem as well, but I also found the C11
rte_ring had other potential
incorrect choices of memory orders, which caused another race
condition in ppc64.

For example,
__ATOMIC_ACQUIRE is passed to __atomic_compare_exchange_n(), but
I am not sure why the load-acquire is used for the compare exchange.
Also in update_tail, the pause can be called before the data copy because
of ht->tail load without atomic_load_n.

The memory order is simply difficult, so it might take a bit longer
time to check
if the code is correct. I think I can fix the C11 rte_ring as another patch.

2018-07-13 2:08 GMT+09:00 Jerin Jacob :
> -Original Message-
>> Date: Thu, 12 Jul 2018 11:44:14 +0900
>> From: Takeshi Yoshimura 
>> To: dev@dpdk.org
>> Cc: Takeshi Yoshimura , sta...@dpdk.org, Takeshi
>>  Yoshimura 
>> Subject: [dpdk-dev] [PATCH] rte_ring: fix racy dequeue/enqueue in ppc64
>> X-Mailer: git-send-email 2.15.1
>>
>> External Email
>>
>> SPDK blobfs encountered a crash around rte_ring dequeues in ppc64.
>> It uses a single consumer and multiple producers for a rte_ring.
>> The problem was a load-load reorder in rte_ring_sc_dequeue_bulk().
>
> Adding rte_smp_rmb() cause performance regression on non x86 platforms.
> Having said that, load-load barrier can be expressed very  well with C11 
> memory
> model. I guess ppc64 supports C11 memory model. If so,
> Could you try CONFIG_RTE_RING_USE_C11_MEM_MODEL=y for ppc64 and check
> original issue?
>
>>
>> The reordered loads happened on r->prod.tail in
>> __rte_ring_move_cons_head() (rte_ring_generic.h) and ring[idx] in
>> DEQUEUE_PTRS() (rte_ring.h). They have a load-load control
>> dependency, but the code does not satisfy it. Note that they are
>> not reordered if __rte_ring_move_cons_head() with is_sc != 1 because
>> cmpset invokes a read barrier.
>>
>> The paired stores on these loads are in ENQUEUE_PTRS() and
>> update_tail(). Simplified code around the reorder is the following.
>>
>> Consumer Producer
>> load idx[ring]
>>  store idx[ring]
>>  store r->prod.tail
>> load r->prod.tail
>>
>> In this case, the consumer loads old idx[ring] and confirms the load
>> is valid with the new r->prod.tail.
>>
>> I added a read barrier in the case where __IS_SC is passed to
>> __rte_ring_move_cons_head(). I also fixed __rte_ring_move_prod_head()
>> to avoid similar problems with a single producer.
>>
>> Cc: sta...@dpdk.org
>>
>> Signed-off-by: Takeshi Yoshimura 
>> ---
>>  lib/librte_ring/rte_ring_generic.h | 10 ++
>>  1 file changed, 6 insertions(+), 4 deletions(-)
>>
>> diff --git a/lib/librte_ring/rte_ring_generic.h 
>> b/lib/librte_ring/rte_ring_generic.h
>> index ea7dbe5b9..477326180 100644
>> --- a/lib/librte_ring/rte_ring_generic.h
>> +++ b/lib/librte_ring/rte_ring_generic.h
>> @@ -90,9 +90,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned 
>> int is_sp,
>> return 0;
>>
>> *new_head = *old_head + n;
>> -   if (is_sp)
>> +   if (is_sp) {
>> +   rte_smp_rmb();
>> r->prod.head = *new_head, success = 1;
>> -   else
>> +   } else
>> success = rte_atomic32_cmpset(&r->prod.head,
>> *old_head, *new_head);
>> } while (unlikely(success == 0));
>> @@ -158,9 +159,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, unsigned 
>> int is_sc,
>> return 0;
>>
>> *new_head = *old_head + n;
>> -   if (is_sc)
>> +   if (is_sc) {
>> +   rte_smp_rmb();
>> r->cons.head = *new_head, success = 1;
>> -   else
>> +   } else
>> success = rte_atomic32_cmpset(&r->cons.head, 
>> *old_head,
>> *new_head);
>> } while (unlikely(success == 0));
>> --
>> 2.17.1
>>


Re: [dpdk-dev] [PATCH v3] vfio: fix workaround of BAR0 mapping

2018-07-17 Thread Takeshi Yoshimura
2018-07-13 20:08 GMT+09:00 Burakov, Anatoly :
> On 13-Jul-18 12:00 PM, Burakov, Anatoly wrote:
>>
>> On 13-Jul-18 11:11 AM, Takeshi Yoshimura wrote:
>>>
>>> The workaround of BAR0 mapping gives up and immediately returns an
>>> error if it cannot map around the MSI-X. However, recent version
>>> of VFIO allows MSIX mapping (*).
>>>
>>> I fixed not to return immediately but try mapping. In old Linux, mmap
>>> just fails and returns the same error as the code before my fix . In
>>> recent Linux, mmap succeeds and this patch enables running DPDK in
>>> specific environments (e.g., ppc64le with HGST NVMe)
>>>
>>> (*): "vfio-pci: Allow mapping MSIX BAR",
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
>>> commit/id=a32295c612c57990d17fb0f41e7134394b2f35f6
>>>
>>> Fixes: 90a1633b2347 ("eal/linux: allow to map BARs with MSI-X tables")
>>>
>>> Signed-off-by: Takeshi Yoshimura 
>>> ---
>>>
>>> Thanks, Anatoly.
>>>
>>> I updated the patch not to affect behaviors of older Linux and
>>> other environments as well as possible. This patch adds another
>>> chance to mmap BAR0.
>>>
>>> I noticed that the check at line 350 already includes the check
>>> of page size, so this patch does not fix the check.
>>>
>>> Regards,
>>> Takeshi
>>
>>
>> Hi Takeshi,
>>
>> Please correct me if i'm wrong, but i'm not sure the old behavior is kept.
>>
>> Let's say we're running an old kernel, which doesn't allow mapping MSI-X
>> BARs. If MSI-X starts at beginning of the BAR (floor-aligned to page size),
>> and ends at or beyond end of BAR (ceiling-aligned to page size). In that
>> situation, old code just skipped the BAR and returned 0.
>>
>> We then exited the function, and there's a check for return value right
>> after pci_vfio_mmap_bar() that stop continuing if we fail to map something.
>> In the old code, we would continue as we went, and finish the rest of our
>> mappings. With your new code, you're attempting to map the BAR, it fails,
>> and you will return -1 on older kernels.
>>
>> I believe what we really need here is the following:
>>
>> 1) If this is a BAR containing MSI-X vector, first try mapping the entire
>> BAR. If it succeeds, great - that would be your new kernel behavior.
>> 2) If we failed on step 1), check to see if we can map around the BAR. If
>> we can, try to map around it like the current code does. If we cannot map
>> around it (i.e. if MSI-X vector, page aligned, occupies entire BAR), then we
>> simply return 0 and skip the BAR.
>>
>> That, i would think, would keep the old behavior and enable the new one.
>>
>> Does that make sense?
>>
>
> I envision this to look something like this:
>
> bool again = false;
> do {
> if (again) {
> // set up mmap-around
> if (cannot map around)
> return 0;
> }
> // try mapping
> if (map_failed && msix_table->bar_index == bar_index) {
> again = true;
> continue;
> }
> if (map_failed)
> return -1;
> break/return 0;
> } while (again);
>
> --
> Thanks,
> Anatoly

That makes sense. The return code was not same as old one in some paths.

I wrote a code based on your idea. It works at least in my ppc64 and
x86 machines, but I am concerned that the error messages for
pci_map_resource() confuse users in old Linux. I saw a message like
this (even if I could mmap):
EAL: pci_map_resource(): cannot mmap(15, 0x728ee3a3, 0x4000, 0x0):
Invalid argument (0x)

But anyway, I send it in the next email, and please check if there is
any other problems in the code.

Thanks,
Takeshi


[dpdk-dev] [PATCH v4] vfio: fix workaround of BAR0 mapping

2018-07-17 Thread Takeshi Yoshimura
The workaround of BAR0 mapping gives up and immediately returns an
error if it cannot map around the MSI-X. However, recent version
of VFIO allows MSIX mapping (*).

I fixed not to return immediately but try mapping. In old Linux, mmap
just fails and returns the same error as the code before my fix . In
recent Linux, mmap succeeds and this patch enables running DPDK in
specific environments (e.g., ppc64le with HGST NVMe)

(*): "vfio-pci: Allow mapping MSIX BAR",
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
commit/id=a32295c612c57990d17fb0f41e7134394b2f35f6

Fixes: 90a1633b2347 ("eal/linux: allow to map BARs with MSI-X tables")

Signed-off-by: Takeshi Yoshimura 
---
 drivers/bus/pci/linux/pci_vfio.c | 92 ++--
 1 file changed, 51 insertions(+), 41 deletions(-)

diff --git a/drivers/bus/pci/linux/pci_vfio.c b/drivers/bus/pci/linux/pci_vfio.c
index aeeaa9ed8..afdf0f6d5 100644
--- a/drivers/bus/pci/linux/pci_vfio.c
+++ b/drivers/bus/pci/linux/pci_vfio.c
@@ -332,50 +332,58 @@ pci_vfio_mmap_bar(int vfio_dev_fd, struct 
mapped_pci_resource *vfio_res,
void *bar_addr;
struct pci_msix_table *msix_table = &vfio_res->msix_table;
struct pci_map *bar = &vfio_res->maps[bar_index];
+   bool again = false;
 
if (bar->size == 0)
/* Skip this BAR */
return 0;
 
-   if (msix_table->bar_index == bar_index) {
-   /*
-* VFIO will not let us map the MSI-X table,
-* but we can map around it.
-*/
-   uint32_t table_start = msix_table->offset;
-   uint32_t table_end = table_start + msix_table->size;
-   table_end = (table_end + ~PAGE_MASK) & PAGE_MASK;
-   table_start &= PAGE_MASK;
-
-   if (table_start == 0 && table_end >= bar->size) {
-   /* Cannot map this BAR */
-   RTE_LOG(DEBUG, EAL, "Skipping BAR%d\n", bar_index);
-   bar->size = 0;
-   bar->addr = 0;
-   return 0;
-   }
-
-   memreg[0].offset = bar->offset;
-   memreg[0].size = table_start;
-   memreg[1].offset = bar->offset + table_end;
-   memreg[1].size = bar->size - table_end;
-
-   RTE_LOG(DEBUG, EAL,
-   "Trying to map BAR%d that contains the MSI-X "
-   "table. Trying offsets: "
-   "0x%04lx:0x%04lx, 0x%04lx:0x%04lx\n", bar_index,
-   memreg[0].offset, memreg[0].size,
-   memreg[1].offset, memreg[1].size);
-   } else {
-   memreg[0].offset = bar->offset;
-   memreg[0].size = bar->size;
-   }
-
/* reserve the address using an inaccessible mapping */
bar_addr = mmap(bar->addr, bar->size, 0, MAP_PRIVATE |
MAP_ANONYMOUS | additional_flags, -1, 0);
-   if (bar_addr != MAP_FAILED) {
+   if (bar_addr == MAP_FAILED) {
+   RTE_LOG(ERR, EAL,
+   "Failed to create inaccessible mapping for BAR%d\n",
+   bar_index);
+   return -1;
+   }
+
+   memreg[0].offset = bar->offset;
+   memreg[0].size = bar->size;
+   do {
void *map_addr = NULL;
+   if (again) {
+   /*
+* VFIO did not let us map the MSI-X table,
+* but we can map around it.
+*/
+   uint32_t table_start = msix_table->offset;
+   uint32_t table_end = table_start + msix_table->size;
+   table_end = (table_end + ~PAGE_MASK) & PAGE_MASK;
+   table_start &= PAGE_MASK;
+
+   if (table_start == 0 && table_end >= bar->size) {
+   /* Cannot map this BAR */
+   RTE_LOG(DEBUG, EAL, "Skipping BAR%d\n",
+   bar_index);
+   bar->size = 0;
+   bar->addr = 0;
+   return 0;
+   }
+
+   memreg[0].offset = bar->offset;
+   memreg[0].size = table_start;
+   memreg[1].offset = bar->offset + table_end;
+   memreg[1].size = bar->size - table_end;
+
+   RTE_LOG(DEBUG, EAL,
+   "Trying to map BAR%d that contains the MSI-X "
+   "table. Trying offsets: "
+   &quo

[dpdk-dev] [PATCH v5] vfio: fix workaround of BAR mapping

2018-07-20 Thread Takeshi Yoshimura
Currently, VFIO will try to map around MSI-X table in the BARs. When
MSI-X table (page-aligned) size is equal to (page-aligned) size of BAR,
VFIO will just skip the BAR.

Recent kernel versions will allow VFIO to map the entire BAR containing
MSI-X tables (*), so instead of trying to map around the MSI-X vector
or skipping the BAR entirely if it's not possible, we can now try
mapping the entire BAR first. If mapping the entire BAR doesn't
succeed, fall back to the old behavior of mapping around MSI-X table or
skipping the BAR.

(*): "vfio-pci: Allow mapping MSIX BAR",
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/
commit/?id=a32295c612c57990d17fb0f41e7134394b2f35f6

Fixes: 90a1633b2347 ("eal/linux: allow to map BARs with MSI-X tables")

Signed-off-by: Takeshi Yoshimura 
Reviewed-by: Anatoly Burakov 
---

Thanks, Anatoly.
I updated the code with munmap in an error path.
I also fixed the message and the wrong link.

Regards,
Takeshi

 drivers/bus/pci/linux/pci_vfio.c | 93 ++--
 1 file changed, 52 insertions(+), 41 deletions(-)

diff --git a/drivers/bus/pci/linux/pci_vfio.c b/drivers/bus/pci/linux/pci_vfio.c
index aeeaa9ed8..07188c071 100644
--- a/drivers/bus/pci/linux/pci_vfio.c
+++ b/drivers/bus/pci/linux/pci_vfio.c
@@ -332,50 +332,59 @@ pci_vfio_mmap_bar(int vfio_dev_fd, struct 
mapped_pci_resource *vfio_res,
void *bar_addr;
struct pci_msix_table *msix_table = &vfio_res->msix_table;
struct pci_map *bar = &vfio_res->maps[bar_index];
+   bool again = false;
 
if (bar->size == 0)
/* Skip this BAR */
return 0;
 
-   if (msix_table->bar_index == bar_index) {
-   /*
-* VFIO will not let us map the MSI-X table,
-* but we can map around it.
-*/
-   uint32_t table_start = msix_table->offset;
-   uint32_t table_end = table_start + msix_table->size;
-   table_end = (table_end + ~PAGE_MASK) & PAGE_MASK;
-   table_start &= PAGE_MASK;
-
-   if (table_start == 0 && table_end >= bar->size) {
-   /* Cannot map this BAR */
-   RTE_LOG(DEBUG, EAL, "Skipping BAR%d\n", bar_index);
-   bar->size = 0;
-   bar->addr = 0;
-   return 0;
-   }
-
-   memreg[0].offset = bar->offset;
-   memreg[0].size = table_start;
-   memreg[1].offset = bar->offset + table_end;
-   memreg[1].size = bar->size - table_end;
-
-   RTE_LOG(DEBUG, EAL,
-   "Trying to map BAR%d that contains the MSI-X "
-   "table. Trying offsets: "
-   "0x%04lx:0x%04lx, 0x%04lx:0x%04lx\n", bar_index,
-   memreg[0].offset, memreg[0].size,
-   memreg[1].offset, memreg[1].size);
-   } else {
-   memreg[0].offset = bar->offset;
-   memreg[0].size = bar->size;
-   }
-
/* reserve the address using an inaccessible mapping */
bar_addr = mmap(bar->addr, bar->size, 0, MAP_PRIVATE |
MAP_ANONYMOUS | additional_flags, -1, 0);
-   if (bar_addr != MAP_FAILED) {
+   if (bar_addr == MAP_FAILED) {
+   RTE_LOG(ERR, EAL,
+   "Failed to create inaccessible mapping for BAR%d\n",
+   bar_index);
+   return -1;
+   }
+
+   memreg[0].offset = bar->offset;
+   memreg[0].size = bar->size;
+   do {
void *map_addr = NULL;
+   if (again) {
+   /*
+* VFIO did not let us map the MSI-X table,
+* but we can map around it.
+*/
+   uint32_t table_start = msix_table->offset;
+   uint32_t table_end = table_start + msix_table->size;
+   table_end = (table_end + ~PAGE_MASK) & PAGE_MASK;
+   table_start &= PAGE_MASK;
+
+   if (table_start == 0 && table_end >= bar->size) {
+   /* Cannot map this BAR */
+   RTE_LOG(DEBUG, EAL, "Skipping BAR%d\n",
+   bar_index);
+   munmap(bar_addr, bar->size);
+   bar->size = 0;
+   bar->addr = 0;
+   return 0;
+   }
+
+   memreg[0].offset = bar->offset;
+   memreg[0].size = table_start;
+   memreg[1]

Re: [dpdk-dev] [PATCH 18.11] pci/vfio: allow mapping MSI-X BARs if kernel allows it

2018-07-31 Thread Takeshi Yoshimura
2018-07-30 20:17 GMT+09:00 Anatoly Burakov :
> Currently, DPDK will skip mapping some areas (or even an entire BAR)
> if MSI-X happens to be in it but is smaller than page address.
>
> Kernels 4.16+ will allow mapping MSI-X BARs [1], and will report this
> as a capability flag. Capability flags themselves are also only
> supported since kernel 4.6 [2].
>
> This commit will introduce support for checking VFIO capabilities,
> and will use it to check if we are allowed to map BARs with MSI-X
> tables in them, along with backwards compatibility for older
> kernels, including a workaround for a variable rename in VFIO
> region info structure [3].
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/
> linux.git/commit/?id=a32295c612c57990d17fb0f41e7134394b2f35f6
>
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/
> linux.git/commit/?id=c84982adb23bcf3b99b79ca33527cd2625fbe279
>
> [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/
> linux.git/commit/?id=ff63eb638d63b95e489f976428f1df01391e15e4
>
> Signed-off-by: Anatoly Burakov 
> ---
>  drivers/bus/pci/linux/pci_vfio.c | 127 ---
>  lib/librte_eal/common/include/rte_vfio.h |  26 +
>  2 files changed, 140 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/bus/pci/linux/pci_vfio.c 
> b/drivers/bus/pci/linux/pci_vfio.c
> index 686386d6a..e7765ee11 100644
> --- a/drivers/bus/pci/linux/pci_vfio.c
> +++ b/drivers/bus/pci/linux/pci_vfio.c
> @@ -415,6 +415,88 @@ pci_vfio_mmap_bar(int vfio_dev_fd, struct 
> mapped_pci_resource *vfio_res,
> return 0;
>  }
>
> +/*
> + * region info may contain capability headers, so we need to keep 
> reallocating
> + * the memory until we match allocated memory size with argsz.
> + */
> +static int
> +pci_vfio_get_region_info(int vfio_dev_fd, struct vfio_region_info **info,
> +   int region)
> +{
> +   struct vfio_region_info *ri;
> +   size_t argsz = sizeof(*ri);
> +   int ret;
> +
> +   ri = malloc(sizeof(*ri));
> +   if (ri == NULL) {
> +   RTE_LOG(ERR, EAL, "Cannot allocate memory for region info\n");
> +   return -1;
> +   }
> +again:
> +   memset(ri, 0, argsz);
> +   ri->argsz = argsz;
> +   ri->index = region;
> +
> +   ret = ioctl(vfio_dev_fd, VFIO_DEVICE_GET_REGION_INFO, info);
> +   if (ret) {
> +   free(ri);
> +   return ret;
> +   }
> +   if (ri->argsz != argsz) {
> +   argsz = ri->argsz;
> +   ri = realloc(ri, argsz);
> +
> +   if (ri == NULL) {
> +   RTE_LOG(ERR, EAL, "Cannot reallocate memory for 
> region info\n");
> +   return -1;
> +   }
> +   goto again;
> +   }
> +   *info = ri;
> +
> +   return 0;
> +}
> +
> +static struct vfio_info_cap_header *
> +pci_vfio_info_cap(struct vfio_region_info *info, int cap)
> +{
> +   struct vfio_info_cap_header *h;
> +   size_t offset;
> +
> +   if ((info->flags & RTE_VFIO_INFO_FLAG_CAPS) == 0) {
> +   /* VFIO info does not advertise capabilities */
> +   return NULL;
> +   }
> +
> +   offset = VFIO_CAP_OFFSET(info);
> +   while (offset != 0) {
> +   h = RTE_PTR_ADD(info, offset);
> +   if (h->id == cap)
> +   return h;
> +   offset = h->next;
> +   }
> +   return NULL;
> +}
> +
> +static int
> +pci_vfio_msix_is_mappable(int vfio_dev_fd, int msix_region)
> +{
> +   struct vfio_region_info *info;
> +   int ret;
> +
> +   ret = pci_vfio_get_region_info(vfio_dev_fd, &info, msix_region);
> +   if (ret < 0)
> +   return -1;
> +
> +   ret = pci_vfio_info_cap(info, RTE_VFIO_CAP_MSIX_MAPPABLE) != NULL;
> +
> +   /* cleanup */
> +   free(info);
> +
> +   return ret;
> +}
> +
> +
>  static int
>  pci_vfio_map_resource_primary(struct rte_pci_device *dev)
>  {
> @@ -464,56 +546,75 @@ pci_vfio_map_resource_primary(struct rte_pci_device 
> *dev)
> if (ret < 0) {
> RTE_LOG(ERR, EAL, "  %s cannot get MSI-X BAR number!\n",
> pci_addr);
> -   goto err_vfio_dev_fd;
> +   goto err_vfio_res;
> +   }
> +   /* if we found our MSI-X BAR region, check if we can mmap it */
> +   if (vfio_res->msix_table.bar_index != -1) {
> +   int ret = pci_vfio_msix_is_mappable(vfio_dev_fd,
> +   vfio_res->msix_table.bar_index);
> +   if (ret < 0) {
> +   RTE_LOG(ERR, EAL, "Couldn't check if MSI-X BAR is 
> mappable\n");
> +   goto err_vfio_res;
> +   } else if (ret != 0) {
> +   /* we can map it, so we don't care where it is */
> +   RTE_LOG(DEBUG, EAL, "VFIO reports MSI-X BAR as 
> mappable\n");
> +   vfio_res->msix_tab

Re: [dpdk-dev] [PATCH 18.11 v2] pci/vfio: allow mapping MSI-X BARs if kernel allows it

2018-08-01 Thread Takeshi Yoshimura
2018-07-31 20:28 GMT+09:00 Anatoly Burakov :
> Currently, DPDK will skip mapping some areas (or even an entire BAR)
> if MSI-X table happens to be in them but is smaller than page size.
>
> Kernels 4.16+ will allow mapping MSI-X BARs [1], and will report this
> as a capability flag. Capability flags themselves are also only
> supported since kernel 4.6 [2].
>
> This commit will introduce support for checking VFIO capabilities,
> and will use it to check if we are allowed to map BARs with MSI-X
> tables in them, along with backwards compatibility for older
> kernels, including a workaround for a variable rename in VFIO
> region info structure [3].
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/
> linux.git/commit/?id=a32295c612c57990d17fb0f41e7134394b2f35f6
>
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/
> linux.git/commit/?id=c84982adb23bcf3b99b79ca33527cd2625fbe279
>
> [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/
> linux.git/commit/?id=ff63eb638d63b95e489f976428f1df01391e15e4
>
> Signed-off-by: Anatoly Burakov 
> ---
>
> Notes:
> v2->v1:
> - Fix pointer in pci_vfio_get_region_info
> - Fix commit message
>
>  drivers/bus/pci/linux/pci_vfio.c | 127 ---
>  lib/librte_eal/common/include/rte_vfio.h |  26 +
>  2 files changed, 140 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/bus/pci/linux/pci_vfio.c 
> b/drivers/bus/pci/linux/pci_vfio.c
> index 686386d6a..24f665c20 100644
> --- a/drivers/bus/pci/linux/pci_vfio.c
> +++ b/drivers/bus/pci/linux/pci_vfio.c
> @@ -415,6 +415,88 @@ pci_vfio_mmap_bar(int vfio_dev_fd, struct 
> mapped_pci_resource *vfio_res,
> return 0;
>  }
>
> +/*
> + * region info may contain capability headers, so we need to keep 
> reallocating
> + * the memory until we match allocated memory size with argsz.
> + */
> +static int
> +pci_vfio_get_region_info(int vfio_dev_fd, struct vfio_region_info **info,
> +   int region)
> +{
> +   struct vfio_region_info *ri;
> +   size_t argsz = sizeof(*ri);
> +   int ret;
> +
> +   ri = malloc(sizeof(*ri));
> +   if (ri == NULL) {
> +   RTE_LOG(ERR, EAL, "Cannot allocate memory for region info\n");
> +   return -1;
> +   }
> +again:
> +   memset(ri, 0, argsz);
> +   ri->argsz = argsz;
> +   ri->index = region;
> +
> +   ret = ioctl(vfio_dev_fd, VFIO_DEVICE_GET_REGION_INFO, ri);
> +   if (ret) {
> +   free(ri);
> +   return ret;
> +   }
> +   if (ri->argsz != argsz) {
> +   argsz = ri->argsz;
> +   ri = realloc(ri, argsz);
> +
> +   if (ri == NULL) {
> +   RTE_LOG(ERR, EAL, "Cannot reallocate memory for 
> region info\n");
> +   return -1;
> +   }
> +   goto again;
> +   }
> +   *info = ri;
> +
> +   return 0;
> +}
> +
> +static struct vfio_info_cap_header *
> +pci_vfio_info_cap(struct vfio_region_info *info, int cap)
> +{
> +   struct vfio_info_cap_header *h;
> +   size_t offset;
> +
> +   if ((info->flags & RTE_VFIO_INFO_FLAG_CAPS) == 0) {
> +   /* VFIO info does not advertise capabilities */
> +   return NULL;
> +   }
> +
> +   offset = VFIO_CAP_OFFSET(info);
> +   while (offset != 0) {
> +   h = RTE_PTR_ADD(info, offset);
> +   if (h->id == cap)
> +   return h;
> +   offset = h->next;
> +   }
> +   return NULL;
> +}
> +
> +static int
> +pci_vfio_msix_is_mappable(int vfio_dev_fd, int msix_region)
> +{
> +   struct vfio_region_info *info;
> +   int ret;
> +
> +   ret = pci_vfio_get_region_info(vfio_dev_fd, &info, msix_region);
> +   if (ret < 0)
> +   return -1;
> +
> +   ret = pci_vfio_info_cap(info, RTE_VFIO_CAP_MSIX_MAPPABLE) != NULL;
> +
> +   /* cleanup */
> +   free(info);
> +
> +   return ret;
> +}
> +
> +
>  static int
>  pci_vfio_map_resource_primary(struct rte_pci_device *dev)
>  {
> @@ -464,56 +546,75 @@ pci_vfio_map_resource_primary(struct rte_pci_device 
> *dev)
> if (ret < 0) {
> RTE_LOG(ERR, EAL, "  %s cannot get MSI-X BAR number!\n",
> pci_addr);
> -   goto err_vfio_dev_fd;
> +   goto err_vfio_res;
> +   }
> +   /* if we found our MSI-X BAR region, check if we can mmap it */
> +   if (vfio_res->msix_table.bar_index != -1) {
> +   int ret = pci_vfio_msix_is_mappable(vfio_dev_fd,
> +   vfio_res->msix_table.bar_index);
> +   if (ret < 0) {
> +   RTE_LOG(ERR, EAL, "Couldn't check if MSI-X BAR is 
> mappable\n");
> +   goto err_vfio_res;
> +   } else if (ret != 0) {
> +   /* we can map it, so we don't care where it is */
> +   RTE

[dpdk-dev] [PATCH] eal/vfio: fix sPAPR IOMMU mapping

2018-08-06 Thread Takeshi Yoshimura
Commit 73a639085938 ("vfio: allow to map other memory regions")
introduced a bug in sPAPR IOMMU mapping. The commit removed necessary
ioctl with VFIO_IOMMU_SPAPR_REGISTER_MEMORY. Also, vfio_spapr_map_walk
should call vfio_spapr_dma_do_map instead of vfio_spapr_dma_mem_map.

Fixes: 73a639085938 ("vfio: allow to map other memory regions")

Signed-off-by: Takeshi Yoshimura 
---
 lib/librte_eal/linuxapp/eal/eal_vfio.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c 
b/lib/librte_eal/linuxapp/eal/eal_vfio.c
index c68dc38e0..68e862946 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
@@ -1145,8 +1145,22 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t 
vaddr, uint64_t iova,
struct vfio_iommu_type1_dma_map dma_map;
struct vfio_iommu_type1_dma_unmap dma_unmap;
int ret;
+   struct vfio_iommu_spapr_register_memory reg = {
+   .argsz = sizeof(reg),
+   .flags = 0
+   };
+   reg.vaddr = (uintptr_t) vaddr;
+   reg.size = len;
 
if (do_map != 0) {
+   ret = ioctl(vfio_container_fd,
+   VFIO_IOMMU_SPAPR_REGISTER_MEMORY, ®);
+   if (ret) {
+   RTE_LOG(ERR, EAL, "  cannot register vaddr for IOMMU, "
+   "error %i (%s)\n", errno, strerror(errno));
+   return -1;
+   }
+
memset(&dma_map, 0, sizeof(dma_map));
dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
dma_map.vaddr = vaddr;
@@ -1163,13 +1177,6 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t 
vaddr, uint64_t iova,
}
 
} else {
-   struct vfio_iommu_spapr_register_memory reg = {
-   .argsz = sizeof(reg),
-   .flags = 0
-   };
-   reg.vaddr = (uintptr_t) vaddr;
-   reg.size = len;
-
ret = ioctl(vfio_container_fd,
VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, ®);
if (ret) {
@@ -1201,7 +1208,7 @@ vfio_spapr_map_walk(const struct rte_memseg_list *msl 
__rte_unused,
 {
int *vfio_container_fd = arg;
 
-   return vfio_spapr_dma_mem_map(*vfio_container_fd, ms->addr_64, ms->iova,
+   return vfio_spapr_dma_do_map(*vfio_container_fd, ms->addr_64, ms->iova,
ms->len, 1);
 }
 
-- 
2.15.1



[dpdk-dev] [PATCH] eal/vfio: fix sPAPR IOMMU mapping

2018-08-06 Thread Takeshi Yoshimura
Commit 73a639085938 ("vfio: allow to map other memory regions")
introduced a bug in sPAPR IOMMU mapping. The commit removed necessary
ioctl with VFIO_IOMMU_SPAPR_REGISTER_MEMORY. Also, vfio_spapr_map_walk
should call vfio_spapr_dma_do_map instead of vfio_spapr_dma_mem_map.

Fixes: 73a639085938 ("vfio: allow to map other memory regions")

Signed-off-by: Takeshi Yoshimura 
---
 lib/librte_eal/linuxapp/eal/eal_vfio.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c 
b/lib/librte_eal/linuxapp/eal/eal_vfio.c
index c68dc38e0..68e862946 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
@@ -1145,8 +1145,22 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t 
vaddr, uint64_t iova,
struct vfio_iommu_type1_dma_map dma_map;
struct vfio_iommu_type1_dma_unmap dma_unmap;
int ret;
+   struct vfio_iommu_spapr_register_memory reg = {
+   .argsz = sizeof(reg),
+   .flags = 0
+   };
+   reg.vaddr = (uintptr_t) vaddr;
+   reg.size = len;
 
if (do_map != 0) {
+   ret = ioctl(vfio_container_fd,
+   VFIO_IOMMU_SPAPR_REGISTER_MEMORY, ®);
+   if (ret) {
+   RTE_LOG(ERR, EAL, "  cannot register vaddr for IOMMU, "
+   "error %i (%s)\n", errno, strerror(errno));
+   return -1;
+   }
+
memset(&dma_map, 0, sizeof(dma_map));
dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
dma_map.vaddr = vaddr;
@@ -1163,13 +1177,6 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t 
vaddr, uint64_t iova,
}
 
} else {
-   struct vfio_iommu_spapr_register_memory reg = {
-   .argsz = sizeof(reg),
-   .flags = 0
-   };
-   reg.vaddr = (uintptr_t) vaddr;
-   reg.size = len;
-
ret = ioctl(vfio_container_fd,
VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, ®);
if (ret) {
@@ -1201,7 +1208,7 @@ vfio_spapr_map_walk(const struct rte_memseg_list *msl 
__rte_unused,
 {
int *vfio_container_fd = arg;
 
-   return vfio_spapr_dma_mem_map(*vfio_container_fd, ms->addr_64, ms->iova,
+   return vfio_spapr_dma_do_map(*vfio_container_fd, ms->addr_64, ms->iova,
ms->len, 1);
 }
 
-- 
2.15.1



[dpdk-dev] [PATCH v2] eal/vfio: fix sPAPR IOMMU mapping

2018-08-06 Thread Takeshi Yoshimura
Commit 73a639085938 ("vfio: allow to map other memory regions")
introduced a bug in sPAPR IOMMU mapping. The commit removed necessary
ioctl with VFIO_IOMMU_SPAPR_REGISTER_MEMORY. Also, vfio_spapr_map_walk
should call vfio_spapr_dma_do_map instead of vfio_spapr_dma_mem_map.

Fixes: 73a639085938 ("vfio: allow to map other memory regions")

Cc: sta...@dpdk.org

Signed-off-by: Takeshi Yoshimura 
---

v2: Added Cc in message

 lib/librte_eal/linuxapp/eal/eal_vfio.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c 
b/lib/librte_eal/linuxapp/eal/eal_vfio.c
index c68dc38e0..68e862946 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
@@ -1145,8 +1145,22 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t 
vaddr, uint64_t iova,
struct vfio_iommu_type1_dma_map dma_map;
struct vfio_iommu_type1_dma_unmap dma_unmap;
int ret;
+   struct vfio_iommu_spapr_register_memory reg = {
+   .argsz = sizeof(reg),
+   .flags = 0
+   };
+   reg.vaddr = (uintptr_t) vaddr;
+   reg.size = len;
 
if (do_map != 0) {
+   ret = ioctl(vfio_container_fd,
+   VFIO_IOMMU_SPAPR_REGISTER_MEMORY, ®);
+   if (ret) {
+   RTE_LOG(ERR, EAL, "  cannot register vaddr for IOMMU, "
+   "error %i (%s)\n", errno, strerror(errno));
+   return -1;
+   }
+
memset(&dma_map, 0, sizeof(dma_map));
dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
dma_map.vaddr = vaddr;
@@ -1163,13 +1177,6 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t 
vaddr, uint64_t iova,
}
 
} else {
-   struct vfio_iommu_spapr_register_memory reg = {
-   .argsz = sizeof(reg),
-   .flags = 0
-   };
-   reg.vaddr = (uintptr_t) vaddr;
-   reg.size = len;
-
ret = ioctl(vfio_container_fd,
VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, ®);
if (ret) {
@@ -1201,7 +1208,7 @@ vfio_spapr_map_walk(const struct rte_memseg_list *msl 
__rte_unused,
 {
int *vfio_container_fd = arg;
 
-   return vfio_spapr_dma_mem_map(*vfio_container_fd, ms->addr_64, ms->iova,
+   return vfio_spapr_dma_do_map(*vfio_container_fd, ms->addr_64, ms->iova,
ms->len, 1);
 }
 
-- 
2.15.1



[dpdk-dev] [PATCH] vfio: fix VFIO mapping failures in ppc64le

2020-01-16 Thread Takeshi Yoshimura
ppc64le failed when using large physical memory. I found problems in my two
commits in the past.

In commit e072d16f8920 ("vfio: fix expanding DMA area in ppc64le"), I added
a sanity check using a mapped address to resolve an issue around expanding
IOMMU window, but this was not enough, since memory allocation can return
memory anywhere dependent on memory fragmentation. DPDK may still skip DMA
mapping and attempts to unmap non-mapped DMA during expanding IOMMU window.
As a result, SPDK apps using large physical memory frequently failed to
proceed the communication with NVMe and/or went into an infinite loop.

The root cause of the bug was in a gap between memory segments managed by
DPDK and firmware-level DMA mapping. DPDK's memory segments don't contain
the state of DMA mapping, and so, the memesg_walk cannot determine if an
iterated memory segment is mapped or not. This resulted in incorrect DMA
maps and unmaps.

At this time, I added the code to avoid iterating non-mapped memory
segments during DMA mapping. The memseg_walk iterates over memory segments
marked as "used", and so, the code sets memory segments that will be
mapped or unmapped as "free" transiently.

The commit db90b4969e2e ("vfio: retry creating sPAPR DMA window") allows
retring different page levels and sizes to create DMA window. However, this
allows page sizes different from hugepage sizes. This inconsistency caused
failures at the time of DMA mapping after the window creation. This patch
fixes to retry only different page levels.

Fixes: e072d16f8920 ("vfio: fix expanding DMA area in ppc64le")
Fixes: db90b4969e2e ("vfio: retry creating sPAPR DMA window")

Signed-off-by: Takeshi Yoshimura 
---
 lib/librte_eal/linux/eal/eal_vfio.c | 76 +
 1 file changed, 33 insertions(+), 43 deletions(-)

diff --git a/lib/librte_eal/linux/eal/eal_vfio.c 
b/lib/librte_eal/linux/eal/eal_vfio.c
index 95f615c2e..01b5ef3f4 100644
--- a/lib/librte_eal/linux/eal/eal_vfio.c
+++ b/lib/librte_eal/linux/eal/eal_vfio.c
@@ -532,6 +532,17 @@ vfio_mem_event_callback(enum rte_mem_event type, const 
void *addr, size_t len,
return;
}
 
+#ifdef RTE_ARCH_PPC_64
+   ms = rte_mem_virt2memseg(addr, msl);
+   while (cur_len < len) {
+   int idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
+
+   rte_fbarray_set_free(&msl->memseg_arr, idx);
+   cur_len += ms->len;
+   ++ms;
+   }
+   cur_len = 0;
+#endif
/* memsegs are contiguous in memory */
ms = rte_mem_virt2memseg(addr, msl);
while (cur_len < len) {
@@ -551,6 +562,17 @@ vfio_mem_event_callback(enum rte_mem_event type, const 
void *addr, size_t len,
cur_len += ms->len;
++ms;
}
+#ifdef RTE_ARCH_PPC_64
+   cur_len = 0;
+   ms = rte_mem_virt2memseg(addr, msl);
+   while (cur_len < len) {
+   int idx = rte_fbarray_find_idx(&msl->memseg_arr, ms);
+
+   rte_fbarray_set_used(&msl->memseg_arr, idx);
+   cur_len += ms->len;
+   ++ms;
+   }
+#endif
 }
 
 static int
@@ -1416,16 +1438,11 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t 
vaddr, uint64_t iova,
return 0;
 }
 
-struct spapr_remap_walk_param {
-   int vfio_container_fd;
-   uint64_t addr_64;
-};
-
 static int
 vfio_spapr_map_walk(const struct rte_memseg_list *msl,
const struct rte_memseg *ms, void *arg)
 {
-   struct spapr_remap_walk_param *param = arg;
+   int *vfio_container_fd = arg;
 
/* skip external memory that isn't a heap */
if (msl->external && !msl->heap)
@@ -1435,10 +1452,7 @@ vfio_spapr_map_walk(const struct rte_memseg_list *msl,
if (ms->iova == RTE_BAD_IOVA)
return 0;
 
-   if (ms->addr_64 == param->addr_64)
-   return 0;
-
-   return vfio_spapr_dma_do_map(param->vfio_container_fd, ms->addr_64, 
ms->iova,
+   return vfio_spapr_dma_do_map(*vfio_container_fd, ms->addr_64, ms->iova,
ms->len, 1);
 }
 
@@ -1446,7 +1460,7 @@ static int
 vfio_spapr_unmap_walk(const struct rte_memseg_list *msl,
const struct rte_memseg *ms, void *arg)
 {
-   struct spapr_remap_walk_param *param = arg;
+   int *vfio_container_fd = arg;
 
/* skip external memory that isn't a heap */
if (msl->external && !msl->heap)
@@ -1456,17 +1470,13 @@ vfio_spapr_unmap_walk(const struct rte_memseg_list *msl,
if (ms->iova == RTE_BAD_IOVA)
return 0;
 
-   if (ms->addr_64 == param->addr_64)
-   return 0;
-
-   return vfio_spapr_dma_do_map(param->vfio_container_fd, ms->addr_64, 
ms->iova,
+   return vfio_spapr_dma_do_map(*vfio_container_fd, ms->a

[dpdk-dev] [PATCH] vfio: fix build errors on old Linux

2019-07-10 Thread Takeshi Yoshimura
The commit db90b4969e2e ("vfio: retry creating sPAPR DMA window")
introduced a build breakage on old Linux. Linux <4.2 does not define ddw in
struct vfio_iommu_spapr_tce_info. Without ddw, we cannot change window size
and so should give up the creation. I just exculuded the retrying code if
ddw is not supported.

Fixes: db90b4969e2e ("vfio: retry creating sPAPR DMA window")

Signed-off-by: Takeshi Yoshimura 
---
 lib/librte_eal/linux/eal/eal_vfio.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/librte_eal/linux/eal/eal_vfio.c 
b/lib/librte_eal/linux/eal/eal_vfio.c
index 7053ebe7d..fadef427f 100644
--- a/lib/librte_eal/linux/eal/eal_vfio.c
+++ b/lib/librte_eal/linux/eal/eal_vfio.c
@@ -1445,6 +1445,7 @@ vfio_spapr_create_new_dma_window(int vfio_container_fd,
/* create new DMA window */
ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, create);
if (ret) {
+#ifdef VFIO_IOMMU_SPAPR_INFO_DDW
/* try possible page_shift and levels for workaround */
uint32_t levels;
 
@@ -1463,6 +1464,7 @@ vfio_spapr_create_new_dma_window(int vfio_container_fd,
if (!ret)
break;
}
+#endif
if (ret) {
RTE_LOG(ERR, EAL, "  cannot create new DMA window, "
"error %i (%s)\n", errno, 
strerror(errno));
-- 
2.17.2 (Apple Git-113)



[dpdk-dev] [PATCH v3] vfio: fix expanding DMA area in ppc64le

2019-07-12 Thread Takeshi Yoshimura
In ppc64le, expanding DMA areas always fail because we cannot remove
a DMA window. As a result, we cannot allocate more than one memseg in
ppc64le. This is because vfio_spapr_dma_mem_map() doesn't unmap all
the mapped DMA before removing the window. This patch fixes this
incorrect behavior.

I also fixed the order of ioctl for unregister and unmap. The ioctl
for unregister sometimes report device busy errors due to the
existence of mapped area.

Signed-off-by: Takeshi Yoshimura 
---
 lib/librte_eal/linux/eal/eal_vfio.c | 99 +++--
 1 file changed, 67 insertions(+), 32 deletions(-)

diff --git a/lib/librte_eal/linux/eal/eal_vfio.c 
b/lib/librte_eal/linux/eal/eal_vfio.c
index fadef427f..ed04231b1 100644
--- a/lib/librte_eal/linux/eal/eal_vfio.c
+++ b/lib/librte_eal/linux/eal/eal_vfio.c
@@ -1354,14 +1354,6 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t 
vaddr, uint64_t iova,
}
 
} else {
-   ret = ioctl(vfio_container_fd,
-   VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, ®);
-   if (ret) {
-   RTE_LOG(ERR, EAL, "  cannot unregister vaddr for IOMMU, 
error %i (%s)\n",
-   errno, strerror(errno));
-   return -1;
-   }
-
memset(&dma_unmap, 0, sizeof(dma_unmap));
dma_unmap.argsz = sizeof(struct vfio_iommu_type1_dma_unmap);
dma_unmap.size = len;
@@ -1374,28 +1366,56 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t 
vaddr, uint64_t iova,
errno, strerror(errno));
return -1;
}
+
+   ret = ioctl(vfio_container_fd,
+   VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, ®);
+   if (ret) {
+   RTE_LOG(ERR, EAL, "  cannot unregister vaddr for IOMMU, 
error %i (%s)\n",
+   errno, strerror(errno));
+   return -1;
+   }
}
 
return 0;
 }
 
+struct spapr_remap_walk_param {
+   int vfio_container_fd;
+   uint64_t addr_64;
+};
+
 static int
 vfio_spapr_map_walk(const struct rte_memseg_list *msl,
const struct rte_memseg *ms, void *arg)
 {
-   int *vfio_container_fd = arg;
+   struct spapr_remap_walk_param *param = arg;
 
-   if (msl->external)
+   if (msl->external || ms->addr_64 == param->addr_64)
return 0;
 
-   return vfio_spapr_dma_do_map(*vfio_container_fd, ms->addr_64, ms->iova,
+   return vfio_spapr_dma_do_map(param->vfio_container_fd, ms->addr_64, 
ms->iova,
ms->len, 1);
 }
 
+static int
+vfio_spapr_unmap_walk(const struct rte_memseg_list *msl,
+   const struct rte_memseg *ms, void *arg)
+{
+   struct spapr_remap_walk_param *param = arg;
+
+   if (msl->external || ms->addr_64 == param->addr_64)
+   return 0;
+
+   return vfio_spapr_dma_do_map(param->vfio_container_fd, ms->addr_64, 
ms->iova,
+   ms->len, 0);
+}
+
 struct spapr_walk_param {
uint64_t window_size;
uint64_t hugepage_sz;
+   uint64_t addr_64;
 };
+
 static int
 vfio_spapr_window_size_walk(const struct rte_memseg_list *msl,
const struct rte_memseg *ms, void *arg)
@@ -1406,6 +1426,10 @@ vfio_spapr_window_size_walk(const struct rte_memseg_list 
*msl,
if (msl->external)
return 0;
 
+   /* do not iterate ms we haven't mapped yet  */
+   if (param->addr_64 && ms->addr_64 == param->addr_64)
+   return 0;
+
if (max > param->window_size) {
param->hugepage_sz = ms->hugepage_sz;
param->window_size = max;
@@ -1503,6 +1527,7 @@ vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t 
vaddr, uint64_t iova,
 
/* check if window size needs to be adjusted */
memset(¶m, 0, sizeof(param));
+   param.addr_64 = vaddr;
 
/* we're inside a callback so use thread-unsafe version */
if (rte_memseg_walk_thread_unsafe(vfio_spapr_window_size_walk,
@@ -1516,7 +1541,7 @@ vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t 
vaddr, uint64_t iova,
for (i = 0; i < user_mem_maps->n_maps; i++) {
uint64_t max = user_mem_maps->maps[i].iova +
user_mem_maps->maps[i].len;
-   create.window_size = RTE_MAX(create.window_size, max);
+   param.window_size = RTE_MAX(param.window_size, max);
}
 
/* sPAPR requires window size to be a power of 2 */
@@ -1525,9 +1550,33 @@ vfio_spapr_dma_mem_map(int vfio_container_fd, uint64_t 
vaddr, uint64_t iova,
create.levels = 1;
 
if (do_map) {
-   void *addr;
   

[dpdk-dev] [PATCH] eal: forcing IOVA as PA in ppc

2019-07-30 Thread Takeshi Yoshimura
Commit b76fafb174d2 ("eal: fix IOVA mode selection as VA for PCI
drivers") breaks ppc apps with no IOVA configs (or RTE_IOVA_DC)
because of the inconsistency of user's request and the result of
device capability for IOVA mode. I updated the code to force IOVA as
PA in ppc as before because current ppc driver does not support VA
mode.

Theoretically, ppc can support VA mode, but I suspect that ppc with
VA mode may have performance issues to create a big DMA window
(VA often uses higher addresses than PA). So, I didn't change the
code to check device capability in ppc.

Signed-off-by: Takeshi Yoshimura 
---
 lib/librte_eal/linux/eal/eal.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/lib/librte_eal/linux/eal/eal.c b/lib/librte_eal/linux/eal/eal.c
index 946222ccd..db2dec922 100644
--- a/lib/librte_eal/linux/eal/eal.c
+++ b/lib/librte_eal/linux/eal/eal.c
@@ -1121,6 +1121,12 @@ rte_eal_init(int argc, char **argv)
RTE_LOG(DEBUG, EAL, "KNI can not work since 
physical addresses are unavailable\n");
}
}
+#endif
+#ifdef RTE_ARCH_PPC_64
+   if (iova_mode == RTE_IOVA_VA) {
+   iova_mode = RTE_IOVA_PA;
+   RTE_LOG(WARNING, EAL, "Forcing IOVA as 'PA' because PPC 
uses PA mode.\n");
+   }
 #endif
rte_eal_get_configuration()->iova_mode = iova_mode;
} else {
-- 
2.17.1



[dpdk-dev] [PATCH] vfio: retry creating sPAPR DMA window

2019-06-06 Thread Takeshi Yoshimura
sPAPR allows only page_shift from VFIO_IOMMU_SPAPR_TCE_GET_INFO ioctl.
However, Linux 4.17 or before returns incorrect page_shift for Power9.
I added the code for retrying creation of sPAPR DMA window.

Signed-off-by: Takeshi Yoshimura 
---
 lib/librte_eal/linux/eal/eal_vfio.c | 26 +++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/lib/librte_eal/linux/eal/eal_vfio.c 
b/lib/librte_eal/linux/eal/eal_vfio.c
index 6892a2c14..f16c5c3c0 100644
--- a/lib/librte_eal/linux/eal/eal_vfio.c
+++ b/lib/librte_eal/linux/eal/eal_vfio.c
@@ -1448,9 +1448,29 @@ vfio_spapr_create_new_dma_window(int vfio_container_fd,
/* create new DMA window */
ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, create);
if (ret) {
-   RTE_LOG(ERR, EAL, "  cannot create new DMA window, "
-   "error %i (%s)\n", errno, strerror(errno));
-   return -1;
+   /* try possible page_shift and levels for workaround */
+   uint32_t levels;
+
+   for (levels = 1; levels <= info.ddw.levels; levels++) {
+   uint32_t pgsizes = info.ddw.pgsizes;
+
+   while (pgsizes != 0) {
+   create->page_shift = 31 - 
__builtin_clz(pgsizes);
+   create->levels = levels;
+   ret = ioctl(vfio_container_fd,
+   VFIO_IOMMU_SPAPR_TCE_CREATE, create);
+   if (!ret)
+   break;
+   pgsizes &= ~(1 << create->page_shift);
+   }
+   if (!ret)
+   break;
+   }
+   if (ret) {
+   RTE_LOG(ERR, EAL, "  cannot create new DMA window, "
+   "error %i (%s)\n", errno, 
strerror(errno));
+   return -1;
+   }
}
 
if (create->start_addr != 0) {
-- 
2.17.1



[dpdk-dev] [PATCH] vfio: fix expanding DMA area in ppc64le

2019-06-11 Thread Takeshi Yoshimura
In ppc64le, expanding DMA areas always fail because we cannot remove
a DMA window. As a result, we cannot allocate more than one memseg in
ppc64le. This is because vfio_spapr_dma_mem_map() doesn't unmap all
the mapped DMA before removing the window. This patch fixes this
incorrect behavior.

I added a global variable to track current window size since we do
not have better ways to get exact size of it than doing so. sPAPR
IOMMU seems not to provide any ways to get window size with ioctl
interfaces. rte_memseg_walk*() is currently used to calculate window
size, but it walks memsegs that are marked as used, not mapped. So,
we need to determine if a given memseg is mapped or not, otherwise
the ioctl reports errors due to attempting to unregister memory
addresses that are not registered. The global variable is excluded
in non-ppc64le binaries.

Similar problems happen in user maps. We need to avoid attempting to
unmap the address that is given as the function's parameter. The
compaction of user maps prevents us from passing correct length for
unmapping DMA at the window recreation. So, I removed it in ppc64le.

I also fixed the order of ioctl for unregister and unmap. The ioctl
for unregister sometimes report device busy errors due to the
existence of mapped area.

Signed-off-by: Takeshi Yoshimura 
---
 lib/librte_eal/linux/eal/eal_vfio.c | 154 +++-
 1 file changed, 103 insertions(+), 51 deletions(-)

diff --git a/lib/librte_eal/linux/eal/eal_vfio.c 
b/lib/librte_eal/linux/eal/eal_vfio.c
index f16c5c3c0..6edbaaff5 100644
--- a/lib/librte_eal/linux/eal/eal_vfio.c
+++ b/lib/librte_eal/linux/eal/eal_vfio.c
@@ -93,6 +93,7 @@ is_null_map(const struct user_mem_map *map)
return map->addr == 0 && map->iova == 0 && map->len == 0;
 }
 
+#ifndef RTE_ARCH_PPC_64
 /* we may need to merge user mem maps together in case of user 
mapping/unmapping
  * chunks of memory, so we'll need a comparator function to sort segments.
  */
@@ -126,6 +127,7 @@ user_mem_map_cmp(const void *a, const void *b)
 
return 0;
 }
+#endif
 
 /* adjust user map entry. this may result in shortening of existing map, or in
  * splitting existing map in two pieces.
@@ -162,6 +164,7 @@ adjust_map(struct user_mem_map *src, struct user_mem_map 
*end,
}
 }
 
+#ifndef RTE_ARCH_PPC_64
 /* try merging two maps into one, return 1 if succeeded */
 static int
 merge_map(struct user_mem_map *left, struct user_mem_map *right)
@@ -177,6 +180,7 @@ merge_map(struct user_mem_map *left, struct user_mem_map 
*right)
 
return 1;
 }
+#endif
 
 static struct user_mem_map *
 find_user_mem_map(struct user_mem_maps *user_mem_maps, uint64_t addr,
@@ -211,6 +215,16 @@ find_user_mem_map(struct user_mem_maps *user_mem_maps, 
uint64_t addr,
return NULL;
 }
 
+#ifdef RTE_ARCH_PPC_64
+/* Recreation of DMA window requires unregistering DMA memory.
+ * Compaction confuses the logic and causes false reports in the recreation.
+ * For now, we do not compact user maps in ppc64le.
+ */
+static void
+compact_user_maps(__rte_unused struct user_mem_maps *user_mem_maps)
+{
+}
+#else
 /* this will sort all user maps, and merge/compact any adjacent maps */
 static void
 compact_user_maps(struct user_mem_maps *user_mem_maps)
@@ -256,6 +270,7 @@ compact_user_maps(struct user_mem_maps *user_mem_maps)
user_mem_maps->n_maps = cur_idx;
}
 }
+#endif
 
 static int
 vfio_open_group_fd(int iommu_group_num)
@@ -1357,14 +1372,6 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t 
vaddr, uint64_t iova,
}
 
} else {
-   ret = ioctl(vfio_container_fd,
-   VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, ®);
-   if (ret) {
-   RTE_LOG(ERR, EAL, "  cannot unregister vaddr for IOMMU, 
error %i (%s)\n",
-   errno, strerror(errno));
-   return -1;
-   }
-
memset(&dma_unmap, 0, sizeof(dma_unmap));
dma_unmap.argsz = sizeof(struct vfio_iommu_type1_dma_unmap);
dma_unmap.size = len;
@@ -1377,24 +1384,50 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t 
vaddr, uint64_t iova,
errno, strerror(errno));
return -1;
}
+
+   ret = ioctl(vfio_container_fd,
+   VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, ®);
+   if (ret) {
+   RTE_LOG(ERR, EAL, "  cannot unregister vaddr for IOMMU, 
error %i (%s)\n",
+   errno, strerror(errno));
+   return -1;
+   }
}
 
return 0;
 }
 
+struct spapr_remap_walk_param {
+   int vfio_container_fd;
+   uint64_t window_size;
+};
+
 static int
 vfio_spapr_map_walk(const struct rte_memseg_list *msl,
   

[dpdk-dev] [PATCH] vfio: fix expanding DMA area in ppc64le

2019-06-12 Thread Takeshi Yoshimura
In ppc64le, expanding DMA areas always fail because we cannot remove
a DMA window. As a result, we cannot allocate more than one memseg in
ppc64le. This is because vfio_spapr_dma_mem_map() doesn't unmap all
the mapped DMA before removing the window. This patch fixes this
incorrect behavior.

I added a global variable to track current window size since we do
not have better ways to get exact size of it than doing so. sPAPR
IOMMU seems not to provide any ways to get window size with ioctl
interfaces. rte_memseg_walk*() is currently used to calculate window
size, but it walks memsegs that are marked as used, not mapped. So,
we need to determine if a given memseg is mapped or not, otherwise
the ioctl reports errors due to attempting to unregister memory
addresses that are not registered. The global variable is excluded
in non-ppc64le binaries.

Similar problems happen in user maps. We need to avoid attempting to
unmap the address that is given as the function's parameter. The
compaction of user maps prevents us from passing correct length for
unmapping DMA at the window recreation. So, I removed it in ppc64le.

I also fixed the order of ioctl for unregister and unmap. The ioctl
for unregister sometimes report device busy errors due to the
existence of mapped area.

Signed-off-by: Takeshi Yoshimura 
---
 lib/librte_eal/linux/eal/eal_vfio.c | 154 +++-
 1 file changed, 103 insertions(+), 51 deletions(-)

diff --git a/lib/librte_eal/linux/eal/eal_vfio.c 
b/lib/librte_eal/linux/eal/eal_vfio.c
index f16c5c3c0..c1b275b56 100644
--- a/lib/librte_eal/linux/eal/eal_vfio.c
+++ b/lib/librte_eal/linux/eal/eal_vfio.c
@@ -93,6 +93,7 @@ is_null_map(const struct user_mem_map *map)
return map->addr == 0 && map->iova == 0 && map->len == 0;
 }
 
+#ifndef RTE_ARCH_PPC_64
 /* we may need to merge user mem maps together in case of user 
mapping/unmapping
  * chunks of memory, so we'll need a comparator function to sort segments.
  */
@@ -126,6 +127,7 @@ user_mem_map_cmp(const void *a, const void *b)
 
return 0;
 }
+#endif
 
 /* adjust user map entry. this may result in shortening of existing map, or in
  * splitting existing map in two pieces.
@@ -162,6 +164,7 @@ adjust_map(struct user_mem_map *src, struct user_mem_map 
*end,
}
 }
 
+#ifndef RTE_ARCH_PPC_64
 /* try merging two maps into one, return 1 if succeeded */
 static int
 merge_map(struct user_mem_map *left, struct user_mem_map *right)
@@ -177,6 +180,7 @@ merge_map(struct user_mem_map *left, struct user_mem_map 
*right)
 
return 1;
 }
+#endif
 
 static struct user_mem_map *
 find_user_mem_map(struct user_mem_maps *user_mem_maps, uint64_t addr,
@@ -211,6 +215,16 @@ find_user_mem_map(struct user_mem_maps *user_mem_maps, 
uint64_t addr,
return NULL;
 }
 
+#ifdef RTE_ARCH_PPC_64
+/* Recreation of DMA window requires unregistering DMA memory.
+ * Compaction confuses the logic and causes false reports in the recreation.
+ * For now, we do not compact user maps in ppc64le.
+ */
+static void
+compact_user_maps(__rte_unused struct user_mem_maps *user_mem_maps)
+{
+}
+#else
 /* this will sort all user maps, and merge/compact any adjacent maps */
 static void
 compact_user_maps(struct user_mem_maps *user_mem_maps)
@@ -256,6 +270,7 @@ compact_user_maps(struct user_mem_maps *user_mem_maps)
user_mem_maps->n_maps = cur_idx;
}
 }
+#endif
 
 static int
 vfio_open_group_fd(int iommu_group_num)
@@ -1306,6 +1321,7 @@ vfio_type1_dma_map(int vfio_container_fd)
return rte_memseg_walk(type1_map, &vfio_container_fd);
 }
 
+#ifdef RTE_ARCH_PPC_64
 static int
 vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
uint64_t len, int do_map)
@@ -1357,14 +1373,6 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t 
vaddr, uint64_t iova,
}
 
} else {
-   ret = ioctl(vfio_container_fd,
-   VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, ®);
-   if (ret) {
-   RTE_LOG(ERR, EAL, "  cannot unregister vaddr for IOMMU, 
error %i (%s)\n",
-   errno, strerror(errno));
-   return -1;
-   }
-
memset(&dma_unmap, 0, sizeof(dma_unmap));
dma_unmap.argsz = sizeof(struct vfio_iommu_type1_dma_unmap);
dma_unmap.size = len;
@@ -1377,24 +1385,50 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t 
vaddr, uint64_t iova,
errno, strerror(errno));
return -1;
}
+
+   ret = ioctl(vfio_container_fd,
+   VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, ®);
+   if (ret) {
+   RTE_LOG(ERR, EAL, "  cannot unregister vaddr for IOMMU, 
error %i (%s)\n",
+  

[dpdk-dev] [PATCH v2] vfio: fix expanding DMA area in ppc64le

2019-06-14 Thread Takeshi Yoshimura
In ppc64le, expanding DMA areas always fail because we cannot remove
a DMA window. As a result, we cannot allocate more than one memseg in
ppc64le. This is because vfio_spapr_dma_mem_map() doesn't unmap all
the mapped DMA before removing the window. This patch fixes this
incorrect behavior.

I added a global variable to track current window size since we do
not have better ways to get exact size of it than doing so. sPAPR
IOMMU seems not to provide any ways to get window size with ioctl
interfaces. rte_memseg_walk*() is currently used to calculate window
size, but it walks memsegs that are marked as used, not mapped. So,
we need to determine if a given memseg is mapped or not, otherwise
the ioctl reports errors due to attempting to unregister memory
addresses that are not registered. The global variable is excluded
in non-ppc64le binaries.

Similar problems happen in user maps. We need to avoid attempting to
unmap the address that is given as the function's parameter. The
compaction of user maps prevents us from passing correct length for
unmapping DMA at the window recreation. So, I removed it in ppc64le.

I also fixed the order of ioctl for unregister and unmap. The ioctl
for unregister sometimes report device busy errors due to the
existence of mapped area.

Signed-off-by: Takeshi Yoshimura 
---
 lib/librte_eal/linux/eal/eal_vfio.c | 154 +++-
 1 file changed, 103 insertions(+), 51 deletions(-)

diff --git a/lib/librte_eal/linux/eal/eal_vfio.c 
b/lib/librte_eal/linux/eal/eal_vfio.c
index 6892a2c14..5587854b8 100644
--- a/lib/librte_eal/linux/eal/eal_vfio.c
+++ b/lib/librte_eal/linux/eal/eal_vfio.c
@@ -93,6 +93,7 @@ is_null_map(const struct user_mem_map *map)
return map->addr == 0 && map->iova == 0 && map->len == 0;
 }
 
+#ifndef RTE_ARCH_PPC_64
 /* we may need to merge user mem maps together in case of user 
mapping/unmapping
  * chunks of memory, so we'll need a comparator function to sort segments.
  */
@@ -126,6 +127,7 @@ user_mem_map_cmp(const void *a, const void *b)
 
return 0;
 }
+#endif
 
 /* adjust user map entry. this may result in shortening of existing map, or in
  * splitting existing map in two pieces.
@@ -162,6 +164,7 @@ adjust_map(struct user_mem_map *src, struct user_mem_map 
*end,
}
 }
 
+#ifndef RTE_ARCH_PPC_64
 /* try merging two maps into one, return 1 if succeeded */
 static int
 merge_map(struct user_mem_map *left, struct user_mem_map *right)
@@ -177,6 +180,7 @@ merge_map(struct user_mem_map *left, struct user_mem_map 
*right)
 
return 1;
 }
+#endif
 
 static struct user_mem_map *
 find_user_mem_map(struct user_mem_maps *user_mem_maps, uint64_t addr,
@@ -211,6 +215,16 @@ find_user_mem_map(struct user_mem_maps *user_mem_maps, 
uint64_t addr,
return NULL;
 }
 
+#ifdef RTE_ARCH_PPC_64
+/* Recreation of DMA window requires unregistering DMA memory.
+ * Compaction confuses the logic and causes false reports in the recreation.
+ * For now, we do not compact user maps in ppc64le.
+ */
+static void
+compact_user_maps(__rte_unused struct user_mem_maps *user_mem_maps)
+{
+}
+#else
 /* this will sort all user maps, and merge/compact any adjacent maps */
 static void
 compact_user_maps(struct user_mem_maps *user_mem_maps)
@@ -256,6 +270,7 @@ compact_user_maps(struct user_mem_maps *user_mem_maps)
user_mem_maps->n_maps = cur_idx;
}
 }
+#endif
 
 static int
 vfio_open_group_fd(int iommu_group_num)
@@ -1306,6 +1321,7 @@ vfio_type1_dma_map(int vfio_container_fd)
return rte_memseg_walk(type1_map, &vfio_container_fd);
 }
 
+#ifdef RTE_ARCH_PPC_64
 static int
 vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
uint64_t len, int do_map)
@@ -1357,14 +1373,6 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t 
vaddr, uint64_t iova,
}
 
} else {
-   ret = ioctl(vfio_container_fd,
-   VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, ®);
-   if (ret) {
-   RTE_LOG(ERR, EAL, "  cannot unregister vaddr for IOMMU, 
error %i (%s)\n",
-   errno, strerror(errno));
-   return -1;
-   }
-
memset(&dma_unmap, 0, sizeof(dma_unmap));
dma_unmap.argsz = sizeof(struct vfio_iommu_type1_dma_unmap);
dma_unmap.size = len;
@@ -1377,24 +1385,50 @@ vfio_spapr_dma_do_map(int vfio_container_fd, uint64_t 
vaddr, uint64_t iova,
errno, strerror(errno));
return -1;
}
+
+   ret = ioctl(vfio_container_fd,
+   VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY, ®);
+   if (ret) {
+   RTE_LOG(ERR, EAL, "  cannot unregister vaddr for IOMMU, 
error %i (%s)\n",
+