On 4/4/2019 6:03 AM, Kiran Kumar Kokkilagadda wrote:
> 
> 
>> -----Original Message-----
>> From: Ferruh Yigit <ferruh.yi...@intel.com>
>> Sent: Wednesday, April 3, 2019 9:59 PM
>> To: Kiran Kumar Kokkilagadda <kirankum...@marvell.com>
>> Cc: dev@dpdk.org; Jerin Jacob <jerin.ja...@caviumnetworks.com>
>> Subject: [EXT] Re: [dpdk-dev] [PATCH v2] kni: add IOVA va support for kni
>>
>> External Email
>>
>> ----------------------------------------------------------------------
>> On 4/1/2019 10:51 AM, Kiran Kumar Kokkilagadda wrote:
>>> From: Kiran Kumar K <kirankum...@marvell.com>
>>>
>>> With current KNI implementation kernel module will work only in
>>> IOVA=PA mode. This patch will add support for kernel module to work
>>> with IOVA=VA mode.
>>
>> Thanks Kiran for removing the limitation, I have a few questions, can you 
>> please
>> help me understand.
>>
>> And when this patch is ready, the restriction in 'linux/eal/eal.c', in 
>> 'rte_eal_init'
>> should be removed, perhaps with this patch. I assume you already doing it to 
>> be
>> able to test this patch.
>>
> 
> User can choose the mode by passing --iova-mode=<va/pa>. I will remove the 
> rte_kni module restriction.
>  
> 
>>>
>>> The idea is to maintain a mapping in KNI module between user pages and
>>> kernel pages and in fast path perform a lookup in this table and get
>>> the kernel virtual address for corresponding user virtual address.
>>>
>>> In IOVA=VA mode, the memory allocated to the pool is physically and
>>> virtually contiguous. We will take advantage of this and create a
>>> mapping in the kernel.In kernel we need mapping for queues (tx_q,
>>> rx_q,... slow path) and mbuf memory (fast path).
>>
>> Is it?
>> As far as I know mempool can have multiple chunks and they can be both
>> virtually and physically separated.
>>
>> And even for a single chunk, that will be virtually continuous, but will it 
>> be
>> physically continuous?
>>
> 
> You are right, it need not have to be physically contiguous. Will change the 
> description.
>  
> 
>>>
>>> At the KNI init time, in slow path we will create a mapping for the
>>> queues and mbuf using get_user_pages similar to af_xdp. Using pool
>>> memory base address, we will create a page map table for the mbuf,
>>> which we will use in the fast path for kernel page translation.
>>>
>>> At KNI init time, we will pass the base address of the pool and size
>>> of the pool to kernel. In kernel, using get_user_pages API, we will
>>> get the pages with size PAGE_SIZE and store the mapping and start
>>> address of user space in a table.
>>>
>>> In fast path for any user address perform PAGE_SHIFT (user_addr >>
>>> PAGE_SHIFT) and subtract the start address from this value, we will
>>> get the index of the kernel page with in the page map table.
>>> Adding offset to this kernel page address, we will get the kernel
>>> address for this user virtual address.
>>>
>>> For example user pool base address is X, and size is S that we passed
>>> to kernel. In kernel we will create a mapping for this using get_user_pages.
>>> Our page map table will look like [Y, Y+PAGE_SIZE, Y+(PAGE_SIZE*2)
>>> ....] and user start page will be U (we will get it from X >> PAGE_SHIFT).
>>>
>>> For any user address Z we will get the index of the page map table
>>> using ((Z >> PAGE_SHIFT) - U). Adding offset (Z & (PAGE_SIZE - 1)) to
>>> this address will give kernel virtual address.
>>>
>>> Signed-off-by: Kiran Kumar K <kirankum...@marvell.com>
>>
>> <...>
>>
>>> +int
>>> +kni_pin_pages(void *address, size_t size, struct page_info *mem) {
>>> +   unsigned int gup_flags = FOLL_WRITE;
>>> +   long npgs;
>>> +   int err;
>>> +
>>> +   /* Get at least one page */
>>> +   if (size < PAGE_SIZE)
>>> +           size = PAGE_SIZE;
>>> +
>>> +   /* Compute number of user pages based on page size */
>>> +   mem->npgs = (size + PAGE_SIZE - 1) / PAGE_SIZE;
>>> +
>>> +   /* Allocate memory for the pages */
>>> +   mem->pgs = kcalloc(mem->npgs, sizeof(*mem->pgs),
>>> +                 GFP_KERNEL | __GFP_NOWARN);
>>> +   if (!mem->pgs) {
>>> +           pr_err("%s: -ENOMEM\n", __func__);
>>> +           return -ENOMEM;
>>> +   }
>>> +
>>> +   down_write(&current->mm->mmap_sem);
>>> +
>>> +   /* Get the user pages from the user address*/ #if
>> LINUX_VERSION_CODE
>>> +>= KERNEL_VERSION(4,9,0)
>>> +   npgs = get_user_pages((u64)address, mem->npgs,
>>> +                           gup_flags, &mem->pgs[0], NULL);
>>> +#else
>>> +   npgs = get_user_pages(current, current->mm, (u64)address, mem-
>>> npgs,
>>> +                           gup_flags, 0, &mem->pgs[0], NULL); #endif
>>> +   up_write(&current->mm->mmap_sem);
>>
>> This should work even memory is physically not continuous, right? Where
>> exactly physically continuous requirement is coming from?
>>
> 
> Yes, it is not necessary to be physically contiguous.
> 
>  
> 
> 
>> <...>
>>
>>> +
>>> +/* Get the kernel address from the user address using
>>> + * page map table. Will be used only in IOVA=VA mode  */ static
>>> +inline void* get_kva(uint64_t usr_addr, struct kni_dev *kni) {
>>> +   uint32_t index;
>>> +   /* User page - start user page will give the index
>>> +    * with in the page map table
>>> +    */
>>> +   index = (usr_addr >> PAGE_SHIFT) - kni->va_info.start_page;
>>> +
>>> +   /* Add the offset to the page address */
>>> +   return (kni->va_info.page_map[index].addr +
>>> +           (usr_addr & kni->va_info.page_mask));
>>> +
>>> +}
>>> +
>>>  /* physical address to kernel virtual address */  static void *
>>> pa2kva(void *pa) @@ -186,7 +205,10 @@ kni_fifo_trans_pa2va(struct
>>> kni_dev *kni,
>>>                     return;
>>>
>>>             for (i = 0; i < num_rx; i++) {
>>> -                   kva = pa2kva(kni->pa[i]);
>>> +                   if (likely(kni->iova_mode == 1))
>>> +                           kva = get_kva((u64)(kni->pa[i]), kni);
>>
>> kni->pa[] now has iova addresses, for 'get_kva()' to work shouldn't
>> 'va_info.start_page' calculated from 'mempool_memhdr->iova' instead of
>> 'mempool_memhdr->addr'
>>
>> If this is working I must be missing something but not able to find what it 
>> is.
>>
>> <...>
>>
>  
> In IOVA=VA mode, both the values will be same right?

I don't know for sure, but according my understanding 'mempool_memhdr->addr' is
application virtual address, 'mempool_memhdr->iova' id device virtual address
and can be anything as long as iommu configured correctly.
So technically ->addr and ->iova can be same, but I don't know it is always the
case in DPDK.

> 
> 
>>> @@ -304,6 +304,27 @@ rte_kni_alloc(struct rte_mempool *pktmbuf_pool,
>>>     kni->group_id = conf->group_id;
>>>     kni->mbuf_size = conf->mbuf_size;
>>>
>>> +   dev_info.iova_mode = (rte_eal_iova_mode() == RTE_IOVA_VA) ? 1 : 0;
>>> +   if (dev_info.iova_mode) {
>>> +           struct rte_mempool_memhdr *hdr;
>>> +           uint64_t pool_size = 0;
>>> +
>>> +           /* In each pool header chunk, we will maintain the
>>> +            * base address of the pool. This chunk is physically and
>>> +            * virtually contiguous.
>>> +            * This approach will work, only if the allocated pool
>>> +            * memory is contiguous, else it won't work
>>> +            */
>>> +           hdr = STAILQ_FIRST(&pktmbuf_pool->mem_list);
>>> +           dev_info.mbuf_va = (void *)(hdr->addr);
>>> +
>>> +           /* Traverse the list and get the total size of the pool */
>>> +           STAILQ_FOREACH(hdr, &pktmbuf_pool->mem_list, next) {
>>> +                   pool_size += hdr->len;
>>> +           }
>>
>> This code is aware that there may be multiple chunks, but assumes they are 
>> all
>> continuous, I don't know if this assumption is correct.
>>
>> Also I guess there is another assumption that there will be single 
>> pktmbuf_pool
>> in the application which passed into kni?
>> What if there are multiple pktmbuf_pool, like one for each PMD, will this 
>> work?
>> Now some mbufs in kni Rx fifo will come from different pktmbuf_pool which we
>> don't know their pages, so won't able to get their kernel virtual address.
> 
> All these chunks have to be virtually contiguous, otherwise this approach 
> will not work. Here one thing we can do is create the mapping for complete 
> huge page, and make sure all the mbuff_mempools will be with in this offset. 
> As huge page size will be big (x64 it is 1G, ARM 512MB), we may not exceed 
> the size, even with multiple pktmbuf_pools. If mbuf_pools were created 
> outside this huge_page, this approach will not work and we will fall back to 
> PA mode.
> 

Those were also my two points,
- This requires all chunks in to be virtually continuous
- There should be only one pktmbuf_pools in system for this to work

But both above assumptions are not correct.
And as far as I can see there is not way to detect them and act accordingly.

Reply via email to