On Thu, Oct 21, 2021 at 2:33 PM Dmitry Kozlyuk <dmitry.kozl...@gmail.com> wrote:
> > Hi All,
> >
> > I came across 2 issues introduced with auto detection mechanism.
> > 1. In case of primary secondary model.  Primary application is started 
> > which makes lots of allocations via
> > rte_malloc*
> >
> >     Secondary side:
> >     a. Secondary starts, in its "rte_eal_init()" it makes some allocation 
> > via rte_*, and in one of the allocation
> > request for heap expand is made as current memseg got exhausted. 
> > (malloc_heap_alloc_on_heap_id ()->
> >    alloc_more_mem_on_socket()->try_expand_heap())
> >    b. A request to primary for heap expand is sent. Please note secondary 
> > holds the spinlock while making
> > the request. (malloc_heap_alloc_on_heap_id 
> > ()->rte_spinlock_lock(&(heap->lock));)
> >
> >    Primary side:
> >    a. Primary receives the request, install a new hugepage and setups up 
> > the heap (handle_alloc_request())
> >    b. To inform all the secondaries about the new memseg, primary sends a 
> > sync notice where it sets up an
> > alarm (rte_mp_request_async ()->mp_request_async()).
> >    c. Inside alarm setup API, we register an interrupt callback.
> >    d. Inside rte_intr_callback_register(), a new interrupt instance 
> > allocation is requested for "src->intr_handle"
> >    e. Since memory management is detected as up, inside 
> > "rte_intr_instance_alloc()", call to "rte_zmalloc" for
> > allocating memory and further inside "malloc_heap_alloc_on_heap_id()", 
> > primary will experience a deadlock
> > while taking up the spinlock because this spinlock is already hold by 
> > secondary.
> >
> >
> > 2. "eal_flags_file_prefix_autotest" is failing because the spawned process 
> > by this tests are expected to cleanup
> > their hugepage traces from respective directories (eg /dev/hugepage).
> > a. Inside eal_cleanup, rte_free()->malloc_heap_free(), where element to be 
> > freed is added to the free list and
> > checked if nearby elements can be joined together and form a big free chunk 
> > (malloc_elem_free()).
> > b. If this free chunk is big enough than the hugepage size, respective 
> > hugepage can be uninstalled after making
> > sure no allocation from this hugepage exists. 
> > (malloc_heap_free()->malloc_heap_free_pages()->eal_memalloc_free_seg())
> >
> > But because of interrupt allocations made for pci intr handles (used for 
> > VFIO) and other driver specific interrupt
> > handles are not cleaned up in "rte_eal_cleanup()", these hugepage files are 
> > not removed and test fails.
>
> Sad to hear. But it's a great and thorough analysis.
>
> > There could be more such issues, I think we should firstly fix the DPDK.
> > 1. Memory management should be made independent and should be the first 
> > thing to come up in rte_eal_init()
>
> As I have explained, buses must be able to report IOVA requirement
> at this point (`get_iommu_class()` bus method).
> Either `scan()` must complete before that
> or `get_iommu_class()` must be able to work before `scan()` is called.
>
> > 2. rte_eal_cleanup() should be exactly opposite to rte_eal_init(), just 
> > like bus_probe, we should have bus_remove
> > to clean up all the memory allocations.
>
> Yes. For most buses it will be just "unplug each device".
> In fact, EAL could do it with `unplug()`, but it is not mandatory.
>
> >
> > Regarding this IRQ series, I would like to fall back to our original design 
> > i.e. rte_intr_instance_alloc() should take
> > an argument whether its memory should be allocated using glibc malloc or 
> > rte_malloc*.
>
> Seems there's no other option to make it on time.

- Sorry, my memory is too short, did we describe where we need to
share rte_intr_handle objects?

I spent some time looking at uses of rte_intr_handle objects.

In many cases intr_handle objects are referenced in malloc() objects.
The cases where rte_intr_handle are shared is in per device private
bits in drivers.

A intr_handle often contains fds.
For them to be used in mp setups, there needs to be a big machinery
with SCM_RIGHTS but I see only 3 drivers which actually reference
this.
So if intr_handle fds are accessed by multiple processes, their
content probably makes no sense wrt fds.


>From these two hints, I think we are going backwards, and the main
usecase is that those rte_intr_instance objects are not used in mp.
I even think they are never accessed from other processes.
But I am not sure.


- Seeing how time it short for rc1, I am ok with
rte_intr_instance_alloc() taking a flag argument.
And we can still go back on this API later.

Can we agree on the flag name?
rte_malloc() interest is that it makes objects shared for mp, so how
about RTE_INTR_INSTANCE_F_SHARED ?


-- 
David Marchand

Reply via email to