On Thu, Oct 21, 2021 at 2:33 PM Dmitry Kozlyuk <dmitry.kozl...@gmail.com> wrote: > > Hi All, > > > > I came across 2 issues introduced with auto detection mechanism. > > 1. In case of primary secondary model. Primary application is started > > which makes lots of allocations via > > rte_malloc* > > > > Secondary side: > > a. Secondary starts, in its "rte_eal_init()" it makes some allocation > > via rte_*, and in one of the allocation > > request for heap expand is made as current memseg got exhausted. > > (malloc_heap_alloc_on_heap_id ()-> > > alloc_more_mem_on_socket()->try_expand_heap()) > > b. A request to primary for heap expand is sent. Please note secondary > > holds the spinlock while making > > the request. (malloc_heap_alloc_on_heap_id > > ()->rte_spinlock_lock(&(heap->lock));) > > > > Primary side: > > a. Primary receives the request, install a new hugepage and setups up > > the heap (handle_alloc_request()) > > b. To inform all the secondaries about the new memseg, primary sends a > > sync notice where it sets up an > > alarm (rte_mp_request_async ()->mp_request_async()). > > c. Inside alarm setup API, we register an interrupt callback. > > d. Inside rte_intr_callback_register(), a new interrupt instance > > allocation is requested for "src->intr_handle" > > e. Since memory management is detected as up, inside > > "rte_intr_instance_alloc()", call to "rte_zmalloc" for > > allocating memory and further inside "malloc_heap_alloc_on_heap_id()", > > primary will experience a deadlock > > while taking up the spinlock because this spinlock is already hold by > > secondary. > > > > > > 2. "eal_flags_file_prefix_autotest" is failing because the spawned process > > by this tests are expected to cleanup > > their hugepage traces from respective directories (eg /dev/hugepage). > > a. Inside eal_cleanup, rte_free()->malloc_heap_free(), where element to be > > freed is added to the free list and > > checked if nearby elements can be joined together and form a big free chunk > > (malloc_elem_free()). > > b. If this free chunk is big enough than the hugepage size, respective > > hugepage can be uninstalled after making > > sure no allocation from this hugepage exists. > > (malloc_heap_free()->malloc_heap_free_pages()->eal_memalloc_free_seg()) > > > > But because of interrupt allocations made for pci intr handles (used for > > VFIO) and other driver specific interrupt > > handles are not cleaned up in "rte_eal_cleanup()", these hugepage files are > > not removed and test fails. > > Sad to hear. But it's a great and thorough analysis. > > > There could be more such issues, I think we should firstly fix the DPDK. > > 1. Memory management should be made independent and should be the first > > thing to come up in rte_eal_init() > > As I have explained, buses must be able to report IOVA requirement > at this point (`get_iommu_class()` bus method). > Either `scan()` must complete before that > or `get_iommu_class()` must be able to work before `scan()` is called. > > > 2. rte_eal_cleanup() should be exactly opposite to rte_eal_init(), just > > like bus_probe, we should have bus_remove > > to clean up all the memory allocations. > > Yes. For most buses it will be just "unplug each device". > In fact, EAL could do it with `unplug()`, but it is not mandatory. > > > > > Regarding this IRQ series, I would like to fall back to our original design > > i.e. rte_intr_instance_alloc() should take > > an argument whether its memory should be allocated using glibc malloc or > > rte_malloc*. > > Seems there's no other option to make it on time.
- Sorry, my memory is too short, did we describe where we need to share rte_intr_handle objects? I spent some time looking at uses of rte_intr_handle objects. In many cases intr_handle objects are referenced in malloc() objects. The cases where rte_intr_handle are shared is in per device private bits in drivers. A intr_handle often contains fds. For them to be used in mp setups, there needs to be a big machinery with SCM_RIGHTS but I see only 3 drivers which actually reference this. So if intr_handle fds are accessed by multiple processes, their content probably makes no sense wrt fds. >From these two hints, I think we are going backwards, and the main usecase is that those rte_intr_instance objects are not used in mp. I even think they are never accessed from other processes. But I am not sure. - Seeing how time it short for rc1, I am ok with rte_intr_instance_alloc() taking a flag argument. And we can still go back on this API later. Can we agree on the flag name? rte_malloc() interest is that it makes objects shared for mp, so how about RTE_INTR_INSTANCE_F_SHARED ? -- David Marchand