2021-10-21 09:16 (UTC+0000), Harman Kalra:
> > -----Original Message-----
> > From: Dmitry Kozlyuk <dmitry.kozl...@gmail.com>
> > Sent: Wednesday, October 20, 2021 9:01 PM
> > To: Harman Kalra <hka...@marvell.com>
> > Cc: Stephen Hemminger <step...@networkplumber.org>; Thomas
> > Monjalon <tho...@monjalon.net>; david.march...@redhat.com;
> > dev@dpdk.org; Ray Kinsella <m...@ashroe.eu>
> > Subject: Re: [EXT] Re: [dpdk-dev] [PATCH v3 2/7] eal/interrupts: implement
> > get set APIs
> >   
> > > >  
> > > > > +     /* Detect if DPDK malloc APIs are ready to be used. */
> > > > > +     mem_allocator = rte_malloc_is_ready();
> > > > > +     if (mem_allocator)
> > > > > +             intr_handle = rte_zmalloc(NULL, sizeof(struct  
> > > > rte_intr_handle),  
> > > > > +                                       0);
> > > > > +     else
> > > > > +             intr_handle = calloc(1, sizeof(struct 
> > > > > rte_intr_handle));  
> > > >
> > > > This is problematic way to do this.
> > > > The reason to use rte_malloc vs malloc should be determined by usage.
> > > >
> > > > If the pointer will be shared between primary/secondary process then
> > > > it has to be in hugepages (ie rte_malloc). If it is not shared then
> > > > then use regular malloc.
> > > >
> > > > But what you have done is created a method which will be a latent
> > > > bug for anyone using primary/secondary process.
> > > >
> > > > Either:
> > > >     intr_handle is not allowed to be used in secondary.
> > > >       Then always use malloc().
> > > > Or.
> > > >     intr_handle can be used by both primary and secondary.
> > > >     Then always use rte_malloc().
> > > >     Any code path that allocates intr_handle before pool is
> > > >     ready is broken.  
> > >
> > > Hi Stephan,
> > >
> > > Till V2, I implemented this API in a way where user of the API can
> > > choose If he wants intr handle to be allocated using malloc or
> > > rte_malloc by passing a flag arg to the rte_intr_instanc_alloc API.
> > > User of the API will best know if the intr handle is to be shared with  
> > secondary or not.  
> > >
> > > But after some discussions and suggestions from the community we
> > > decided to drop that flag argument and auto detect on whether
> > > rte_malloc APIs are ready to be used and thereafter make all further  
> > allocations via rte_malloc.  
> > > Currently alarm subsystem (or any driver doing allocation in
> > > constructor) gets interrupt instance allocated using glibc malloc that
> > > too because rte_malloc* is not ready by rte_eal_alarm_init(), while
> > > all further consumers gets instance allocated via rte_malloc.  
> > 
> > Just as a comment, bus scanning is the real issue, not the alarms.
> > Alarms could be initialized after the memory management (but it's irrelevant
> > because their handle is not accessed from the outside).
> > However, MM needs to know bus IOVA requirements to initialize, which is
> > usually determined by at least bus device requirements.
> >   
> > >  I think this should not cause any issue in primary/secondary model as
> > > all interrupt instance pointer will be shared.  
> > 
> > What do you mean? Aren't we discussing the issue that those allocated early
> > are not shared?
> >   
> > > Infact to avoid any surprises of primary/secondary not working we
> > > thought of making all allocations via rte_malloc.  
> > 
> > I don't see why anyone would not make them shared.
> > In order to only use rte_malloc(), we need:
> > 1. In bus drivers, move handle allocation from scan to probe stage.
> > 2. In EAL, move alarm initialization to after the MM.
> > It all can be done later with v3 design---but there are out-of-tree drivers.
> > We need to force them to make step 1 at some point.
> > I see two options:
> > a) Right now have an external API that only works with rte_malloc()
> >    and internal API with autodetection. Fix DPDK and drop internal API.
> > b) Have external API with autodetection. Fix DPDK.
> >    At the next ABI breakage drop autodetection and libc-malloc.
> >   
> > > David, Thomas, Dmitry, please add if I missed anything.
> > >
> > > Can we please conclude on this series APIs as API freeze deadline (rc1) 
> > > is  
> > very near.
> > 
> > I support v3 design with no options and autodetection, because that's the
> > interface we want in the end.
> > Implementation can be improved later.  
> 
> Hi All,
> 
> I came across 2 issues introduced with auto detection mechanism.
> 1. In case of primary secondary model.  Primary application is started which 
> makes lots of allocations via
> rte_malloc*
>     
>     Secondary side:
>     a. Secondary starts, in its "rte_eal_init()" it makes some allocation via 
> rte_*, and in one of the allocation
> request for heap expand is made as current memseg got exhausted. 
> (malloc_heap_alloc_on_heap_id ()->
>    alloc_more_mem_on_socket()->try_expand_heap())
>    b. A request to primary for heap expand is sent. Please note secondary 
> holds the spinlock while making
> the request. (malloc_heap_alloc_on_heap_id 
> ()->rte_spinlock_lock(&(heap->lock));)
> 
>    Primary side:
>    a. Primary receives the request, install a new hugepage and setups up the 
> heap (handle_alloc_request())
>    b. To inform all the secondaries about the new memseg, primary sends a 
> sync notice where it sets up an 
> alarm (rte_mp_request_async ()->mp_request_async()).
>    c. Inside alarm setup API, we register an interrupt callback.
>    d. Inside rte_intr_callback_register(), a new interrupt instance 
> allocation is requested for "src->intr_handle"
>    e. Since memory management is detected as up, inside 
> "rte_intr_instance_alloc()", call to "rte_zmalloc" for
> allocating memory and further inside "malloc_heap_alloc_on_heap_id()", 
> primary will experience a deadlock
> while taking up the spinlock because this spinlock is already hold by 
> secondary.
> 
> 
> 2. "eal_flags_file_prefix_autotest" is failing because the spawned process by 
> this tests are expected to cleanup
> their hugepage traces from respective directories (eg /dev/hugepage). 
> a. Inside eal_cleanup, rte_free()->malloc_heap_free(), where element to be 
> freed is added to the free list and
> checked if nearby elements can be joined together and form a big free chunk 
> (malloc_elem_free()).
> b. If this free chunk is big enough than the hugepage size, respective 
> hugepage can be uninstalled after making
> sure no allocation from this hugepage exists. 
> (malloc_heap_free()->malloc_heap_free_pages()->eal_memalloc_free_seg())
> 
> But because of interrupt allocations made for pci intr handles (used for 
> VFIO) and other driver specific interrupt
> handles are not cleaned up in "rte_eal_cleanup()", these hugepage files are 
> not removed and test fails.

Sad to hear. But it's a great and thorough analysis.

> There could be more such issues, I think we should firstly fix the DPDK.
> 1. Memory management should be made independent and should be the first thing 
> to come up in rte_eal_init()

As I have explained, buses must be able to report IOVA requirement
at this point (`get_iommu_class()` bus method).
Either `scan()` must complete before that
or `get_iommu_class()` must be able to work before `scan()` is called.

> 2. rte_eal_cleanup() should be exactly opposite to rte_eal_init(), just like 
> bus_probe, we should have bus_remove
> to clean up all the memory allocations.

Yes. For most buses it will be just "unplug each device".
In fact, EAL could do it with `unplug()`, but it is not mandatory.

> 
> Regarding this IRQ series, I would like to fall back to our original design 
> i.e. rte_intr_instance_alloc() should take
> an argument whether its memory should be allocated using glibc malloc or 
> rte_malloc*.

Seems there's no other option to make it on time.

> Decision for allocation
> (malloc or rte_malloc) can be made on fact that in the existing code is the 
> interrupt handle is shared?
> Eg.  a. In case of alarm intr_handle was global entry and not confined to any 
> structure, so this can be allocated from
> normal malloc.
> b. PCI device, had static entry for intr_handle inside "struct 
> rte_pci_device" and memory for struct rte_pci_device is
> via normal malloc, so it intr_handle can also be malloc'ed
> c. Some driver with intr_handle inside its priv structure, and this priv 
> structure gets allocated via rte_malloc, so
> Intr_handle can also be rte_malloc.
> 
> Later once DPDK is fixed up, this argument can be removed and all allocations 
> can be via rte_malloc family without
> any auto detection.

Reply via email to