> On Aug 24, 2018, at 9:44 AM, Matteo Lanzuisi <m.lanzu...@resi.it> wrote:
>
> Hi,
>
> I used valgrind again for a very long time, and it told me nothing strange is
> happening on my code.
> After it, I changed my code this way
>
> unsigned lcore_id_start = rte_lcore_id();
> RTE_LCORE_FOREACH(lcore_id)
> {
> if (lcore_id_start != lcore_id) // <--------- before this change,
> every lcore could use it own mempool and enqueue to its own ring
Something in the back of my head tells me this is correct, but I have no real
reason :-(
If this works then I guess it is OK, but it would be nice to understand why it
works with this fix. Unless you have another thread running on this lcore doing
a get/put I do not see the problem.
> {
> new_work = NULL;
> result =
> rte_mempool_get(cea_main_lcore_conf[lcore_id].de_conf.cmd_pool, (VOID_P *)
> &new_work); // mempools are created one for each logical core
> if (result == 0)
> {
> if (((uint64_t)(new_work)) < 0x7f0000000000)
> printf("Result %d, lcore di partenza %u, lcore di
> ricezione %u, pointer %p\n", result, rte_lcore_id(), lcore_id, new_work);
> // debug print, on my server it should never happen but with multi-thread
> happens always on the last logical core!!!!
> new_work->command = command; // usage of the memory
> gotten from the mempool... <<<<<- here is where the application crashes!!!!
> result =
> rte_ring_enqueue(cea_main_lcore_conf[lcore_id].de_conf.cmd_ring, (VOID_P)
> new_work); // enqueues the gotten buffer on the rings of all lcores
> // check on result value ...
> }
> else
> {
> // do something if result != 0 ...
> }
> }
> else
> {
> // don't use mempool but call a function instead ....
> }
> }
>
> and now it all goes well.
> It is possibile that sending to itself could generate this issue?
>
> Regards,
> Matteo
>
> Il 21/08/2018 16:46, Matteo Lanzuisi ha scritto:
>> Il 21/08/2018 14:51, Wiles, Keith ha scritto:
>>>
>>>> On Aug 21, 2018, at 7:44 AM, Matteo Lanzuisi <m.lanzu...@resi.it> wrote:
>>>>
>>>> Il 21/08/2018 14:17, Wiles, Keith ha scritto:
>>>>>> On Aug 21, 2018, at 7:01 AM, Matteo Lanzuisi <m.lanzu...@resi.it> wrote:
>>>>>>
>>>>>> Hi
>>>>>>
>>>>>> Il 20/08/2018 18:03, Wiles, Keith ha scritto:
>>>>>>>> On Aug 20, 2018, at 9:47 AM, Matteo Lanzuisi <m.lanzu...@resi.it>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hello Olivier,
>>>>>>>>
>>>>>>>> Il 13/08/2018 23:54, Olivier Matz ha scritto:
>>>>>>>>
>>>>>>>>> Hello Matteo,
>>>>>>>>>
>>>>>>>>> On Mon, Aug 13, 2018 at 03:20:44PM +0200, Matteo Lanzuisi wrote:
>>>>>>>>>
>>>>>>>>>> Any suggestion? any idea about this behaviour?
>>>>>>>>>>
>>>>>>>>>> Il 08/08/2018 11:56, Matteo Lanzuisi ha scritto:
>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> recently I began using "dpdk-17.11-11.el7.x86_64" rpm (RedHat rpm)
>>>>>>>>>>> on
>>>>>>>>>>> RedHat 7.5 kernel 3.10.0-862.6.3.el7.x86_64 as a porting of an
>>>>>>>>>>> application from RH6 to RH7. On RH6 I used dpdk-2.2.0.
>>>>>>>>>>>
>>>>>>>>>>> This application is made up by one or more threads (each one on a
>>>>>>>>>>> different logical core) reading packets from i40e interfaces.
>>>>>>>>>>>
>>>>>>>>>>> Each thread can call the following code lines when receiving a
>>>>>>>>>>> specific
>>>>>>>>>>> packet:
>>>>>>>>>>>
>>>>>>>>>>> RTE_LCORE_FOREACH(lcore_id)
>>>>>>>>>>> {
>>>>>>>>>>> result =
>>>>>>>>>>> rte_mempool_get(cea_main_lcore_conf[lcore_id].de_conf.cmd_pool,
>>>>>>>>>>> (VOID_P
>>>>>>>>>>> *) &new_work); // mempools are created one for each logical
>>>>>>>>>>> core
>>>>>>>>>>> if (((uint64_t)(new_work)) < 0x7f0000000000)
>>>>>>>>>>> printf("Result %d, lcore di partenza %u, lcore di
>>>>>>>>>>> ricezione
>>>>>>>>>>> %u, pointer %p\n", result, rte_lcore_id(), lcore_id, new_work);
>>>>>>>>>>> //
>>>>>>>>>>> debug print, on my server it should never happen but with
>>>>>>>>>>> multi-thread
>>>>>>>>>>> happens always on the last logical core!!!!
>>>>>>>>>>>
>>>>>>>>> Here, checking the value of new_work looks wrong to me, before
>>>>>>>>> ensuring that result == 0. At least, new_work should be set to
>>>>>>>>> NULL before calling rte_mempool_get().
>>>>>>>>>
>>>>>>>> I put the check after result == 0, and just before the
>>>>>>>> rte_mempool_get() I set new_work to NULL, but nothing changed.
>>>>>>>> The first time something goes wrong the print is
>>>>>>>>
>>>>>>>> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 635,
>>>>>>>> pointer 0x880002
>>>>>>>>
>>>>>>>> Sorry for the italian language print :) it means that application is
>>>>>>>> sending a message from the logical core 1 to the logical core 2, it's
>>>>>>>> the 635th time, the result is 0 and the pointer is 0x880002 while all
>>>>>>>> pointers before were 0x7ffxxxxxx.
>>>>>>>> One strange thing is that this behaviour happens always from the
>>>>>>>> logical core 1 to the logical core 2 when the counter is 635!!!
>>>>>>>> (Sending messages from 2 to 1 or 1 to 1 or 2 to 2 is all ok)
>>>>>>>> Another strange thing is that pointers from counter 636 to 640 are
>>>>>>>> NULL, and from 641 begin again to be good... as you can see here
>>>>>>>> following (I attached the result of a test without the "if" of the
>>>>>>>> check on the value of new_work, and only for messages from the lcore 1
>>>>>>>> to lcore 2)
>>>>>>>>
>>>>>>>> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 627,
>>>>>>>> pointer 0x7ffe8a261880
>>>>>>>> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 628,
>>>>>>>> pointer 0x7ffe8a261900
>>>>>>>> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 629,
>>>>>>>> pointer 0x7ffe8a261980
>>>>>>>> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 630,
>>>>>>>> pointer 0x7ffe8a261a00
>>>>>>>> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 631,
>>>>>>>> pointer 0x7ffe8a261a80
>>>>>>>> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 632,
>>>>>>>> pointer 0x7ffe8a261b00
>>>>>>>> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 633,
>>>>>>>> pointer 0x7ffe8a261b80
>>>>>>>> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 634,
>>>>>>>> pointer 0x7ffe8a261c00
>>>>>>>> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 635,
>>>>>>>> pointer 0x880002
>>>>>>>> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 636,
>>>>>>>> pointer (nil)
>>>>>>>> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 637,
>>>>>>>> pointer (nil)
>>>>>>>> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 638,
>>>>>>>> pointer (nil)
>>>>>>>> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 639,
>>>>>>>> pointer (nil)
>>>>>>>> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 640,
>>>>>>>> pointer (nil)
>>>>>>>>
>>>>>>> This sure does seem like a memory over write problem, with maybe a
>>>>>>> memset(0) in the mix as well. Have you tried using hardware break
>>>>>>> points with the 0x880002 or 0x00 being written into this range?
>>>>>> I put some breakpoints and found this:
>>>>>>
>>>>>> 1 - using pointer 0x880002, the output is (the pointer comes in the
>>>>>> middle of two rwlock):
>>>>>>
>>>>>> (gdb) awatch *0x880002
>>>>>> Hardware access (read/write) watchpoint 1: *0x880002
>>>>>> (gdb) c
>>>>>> Continuing.
>>>>>> [New Thread 0x7fffeded5700 (LWP 19969)]
>>>>>> [New Thread 0x7fffed6d4700 (LWP 19970)]
>>>>>> [New Thread 0x7fffeced3700 (LWP 19971)]
>>>>>> [New Thread 0x7fffec6d2700 (LWP 19972)]
>>>>>> [New Thread 0x7fffebed1700 (LWP 19973)]
>>>>>> [New Thread 0x7fffeb6d0700 (LWP 19974)]
>>>>>> Hardware access (read/write) watchpoint 1: *0x880002
>>>>>>
>>>>>> Value = 0
>>>>>> rte_rwlock_init (rwl=0x880000 <ikco_sdkif_actlist_lock+677024>)
>>>>>> at
>>>>>> /usr/share/dpdk/x86_64-default-linuxapp-gcc/include/generic/rte_rwlock.h:81
>>>>>> 81 }
>>>>>> (gdb) c
>>>>>> Continuing.
>>>>>> Hardware access (read/write) watchpoint 1: *0x880002
>>>>> These are most likely false positive hits and not the issue.
>>>>>> Value = 0
>>>>>> rte_rwlock_init (rwl=0x880004 <ikco_sdkif_actlist_lock+677028>)
>>>>>> at
>>>>>> /usr/share/dpdk/x86_64-default-linuxapp-gcc/include/generic/rte_rwlock.h:81
>>>>>> 81 }
>>>>>> (gdb) c
>>>>>> Continuing.
>>>>>>
>>>>>> 2 - when using pointers minor or equal than 0x7ffe8a261d64 (in the range
>>>>>> of the mempool), gdb tells nothing about them (I don't use them, I just
>>>>>> get them from the pool and the put them in the pool again);
>>>>>>
>>>>>> 3 - when using pointer 0x7ffe8a261d65 or greater, this is the output of
>>>>>> gdb:
>>>>>>
>>>>>> (gdb) awatch *(int *)0x7ffe8a261d65
>>>>>> Hardware access (read/write) watchpoint 1: *(int *)0x7ffe8a261d65
>>>>>> (gdb) c
>>>>>> Continuing.
>>>>>> [New Thread 0x7fffeded5700 (LWP 17689)]
>>>>>> [New Thread 0x7fffed6d4700 (LWP 17690)]
>>>>>> [New Thread 0x7fffeced3700 (LWP 17691)]
>>>>>> [New Thread 0x7fffec6d2700 (LWP 17692)]
>>>>>> [New Thread 0x7fffebed1700 (LWP 17693)]
>>>>>> [New Thread 0x7fffeb6d0700 (LWP 17694)]
>>>>>> Hardware access (read/write) watchpoint 1: *(int *)0x7ffe8a261d65
>>>>>>
>>>>>> Value = 0
>>>>>> 0x00007ffff3798c21 in mempool_add_elem (mp=mp@entry=0x7ffebfd8d6c0,
>>>>>> obj=obj@entry=0x7ffe8a261d80,
>>>>>> iova=iova@entry=4465237376) at
>>>>>> /usr/src/debug/dpdk-17.11/lib/librte_mempool/rte_mempool.c:140
>>>>>> 140 STAILQ_INSERT_TAIL(&mp->elt_list, hdr, next);
>>>>>> (gdb) where
>>>>>> #0 0x00007ffff3798c21 in mempool_add_elem (mp=mp@entry=0x7ffebfd8d6c0,
>>>>>> obj=obj@entry=0x7ffe8a261d80,
>>>>>> iova=iova@entry=4465237376) at
>>>>>> /usr/src/debug/dpdk-17.11/lib/librte_mempool/rte_mempool.c:140
>>>>>> #1 0x00007ffff37990f0 in rte_mempool_populate_iova (mp=0x7ffebfd8d6c0,
>>>>>> vaddr=0x7ffe8a23d540 "",
>>>>>> iova=4465087808, len=8388480, free_cb=<optimized out>,
>>>>>> opaque=<optimized out>)
>>>>>> at /usr/src/debug/dpdk-17.11/lib/librte_mempool/rte_mempool.c:424
>>>>>> #2 0x00007ffff379967d in rte_mempool_populate_default
>>>>>> (mp=mp@entry=0x7ffebfd8d6c0)
>>>>>> at /usr/src/debug/dpdk-17.11/lib/librte_mempool/rte_mempool.c:624
>>>>>> #3 0x00007ffff3799e89 in rte_mempool_create (name=<optimized out>,
>>>>>> n=<optimized out>,
>>>>>> elt_size=<optimized out>, cache_size=<optimized out>,
>>>>>> private_data_size=<optimized out>,
>>>>>> mp_init=0x7ffff444e410 <rte_pktmbuf_pool_init>, mp_init_arg=0x0,
>>>>>> obj_init=0x7ffff444e330 <rte_pktmbuf_init>, obj_init_arg=0x0,
>>>>>> socket_id=0, flags=0)
>>>>>> at /usr/src/debug/dpdk-17.11/lib/librte_mempool/rte_mempool.c:952
>>>>>> #4 0x0000000000548a52 in main (argc=16, argv=0x7fffffffe3c8)
>>>>>> at
>>>>>> /root/gemini-cea-4.6.0/msrc/sys/com/linux-dpdk/cea-app/../../../../sys/com/linux-dpdk/cea-app/main.c:2360
>>>>>> (gdb) c
>>>>>> Continuing.
>>>>>> Hardware access (read/write) watchpoint 1: *(int *)0x7ffe8a261d65
>>>>> This seems to be just creating a pktmbuf pool. The STAILQ_INSERT_TAILQ is
>>>>> just putting the mempool on the main tailq list for mempools in DPDK.
>>>>>
>>>>>> Old value = 0
>>>>>> New value = -402653184
>>>>>> 0x00007ffff3798c24 in mempool_add_elem (mp=mp@entry=0x7ffebfd8d6c0,
>>>>>> obj=obj@entry=0x7ffe8a261e00,
>>>>>> iova=iova@entry=4465237504) at
>>>>>> /usr/src/debug/dpdk-17.11/lib/librte_mempool/rte_mempool.c:140
>>>>>> 140 STAILQ_INSERT_TAIL(&mp->elt_list, hdr, next);
>>>>>> (gdb) where
>>>>>> #0 0x00007ffff3798c24 in mempool_add_elem (mp=mp@entry=0x7ffebfd8d6c0,
>>>>>> obj=obj@entry=0x7ffe8a261e00,
>>>>>> iova=iova@entry=4465237504) at
>>>>>> /usr/src/debug/dpdk-17.11/lib/librte_mempool/rte_mempool.c:140
>>>>>> #1 0x00007ffff37990f0 in rte_mempool_populate_iova (mp=0x7ffebfd8d6c0,
>>>>>> vaddr=0x7ffe8a23d540 "",
>>>>>> iova=4465087808, len=8388480, free_cb=<optimized out>,
>>>>>> opaque=<optimized out>)
>>>>>> at /usr/src/debug/dpdk-17.11/lib/librte_mempool/rte_mempool.c:424
>>>>>> #2 0x00007ffff379967d in rte_mempool_populate_default
>>>>>> (mp=mp@entry=0x7ffebfd8d6c0)
>>>>>> at /usr/src/debug/dpdk-17.11/lib/librte_mempool/rte_mempool.c:624
>>>>>> #3 0x00007ffff3799e89 in rte_mempool_create (name=<optimized out>,
>>>>>> n=<optimized out>,
>>>>>> elt_size=<optimized out>, cache_size=<optimized out>,
>>>>>> private_data_size=<optimized out>,
>>>>>> mp_init=0x7ffff444e410 <rte_pktmbuf_pool_init>, mp_init_arg=0x0,
>>>>>> obj_init=0x7ffff444e330 <rte_pktmbuf_init>, obj_init_arg=0x0,
>>>>>> socket_id=0, flags=0)
>>>>>> at /usr/src/debug/dpdk-17.11/lib/librte_mempool/rte_mempool.c:952
>>>>>> #4 0x0000000000548a52 in main (argc=16, argv=0x7fffffffe3c8)
>>>>>> at
>>>>>> /root/gemini-cea-4.6.0/msrc/sys/com/linux-dpdk/cea-app/../../../../sys/com/linux-dpdk/cea-app/main.c:2360
>>>>>> (gdb) c
>>>>>> Continuing.
>>>>>>
>>>>>> What do you think? It is normal that the mempool_add_elem is called only
>>>>>> on certain pointers of the mempool?
>>>>>> I attached the initialization of the mempool. Can this be wrong?
>>>>> All mempools with a cache size will have two queue to put memory on, one
>>>>> is the per lcore list and that one is used as a fast access queue. When
>>>>> the cache becomes empty or has more entries then the cache was created
>>>>> with then it pushed the extra entries to the main list of mempool data.
>>>> Why do you say "mempools with a cache size" ? In my initialization this
>>>> mempool has cache_size = 0
>>> If you give a cache size then you will have a cache list per lcore, in your
>>> case you do not have a cache. BTW not having a cache will effect
>>> performance a great deal.
>>>
>>>>> The only time that rwlock is touched is to get/put items on the main
>>>>> mempool.
>>>>>
>>>>> Just as a data point have you tried this app on 18.08 yet? I do not see
>>>>> the problem yet, sorry.
>>>> I'll try 18.08 and let you know
>>
>> Hi ,
>>
>> I tried 18.08 but nothing changed about the described behaviour. I'm
>> thinking about some overflow in my code lines but using valgrind on my
>> application tells me nothing more and it seems strange to me.
>> Is there any particular way to debug memory issues on dpdk application apart
>> from valgrind?
>>
>> Regards,
>> Matteo
>>
>>>>>>>> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 641,
>>>>>>>> pointer 0x7ffe8a262b00
>>>>>>>> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 642,
>>>>>>>> pointer 0x7ffe8a262b80
>>>>>>>> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 643,
>>>>>>>> pointer 0x7ffe8a262d00
>>>>>>>> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 644,
>>>>>>>> pointer 0x7ffe8a262d80
>>>>>>>> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 645,
>>>>>>>> pointer 0x7ffe8a262e00
>>>>>>>>
>>>>>>>>
>>>>>>>>>>> if (result == 0)
>>>>>>>>>>> {
>>>>>>>>>>> new_work->command = command; // usage of the memory
>>>>>>>>>>> gotten
>>>>>>>>>>> from the mempool... <<<<<- here is where the application
>>>>>>>>>>> crashes!!!!
>>>>>>>>>>>
>>>>>>>>> Do you know why it crashes? Is it that new_work is NULL?
>>>>>>>>>
>>>>>>>> The pointer is not NULL but is not sequential to the others (0x880002
>>>>>>>> as written before in this email). It seems to be in a memory zone not
>>>>>>>> in DPDK hugepages or something similar.
>>>>>>>> If I use this pointer the application crashes.
>>>>>>>>
>>>>>>>>> Can you check how the mempool is initialized? It should be in multi
>>>>>>>>> consumer and depending on your use case, single or multi producer.
>>>>>>>>>
>>>>>>>> Here is the initialization of this mempool
>>>>>>>>
>>>>>>>> cea_main_cmd_pool[i] = rte_mempool_create(pool_name,
>>>>>>>> (unsigned int) (ikco_cmd_buffers - 1), // 65536 - 1 in
>>>>>>>> this case
>>>>>>>> sizeof (CEA_DECODE_CMD_T), // 24 bytes
>>>>>>>> 0, 0,
>>>>>>>> rte_pktmbuf_pool_init, NULL,
>>>>>>>> rte_pktmbuf_init, NULL,
>>>>>>>> rte_socket_id(), 0);
>>>>>>>>
>>>>>>>>> Another thing that could be checked: at all the places where you
>>>>>>>>> return your work object to the mempool, you should add a check
>>>>>>>>> that it is not NULL. Or just enabling RTE_LIBRTE_MEMPOOL_DEBUG
>>>>>>>>> could do the trick: it adds some additional checks when doing
>>>>>>>>> mempool operations.
>>>>>>>>>
>>>>>>>> I think I have already answered this point with the prints up in the
>>>>>>>> email.
>>>>>>>>
>>>>>>>> What do you think about this behaviour?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Matteo
>>>>>>>>
>>>>>>>>>>> result =
>>>>>>>>>>> rte_ring_enqueue(cea_main_lcore_conf[lcore_id].de_conf.cmd_ring,
>>>>>>>>>>> (VOID_P) new_work); // enqueues the gotten buffer on the rings
>>>>>>>>>>> of all
>>>>>>>>>>> lcores
>>>>>>>>>>> // check on result value ...
>>>>>>>>>>> }
>>>>>>>>>>> else
>>>>>>>>>>> {
>>>>>>>>>>> // do something if result != 0 ...
>>>>>>>>>>> }
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> This code worked perfectly (never had an issue) on dpdk-2.2.0,
>>>>>>>>>>> while if
>>>>>>>>>>> I use more than 1 thread doing these operations on dpdk-17.11 it
>>>>>>>>>>> happens
>>>>>>>>>>> that after some times the "new_work" pointer is not a good one, and
>>>>>>>>>>> the
>>>>>>>>>>> application crashes when using that pointer.
>>>>>>>>>>>
>>>>>>>>>>> It seems that these lines cannot be used by more than one thread
>>>>>>>>>>> simultaneously. I also used many 2017 and 2018 dpdk versions
>>>>>>>>>>> without
>>>>>>>>>>> success.
>>>>>>>>>>>
>>>>>>>>>>> Is this code possible on the new dpdk versions? Or have I to change
>>>>>>>>>>> my
>>>>>>>>>>> application so that this code is called just by one lcore at a
>>>>>>>>>>> time?
>>>>>>>>>>>
>>>>>>>>> Assuming the mempool is properly initialized, I don't see any reason
>>>>>>>>> why it would not work. There has been a lot of changes in mempool
>>>>>>>>> between
>>>>>>>>> dpdk-2.2.0 and dpdk-17.11, but this behavior should remain the same.
>>>>>>>>>
>>>>>>>>> If the comments above do not help to solve the issue, it could be
>>>>>>>>> helpful
>>>>>>>>> to try to reproduce the issue in a minimal program, so we can help to
>>>>>>>>> review it.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Olivier
>>>>>>>>>
>>>>>>> Regards,
>>>>>>> Keith
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Matteo
>>>>>>
>>>>> Regards,
>>>>> Keith
>>>>>
>>>>>
>>>>>
>>>> Regards,
>>>> Matteo
>>> Regards,
>>> Keith
>>>
>>>
>>
>>
Regards,
Keith