> On Aug 20, 2018, at 9:47 AM, Matteo Lanzuisi <m.lanzu...@resi.it> wrote:
> 
> Hello Olivier,
> 
> Il 13/08/2018 23:54, Olivier Matz ha scritto:
>> Hello Matteo,
>> 
>> On Mon, Aug 13, 2018 at 03:20:44PM +0200, Matteo Lanzuisi wrote:
>>> Any suggestion? any idea about this behaviour?
>>> 
>>> Il 08/08/2018 11:56, Matteo Lanzuisi ha scritto:
>>>> Hi all,
>>>> 
>>>> recently I began using "dpdk-17.11-11.el7.x86_64" rpm (RedHat rpm) on
>>>> RedHat 7.5 kernel 3.10.0-862.6.3.el7.x86_64 as a porting of an
>>>> application from RH6 to RH7. On RH6 I used dpdk-2.2.0.
>>>> 
>>>> This application is made up by one or more threads (each one on a
>>>> different logical core) reading packets from i40e interfaces.
>>>> 
>>>> Each thread can call the following code lines when receiving a specific
>>>> packet:
>>>> 
>>>> RTE_LCORE_FOREACH(lcore_id)
>>>> {
>>>>         result =
>>>> rte_mempool_get(cea_main_lcore_conf[lcore_id].de_conf.cmd_pool, (VOID_P
>>>> *) &new_work);        // mempools are created one for each logical core
>>>>         if (((uint64_t)(new_work)) < 0x7f0000000000)
>>>>             printf("Result %d, lcore di partenza %u, lcore di ricezione
>>>> %u, pointer %p\n", result, rte_lcore_id(), lcore_id, new_work);    //
>>>> debug print, on my server it should never happen but with multi-thread
>>>> happens always on the last logical core!!!!
>> Here, checking the value of new_work looks wrong to me, before
>> ensuring that result == 0. At least, new_work should be set to
>> NULL before calling rte_mempool_get().
> I put the check after result == 0, and just before the rte_mempool_get() I 
> set new_work to NULL, but nothing changed.
> The first time something goes wrong the print is
> 
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 635, pointer 
> 0x880002
> 
> Sorry for the italian language print :) it means that application is sending 
> a message from the logical core 1 to the logical core 2, it's the 635th time, 
> the result is 0 and the pointer is 0x880002 while all pointers before were 
> 0x7ffxxxxxx.
> One strange thing is that this behaviour happens always from the logical core 
> 1 to the logical core 2 when the counter is 635!!! (Sending messages from 2 
> to 1 or 1 to 1 or 2 to 2 is all ok)
> Another strange thing is that pointers from counter 636 to 640 are NULL, and 
> from 641 begin again to be good... as you can see here following (I attached 
> the result of a test without the "if" of the check on the value of new_work, 
> and only for messages from the lcore 1 to lcore 2)
> 
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 627, pointer 
> 0x7ffe8a261880
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 628, pointer 
> 0x7ffe8a261900
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 629, pointer 
> 0x7ffe8a261980
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 630, pointer 
> 0x7ffe8a261a00
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 631, pointer 
> 0x7ffe8a261a80
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 632, pointer 
> 0x7ffe8a261b00
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 633, pointer 
> 0x7ffe8a261b80
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 634, pointer 
> 0x7ffe8a261c00
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 635, pointer 
> 0x880002
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 636, pointer 
> (nil)
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 637, pointer 
> (nil)
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 638, pointer 
> (nil)
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 639, pointer 
> (nil)
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 640, pointer 
> (nil)

This sure does seem like a memory over write problem, with maybe a memset(0) in 
the mix as well. Have you tried using hardware break points with the 0x880002 
or 0x00 being written into this range?

> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 641, pointer 
> 0x7ffe8a262b00
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 642, pointer 
> 0x7ffe8a262b80
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 643, pointer 
> 0x7ffe8a262d00
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 644, pointer 
> 0x7ffe8a262d80
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 645, pointer 
> 0x7ffe8a262e00
> 
>> 
>>>>         if (result == 0)
>>>>         {
>>>>             new_work->command = command; // usage of the memory gotten
>>>> from the mempool... <<<<<- here is where the application crashes!!!!
>> Do you know why it crashes? Is it that new_work is NULL?
> The pointer is not NULL but is not sequential to the others (0x880002 as 
> written before in this email). It seems to be in a memory zone not in DPDK 
> hugepages or something similar.
> If I use this pointer the application crashes.
>> 
>> Can you check how the mempool is initialized? It should be in multi
>> consumer and depending on your use case, single or multi producer.
> Here is the initialization of this mempool
> 
> cea_main_cmd_pool[i] = rte_mempool_create(pool_name,
>             (unsigned int) (ikco_cmd_buffers - 1), // 65536 - 1 in this case
>             sizeof (CEA_DECODE_CMD_T), // 24 bytes
>             0, 0,
>             rte_pktmbuf_pool_init, NULL,
>             rte_pktmbuf_init, NULL,
>             rte_socket_id(), 0);
>> 
>> Another thing that could be checked: at all the places where you
>> return your work object to the mempool, you should add a check
>> that it is not NULL. Or just enabling RTE_LIBRTE_MEMPOOL_DEBUG
>> could do the trick: it adds some additional checks when doing
>> mempool operations.
> I think I have already answered this point with the prints up in the email.
> 
> What do you think about this behaviour?
> 
> Regards,
> Matteo
>> 
>>>>             result =
>>>> rte_ring_enqueue(cea_main_lcore_conf[lcore_id].de_conf.cmd_ring,
>>>> (VOID_P) new_work);    // enqueues the gotten buffer on the rings of all
>>>> lcores
>>>>             // check on result value ...
>>>>         }
>>>>         else
>>>>         {
>>>>             // do something if result != 0 ...
>>>>         }
>>>> }
>>>> 
>>>> This code worked perfectly (never had an issue) on dpdk-2.2.0, while if
>>>> I use more than 1 thread doing these operations on dpdk-17.11 it happens
>>>> that after some times the "new_work" pointer is not a good one, and the
>>>> application crashes when using that pointer.
>>>> 
>>>> It seems that these lines cannot be used by more than one thread
>>>> simultaneously. I also used many 2017 and 2018 dpdk versions without
>>>> success.
>>>> 
>>>> Is this code possible on the new dpdk versions? Or have I to change my
>>>> application so that this code is called just by one lcore at a time?
>> Assuming the mempool is properly initialized, I don't see any reason
>> why it would not work. There has been a lot of changes in mempool between
>> dpdk-2.2.0 and dpdk-17.11, but this behavior should remain the same.
>> 
>> If the comments above do not help to solve the issue, it could be helpful
>> to try to reproduce the issue in a minimal program, so we can help to
>> review it.
>> 
>> Regards,
>> Olivier

Regards,
Keith

Reply via email to