> On Aug 20, 2018, at 9:47 AM, Matteo Lanzuisi <m.lanzu...@resi.it> wrote:
>
> Hello Olivier,
>
> Il 13/08/2018 23:54, Olivier Matz ha scritto:
>> Hello Matteo,
>>
>> On Mon, Aug 13, 2018 at 03:20:44PM +0200, Matteo Lanzuisi wrote:
>>> Any suggestion? any idea about this behaviour?
>>>
>>> Il 08/08/2018 11:56, Matteo Lanzuisi ha scritto:
>>>> Hi all,
>>>>
>>>> recently I began using "dpdk-17.11-11.el7.x86_64" rpm (RedHat rpm) on
>>>> RedHat 7.5 kernel 3.10.0-862.6.3.el7.x86_64 as a porting of an
>>>> application from RH6 to RH7. On RH6 I used dpdk-2.2.0.
>>>>
>>>> This application is made up by one or more threads (each one on a
>>>> different logical core) reading packets from i40e interfaces.
>>>>
>>>> Each thread can call the following code lines when receiving a specific
>>>> packet:
>>>>
>>>> RTE_LCORE_FOREACH(lcore_id)
>>>> {
>>>> result =
>>>> rte_mempool_get(cea_main_lcore_conf[lcore_id].de_conf.cmd_pool, (VOID_P
>>>> *) &new_work); // mempools are created one for each logical core
>>>> if (((uint64_t)(new_work)) < 0x7f0000000000)
>>>> printf("Result %d, lcore di partenza %u, lcore di ricezione
>>>> %u, pointer %p\n", result, rte_lcore_id(), lcore_id, new_work); //
>>>> debug print, on my server it should never happen but with multi-thread
>>>> happens always on the last logical core!!!!
>> Here, checking the value of new_work looks wrong to me, before
>> ensuring that result == 0. At least, new_work should be set to
>> NULL before calling rte_mempool_get().
> I put the check after result == 0, and just before the rte_mempool_get() I
> set new_work to NULL, but nothing changed.
> The first time something goes wrong the print is
>
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 635, pointer
> 0x880002
>
> Sorry for the italian language print :) it means that application is sending
> a message from the logical core 1 to the logical core 2, it's the 635th time,
> the result is 0 and the pointer is 0x880002 while all pointers before were
> 0x7ffxxxxxx.
> One strange thing is that this behaviour happens always from the logical core
> 1 to the logical core 2 when the counter is 635!!! (Sending messages from 2
> to 1 or 1 to 1 or 2 to 2 is all ok)
> Another strange thing is that pointers from counter 636 to 640 are NULL, and
> from 641 begin again to be good... as you can see here following (I attached
> the result of a test without the "if" of the check on the value of new_work,
> and only for messages from the lcore 1 to lcore 2)
>
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 627, pointer
> 0x7ffe8a261880
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 628, pointer
> 0x7ffe8a261900
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 629, pointer
> 0x7ffe8a261980
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 630, pointer
> 0x7ffe8a261a00
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 631, pointer
> 0x7ffe8a261a80
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 632, pointer
> 0x7ffe8a261b00
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 633, pointer
> 0x7ffe8a261b80
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 634, pointer
> 0x7ffe8a261c00
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 635, pointer
> 0x880002
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 636, pointer
> (nil)
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 637, pointer
> (nil)
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 638, pointer
> (nil)
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 639, pointer
> (nil)
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 640, pointer
> (nil)
This sure does seem like a memory over write problem, with maybe a memset(0) in
the mix as well. Have you tried using hardware break points with the 0x880002
or 0x00 being written into this range?
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 641, pointer
> 0x7ffe8a262b00
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 642, pointer
> 0x7ffe8a262b80
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 643, pointer
> 0x7ffe8a262d00
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 644, pointer
> 0x7ffe8a262d80
> Result 0, lcore di partenza 1, lcore di ricezione 2, counter 645, pointer
> 0x7ffe8a262e00
>
>>
>>>> if (result == 0)
>>>> {
>>>> new_work->command = command; // usage of the memory gotten
>>>> from the mempool... <<<<<- here is where the application crashes!!!!
>> Do you know why it crashes? Is it that new_work is NULL?
> The pointer is not NULL but is not sequential to the others (0x880002 as
> written before in this email). It seems to be in a memory zone not in DPDK
> hugepages or something similar.
> If I use this pointer the application crashes.
>>
>> Can you check how the mempool is initialized? It should be in multi
>> consumer and depending on your use case, single or multi producer.
> Here is the initialization of this mempool
>
> cea_main_cmd_pool[i] = rte_mempool_create(pool_name,
> (unsigned int) (ikco_cmd_buffers - 1), // 65536 - 1 in this case
> sizeof (CEA_DECODE_CMD_T), // 24 bytes
> 0, 0,
> rte_pktmbuf_pool_init, NULL,
> rte_pktmbuf_init, NULL,
> rte_socket_id(), 0);
>>
>> Another thing that could be checked: at all the places where you
>> return your work object to the mempool, you should add a check
>> that it is not NULL. Or just enabling RTE_LIBRTE_MEMPOOL_DEBUG
>> could do the trick: it adds some additional checks when doing
>> mempool operations.
> I think I have already answered this point with the prints up in the email.
>
> What do you think about this behaviour?
>
> Regards,
> Matteo
>>
>>>> result =
>>>> rte_ring_enqueue(cea_main_lcore_conf[lcore_id].de_conf.cmd_ring,
>>>> (VOID_P) new_work); // enqueues the gotten buffer on the rings of all
>>>> lcores
>>>> // check on result value ...
>>>> }
>>>> else
>>>> {
>>>> // do something if result != 0 ...
>>>> }
>>>> }
>>>>
>>>> This code worked perfectly (never had an issue) on dpdk-2.2.0, while if
>>>> I use more than 1 thread doing these operations on dpdk-17.11 it happens
>>>> that after some times the "new_work" pointer is not a good one, and the
>>>> application crashes when using that pointer.
>>>>
>>>> It seems that these lines cannot be used by more than one thread
>>>> simultaneously. I also used many 2017 and 2018 dpdk versions without
>>>> success.
>>>>
>>>> Is this code possible on the new dpdk versions? Or have I to change my
>>>> application so that this code is called just by one lcore at a time?
>> Assuming the mempool is properly initialized, I don't see any reason
>> why it would not work. There has been a lot of changes in mempool between
>> dpdk-2.2.0 and dpdk-17.11, but this behavior should remain the same.
>>
>> If the comments above do not help to solve the issue, it could be helpful
>> to try to reproduce the issue in a minimal program, so we can help to
>> review it.
>>
>> Regards,
>> Olivier
Regards,
Keith