Re: [dpdk-dev] [PATCH] mem: balanced allocation of hugepages

Ilya Maximets Fri, 07 Apr 2017 08:15:11 -0700

Hi All.

I wanted to ask (just to clarify current status):
Will this patch be included in current release (acked by maintainer)
and then I will upgrade it to hybrid logic or I will just prepare v3
with hybrid logic for 17.08 ?


Best regards, Ilya Maximets.


On 27.03.2017 17:43, Ilya Maximets wrote:
> On 27.03.2017 16:01, Sergio Gonzalez Monroy wrote:
>> On 09/03/2017 12:57, Ilya Maximets wrote:
>>> On 08.03.2017 16:46, Sergio Gonzalez Monroy wrote:
>>>> Hi Ilya,
>>>>
>>>> I have done similar tests and as you already pointed out, 'numactl 
>>>> --interleave' does not seem to work as expected.
>>>> I have also checked that the issue can be reproduced with quota limit on 
>>>> hugetlbfs mount point.
>>>>
>>>> I would be inclined towards *adding libnuma as dependency* to DPDK to make 
>>>> memory allocation a bit more reliable.
>>>>
>>>> Currently at a high level regarding hugepages per numa node:
>>>> 1) Try to map all free hugepages. The total number of mapped hugepages 
>>>> depends if there were any limits, such as cgroups or quota in mount point.
>>>> 2) Find out numa node of each hugepage.
>>>> 3) Check if we have enough hugepages for requested memory in each numa 
>>>> socket/node.
>>>>
>>>> Using libnuma we could try to allocate hugepages per numa:
>>>> 1) Try to map as many hugepages from numa 0.
>>>> 2) Check if we have enough hugepages for requested memory in numa 0.
>>>> 3) Try to map as many hugepages from numa 1.
>>>> 4) Check if we have enough hugepages for requested memory in numa 1.
>>>>
>>>> This approach would improve failing scenarios caused by limits but It 
>>>> would still not fix issues regarding non-contiguous hugepages (worst case 
>>>> each hugepage is a memseg).
>>>> The non-contiguous hugepages issues are not as critical now that mempools 
>>>> can span over multiple memsegs/hugepages, but it is still a problem for 
>>>> any other library requiring big chunks of memory.
>>>>
>>>> Potentially if we were to add an option such as 'iommu-only' when all 
>>>> devices are bound to vfio-pci, we could have a reliable way to allocate 
>>>> hugepages by just requesting the number of pages from each numa.
>>>>
>>>> Thoughts?
>>> Hi Sergio,
>>>
>>> Thanks for your attention to this.
>>>
>>> For now, as we have some issues with non-contiguous
>>> hugepages, I'm thinking about following hybrid schema:
>>> 1) Allocate essential hugepages:
>>>     1.1) Allocate as many hugepages from numa N to
>>>          only fit requested memory for this numa.
>>>     1.2) repeat 1.1 for all numa nodes.
>>> 2) Try to map all remaining free hugepages in a round-robin
>>>     fashion like in this patch.
>>> 3) Sort pages and choose the most suitable.
>>>
>>> This solution should decrease number of issues connected with
>>> non-contiguous memory.
>>
>> Sorry for late reply, I was hoping for more comments from the community.
>>
>> IMHO this should be default behavior, which means no config option and 
>> libnuma as EAL dependency.
>> I think your proposal is good, could you consider implementing such approach 
>> on next release?
> 
> Sure, I can implement this for 17.08 release.
> 
>>>
>>>> On 06/03/2017 09:34, Ilya Maximets wrote:
>>>>> Hi all.
>>>>>
>>>>> So, what about this change?
>>>>>
>>>>> Best regards, Ilya Maximets.
>>>>>
>>>>> On 16.02.2017 16:01, Ilya Maximets wrote:
>>>>>> Currently EAL allocates hugepages one by one not paying
>>>>>> attention from which NUMA node allocation was done.
>>>>>>
>>>>>> Such behaviour leads to allocation failure if number of
>>>>>> available hugepages for application limited by cgroups
>>>>>> or hugetlbfs and memory requested not only from the first
>>>>>> socket.
>>>>>>
>>>>>> Example:
>>>>>>      # 90 x 1GB hugepages availavle in a system
>>>>>>
>>>>>>      cgcreate -g hugetlb:/test
>>>>>>      # Limit to 32GB of hugepages
>>>>>>      cgset -r hugetlb.1GB.limit_in_bytes=34359738368 test
>>>>>>      # Request 4GB from each of 2 sockets
>>>>>>      cgexec -g hugetlb:test testpmd --socket-mem=4096,4096 ...
>>>>>>
>>>>>>      EAL: SIGBUS: Cannot mmap more hugepages of size 1024 MB
>>>>>>      EAL: 32 not 90 hugepages of size 1024 MB allocated
>>>>>>      EAL: Not enough memory available on socket 1!
>>>>>>           Requested: 4096MB, available: 0MB
>>>>>>      PANIC in rte_eal_init():
>>>>>>      Cannot init memory
>>>>>>
>>>>>>      This happens beacause all allocated pages are
>>>>>>      on socket 0.
>>>>>>
>>>>>> Fix this issue by setting mempolicy MPOL_PREFERRED for each
>>>>>> hugepage to one of requested nodes in a round-robin fashion.
>>>>>> In this case all allocated pages will be fairly distributed
>>>>>> between all requested nodes.
>>>>>>
>>>>>> New config option RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES
>>>>>> introduced and disabled by default because of external
>>>>>> dependency from libnuma.
>>>>>>
>>>>>> Cc:<sta...@dpdk.org>
>>>>>> Fixes: 77988fc08dc5 ("mem: fix allocating all free hugepages")
>>>>>>
>>>>>> Signed-off-by: Ilya Maximets<i.maxim...@samsung.com>
>>>>>> ---
>>>>>>    config/common_base                       |  1 +
>>>>>>    lib/librte_eal/Makefile                  |  4 ++
>>>>>>    lib/librte_eal/linuxapp/eal/eal_memory.c | 66 
>>>>>> ++++++++++++++++++++++++++++++++
>>>>>>    mk/rte.app.mk                            |  3 ++
>>>>>>    4 files changed, 74 insertions(+)
>>
>> Acked-by: Sergio Gonzalez Monroy <sergio.gonzalez.mon...@intel.com>
> 
> Thanks.
> 
> Best regards, Ilya Maximets.
>

Re: [dpdk-dev] [PATCH] mem: balanced allocation of hugepages

Reply via email to