On 08.03.2017 16:46, Sergio Gonzalez Monroy wrote: > Hi Ilya, > > I have done similar tests and as you already pointed out, 'numactl > --interleave' does not seem to work as expected. > I have also checked that the issue can be reproduced with quota limit on > hugetlbfs mount point. > > I would be inclined towards *adding libnuma as dependency* to DPDK to make > memory allocation a bit more reliable. > > Currently at a high level regarding hugepages per numa node: > 1) Try to map all free hugepages. The total number of mapped hugepages > depends if there were any limits, such as cgroups or quota in mount point. > 2) Find out numa node of each hugepage. > 3) Check if we have enough hugepages for requested memory in each numa > socket/node. > > Using libnuma we could try to allocate hugepages per numa: > 1) Try to map as many hugepages from numa 0. > 2) Check if we have enough hugepages for requested memory in numa 0. > 3) Try to map as many hugepages from numa 1. > 4) Check if we have enough hugepages for requested memory in numa 1. > > This approach would improve failing scenarios caused by limits but It would > still not fix issues regarding non-contiguous hugepages (worst case each > hugepage is a memseg). > The non-contiguous hugepages issues are not as critical now that mempools can > span over multiple memsegs/hugepages, but it is still a problem for any other > library requiring big chunks of memory. > > Potentially if we were to add an option such as 'iommu-only' when all devices > are bound to vfio-pci, we could have a reliable way to allocate hugepages by > just requesting the number of pages from each numa. > > Thoughts?
Hi Sergio, Thanks for your attention to this. For now, as we have some issues with non-contiguous hugepages, I'm thinking about following hybrid schema: 1) Allocate essential hugepages: 1.1) Allocate as many hugepages from numa N to only fit requested memory for this numa. 1.2) repeat 1.1 for all numa nodes. 2) Try to map all remaining free hugepages in a round-robin fashion like in this patch. 3) Sort pages and choose the most suitable. This solution should decrease number of issues connected with non-contiguous memory. Best regards, Ilya Maximets. > > On 06/03/2017 09:34, Ilya Maximets wrote: >> Hi all. >> >> So, what about this change? >> >> Best regards, Ilya Maximets. >> >> On 16.02.2017 16:01, Ilya Maximets wrote: >>> Currently EAL allocates hugepages one by one not paying >>> attention from which NUMA node allocation was done. >>> >>> Such behaviour leads to allocation failure if number of >>> available hugepages for application limited by cgroups >>> or hugetlbfs and memory requested not only from the first >>> socket. >>> >>> Example: >>> # 90 x 1GB hugepages availavle in a system >>> >>> cgcreate -g hugetlb:/test >>> # Limit to 32GB of hugepages >>> cgset -r hugetlb.1GB.limit_in_bytes=34359738368 test >>> # Request 4GB from each of 2 sockets >>> cgexec -g hugetlb:test testpmd --socket-mem=4096,4096 ... >>> >>> EAL: SIGBUS: Cannot mmap more hugepages of size 1024 MB >>> EAL: 32 not 90 hugepages of size 1024 MB allocated >>> EAL: Not enough memory available on socket 1! >>> Requested: 4096MB, available: 0MB >>> PANIC in rte_eal_init(): >>> Cannot init memory >>> >>> This happens beacause all allocated pages are >>> on socket 0. >>> >>> Fix this issue by setting mempolicy MPOL_PREFERRED for each >>> hugepage to one of requested nodes in a round-robin fashion. >>> In this case all allocated pages will be fairly distributed >>> between all requested nodes. >>> >>> New config option RTE_LIBRTE_EAL_NUMA_AWARE_HUGEPAGES >>> introduced and disabled by default because of external >>> dependency from libnuma. >>> >>> Cc: <sta...@dpdk.org> >>> Fixes: 77988fc08dc5 ("mem: fix allocating all free hugepages") >>> >>> Signed-off-by: Ilya Maximets <i.maxim...@samsung.com> >>> --- >>> config/common_base | 1 + >>> lib/librte_eal/Makefile | 4 ++ >>> lib/librte_eal/linuxapp/eal/eal_memory.c | 66 >>> ++++++++++++++++++++++++++++++++ >>> mk/rte.app.mk | 3 ++ >>> 4 files changed, 74 insertions(+) >>> > > > >