Hi Anatoly, Thank you very much for the useful explanations!
Thanks, Asaf -----Original Message----- From: Burakov, Anatoly <anatoly.bura...@intel.com> Sent: Monday, December 10, 2018 12:10 PM To: Asaf Sinai <asa...@radware.com>; Ilya Maximets <i.maxim...@samsung.com>; Hemant Agrawal <hemant.agra...@nxp.com>; dev@dpdk.org; Thomas Monjalon <tho...@monjalon.net> Cc: Ilia Ferdman <il...@radware.com>; Sasha Hodos <sas...@radware.com> Subject: Re: [dpdk-dev] CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES: no difference in memory pool allocations, when enabling/disabling this configuration On 09-Dec-18 8:14 AM, Asaf Sinai wrote: > Hi all, > > Thanks for the detailed explanations! > > So, what we understood from that, is the following (please correct, if it is > wrong): > Before 18.05 version: > - Dividing huge pages between NUMAs was based, by default, on Linux good will. > - Enforcing Linux to divide huge pages between NUMAs, required enabling > configuration option "CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES". > - The enforcement was done via "libnuma" library. > > From 18.05 version: > - The mentioned configuration option is ignored, so that by default, all huge > pages are allocated on NUMA 0. > - if "libnuma" library exists in system, then huge pages will be divided > between NUMAs, without any special configuration. > - The above is relevant to architectures that support NUMA, e.g. X86 (which > we use). > > Thanks, > Asaf Hi Asaf, Before 18.05, the above description is correct. Since 18.05, it's not _quite_ like that. There are two memory modes in 18.05 - default and legacy. Legacy mode pretty much behaves like pre-18.05 code. Default memory mode without the CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES for all intents and purposes should be considered unsupported for post-18.05 code, and libnuma should be considered to be a hard dependency for non-legacy, NUMA-aware code. Without this option, EAL will disallow allocations on sockets other than 0, but on a NUMA-enabled system, you won't necessarily get memory from socket 0 - it will *say* it is on socket 0, but it may not *actually* be the case, because without libnuma we do not check where it was allocated. Reasons for the above behavior is simple: legacy mem mode preallocates all memory in advance. This gives us an opportunity to figure out page socket affinity at initialization, and not worry about it afterwards. Non-legacy mode doesn't have the luxury of preallocating all memory in advance, instead we allocate memory on the fly - which means that whenever an allocation is requested, we need memory not just anywhere (like in legacy init case), but located on a specific socket - we cannot "sort it out later" like we do with legacy mem. Without libnuma, we cannot get this functionality. > > -----Original Message----- > From: Ilya Maximets <i.maxim...@samsung.com> > Sent: Tuesday, November 27, 2018 06:50 PM > To: Burakov, Anatoly <anatoly.bura...@intel.com>; Hemant Agrawal > <hemant.agra...@nxp.com>; Asaf Sinai <asa...@radware.com>; > dev@dpdk.org; Thomas Monjalon <tho...@monjalon.net> > Cc: Ilia Ferdman <il...@radware.com>; Sasha Hodos <sas...@radware.com> > Subject: Re: [dpdk-dev] CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES: no > difference in memory pool allocations, when enabling/disabling this > configuration > > On 27.11.2018 13:33, Burakov, Anatoly wrote: >> On 27-Nov-18 10:26 AM, Hemant Agrawal wrote: >>> >>> On 11/26/2018 8:55 PM, Asaf Sinai wrote: >>>> +CC Ilia & Sasha. >>>> >>>> -----Original Message----- >>>> From: Burakov, Anatoly <anatoly.bura...@intel.com> >>>> Sent: Monday, November 26, 2018 04:57 PM >>>> To: Ilya Maximets <i.maxim...@samsung.com>; Asaf Sinai >>>> <asa...@radware.com>; dev@dpdk.org; Thomas Monjalon >>>> <tho...@monjalon.net> >>>> Subject: Re: [dpdk-dev] CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES: no >>>> difference in memory pool allocations, when enabling/disabling this >>>> configuration >>>> >>>> On 26-Nov-18 2:32 PM, Ilya Maximets wrote: >>>>> On 26.11.2018 17:21, Burakov, Anatoly wrote: >>>>>> On 26-Nov-18 2:10 PM, Ilya Maximets wrote: >>>>>>> On 26.11.2018 16:42, Burakov, Anatoly wrote: >>>>>>>> On 26-Nov-18 1:20 PM, Ilya Maximets wrote: >>>>>>>>> On 26.11.2018 16:16, Ilya Maximets wrote: >>>>>>>>>> On 26.11.2018 15:50, Burakov, Anatoly wrote: >>>>>>>>>>> On 26-Nov-18 11:43 AM, Burakov, Anatoly wrote: >>>>>>>>>>>> On 26-Nov-18 11:33 AM, Asaf Sinai wrote: >>>>>>>>>>>>> Hi Anatoly, >>>>>>>>>>>>> >>>>>>>>>>>>> We did not check it with "testpmd", only with our application. >>>>>>>>>>>>> From the beginning, we did not enable this configuration >>>>>>>>>>>>> (look at attached files), and everything works fine. >>>>>>>>>>>>> Of course we rebuild DPDK, when we change configuration. >>>>>>>>>>>>> Please note that we use DPDK 17.11.3, maybe this is why it works >>>>>>>>>>>>> fine? >>>>>>>>>>>> Just tested with DPDK 17.11, and yes, it does work the way you are >>>>>>>>>>>> describing. This is not intended behavior. I will look into it. >>>>>>>>>>>> >>>>>>>>>>> +CC author of commit introducing >>>>>>>>>>> CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES. >>>>>>>>>>> >>>>>>>>>>> Looking at the code, i think this config option needs to be >>>>>>>>>>> reworked and we should clarify what we mean by this option. It >>>>>>>>>>> appears that i've misunderstood what this option actually intended >>>>>>>>>>> to do, and i also think it's naming could be improved because it's >>>>>>>>>>> confusing and misleading. >>>>>>>>>>> >>>>>>>>>>> In 17.11, this option does *not* prevent EAL from using NUMA - it >>>>>>>>>>> merely disables using libnuma to perform memory allocation. This >>>>>>>>>>> looks like intended (if counter-intuitive) behavior - disabling >>>>>>>>>>> this option will simply revert DPDK to working as it did before >>>>>>>>>>> this option was introduced (i.e. best-effort allocation). This is >>>>>>>>>>> why your code still works - because EAL still does allocate memory >>>>>>>>>>> on socket 1, and *knows* that it's socket 1 memory. It still >>>>>>>>>>> supports NUMA. >>>>>>>>>>> >>>>>>>>>>> The commit message for these changes states that the actual purpose >>>>>>>>>>> of this option is to enable "balanced" hugepage allocation. In case >>>>>>>>>>> of cgroups limitations, previously, DPDK would've exhausted all >>>>>>>>>>> hugepages on master core's socket before attempting to allocate >>>>>>>>>>> from other sockets, but by the time we've reached cgroups limits on >>>>>>>>>>> numbers of hugepages, we might not have reached socket 1 and thus >>>>>>>>>>> missed out on the pages we could've allocated, but didn't. Using >>>>>>>>>>> libnuma solves this issue, because now we can allocate pages on >>>>>>>>>>> sockets we want, instead of hoping we won't run out of hugepages >>>>>>>>>>> before we get the memory we need. >>>>>>>>>>> >>>>>>>>>>> In 18.05 onwards, this option works differently (and arguably >>>>>>>>>>> wrong). More specifically, it disallows allocations on sockets >>>>>>>>>>> other than 0, and it also makes it so that EAL does not check which >>>>>>>>>>> socket the memory *actually* came from. So, not only allocating >>>>>>>>>>> memory from socket 1 is disabled, but allocating from socket 0 may >>>>>>>>>>> even get you memory from socket 1! >>>>>>>>>> I'd consider this as a bug. >>>>>>>>>> >>>>>>>>>>> +CC Thomas >>>>>>>>>>> >>>>>>>>>>> The CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES option is a misnomer, >>>>>>>>>>> because it makes it seem like this option disables NUMA support, >>>>>>>>>>> which is not the case. >>>>>>>>>>> >>>>>>>>>>> I would also argue that it is not relevant to 18.05+ memory >>>>>>>>>>> subsystem, and should only work in legacy mode, because it is >>>>>>>>>>> *impossible* to make it work right in the new memory subsystem, and >>>>>>>>>>> here's why: >>>>>>>>>>> >>>>>>>>>>> Without libnuma, we have no way of "asking" the kernel to allocate >>>>>>>>>>> a hugepage on a specific socket - instead, any allocation will most >>>>>>>>>>> likely happen on socket from which the allocation came from. For >>>>>>>>>>> example, if user program's lcore is on socket 1, allocation on >>>>>>>>>>> socket 0 will actually allocate a page on socket 1. >>>>>>>>>>> >>>>>>>>>>> If we don't check for page's NUMA node affinity (which is what >>>>>>>>>>> currently happens) - we get performance degradation because we may >>>>>>>>>>> unintentionally allocate memory on wrong NUMA node. If we do check >>>>>>>>>>> for this - then allocation of memory on socket 1 from lcore on >>>>>>>>>>> socket 0 will almost never succeed, because kernel will always give >>>>>>>>>>> us pages on socket 0. >>>>>>>>>>> >>>>>>>>>>> Put it simply, there is no sane way to make this option work for >>>>>>>>>>> the new memory subsystem - IMO it should be dropped, and libnuma >>>>>>>>>>> should be made a hard dependency on Linux. >>>>>>>>>> I agree that new memory model could not work without libnuma, >>>>>>>>>> i.e. will lead to unpredictable memory allocations with no >>>>>>>>>> any respect to requested socket_id's. I also agree that >>>>>>>>>> CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES is only sane for a legacy memory >>>>>>>>>> model. >>>>>>>>>> It looks like we have no other choice than just drop the >>>>>>>>>> option and make the code unconditional, i.e. have hard dependency on >>>>>>>>>> libnuma. >>>>>>>>>> >>>>>>>>> We, probably, could compile this code and have hard dependency >>>>>>>>> only for platforms with 'RTE_MAX_NUMA_NODES > 1'. >>>>>>>> Well, as long as legacy mode stays supported, we have to keep the >>>>>>>> option. The "drop" part was referring to supporting it under the new >>>>>>>> memory system, not a literal drop from config files. >>>>>>> The option was introduced because we didn't want to introduce >>>>>>> the new hard dependency. Since we'll have it anyway, I'm not >>>>>>> sure if keeping the option for legacy mode makes any sense. >>>>>> Oh yes, you're right. Drop it is! >>>>>> >>>>>>>> As for using RTE_MAX_NUMA_NODES, i don't think it's merited. >>>>>>>> Distributions cannot deliver different DPDK versions based on the >>>>>>>> number of sockets on a particular machine - so it would have to be a >>>>>>>> hard dependency for distributions anyway (does any distribution ship >>>>>>>> DPDK without libnuma?). >>>>>>> At least ARMv7 builds commonly does not ship libnuma package. >>>>>> Do you mean libnuma builds for ARMv7 are not available? Or do you mean >>>>>> the libnuma package is not installed by default? >>>>>> >>>>>> If it's the latter, then i believe it's not installed by default >>>>>> anywhere, but if using distribution version of DPDK, libnuma will be >>>>>> taken care of via package manager. Presumably building from source can >>>>>> be taken care of with pkg-config/meson. >>>>>> >>>>>> Or do you mean ARMv7 does not have libnuma for their arch at all, in any >>>>>> distro? >>>>> libnuma builds for ARMv7 are not available in most of the distros. >>>>> I didn't check all, but here is results for Ubuntu: >>>>> >>>>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2 >>>>> F >>>>> pac >>>>> kages.ubuntu.com%2Fsearch%3Fsuite%3Dbionic%26arch%3Darmhf%26search >>>>> o >>>>> n%3 >>>>> Dnames%26keywords%3Dlibnuma&data=02%7C01%7CAsafSi%40radware.co >>>>> m >>>>> %7C >>>>> a44f84bca42d4a52acac08d653af83b8%7C6ae4e000b5d04f48a766402d46119b7 >>>>> 6 >>>>> %7C >>>>> 0%7C0%7C636788410626179927&sdata=1pJ0WkAs6Y%2Bv3w%2BhKAELBw%2B >>>>> j >>>>> Mra >>>>> BnhiqqpsXkRv2ifI%3D&reserved=0 >>>>> >>>>> You may see that Ubuntu 18.04 (bionic) has no libnuma package for >>>>> 'armhf' and also 'powerpc' platforms. >>>>> >>>> That's a difficulty. Do these platforms support NUMA? In other words, >>>> could we replace this flag with just outright disabling NUMA support? >>> >>> Many platforms don't support NUMA, so they dont' really need libnuma. >>> >>> Mandating libnuma will also break several things: >>> >>> - cross build for ARM on x86 - which is among the preferred >>> method for build by many in ARM community. >>> >>> - many of the embedded SoCs are without NUMA support, they use >>> smaller rootf (e.g. Yocto). It will be a burden to add libnuma there. >>> >> >> OK, point taken. >> >> So, the alternative would be to have the ability to outright disable NUMA >> support (either with a new option, or reworking this one - i would prefer a >> new one, since this one is confusingly named). Meaning, report all cores as >> socket 0, report all hardware as socket 0, report all memory as socket 0 and >> never care about NUMA nodes anywhere. >> >> Would that work? E.g. by default, make libnuma a hard dependency on x86 >> Linux (but allow to disable it), but disable it everywhere else? > > I think, you may just rename the RTE_EAL_NUMA_AWARE_HUGEPAGES to something > like RTE_EAL_NUMA_SUPPORT and keep all the defaults as is, i.e. > * globally disabled > * enabled for linux > * disabled for armv7a, dpaa, dpaa2 and stingray. > Meson could handle everything dynamically. > >>> >>>> >>>>>>>> For those compiling from source - are there any supported >>>>>>>> distributions which don't package libnuma? I don't see much >>>>>>>> sense in keeping libnuma optional, IMO. This is of course up to >>>>>>>> the tech board to decide, but IMO the "without libnuma it's >>>>>>>> basically broken" argument is very strong in my opinion :) >>>>>>>> >>>>>> >>>> >>>> -- >>>> Thanks, >>>> Anatoly >> >> -- Thanks, Anatoly