Sourabh Jain <sourabhj...@linux.ibm.com> writes: > On 01/02/22 17:14, Michael Ellerman wrote: >> Sourabh Jain <sourabhj...@linux.ibm.com> writes: >>> On large config LPARs (having 192 and more cores), Linux fails to boot >>> due to insufficient memory in the first memblock. It is due to the >>> memory reservation for the crash kernel which starts at 128MB offset of >>> the first memblock. This memory reservation for the crash kernel doesn't >>> leave enough space in the first memblock to accommodate other essential >>> system resources. >>> >>> The crash kernel start address was set to 128MB offset by default to >>> ensure that the crash kernel get some memory below the RMA region which >>> is used to be of size 256MB. But given that the RMA region size can be >>> 512MB or more, setting the crash kernel offset to mid of RMA size will >>> leave enough space for kernel to allocate memory for other system >>> resources. >>> >>> Since the above crash kernel offset change is only applicable to the LPAR >>> platform, the LPAR feature detection is pushed before the crash kernel >>> reservation. The rest of LPAR specific initialization will still >>> be done during pseries_probe_fw_features as usual. >>> >>> Signed-off-by: Sourabh Jain <sourabhj...@linux.ibm.com> >>> Reported-and-tested-by: Abdul haleem <abdha...@linux.vnet.ibm.com> >>> >>> --- >>> arch/powerpc/kernel/rtas.c | 4 ++++ >>> arch/powerpc/kexec/core.c | 15 +++++++++++---- >>> 2 files changed, 15 insertions(+), 4 deletions(-) >>> >>> --- >>> Change in v3: >>> Dropped 1st and 2nd patch from v2. 1st and 2nd patch from v2 patch >>> series [1] try to discover 1T segment MMU feature support >>> BEFORE boot CPU paca allocation ([1] describes why it is needed). >>> MPE has posted a patch [2] that archives a similar objective by moving >>> boot CPU paca allocation after mmu_early_init_devtree(). >>> >>> NOTE: This patch is dependent on the patch [2]. >>> >>> [1] >>> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20211018084434.217772-3-sourabhj...@linux.ibm.com/ >>> [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2022-January/239175.html >>> --- >>> >>> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c >>> index 733e6ef36758..06df7464fb57 100644 >>> --- a/arch/powerpc/kernel/rtas.c >>> +++ b/arch/powerpc/kernel/rtas.c >>> @@ -1313,6 +1313,10 @@ int __init early_init_dt_scan_rtas(unsigned long >>> node, >>> entryp = of_get_flat_dt_prop(node, "linux,rtas-entry", NULL); >>> sizep = of_get_flat_dt_prop(node, "rtas-size", NULL); >>> >>> + /* need this feature to decide the crashkernel offset */ >>> + if (of_get_flat_dt_prop(node, "ibm,hypertas-functions", NULL)) >>> + powerpc_firmware_features |= FW_FEATURE_LPAR; >>> + >> As you'd have seen this breaks the 32-bit build. It will need an #ifdef >> CONFIG_PPC64 around it. >> >>> if (basep && entryp && sizep) { >>> rtas.base = *basep; >>> rtas.entry = *entryp; >>> diff --git a/arch/powerpc/kexec/core.c b/arch/powerpc/kexec/core.c >>> index 8b68d9f91a03..abf5897ae88c 100644 >>> --- a/arch/powerpc/kexec/core.c >>> +++ b/arch/powerpc/kexec/core.c >>> @@ -134,11 +134,18 @@ void __init reserve_crashkernel(void) >>> if (!crashk_res.start) { >>> #ifdef CONFIG_PPC64 >>> /* >>> - * On 64bit we split the RMO in half but cap it at half of >>> - * a small SLB (128MB) since the crash kernel needs to place >>> - * itself and some stacks to be in the first segment. >>> + * On the LPAR platform place the crash kernel to mid of >>> + * RMA size (512MB or more) to ensure the crash kernel >>> + * gets enough space to place itself and some stack to be >>> + * in the first segment. At the same time normal kernel >>> + * also get enough space to allocate memory for essential >>> + * system resource in the first segment. Keep the crash >>> + * kernel starts at 128MB offset on other platforms. >>> */ >>> - crashk_res.start = min(0x8000000ULL, (ppc64_rma_size / 2)); >>> + if (firmware_has_feature(FW_FEATURE_LPAR)) >>> + crashk_res.start = ppc64_rma_size / 2; >>> + else >>> + crashk_res.start = min(0x8000000ULL, (ppc64_rma_size / >>> 2)); >> I think this will break on machines using Radix won't it? At this point >> in boot ppc64_rma_size will be == 0. Because we won't call into >> hash__setup_initial_memory_limit(). >> >> That's not changed by your patch, but seems like this code needs to be >> more careful/clever. > > Interesting, but in my testing, I found that ppc64_rma_size > did get initialized before reserve_crashkernel() using radix on LPAR. > > I am not sure why but hash__setup_initial_memory_limit() function is > gets called > regardless of radix or hash. Not sure whether it is by design but here > is the flow:
It sort of is by design. See: 103a8542cb35 ("powerpc/book3s64/radix: Fix boot failure with large amount of guest memory") Basically the hash restrictions are more strict, so we apply them until we know we will use radix. But ... > setup_initial_memory_limit() > > static inline void setup_initial_memory_limit() > (arch/powerpc/include/asm/book3s/64/mmu.h) > > if (!early_radix_enabled()) // FALSE regardless of radix is > enabled or not You mean early_radix_enabled() is False regardless. But that's not true in all cases. We can now build the kernel without hash MMU support at all, see: 387e220a2e5e ("powerpc/64s: Move hash MMU support code under CONFIG_PPC_64S_HASH_MMU") In which case early_radix_enabled() will be true here, because it's hard coded to be true at build time. > hash__setup_initial_memory_limit() // initialize > ppc64_rma_size > > reserve_crashkernel() // initialize crashkernel offset to mid of RMA > size. > > > For the sack of understanding even if we restrict crashkernel offset > setting to mid RMA (i.e. ppc64_rma_size/2) for > only hash it may not save radix because even today we are assigning > crashkernel offset using > ppc64_rma_size variable. Yes. There's already a bug there, your patch doesn't make it better or worse. > Is the current flow of initializing ppc64_rma_size variable before > reserve_crashkernel() for radix expected? > > Please provide your input. I wonder if we're better off moving the crash kernel reservation later, once we've discovered what MMU we're using. I can't immediately see why that would be a problem, as long as we do the reservation before we do any (many?) allocations. I'll have to think about it a bit more though, these boot ordering things are always subtle. For now I think this patch is OK if you send a v2 to fix the compile error. cheers