Sourabh Jain <sourabhj...@linux.ibm.com> writes:
> On 01/02/22 17:14, Michael Ellerman wrote:
>> Sourabh Jain <sourabhj...@linux.ibm.com> writes:
>>> On large config LPARs (having 192 and more cores), Linux fails to boot
>>> due to insufficient memory in the first memblock. It is due to the
>>> memory reservation for the crash kernel which starts at 128MB offset of
>>> the first memblock. This memory reservation for the crash kernel doesn't
>>> leave enough space in the first memblock to accommodate other essential
>>> system resources.
>>>
>>> The crash kernel start address was set to 128MB offset by default to
>>> ensure that the crash kernel get some memory below the RMA region which
>>> is used to be of size 256MB. But given that the RMA region size can be
>>> 512MB or more, setting the crash kernel offset to mid of RMA size will
>>> leave enough space for kernel to allocate memory for other system
>>> resources.
>>>
>>> Since the above crash kernel offset change is only applicable to the LPAR
>>> platform, the LPAR feature detection is pushed before the crash kernel
>>> reservation. The rest of LPAR specific initialization will still
>>> be done during pseries_probe_fw_features as usual.
>>>
>>> Signed-off-by: Sourabh Jain <sourabhj...@linux.ibm.com>
>>> Reported-and-tested-by: Abdul haleem <abdha...@linux.vnet.ibm.com>
>>>
>>> ---
>>>   arch/powerpc/kernel/rtas.c |  4 ++++
>>>   arch/powerpc/kexec/core.c  | 15 +++++++++++----
>>>   2 files changed, 15 insertions(+), 4 deletions(-)
>>>
>>>   ---
>>>   Change in v3:
>>>     Dropped 1st and 2nd patch from v2. 1st and 2nd patch from v2 patch
>>>     series [1] try to discover 1T segment MMU feature support
>>>     BEFORE boot CPU paca allocation ([1] describes why it is needed).
>>>     MPE has posted a patch [2] that archives a similar objective by moving
>>>     boot CPU paca allocation after mmu_early_init_devtree().
>>>
>>> NOTE: This patch is dependent on the patch [2].
>>>
>>> [1] 
>>> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20211018084434.217772-3-sourabhj...@linux.ibm.com/
>>> [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2022-January/239175.html
>>>   ---
>>>
>>> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
>>> index 733e6ef36758..06df7464fb57 100644
>>> --- a/arch/powerpc/kernel/rtas.c
>>> +++ b/arch/powerpc/kernel/rtas.c
>>> @@ -1313,6 +1313,10 @@ int __init early_init_dt_scan_rtas(unsigned long 
>>> node,
>>>     entryp = of_get_flat_dt_prop(node, "linux,rtas-entry", NULL);
>>>     sizep  = of_get_flat_dt_prop(node, "rtas-size", NULL);
>>>   
>>> +   /* need this feature to decide the crashkernel offset */
>>> +   if (of_get_flat_dt_prop(node, "ibm,hypertas-functions", NULL))
>>> +           powerpc_firmware_features |= FW_FEATURE_LPAR;
>>> +
>> As you'd have seen this breaks the 32-bit build. It will need an #ifdef
>> CONFIG_PPC64 around it.
>>
>>>     if (basep && entryp && sizep) {
>>>             rtas.base = *basep;
>>>             rtas.entry = *entryp;
>>> diff --git a/arch/powerpc/kexec/core.c b/arch/powerpc/kexec/core.c
>>> index 8b68d9f91a03..abf5897ae88c 100644
>>> --- a/arch/powerpc/kexec/core.c
>>> +++ b/arch/powerpc/kexec/core.c
>>> @@ -134,11 +134,18 @@ void __init reserve_crashkernel(void)
>>>     if (!crashk_res.start) {
>>>   #ifdef CONFIG_PPC64
>>>             /*
>>> -            * On 64bit we split the RMO in half but cap it at half of
>>> -            * a small SLB (128MB) since the crash kernel needs to place
>>> -            * itself and some stacks to be in the first segment.
>>> +            * On the LPAR platform place the crash kernel to mid of
>>> +            * RMA size (512MB or more) to ensure the crash kernel
>>> +            * gets enough space to place itself and some stack to be
>>> +            * in the first segment. At the same time normal kernel
>>> +            * also get enough space to allocate memory for essential
>>> +            * system resource in the first segment. Keep the crash
>>> +            * kernel starts at 128MB offset on other platforms.
>>>              */
>>> -           crashk_res.start = min(0x8000000ULL, (ppc64_rma_size / 2));
>>> +           if (firmware_has_feature(FW_FEATURE_LPAR))
>>> +                   crashk_res.start = ppc64_rma_size / 2;
>>> +           else
>>> +                   crashk_res.start = min(0x8000000ULL, (ppc64_rma_size / 
>>> 2));
>> I think this will break on machines using Radix won't it? At this point
>> in boot ppc64_rma_size will be == 0. Because we won't call into
>> hash__setup_initial_memory_limit().
>>
>> That's not changed by your patch, but seems like this code needs to be
>> more careful/clever.
>
> Interesting, but in my testing, I found that ppc64_rma_size
> did get initialized before reserve_crashkernel() using radix on LPAR.
>
> I am not sure why but hash__setup_initial_memory_limit() function is 
> gets called
> regardless of radix or hash. Not sure whether it is by design but here 
> is the flow:

It sort of is by design. See:

  103a8542cb35 ("powerpc/book3s64/radix: Fix boot failure with large amount of 
guest memory")

Basically the hash restrictions are more strict, so we apply them until
we know we will use radix.

But ...

> setup_initial_memory_limit()
>
>       static inline void setup_initial_memory_limit() 
> (arch/powerpc/include/asm/book3s/64/mmu.h)
>
>              if (!early_radix_enabled())  // FALSE regardless of radix is 
> enabled or not

You mean early_radix_enabled() is False regardless. But that's not true
in all cases.

We can now build the kernel without hash MMU support at all, see:

  387e220a2e5e ("powerpc/64s: Move hash MMU support code under 
CONFIG_PPC_64S_HASH_MMU")

In which case early_radix_enabled() will be true here, because it's hard
coded to be true at build time.

>                  hash__setup_initial_memory_limit() // initialize 
> ppc64_rma_size
>
>       reserve_crashkernel()  // initialize crashkernel offset to mid of RMA 
> size.
>
>
> For the sack of understanding even if we restrict crashkernel offset 
> setting to mid RMA (i.e. ppc64_rma_size/2) for
> only hash it may not save radix because even today we are assigning 
> crashkernel offset using
> ppc64_rma_size variable.

Yes. There's already a bug there, your patch doesn't make it better or worse.

> Is the current flow of initializing ppc64_rma_size variable before 
> reserve_crashkernel() for radix expected?
>
> Please provide your input.

I wonder if we're better off moving the crash kernel reservation later,
once we've discovered what MMU we're using.

I can't immediately see why that would be a problem, as long as we do
the reservation before we do any (many?) allocations. I'll have to think
about it a bit more though, these boot ordering things are always
subtle.

For now I think this patch is OK if you send a v2 to fix the compile
error.

cheers

Reply via email to