On 03/02/22 16:37, Michael Ellerman wrote:
Sourabh Jain <sourabhj...@linux.ibm.com> writes:
On 01/02/22 17:14, Michael Ellerman wrote:
Sourabh Jain <sourabhj...@linux.ibm.com> writes:
On large config LPARs (having 192 and more cores), Linux fails to boot
due to insufficient memory in the first memblock. It is due to the
memory reservation for the crash kernel which starts at 128MB offset of
the first memblock. This memory reservation for the crash kernel doesn't
leave enough space in the first memblock to accommodate other essential
system resources.
The crash kernel start address was set to 128MB offset by default to
ensure that the crash kernel get some memory below the RMA region which
is used to be of size 256MB. But given that the RMA region size can be
512MB or more, setting the crash kernel offset to mid of RMA size will
leave enough space for kernel to allocate memory for other system
resources.
Since the above crash kernel offset change is only applicable to the LPAR
platform, the LPAR feature detection is pushed before the crash kernel
reservation. The rest of LPAR specific initialization will still
be done during pseries_probe_fw_features as usual.
Signed-off-by: Sourabh Jain <sourabhj...@linux.ibm.com>
Reported-and-tested-by: Abdul haleem <abdha...@linux.vnet.ibm.com>
---
arch/powerpc/kernel/rtas.c | 4 ++++
arch/powerpc/kexec/core.c | 15 +++++++++++----
2 files changed, 15 insertions(+), 4 deletions(-)
---
Change in v3:
Dropped 1st and 2nd patch from v2. 1st and 2nd patch from v2 patch
series [1] try to discover 1T segment MMU feature support
BEFORE boot CPU paca allocation ([1] describes why it is needed).
MPE has posted a patch [2] that archives a similar objective by moving
boot CPU paca allocation after mmu_early_init_devtree().
NOTE: This patch is dependent on the patch [2].
[1]
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20211018084434.217772-3-sourabhj...@linux.ibm.com/
[2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2022-January/239175.html
---
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 733e6ef36758..06df7464fb57 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -1313,6 +1313,10 @@ int __init early_init_dt_scan_rtas(unsigned long node,
entryp = of_get_flat_dt_prop(node, "linux,rtas-entry", NULL);
sizep = of_get_flat_dt_prop(node, "rtas-size", NULL);
+ /* need this feature to decide the crashkernel offset */
+ if (of_get_flat_dt_prop(node, "ibm,hypertas-functions", NULL))
+ powerpc_firmware_features |= FW_FEATURE_LPAR;
+
As you'd have seen this breaks the 32-bit build. It will need an #ifdef
CONFIG_PPC64 around it.
if (basep && entryp && sizep) {
rtas.base = *basep;
rtas.entry = *entryp;
diff --git a/arch/powerpc/kexec/core.c b/arch/powerpc/kexec/core.c
index 8b68d9f91a03..abf5897ae88c 100644
--- a/arch/powerpc/kexec/core.c
+++ b/arch/powerpc/kexec/core.c
@@ -134,11 +134,18 @@ void __init reserve_crashkernel(void)
if (!crashk_res.start) {
#ifdef CONFIG_PPC64
/*
- * On 64bit we split the RMO in half but cap it at half of
- * a small SLB (128MB) since the crash kernel needs to place
- * itself and some stacks to be in the first segment.
+ * On the LPAR platform place the crash kernel to mid of
+ * RMA size (512MB or more) to ensure the crash kernel
+ * gets enough space to place itself and some stack to be
+ * in the first segment. At the same time normal kernel
+ * also get enough space to allocate memory for essential
+ * system resource in the first segment. Keep the crash
+ * kernel starts at 128MB offset on other platforms.
*/
- crashk_res.start = min(0x8000000ULL, (ppc64_rma_size / 2));
+ if (firmware_has_feature(FW_FEATURE_LPAR))
+ crashk_res.start = ppc64_rma_size / 2;
+ else
+ crashk_res.start = min(0x8000000ULL, (ppc64_rma_size /
2));
I think this will break on machines using Radix won't it? At this point
in boot ppc64_rma_size will be == 0. Because we won't call into
hash__setup_initial_memory_limit().
That's not changed by your patch, but seems like this code needs to be
more careful/clever.
Interesting, but in my testing, I found that ppc64_rma_size
did get initialized before reserve_crashkernel() using radix on LPAR.
I am not sure why but hash__setup_initial_memory_limit() function is
gets called
regardless of radix or hash. Not sure whether it is by design but here
is the flow:
It sort of is by design. See:
103a8542cb35 ("powerpc/book3s64/radix: Fix boot failure with large amount of
guest memory")
Basically the hash restrictions are more strict, so we apply them until
we know we will use radix.
But ...
setup_initial_memory_limit()
static inline void setup_initial_memory_limit()
(arch/powerpc/include/asm/book3s/64/mmu.h)
if (!early_radix_enabled()) // FALSE regardless of radix is
enabled or not
You mean early_radix_enabled() is False regardless. But that's not true
in all cases.
We can now build the kernel without hash MMU support at all, see:
387e220a2e5e ("powerpc/64s: Move hash MMU support code under
CONFIG_PPC_64S_HASH_MMU")
In which case early_radix_enabled() will be true here, because it's hard
coded to be true at build time.
Sorry, my bad I was not aware of that. But when both hash and radix
are enabled the early_radix_enabled return FALSE regardless it is disabled
using kernel command line or not is confusing to me. Maybe it is too early
in the boot sequence...
hash__setup_initial_memory_limit() // initialize
ppc64_rma_size
reserve_crashkernel() // initialize crashkernel offset to mid of RMA
size.
For the sack of understanding even if we restrict crashkernel offset
setting to mid RMA (i.e. ppc64_rma_size/2) for
only hash it may not save radix because even today we are assigning
crashkernel offset using
ppc64_rma_size variable.
Yes. There's already a bug there, your patch doesn't make it better or worse.
Is the current flow of initializing ppc64_rma_size variable before
reserve_crashkernel() for radix expected?
Please provide your input.
I wonder if we're better off moving the crash kernel reservation later,
once we've discovered what MMU we're using.
I can't immediately see why that would be a problem, as long as we do
the reservation before we do any (many?) allocations. I'll have to think
about it a bit more though, these boot ordering things are always
subtle.
Agree we have space to improve this piece of code. Let me know
if I can help you to make this better.
For now I think this patch is OK if you send a v2 to fix the compile
error
I sent the next version in the mailing list. Thanks for the support.
- Sourabh Jain