On 12/4/25 18:03, Shuah Khan wrote:
On 12/3/25 23:35, Mike Rapoport wrote:
On Thu, Dec 04, 2025 at 07:17:06AM +0100, David Hildenbrand (Red Hat) wrote:
Hi,
On 12/4/25 03:33, Shuah Khan wrote:
This reverts commit 39231e8d6ba7f794b566fd91ebd88c0834a23b98.
That was supposed to fix powerpc handling though. So I think we have to
understand what is happening here.
This patch changes include/linux/mm.h and mm/Kconfig in addition to
arch/powerpc/Kconfig and arch/powerpc/platforms/Kconfig.cputype
With this patch HAVE_GIGANTIC_FOLIOS is enabled on x86_64 config
The following mm/Kconfig isn't arch specific. This makes this
not powerpc specific and this is enabled on x86_64
Yes, and as the patch explains that's expected. See below.
+#
+# We can end up creating gigantic folio.
+#
+config HAVE_GIGANTIC_FOLIOS
+ def_bool (HUGETLB_PAGE && ARCH_HAS_GIGANTIC_PAGE) || \
+ (ZONE_DEVICE && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
+
The following change in include/linux/mm.h is also generic
and applies to x86_64 as well.
-#if !defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE)
+#if !defined(CONFIG_HAVE_GIGANTIC_FOLIOS)
Is this not intended on all architectures?
All expected. See below.
Enabling HAVE_GIGANTIC_FOLIOS broke kernel build and git clone on two
systems. git fetch-pack fails when cloning large repos and make hangs
or errors out of Makefile.build with Error: 139. These failures are
random with git clone failing after fetching 1% of the objects, and
make hangs while compiling random files.
On which architecture do we see these issues and with which kernel configs?
Can you share one?
Config attached.
Okay, let's walk this through. The config has:
CONFIG_HAVE_GIGANTIC_FOLIOS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_ARCH_HAS_GIGANTIC_PAGE=y
CONFIG_ZONE_DEVICE=y
CONFIG_SPARSEMEM=y
CONFIG_SPARSEMEM_VMEMMAP=y
In the old code:
#if !defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE)
/*
* We don't expect any folios that exceed buddy sizes (and consequently
* memory sections).
*/
#define MAX_FOLIO_ORDER MAX_PAGE_ORDER
#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
/*
* Only pages within a single memory section are guaranteed to be
* contiguous. By limiting folios to a single memory section, all folio
* pages are guaranteed to be contiguous.
*/
#define MAX_FOLIO_ORDER PFN_SECTION_SHIFT
#else
/*
* There is no real limit on the folio size. We limit them to the maximum we
* currently expect (e.g., hugetlb, dax).
*/
#define MAX_FOLIO_ORDER PUD_ORDER
#endif
We would get MAX_FOLIO_ORDER = PUD_ORDER = 18
In the new code we will get:
#if !defined(CONFIG_HAVE_GIGANTIC_FOLIOS)
/*
* We don't expect any folios that exceed buddy sizes (and consequently
* memory sections).
*/
#define MAX_FOLIO_ORDER MAX_PAGE_ORDER
#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
/*
* Only pages within a single memory section are guaranteed to be
* contiguous. By limiting folios to a single memory section, all folio
* pages are guaranteed to be contiguous.
*/
#define MAX_FOLIO_ORDER PFN_SECTION_SHIFT
#elif defined(CONFIG_HUGETLB_PAGE)
/*
* There is no real limit on the folio size. We limit them to the maximum we
* currently expect (see CONFIG_HAVE_GIGANTIC_FOLIOS): with hugetlb, we expect
* no folios larger than 16 GiB on 64bit and 1 GiB on 32bit.
*/
#define MAX_FOLIO_ORDER get_order(IS_ENABLED(CONFIG_64BIT) ? SZ_16G :
SZ_1G)
#else
/*
* Without hugetlb, gigantic folios that are bigger than a single PUD are
* currently impossible.
*/
#define MAX_FOLIO_ORDER PUD_ORDER
#endif
MAX_FOLIO_ORDER = get_order(SZ_16G) = 22
That's expected and okay (raising the maximum we expect), as we only want to
set a
rough upper cap on the maximum folio size.
As I raised, observe how MAX_FOLIO_ORDER is only used to
* trigger warnings if we observe an unexpectedly large folio size. Safety
checks.
* use it when dumping a folio to detect possible folio corruption on unexpected
folio sizes
The blow is is one of the git clone failures:
git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
linux_6.19
Cloning into 'linux_6.19'...
remote: Enumerating objects: 11173575, done.
remote: Counting objects: 100% (785/785), done.
remote: Compressing objects: 100% (373/373), done.
remote: Total 11173575 (delta 534), reused 505 (delta 411), pack-reused
11172790 (from 1)
Receiving objects: 100% (11173575/11173575), 3.00 GiB | 7.08 MiB/s, done.
Resolving deltas: 100% (9195212/9195212), done.
fatal: did not receive expected object 0002003e951b5057c16de5a39140abcbf6e44e50
fatal: fetch-pack: invalid index-pack output
If I would have to guess, these symptoms match what we saw between commit
adfb6609c680 ("mm/huge_memory: initialise the tags of the huge zero folio")
and commit 5bebe8de1926 ("mm/huge_memory: Fix initialization of huge zero
folio").
5bebe8de1926 went into v6.18-rc7.
Just to be sure, are you sure we were able to reproduce this issue with a
v6.18-rc7 or even v6.18 that contains 5bebe8de1926?
Bisecting might give you wrong results, as the problems of adfb6609c680 do not
reproduce reliably.
I can confirm that bisecting gives odd results between v6.18-rc5 and
v6.18-rc6. I was seeing failures in some tests, bisected a few times and
got a bunch of bogus commits including 3470715e5c22 ("MAINTAINERS: update
David Hildenbrand's email address") :)
I am sure this patch is the cause oh the problems I have seen on my two
systems. Reverting this commit solved issues since this commit does
impact all architectures enabling HAVE_GIGANTIC_FOLIOS if the conditions
are right.
And 5bebe8de1926 actually solved the issue for me.
Were you seeing the problems I reported without 5bebe8de1926?
Is 5bebe8de1926 is 6.18?
We were seeing all kinds of different segmentation faults or corruptions.
In my case, every-time I tried to login something would segfault. For others,
compilers stopped working or they got different random segfaults.
Assume you think you have a shared zero page, but every time you reboot it's
filled
with other garbage data. Not good when your app assumes something contains 0s.
I can try this commit with 39231e8d6ba7f794b566fd91ebd88c0834a23b98
and see what happens on my system.
Yes, please. I cannot yet make sense of how MAX_FOLIO_ORDER would make any
difference.
Unless you would actually be seeing one of the WARNINGS that are based on
MAX_FOLIO_ORDER / MAX_FOLIO_NR_PAGES. But I guess none showed up in dmesg?
--
Cheers
David