that for safety on some AMD CPUs, this relies on recent commit
> 86e6b1547b3d ("x86: fix user address masking non-canonical speculation
> issue").
>
> Link: https://lore.kernel.org/202410281344.d02c72a2-oliver.s...@intel.com
> Signed-off-by: Josh Poimboeuf
Acked-by: Kirill A. Shutemov
--
Kiryl Shutsemau / Kirill A. Shutemov
for safety).
Okay, fair enough.
--
Kiryl Shutsemau / Kirill A. Shutemov
On Wed, Oct 16, 2024 at 03:34:11PM -0700, Linus Torvalds wrote:
> On Wed, 16 Oct 2024 at 15:13, Kirill A. Shutemov wrote:
> >
> > It is worse than that. If we get LAM_SUP enabled (there's KASAN patchset
> > in works) this check will allow arbitrary kernel addresses.
&g
On Tue, Oct 22, 2024 at 01:16:58AM -0700, Pawan Gupta wrote:
> On Mon, Oct 21, 2024 at 01:48:15PM +0300, Kirill A. Shutemov wrote:
> > On Sun, Oct 20, 2024 at 03:59:25PM -0700, Linus Torvalds wrote:
> > > On Sun, 20 Oct 2024 at 15:44, Josh Poimboeuf wrote:
> > > >
&
On Mon, Oct 21, 2024 at 07:36:50PM -0700, Linus Torvalds wrote:
> On Mon, 21 Oct 2024 at 03:48, Kirill A. Shutemov wrote:
> >
> > LAM brings own speculation issues[1] that is going to be addressed by
> > LASS[2]. There was a patch[3] to disable LAM until LASS is landed,
t
never got applied for some reason.
[1] https://download.vusec.net/papers/slam_sp24.pdf
[2]
https://lore.kernel.org/all/20240710160655.3402786-1-alexander.shish...@linux.intel.com
[3]
https://lore.kernel.org/all/5373262886f2783f054256babdf5a98545dc986b.1706068222.git.pawan.kumar.gu...@linux.intel.com
--
Kiryl Shutsemau / Kirill A. Shutemov
ut_user_2)
> EXPORT_SYMBOL(__put_user_2)
This patch provides an opportunity to give these labels more meaningful
names, so that future rearrangements do not require as much boilerplate.
For example, we can rename this label 2: to .Luser_2 or something similar.
--
Kiryl Shutsemau / Kirill A. Shutemov
semantics, does it?
>
> Consider userspace passing an otherwise-good pointer with bit 60 set.
> Previously that would have resulted in a failure, whereas now it will
> succeed.
It is worse than that. If we get LAM_SUP enabled (there's KASAN patchset
in works) this check will allow arbitrary kernel addresses.
--
Kiryl Shutsemau / Kirill A. Shutemov
e ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC
+#define SHIFT_LEFT_TO_MSB ALTERNATIVE \
+ "shl $(64 - 48), %rdx", \
+ "shl $(64 - 57), %rdx", X86_FEATURE_LA57
+
.macro check_range size:req
.if IS_ENABLED(CONFIG_X86_64)
mov %rax, %rdx
+ SHIFT_LEFT_TO_MSB
sar $63, %rdx
or %rdx, %rax
.else
--
Kiryl Shutsemau / Kirill A. Shutemov
's out there that actually
> have LAM enabled.
Actually LAM is fine with the __VIRTUAL_MASK_SHIFT check. LAM enforces bit
47 (or 56 for 5-level paging) to be equal to bit 63. Otherwise it is
canonicality violation.
--
Kiryl Shutsemau / Kirill A. Shutemov
On Thu, Sep 05, 2024 at 10:26:52AM -0700, Charlie Jenkins wrote:
> On Thu, Sep 05, 2024 at 09:47:47AM +0300, Kirill A. Shutemov wrote:
> > On Thu, Aug 29, 2024 at 12:15:57AM -0700, Charlie Jenkins wrote:
> > > Some applications rely on placing data in free bits addresses alloc
e got tested on
x86 with 47bit VA.
We can consider more options to opt-in into wider address space like
personality or prctl() handle. But opt-out is no-go from what I see.
--
Kiryl Shutsemau / Kirill A. Shutemov
et = 0;
return vm_unmapped_area(&info);
}
--
Kiryl Shutsemau / Kirill A. Shutemov
On Tue, Jul 25, 2023 at 01:51:55PM +0100, Matthew Wilcox wrote:
> On Tue, Jul 25, 2023 at 01:24:03PM +0300, Kirill A . Shutemov wrote:
> > On Tue, Jul 18, 2023 at 04:44:53PM -0700, Sean Christopherson wrote:
> > > diff --git a/mm/compaction.c b/mm/compaction.c
> &
the
mapping is still tied to the folio).
Vlastimil, any comments?
--
Kiryl Shutsemau / Kirill A. Shutemov
arch/m68k/Kconfig.cpu | 16 +---
> arch/nios2/Kconfig| 17 +
> arch/powerpc/Kconfig | 22 +-
> arch/sh/mm/Kconfig| 19 +--
> arch/sparc/Kconfig| 16 +++++---
> arch/xtensa/Kconfig | 16 +---
> 10 files changed, 76 insertions(+), 80 deletions(-)
Acked-by: Kirill A. Shutemov
--
Kiryl Shutsemau / Kirill A. Shutemov
On Thu, Sep 23, 2021 at 08:21:03PM +0200, Borislav Petkov wrote:
> On Thu, Sep 23, 2021 at 12:05:58AM +0300, Kirill A. Shutemov wrote:
> > Unless we find other way to guarantee RIP-relative access, we must use
> > fixup_pointer() to access any global variables.
>
> Yah, I
On Wed, Sep 22, 2021 at 09:52:07PM +0200, Borislav Petkov wrote:
> On Wed, Sep 22, 2021 at 05:30:15PM +0300, Kirill A. Shutemov wrote:
> > Not fine, but waiting to blowup with random build environment change.
>
> Why is it not fine?
>
> Are you suspecting that the co
On Wed, Sep 22, 2021 at 08:40:43AM -0500, Tom Lendacky wrote:
> On 9/21/21 4:58 PM, Kirill A. Shutemov wrote:
> > On Tue, Sep 21, 2021 at 04:43:59PM -0500, Tom Lendacky wrote:
> > > On 9/21/21 4:34 PM, Kirill A. Shutemov wrote:
> > > > On Tue, Sep 21, 2021 at 11:
On Tue, Sep 21, 2021 at 04:43:59PM -0500, Tom Lendacky wrote:
> On 9/21/21 4:34 PM, Kirill A. Shutemov wrote:
> > On Tue, Sep 21, 2021 at 11:27:17PM +0200, Borislav Petkov wrote:
> > > On Wed, Sep 22, 2021 at 12:20:59AM +0300, Kirill A. Shutemov wrote:
> > >
On Tue, Sep 21, 2021 at 11:27:17PM +0200, Borislav Petkov wrote:
> On Wed, Sep 22, 2021 at 12:20:59AM +0300, Kirill A. Shutemov wrote:
> > I still believe calling cc_platform_has() from __startup_64() is totally
> > broken as it lacks proper wrapping while accessing global varia
mm/mem_encrypt_identity.c
@@ -288,7 +288,7 @@ void __init sme_encrypt_kernel(struct boot_params *bp)
unsigned long pgtable_area_len;
unsigned long decrypted_base;
- if (!cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
+ if (1 || !cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
return;
/*
--
Kirill A. Shutemov
ave a special version of
the helper). Note that only AMD requires these cc_platform_has() to return
true.
--
Kirill A. Shutemov
On Wed, Aug 11, 2021 at 10:52:55AM -0500, Tom Lendacky wrote:
> On 8/11/21 7:19 AM, Kirill A. Shutemov wrote:
> > On Tue, Aug 10, 2021 at 02:48:54PM -0500, Tom Lendacky wrote:
> >> On 8/10/21 1:45 PM, Kuppuswamy, Sathyanarayanan wrote:
> >>>
> >>>
&g
thing with this shared/unencrypted
> area, though? Or since it is shared, there's actually nothing you need to
> do (the bss decrpyted section exists even if CONFIG_AMD_MEM_ENCRYPT is not
> configured)?
AFAICS, only kvmclock uses __bss_decrypted. We don't enable kvmclock in
TDX at the moment. It may change in the future.
--
Kirill A. Shutemov
On Tue, Jun 08, 2021 at 04:47:19PM +0530, Aneesh Kumar K.V wrote:
> On 6/8/21 3:12 PM, Kirill A. Shutemov wrote:
> > On Tue, Jun 08, 2021 at 01:22:23PM +0530, Aneesh Kumar K.V wrote:
> > >
> > > Hi Hugh,
> > >
> > > Hugh Dickins writes:
> > &g
>
> and old pfn
>
> unlock(pud_ptl)
> ptep_clear_flush()
> old pfn is free.
>
> Stale TLB entry
>
> Both the above race condition can be fixed if we force mremap path to
> take rmap lock.
>
> Signed-off-by: Aneesh Kumar K.V
Looks like it should be enough to address the race.
It would be nice to understand what is performance overhead of the
additional locking. Is it still faster to move single PMD page table under
these locks comparing to moving PTE page table entries without the locks?
--
Kirill A. Shutemov
ut you need to check it per
distro. For Debian it would be here:
https://distrowatch.com/table.php?distribution=debian
--
Kirill A. Shutemov
t sure it's an issue, but strictly speaking, size of page according
to page table tree doesn't mean pagewalk would fill TLB entry of the size.
CPU may support 1G pages in page table tree without 1G TLB at all.
IIRC, current Intel CPU still don't have any 1G iTLB entries and fill 2M
iTLB instead.
--
Kirill A. Shutemov
On Tue, Nov 03, 2020 at 02:13:50PM +0200, Mike Rapoport wrote:
> On Tue, Nov 03, 2020 at 02:08:16PM +0300, Kirill A. Shutemov wrote:
> > On Sun, Nov 01, 2020 at 07:08:13PM +0200, Mike Rapoport wrote:
> > > diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
> &
arch, mm: make kernel_page_present() always available
The series looks good to me (apart from the minor nit):
Acked-by: Kirill A. Shutemov
--
Kirill A. Shutemov
} else {
> + debug_pagealloc_map_pages(page, 1, enable);
> + }
> +}
> +
> static int swsusp_page_is_free(struct page *);
> static void swsusp_set_page_forbidden(struct page *);
> static void swsusp_unset_page_forbidden(struct page *);
--
Kirill A. Shutemov
On Wed, Dec 18, 2019 at 02:15:53PM -0800, John Hubbard wrote:
> On 12/18/19 7:52 AM, Kirill A. Shutemov wrote:
> > On Mon, Dec 16, 2019 at 02:25:13PM -0800, John Hubbard wrote:
> > > +static void put_compound_head(struct page *page, int refs)
> > > +{
> > > + /*
vmas arg is NULL)
> + * and return -ENOTSUPP if DAX isn't allowed in this case:
> + */
> + return __gup_longterm_locked(tsk, mm, start, nr_pages, pages,
> + vmas, gup_flags | FOLL_TOUCH |
> + FOLL_REMOTE);
> + }
>
> return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
> locked,
> --
> 2.24.1
>
--
Kirill A. Shutemov
e condition would save you an indentation level.
> + int count = page_ref_dec_return(page);
> +
> + /*
> + * devmap page refcounts are 1-based, rather than 0-based: if
> + * refcount is 1, then the page is free and the refcount is
> + * stable because nobody holds a reference on the page.
> + */
> + if (count == 1)
> + free_devmap_managed_page(page);
> + else if (!count)
> + __put_page(page);
> + }
> +
> + return is_devmap;
> +}
> +EXPORT_SYMBOL(put_devmap_managed_page);
> +#endif
> --
> 2.24.1
>
>
--
Kirill A. Shutemov
page);
> +}
It's not terribly efficient. Maybe something like:
VM_BUG_ON_PAGE(page_ref_count(page) < ref, page);
if (refs > 2)
page_ref_sub(page, refs - 1);
put_page(page);
?
--
Kirill A. Shutemov
.
Yes we do. MADV_DONTNEED used a lot by userspace memory allocators and it
will be very noticible performance regression if we would switch it to
down_write(mmap_sem).
--
Kirill A. Shutemov
On Mon, Oct 07, 2019 at 03:51:58PM +0200, Ingo Molnar wrote:
>
> * Kirill A. Shutemov wrote:
>
> > On Mon, Oct 07, 2019 at 03:06:17PM +0200, Ingo Molnar wrote:
> > >
> > > * Anshuman Khandual wrote:
> > >
> > > > This adds a t
ritten
as inline function + define. Something like:
#define mm_p4d_folded mm_p4d_folded
static inline bool mm_p4d_folded(struct mm_struct *mm)
{
return !pgtable_l5_enabled();
}
But I don't see much reason to be more verbose here than needed.
--
Kirill A. Shutemov
On Fri, Sep 27, 2019 at 08:40:00PM -0300, Leonardo Bras wrote:
> As decribed, gup_pgd_range is a lockless pagetable walk. So, in order to
^ typo
--
Kirill A. Shutemov
wn) in pud_clear_tests() as there were no available
> __pgd() definitions.
>
> - ARM32
> - IA64
Hm. Grep shows __pgd() definitions for both of them. Is it for specific
config?
--
Kirill A. Shutemov
it but its just a single line. Kirill suggested this in the
> previous version. There is a generic fallback definition but s390 has it's
> own. This change overrides the generic one for x86 probably as a fix or as
> an improvement. Kirill should be able to help classify it in which case it
> can be a separate patch.
I don't think it worth a separate patch.
--
Kirill A. Shutemov
/highmem_32.c#L34
> >>
> >> I have not checked others, but I guess it is like that for all.
> >>
> >
> >
> > Seems like I answered too quickly. All kmap_atomic() do preempt_disable(),
> > but not all pte_alloc_map() call kmap_atomic().
> >
> > However, for instance ARM does:
> >
> > https://elixir.bootlin.com/linux/v5.3-rc8/source/arch/arm/include/asm/pgtable.h#L200
> >
> > And X86 as well:
> >
> > https://elixir.bootlin.com/linux/v5.3-rc8/source/arch/x86/include/asm/pgtable_32.h#L51
> >
> > Microblaze also:
> >
> > https://elixir.bootlin.com/linux/v5.3-rc8/source/arch/microblaze/include/asm/pgtable.h#L495
>
> All the above platforms checks out to be using k[un]map_atomic(). I am
> wondering whether
> any of the intermediate levels will have similar problems on any these 32 bit
> platforms
> or any other platforms which might be using generic k[un]map_atomic().
No. Kernel only allocates pte page table from highmem. All other page
tables are always visible in kernel address space.
--
Kirill A. Shutemov
ll code here __init (or it's variants) so it
can be discarded on boot. It has not use after that.
--
Kirill A. Shutemov
ry from generic code like this test case is bit tricky. That
> >> is because there are not enough helpers to create entries with an absolute
> >> value. This would have been easier if all the platforms provided functions
> >> like __pxx() which is not the case now. Otherwise something like this
> >> should
> >> have worked.
> >>
> >>
> >> pud_t pud = READ_ONCE(*pudp);
> >> pud = __pud(pud_val(pud) | RANDOM_VALUE (keeping lower 12 bits 0))
> >> WRITE_ONCE(*pudp, pud);
> >>
> >> But __pud() will fail to build in many platforms.
> >
> > Hmm, I simply used this on my system to make pud_clear_tests() work, not
> > sure if it works on all archs:
> >
> > pud_val(*pudp) |= RANDOM_NZVALUE;
>
> Which compiles on arm64 but then fails on x86 because of the way pmd_val()
> has been defined there.
Use instead
*pudp = __pud(pud_val(*pudp) | RANDOM_NZVALUE);
It *should* be more portable.
--
Kirill A. Shutemov
s(mm, pmdp, (pgtable_t) page);
- pud_populate_tests(mm, pudp, pmdp);
- p4d_populate_tests(mm, p4dp, pudp);
- pgd_populate_tests(mm, pgdp, p4dp);
+ pud_populate_tests(mm, pudp, saved_pmdp);
+ p4d_populate_tests(mm, p4dp, saved_pudp);
+ pgd_populate_tests(mm, pgdp, saved_p4dp);
p4d_free(mm, saved_p4dp);
pud_free(mm, saved_pudp);
--
Kirill A. Shutemov
table
> + * entries will be used for testing with random or garbage
> + * values. These saved addresses will be used for freeing
> + * page table pages.
> + */
> + saved_p4dp = p4d_offset(pgdp, 0UL);
> + saved_pudp = pud_offset(p4dp, 0UL);
> + saved_pmdp = pmd_offset(pudp, 0UL);
> + saved_ptep = pte_offset_map(pmdp, 0UL);
> +
> + pte_basic_tests(page, prot);
> + pmd_basic_tests(page, prot);
> + pud_basic_tests(page, prot);
> + p4d_basic_tests(page, prot);
> + pgd_basic_tests(page, prot);
> +
> + pte_clear_tests(ptep);
> + pmd_clear_tests(pmdp);
> + pud_clear_tests(pudp);
> + p4d_clear_tests(p4dp);
> + pgd_clear_tests(pgdp);
> +
> + pmd_populate_tests(mm, pmdp, (pgtable_t) page);
This is not correct for architectures that defines pgtable_t as pte_t
pointer, not struct page pointer.
> + pud_populate_tests(mm, pudp, pmdp);
> + p4d_populate_tests(mm, p4dp, pudp);
> + pgd_populate_tests(mm, pgdp, p4dp);
This is wrong. All p?dp points to the second entry in page table entry.
This is not valid pointer for page table and triggers p?d_bad() on x86.
Use saved_p?dp instead.
> +
> + p4d_free(mm, saved_p4dp);
> + pud_free(mm, saved_pudp);
> + pmd_free(mm, saved_pmdp);
> + pte_free(mm, (pgtable_t) virt_to_page(saved_ptep));
> +
> + mm_dec_nr_puds(mm);
> + mm_dec_nr_pmds(mm);
> + mm_dec_nr_ptes(mm);
> + __mmdrop(mm);
> +
> + free_mapped_page(page);
> + return 0;
> +}
> +
> +static void __exit arch_pgtable_tests_exit(void) { }
> +
> +module_init(arch_pgtable_tests_init);
> +module_exit(arch_pgtable_tests_exit);
> +
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR("Anshuman Khandual ");
> +MODULE_DESCRIPTION("Test archicture page table helpers");
> --
> 2.20.1
>
>
--
Kirill A. Shutemov
ges in that case?
>
> The problem with the transparent_hugepage/enabled interface is that it
> conflates performing compaction work to produce THP-pages with the
> ability to map huge pages at all.
That's not [entirely] true. transparent_hugepage/defrag gates heavy-duty
compaction. We do only very limited compaction if it's not advised by
transparent_hugepage/defrag.
I believe DAX has to respect transparent_hugepage/enabled. Or not
advertise its huge pages as THP. It's confusing for user.
--
Kirill A. Shutemov
ocated out of /dev/dax/ or
> /dev/pmem*. Do we have a reason not to use hugepages for mapping pages in
> that case?
Yes. Like when you don't want dax to compete for TLB with mission-critical
application (which uses hugetlb for instance).
--
Kirill A. Shutemov
address space.
It probably worth recommending (void *) -1 as such address.
> .\" Before Linux 2.6.24, the address was rounded up to the next page
> .\" boundary; since 2.6.24, it is rounded down!
> The address of the new mapping is returned as the result of the call.
> --
> 2.20.1.791.gb4d0f1c61a-goog
>
--
Kirill A. Shutemov
efine OBJ_INDEX_MASK ((_AC(1, UL) << OBJ_INDEX_BITS) - 1)
Have you tested it with CONFIG_X86_5LEVEL=y?
ASAICS, the patch makes OBJ_INDEX_BITS and what depends from it dynamic --
it depends what paging mode we are booting in. ZS_SIZE_CLASSES depends
indirectly on OBJ_INDEX_BITS and I don't see how struct zs_pool definition
can compile with dynamic ZS_SIZE_CLASSES.
Hm?
--
Kirill A. Shutemov
On Wed, Oct 24, 2018 at 07:09:07PM -0700, Joel Fernandes wrote:
> On Wed, Oct 24, 2018 at 03:57:24PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Oct 24, 2018 at 10:57:33PM +1100, Balbir Singh wrote:
> > > On Wed, Oct 24, 2018 at 01:12:56PM +0300, Kirill A. Shutemov wrote:
>
d the address
for coloring. It's not needed anymore. Page allocator and SL?B are good
enough now.
See 3c936465249f ("[SPARC64]: Kill pgtable quicklists and use SLAB.")
--
Kirill A. Shutemov
On Wed, Oct 24, 2018 at 10:57:33PM +1100, Balbir Singh wrote:
> On Wed, Oct 24, 2018 at 01:12:56PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Oct 12, 2018 at 06:31:58PM -0700, Joel Fernandes (Google) wrote:
> > > diff --git a/mm/mremap.c b/mm/mremap.c
> > > index
/* Set the new pmd */
> + set_pmd_at(mm, new_addr, new_pmd, pmd);
> + if (new_ptl != old_ptl)
> + spin_unlock(new_ptl);
> + spin_unlock(old_ptl);
> +
> + *need_flush = true;
> + return true;
> + }
> + return false;
> +}
> +
--
Kirill A. Shutemov
On Fri, Oct 12, 2018 at 05:42:24PM +0100, Anton Ivanov wrote:
>
> On 10/12/18 3:48 PM, Anton Ivanov wrote:
> > On 12/10/2018 15:37, Kirill A. Shutemov wrote:
> > > On Fri, Oct 12, 2018 at 03:09:49PM +0100, Anton Ivanov wrote:
> > > > On 10/12/18 2:37
On Fri, Oct 12, 2018 at 09:57:19AM -0700, Joel Fernandes wrote:
> On Fri, Oct 12, 2018 at 04:19:46PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Oct 12, 2018 at 05:50:46AM -0700, Joel Fernandes wrote:
> > > On Fri, Oct 12, 2018 at 02:30:56PM +0300, Kirill A. Shutemov wrote:
>
+
> > + /* Set the new pmd */
> > + set_pmd_at(mm, new_addr, new_pmd, pmd);
>
> UML does not have set_pmd_at at all
Every architecture does. :)
But it may come not from the arch code.
> If I read the code right, MIPS completely ignores the address argument so
> set_pmd_at there may not have the effect which this patch is trying to
> achieve.
Ignoring address is fine. Most architectures do that..
The ideas is to move page table to the new pmd slot. It's nothing to do
with the address passed to set_pmd_at().
--
Kirill A. Shutemov
On Fri, Oct 12, 2018 at 05:50:46AM -0700, Joel Fernandes wrote:
> On Fri, Oct 12, 2018 at 02:30:56PM +0300, Kirill A. Shutemov wrote:
> > On Thu, Oct 11, 2018 at 06:37:56PM -0700, Joel Fernandes (Google) wrote:
> > > Android needs to mremap large regions of memory during
On Fri, Oct 12, 2018 at 02:30:56PM +0300, Kirill A. Shutemov wrote:
> On Thu, Oct 11, 2018 at 06:37:56PM -0700, Joel Fernandes (Google) wrote:
> > @@ -239,7 +287,21 @@ unsigned long move_page_tables(struct vm_area_struct
> > *vma,
> > split_huge_pmd(v
continue;
> + } else if (extent == PMD_SIZE) {
Hm. What guarantees that new_addr is PMD_SIZE-aligned?
It's not obvious to me.
--
Kirill A. Shutemov
pte_quicklist = (unsigned long *)(*ret);
> - ret[0] = 0;
> - pgtable_cache_size--;
> - }
> - return (pte_t *)ret;
> -}
> -
Ditto.
--
Kirill A. Shutemov
ss)
{
tlb_flush_pgtable(tlb, address);
- pgtable_page_dtor(table);
pgtable_free_tlb(tlb, page_address(table), 0);
}
#endif /* _ASM_POWERPC_PGALLOC_32_H */
--
Kirill A. Shutemov
ts any
address, not restricted to 47-bit address space. It doesn't mean the
application *require* the address to be above 47-bit.
At least on x86, -1 just shift upper boundary of address range where we
can look for unmapped area.
--
Kirill A. Shutemov
gt;
> >
> > The code was first introduced with commit( 83e3c48: mm/sparsemem:
> > Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y)
Any chance to bisect it?
Could you check if the commit just before 83e3c48729d9 is fine?
--
Kirill A. Shutemov
t case scenario? Like when we go far enough into
speculative code path on every page fault and then fallback to normal page
fault?
--
Kirill A. Shutemov
n't switch to large address space if hint_addr + len > 128TB.
> > The decision to switch to large address space is primarily based on hint
> > addr
>
> But does the mmap succeed in that case or not?
>
> ie: mmap(0x7000, 0x2000, ...) = ?
It does, but resulting address doesn't match the hint. It's somewhere
below 47-bit border.
--
Kirill A. Shutemov
t; 2) For everything else we search in < 128TB space if hint addr is below
> 128TB
>
> 3) We don't switch to large address space if hint_addr + len > 128TB. The
> decision to switch to large address space is primarily based on hint addr
>
> Is there any other rule we need to outline? Or is any of the above not
> correct?
That's correct.
--
Kirill A. Shutemov
On Tue, Nov 07, 2017 at 02:05:42PM +0100, Florian Weimer wrote:
> On 11/07/2017 12:44 PM, Kirill A. Shutemov wrote:
> > On Tue, Nov 07, 2017 at 12:26:12PM +0100, Florian Weimer wrote:
> > > On 11/07/2017 12:15 PM, Kirill A. Shutemov wrote:
> > >
> > > > >
ttempts MAP_FIXED allocation
> of addr + len above 128TB might use high bits of pointer returned by
> that library because those are never satisfied today and the library
> would fall back.
If you want to point that it's ABI break, yes it is.
But we allow ABI break as long as nobody notices. I think it's reasonable
to expect that nobody relies on such corner cases.
If we would find any piece of software affect by the change we would need
to reconsider.
--
Kirill A. Shutemov
On Tue, Nov 07, 2017 at 12:26:12PM +0100, Florian Weimer wrote:
> On 11/07/2017 12:15 PM, Kirill A. Shutemov wrote:
>
> > > First of all, using addr and MAP_FIXED to develop our heuristic can
> > > never really give unchanged ABI. It's an in-band signal. brk()
l see
> out-of-range addresses, but I expected a full opt-out based on RLIMIT_AS
> would be sufficient for them.
Just use mmap(-1), without MAP_FIXED to get full address space.
--
Kirill A. Shutemov
et
> our user virtual address bits on a fine grained basis. Maybe a
> sysctl, maybe a personality. Something out-of-band. I don't wan to
> get too far into that discussion yet. First we need to agree whether
> or not the code in the tree today is a problem.
Well, we've discussed before all options you are proposing.
Linus wanted a minimalistic interface, so we took this path for now.
We can always add more ways to get access to full address space later.
--
Kirill A. Shutemov
p: enable thp migration in generic path")
> Reported-and-tested-by: Abdul Haleem
> Signed-off-by: Zi Yan
> Cc: "Kirill A. Shutemov"
> Cc: Anshuman Khandual
> Cc: Andrew Morton
Acked-by: Kirill A. Shutemov
--
Kirill A. Shutemov
/*
> + * We need to re-validate the VMA after checking the bounds, otherwise
> + * we might have a false positive on the bounds.
> + */
> + if (read_seqcount_retry(&vma->vm_sequence, seq))
> + goto unlock;
> +
> + ret = handle_pte_fault(&vmf);
> +
> +unlock:
> + srcu_read_unlock(&vma_srcu, idx);
> + return ret;
> +
> +out_walk:
> + local_irq_enable();
> + goto unlock;
> +}
> +#endif /* __HAVE_ARCH_CALL_SPF */
> +
> /*
> * By the time we get here, we already hold the mm semaphore
> *
> --
> 2.7.4
>
--
Kirill A. Shutemov
On Thu, Aug 10, 2017 at 10:27:50AM +0200, Laurent Dufour wrote:
> On 10/08/2017 02:58, Kirill A. Shutemov wrote:
> > On Wed, Aug 09, 2017 at 12:43:33PM +0200, Laurent Dufour wrote:
> >> On 09/08/2017 12:12, Kirill A. Shutemov wrote:
> >>> On Tue, Aug 08, 2017 at 04
On Wed, Aug 09, 2017 at 12:43:33PM +0200, Laurent Dufour wrote:
> On 09/08/2017 12:12, Kirill A. Shutemov wrote:
> > On Tue, Aug 08, 2017 at 04:35:38PM +0200, Laurent Dufour wrote:
> >> The VMA sequence count has been introduced to allow fast detection of
> >> VMA modif
x27;s anywhere near complete list of places where we touch
vm_flags. What is your plan for the rest?
--
Kirill A. Shutemov
pte, vmf->orig_pte))) {
> if (old_page) {
> if (!PageAnon(old_page)) {
--
Kirill A. Shutemov
tp://lkml.kernel.org/r/20170615145224.66200-1-kirill.shute...@linux.intel.com
--
Kirill A. Shutemov
ven't had a chance to narrow it down yet.
Please check if patch by this link helps:
http://lkml.kernel.org/r/20170313052213.11411-1-kirill.shute...@linux.intel.com
--
Kirill A. Shutemov
Residue of an earlier implementation, perhaps? Delete it.
>
> Fixes: 953c66c2b22a ("mm: THP page cache support for ppc64")
> Signed-off-by: Hugh Dickins
Sorry, that I missed this initially.
Acked-by: Kirill A. Shutemov
--
Kirill A. Shutemov
..@vger.kernel.org
> Cc: sparcli...@vger.kernel.org
> Signed-off-by: Dmitry Safonov
I've noticed this too.
Acked-by: Kirill A. Shutemov
--
Kirill A. Shutemov
hp_split_page 51518
> thp_split_page_failed 1
> thp_deferred_split_page 73566
> thp_split_pmd 665
> thp_zero_page_alloc 3
> thp_zero_page_alloc_failed 0
>
> Signed-off-by: Aneesh Kumar K.V
One nit-pick below, but otherwise
Acked-by: Kirill A. Shutemov
> @@ -2975,6 +3004,1
On Fri, Nov 11, 2016 at 05:42:11PM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov" writes:
>
> > On Mon, Nov 07, 2016 at 02:04:41PM +0530, Aneesh Kumar K.V wrote:
> >> @@ -2953,6 +2966,13 @@ static int do_set_pmd(struct fault_env *fe, struc
MEM handling?
I think we should do this way before this point. Maybe in do_fault() or
something.
--
Kirill A. Shutemov
ned-off-by: Aneesh Kumar K.V
Acked-by: Kirill A. Shutemov
--
Kirill A. Shutemov
long addr)
> {
> @@ -1359,6 +1367,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct
> vm_area_struct *vma,
> atomic_long_dec(&tlb->mm->nr_ptes);
> add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> } else {
> + if (arch_needs_pgtable_deposit())
Just hide the arch_needs_pgtable_deposit() check in zap_deposited_table().
> + zap_deposited_table(tlb->mm, pmd);
> add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
> }
> spin_unlock(ptl);
--
Kirill A. Shutemov
ld Schaefer wrote:
> >> On Tue, 23 Feb 2016 13:32:21 +0300
> >> "Kirill A. Shutemov" wrote:
> >> > The theory is that the splitting bit effetely masked bogus pmd_present():
> >> > we had pmd_trans_splitting() in all code path and that prevented mm
AGE_SIZE) {
+ for (i = 0; i < HPAGE_PMD_NR; i++) {
page_remove_rmap(page + i, false);
put_page(page + i);
}
--
Kirill A. Shutemov
___
Linuxppc-dev mailing list
Linuxppc-d
is the purpose behind the BUG_ON.
I would guess requesting pin on non-reclaimable page is considered
useless, meaning suspicius behavior. BUG_ON() is overkill, I think.
WARN_ON_ONCE() would make it.
Not that this follow_huge_addr() on Power is not reachable via
do_move_page_to_node_array(), becaus
00644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -490,7 +490,7 @@ static inline int pud_bad(pud_t pud)
static inline int pmd_present(pmd_t pmd)
{
- return pmd_val(pmd) != _SEGMENT_ENTRY_INVALID;
+ return !(pmd_val(pmd) & _SEGMEN
On Thu, Feb 18, 2016 at 04:00:37PM +0100, Gerald Schaefer wrote:
> On Thu, 18 Feb 2016 01:58:08 +0200
> "Kirill A. Shutemov" wrote:
>
> > On Wed, Feb 17, 2016 at 08:13:40PM +0100, Gerald Schaefer wrote:
> > > On Sat, 13 Feb 2016 12:58:31 +010
memory.c that check the same?
>
> This behavior is not new, it was the same before the THP rework, so I do not
> assume that it is related to the current problems, maybe with the exception
> of this specific crash. I never saw the BUG at mm/huge_
On Tue, Feb 16, 2016 at 05:24:44PM +0100, Gerald Schaefer wrote:
> On Mon, 15 Feb 2016 23:35:26 +0200
> "Kirill A. Shutemov" wrote:
>
> > Is there any chance that I'll be able to trigger the bug using QEMU?
> > Does anybody have an QEMU image I can use?
>
On Mon, Feb 15, 2016 at 07:37:02PM +0100, Gerald Schaefer wrote:
> On Mon, 15 Feb 2016 13:31:59 +0200
> "Kirill A. Shutemov" wrote:
>
> > On Sat, Feb 13, 2016 at 12:58:31PM +0100, Sebastian Ott wrote:
> > >
> > > On Sat, 13 Feb 2016, Kirill A. Shutem
On Sat, Feb 13, 2016 at 12:58:31PM +0100, Sebastian Ott wrote:
>
> On Sat, 13 Feb 2016, Kirill A. Shutemov wrote:
> > Could you check if revert of fecffad25458 helps?
>
> I reverted fecffad25458 on top of 721675fcf277cf - it oopsed with:
>
> ¢ 1851.721062! Unable
On Fri, Feb 12, 2016 at 06:16:40PM +0100, Gerald Schaefer wrote:
> On Fri, 12 Feb 2016 16:57:27 +0100
> Christian Borntraeger wrote:
>
> > On 02/12/2016 04:41 PM, Kirill A. Shutemov wrote:
> > > On Thu, Feb 11, 2016 at 08:57:02PM +0100, Gerald Schaefer wrote:
> > &
On Thu, Feb 11, 2016 at 08:57:02PM +0100, Gerald Schaefer wrote:
> On Thu, 11 Feb 2016 21:09:42 +0200
> "Kirill A. Shutemov" wrote:
>
> > On Thu, Feb 11, 2016 at 07:22:23PM +0100, Gerald Schaefer wrote:
> > > Hi,
> > >
> > > Sebastian Ott re
On Thu, Feb 11, 2016 at 09:09:42PM +0200, Kirill A. Shutemov wrote:
> On Thu, Feb 11, 2016 at 07:22:23PM +0100, Gerald Schaefer wrote:
> > Hi,
> >
> > Sebastian Ott reported random kernel crashes beginning with v4.5-rc1 and
> > he also bisected this to commit 61f5d698 &
1 - 100 of 159 matches
Mail list logo