Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-16 Thread Gerald Schaefer
On Mon, 15 Feb 2016 23:35:26 +0200
"Kirill A. Shutemov"  wrote:

> On Mon, Feb 15, 2016 at 07:37:02PM +0100, Gerald Schaefer wrote:
> > On Mon, 15 Feb 2016 13:31:59 +0200
> > "Kirill A. Shutemov"  wrote:
> > 
> > > On Sat, Feb 13, 2016 at 12:58:31PM +0100, Sebastian Ott wrote:
> > > > 
> > > > On Sat, 13 Feb 2016, Kirill A. Shutemov wrote:
> > > > > Could you check if revert of fecffad25458 helps?
> > > > 
> > > > I reverted fecffad25458 on top of 721675fcf277cf - it oopsed with:
> > > > 
> > > > ¢ 1851.721062! Unable to handle kernel pointer dereference in virtual 
> > > > kernel address space
> > > > ¢ 1851.721075! failing address:  TEID: 0483
> > > > ¢ 1851.721078! Fault in home space mode while using kernel ASCE.
> > > > ¢ 1851.721085! AS:00d5c007 R3:0007 
> > > > S:a800 P:003d
> > > > ¢ 1851.721128! Oops: 0004 ilc:3 ¢#1! PREEMPT SMP DEBUG_PAGEALLOC
> > > > ¢ 1851.721135! Modules linked in: bridge stp llc btrfs mlx4_ib mlx4_en 
> > > > ib_sa ib_mad vxlan xor ip6_udp_tunnel ib_core udp_tunnel ptp pps_core 
> > > > ib_addr ghash_s390raid6_pq prng ecb aes_s390 mlx4_core des_s390 
> > > > des_generic genwqe_card sha512_s390 sha256_s390 sha1_s390 sha_common 
> > > > crc_itu_t dm_mod scm_block vhost_net tun vhost eadm_sch macvtap macvlan 
> > > > kvm autofs4
> > > > ¢ 1851.721183! CPU: 7 PID: 256422 Comm: bash Not tainted 
> > > > 4.5.0-rc3-00058-g07923d7-dirty #178
> > > > ¢ 1851.721186! task: 7fbfd290 ti: 8c604000 task.ti: 
> > > > 8c604000
> > > > ¢ 1851.721189! Krnl PSW : 0704d0018000 0045d3b8 
> > > > (__rb_erase_color+0x280/0x308)
> > > > ¢ 1851.721200!R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 
> > > > PM:0 EA:3
> > > >Krnl GPRS: 0001 0020 
> > > >  bd07eff1
> > > > ¢ 1851.721205!0027ca10  
> > > > 83e45898 77b61198
> > > > ¢ 1851.721207!7ce1a490 bd07eff0 
> > > > 7ce1a548 0027ca10
> > > > ¢ 1851.721210!bd07c350 bd07eff0 
> > > > 8c607aa8 8c607a68
> > > > ¢ 1851.721221! Krnl Code: 0045d3aa: e3c0d0080024   stg 
> > > > %%r12,8(%%r13)
> > > >   0045d3b0: b9040039   lgr 
> > > > %%r3,%%r9
> > > >  #0045d3b4: a53b0001   oill
> > > > %%r3,1
> > > >  >0045d3b8: e3301024   stg 
> > > > %%r3,0(%%r1)
> > > >   0045d3be: ec28000e007c   cgij
> > > > %%r2,0,8,45d3da
> > > >   0045d3c4: e3402004   lg  
> > > > %%r4,0(%%r2)
> > > >   0045d3ca: b904001c   lgr 
> > > > %%r1,%%r12
> > > >   0045d3ce: ec143f3f0056   rosbg   
> > > > %%r1,%%r4,63,63,0
> > > > ¢ 1851.721269! Call Trace:
> > > > ¢ 1851.721273! (¢<83e45898>! 0x83e45898)
> > > > ¢ 1851.721279!  ¢<0029342a>! unlink_anon_vmas+0x9a/0x1d8
> > > > ¢ 1851.721282!  ¢<00283f34>! free_pgtables+0xcc/0x148
> > > > ¢ 1851.721285!  ¢<0028c376>! exit_mmap+0xd6/0x300
> > > > ¢ 1851.721289!  ¢<00134db8>! mmput+0x90/0x118
> > > > ¢ 1851.721294!  ¢<002d76bc>! flush_old_exec+0x5d4/0x700
> > > > ¢ 1851.721298!  ¢<003369f4>! load_elf_binary+0x2f4/0x13e8
> > > > ¢ 1851.721301!  ¢<002d6e4a>! search_binary_handler+0x9a/0x1f8
> > > > ¢ 1851.721304!  ¢<002d8970>! 
> > > > do_execveat_common.isra.32+0x668/0x9a0
> > > > ¢ 1851.721307!  ¢<002d8cec>! do_execve+0x44/0x58
> > > > ¢ 1851.721310!  ¢<002d8f92>! SyS_execve+0x3a/0x48
> > > > ¢ 1851.721315!  ¢<006fb096>! system_call+0xd6/0x258
> > > > ¢ 1851.721317!  ¢<03ff997436d6>! 0x3ff997436d6
> > > > ¢ 1851.721319! INFO: lockdep is turned off.
> > > > ¢ 1851.721321! Last Breaking-Ev

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-17 Thread Gerald Schaefer
On Sat, 13 Feb 2016 12:58:31 +0100 (CET)
Sebastian Ott  wrote:

> [   59.875935] [ cut here ]
> [   59.875937] kernel BUG at mm/huge_memory.c:2884!
> [   59.875979] illegal operation: 0001 ilc:1 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [   59.875986] Modules linked in: bridge stp llc btrfs xor mlx4_en vxlan 
> ip6_udp_tunnel udp_tunnel mlx4_ib ptp pps_core ib_sa ib_mad ib_core ib_addr 
> ghash_s390 prng raid6_pq ecb aes_s390 des_s390 des_generic sha512_s390 
> sha256_s390 sha1_s390 mlx4_core sha_common genwqe_card scm_block crc_itu_t 
> vhost_net tun vhost dm_mod macvtap eadm_sch macvlan kvm autofs4
> [   59.876033] CPU: 2 PID: 5402 Comm: git Tainted: GW   
> 4.4.0-07794-ga4eff16-dirty #77
> [   59.876036] task: d2312948 ti: cfecc000 task.ti: 
> cfecc000
> [   59.876039] Krnl PSW : 0704d0018000 002bf3aa 
> (__split_huge_pmd_locked+0x562/0xa10)
> [   59.876045]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 
> EA:3
>Krnl GPRS: 01a7a1cf 03d10177c000 00044068 
> 5df00215
> [   59.876051]0001 0001  
> 774e6900
> [   59.876054]03ff5200 6d403b10 6e1eb800 
> 03ff51f0
> [   59.876058]03d10177c000 00715190 002bf234 
> cfecfb58
> [   59.876068] Krnl Code: 002bf39c: d507d010a000  clc 
> 16(8,%%r13),0(%%r10)
>   002bf3a2: a7840004  brc 8,2bf3aa
>  #002bf3a6: a7f40001  brc 
> 15,2bf3a8
>  >002bf3aa: 91407440  tm  
> 1088(%%r7),64
>   002bf3ae: a7840208  brc 8,2bf7be
>   002bf3b2: a7f401e9  brc 
> 15,2bf784
>   002bf3b6: 9104a006  tm  
> 6(%%r10),4
>   002bf3ba: a7740004  brc 7,2bf3c2
> [   59.876089] Call Trace:
> [   59.876092] ([<002bf234>] __split_huge_pmd_locked+0x3ec/0xa10)
> [   59.876095]  [<002c4310>] __split_huge_pmd+0x118/0x218
> [   59.876099]  [<002810e8>] unmap_single_vma+0x2d8/0xb40
> [   59.876102]  [<00282d66>] zap_page_range+0x116/0x318
> [   59.876105]  [<0029b834>] SyS_madvise+0x23c/0x5e8
> [   59.876108]  [<006f9f56>] system_call+0xd6/0x258
> [   59.876111]  [<03ff9bbfd282>] 0x3ff9bbfd282
> [   59.876113] INFO: lockdep is turned off.
> [   59.876115] Last Breaking-Event-Address:
> [   59.876118]  [<002bf3a6>] __split_huge_pmd_locked+0x55e/0xa10

The BUG at mm/huge_memory.c:2884 is interesting, it's the 
BUG_ON(!pte_none(*pte))
check in __split_huge_pmd_locked(). Obviously we expect the pre-allocated
pagetables to be empty, but in collapse_huge_page() we deposit the original
pagetable instead of allocating a new (empty) one. This saves an allocation,
which is good, but doesn't that mean that if such a collapsed hugepage will
ever be split, we will always run into the BUG_ON(!pte_none(*pte)), or one
of the two other VM_BUG_ONs in mm/huge_memory.c that check the same?

This behavior is not new, it was the same before the THP rework, so I do not
assume that it is related to the current problems, maybe with the exception
of this specific crash. I never saw the BUG at mm/huge_memory.c:2884 myself,
and the other crashes probably cannot be explained with this. Maybe I am
also missing something, but I do not see how collapse_huge_page() and the
(non-empty) pgtable deposit there can work out with the BUG_ON(!pte_none(*pte))
checks. Any thoughts?

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-18 Thread Gerald Schaefer
On Thu, 18 Feb 2016 01:58:08 +0200
"Kirill A. Shutemov"  wrote:

> On Wed, Feb 17, 2016 at 08:13:40PM +0100, Gerald Schaefer wrote:
> > On Sat, 13 Feb 2016 12:58:31 +0100 (CET)
> > Sebastian Ott  wrote:
> > 
> > > [   59.875935] [ cut here ]
> > > [   59.875937] kernel BUG at mm/huge_memory.c:2884!
> > > [   59.875979] illegal operation: 0001 ilc:1 [#1] PREEMPT SMP 
> > > DEBUG_PAGEALLOC
> > > [   59.875986] Modules linked in: bridge stp llc btrfs xor mlx4_en vxlan 
> > > ip6_udp_tunnel udp_tunnel mlx4_ib ptp pps_core ib_sa ib_mad ib_core 
> > > ib_addr ghash_s390 prng raid6_pq ecb aes_s390 des_s390 des_generic 
> > > sha512_s390 sha256_s390 sha1_s390 mlx4_core sha_common genwqe_card 
> > > scm_block crc_itu_t vhost_net tun vhost dm_mod macvtap eadm_sch macvlan 
> > > kvm autofs4
> > > [   59.876033] CPU: 2 PID: 5402 Comm: git Tainted: GW   
> > > 4.4.0-07794-ga4eff16-dirty #77
> > > [   59.876036] task: d2312948 ti: cfecc000 task.ti: 
> > > cfecc000
> > > [   59.876039] Krnl PSW : 0704d0018000 002bf3aa 
> > > (__split_huge_pmd_locked+0x562/0xa10)
> > > [   59.876045]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 
> > > PM:0 EA:3
> > >Krnl GPRS: 01a7a1cf 03d10177c000 
> > > 00044068 5df00215
> > > [   59.876051]0001 0001 
> > >  774e6900
> > > [   59.876054]03ff5200 6d403b10 
> > > 6e1eb800 03ff51f0
> > > [   59.876058]03d10177c000 00715190 
> > > 002bf234 cfecfb58
> > > [   59.876068] Krnl Code: 002bf39c: d507d010a000  clc 
> > > 16(8,%%r13),0(%%r10)
> > >   002bf3a2: a7840004  brc 
> > > 8,2bf3aa
> > >  #002bf3a6: a7f40001  brc 
> > > 15,2bf3a8
> > >  >002bf3aa: 91407440  tm  
> > > 1088(%%r7),64
> > >   002bf3ae: a7840208  brc 
> > > 8,2bf7be
> > >   002bf3b2: a7f401e9  brc 
> > > 15,2bf784
> > >   002bf3b6: 9104a006  tm  
> > > 6(%%r10),4
> > >   002bf3ba: a7740004  brc 
> > > 7,2bf3c2
> > > [   59.876089] Call Trace:
> > > [   59.876092] ([<002bf234>] __split_huge_pmd_locked+0x3ec/0xa10)
> > > [   59.876095]  [<002c4310>] __split_huge_pmd+0x118/0x218
> > > [   59.876099]  [<002810e8>] unmap_single_vma+0x2d8/0xb40
> > > [   59.876102]  [<00282d66>] zap_page_range+0x116/0x318
> > > [   59.876105]  [<0029b834>] SyS_madvise+0x23c/0x5e8
> > > [   59.876108]  [<006f9f56>] system_call+0xd6/0x258
> > > [   59.876111]  [<03ff9bbfd282>] 0x3ff9bbfd282
> > > [   59.876113] INFO: lockdep is turned off.
> > > [   59.876115] Last Breaking-Event-Address:
> > > [   59.876118]  [<002bf3a6>] __split_huge_pmd_locked+0x55e/0xa10
> > 
> > The BUG at mm/huge_memory.c:2884 is interesting, it's the 
> > BUG_ON(!pte_none(*pte))
> > check in __split_huge_pmd_locked(). Obviously we expect the pre-allocated
> > pagetables to be empty, but in collapse_huge_page() we deposit the original
> > pagetable instead of allocating a new (empty) one. This saves an allocation,
> > which is good, but doesn't that mean that if such a collapsed hugepage will
> > ever be split, we will always run into the BUG_ON(!pte_none(*pte)), or one
> > of the two other VM_BUG_ONs in mm/huge_memory.c that check the same?
> > 
> > This behavior is not new, it was the same before the THP rework, so I do not
> > assume that it is related to the current problems, maybe with the exception
> > of this specific crash. I never saw the BUG at mm/huge_memory.c:2884 myself,
> > and the other crashes probably cannot be explained with this. Maybe I am
> > also missing something, but I do not see how collapse_huge_page() and the
> > (non-empty) pgtable deposit there can work out with the 
> > BUG_ON(!pte_none(*pte))
> > checks. Any thoughts?
> 
> I don't think there's a problem: ptes in the pgtable are cleared with
> pt

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-23 Thread Gerald Schaefer
On Tue, 23 Feb 2016 13:32:21 +0300
"Kirill A. Shutemov"  wrote:

> On Fri, Feb 12, 2016 at 06:16:40PM +0100, Gerald Schaefer wrote:
> > On Fri, 12 Feb 2016 16:57:27 +0100
> > Christian Borntraeger  wrote:
> > 
> > > > I'm also confused by pmd_none() is equal to !pmd_present() on s390. Hm?
> > > 
> > > Don't know, Gerald or Martin?
> > 
> > The implementation frequently changes depending on how many new bits Martin
> > needs to squeeze out :-)
> > We don't have a _PAGE_PRESENT bit for pmds, so pmd_present() just checks if 
> > the
> > entry is not empty. pmd_none() of course does the opposite, it checks if it 
> > is
> > empty.
> 
> I still worry about pmd_present(). It looks wrong to me. I wounder if
> patch below makes a difference.
> 
> The theory is that the splitting bit effetely masked bogus pmd_present():
> we had pmd_trans_splitting() in all code path and that prevented mm from
> touching the pmd. Once pmd_trans_splitting() has gone, mm proceed with the
> pmd where it shouldn't and here's a boom.

Well, I don't think pmd_present() == true is bogus for a trans_huge pmd under
splitting, after all there is a page behind the the pmd. Also, if it was
bogus, and it would need to be false, why should it be marked !pmd_present()
only at the pmdp_invalidate() step before the pmd_populate()? It clearly
is pmd_present() before that, on all architectures, and if there was any
problem/race with that, setting it to !pmd_present() at this stage would
only (marginally) reduce the race window.

BTW, PowerPC and Sparc seem to do the same thing in pmdp_invalidate(),
i.e. they do not set pmd_present() == false, only mark it so that it would
not generate a new TLB entry, just like on s390. After all, the function
is called pmdp_invalidate(), and I think the comment in mm/huge_memory.c
before that call is just a little ambiguous in its wording. When it says
"mark the pmd notpresent" it probably means "mark it so that it will not
generate a new TLB entry", which is also what the comment is really about:
prevent huge and small entries in the TLB for the same page at the same
time.

FWIW, and since the ARM arch-list is already on cc, I think there is
an issue with pmdp_invalidate() on ARM, since it also seems to clear
the trans_huge (and formerly trans_splitting) bit, which actually makes
the pmd !pmd_present(), but it violates the other requirement from the
comment:
"the pmd_trans_huge and pmd_trans_splitting must remain set at all times
on the pmd until the split is complete for this pmd"

> 
> I'm not sure that the patch is correct wrt yound/old pmds and I have no
> way to test it...
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 64ead8091248..2eeb17ab68ac 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -490,7 +490,7 @@ static inline int pud_bad(pud_t pud)
> 
>  static inline int pmd_present(pmd_t pmd)
>  {
> - return pmd_val(pmd) != _SEGMENT_ENTRY_INVALID;
> + return !(pmd_val(pmd) & _SEGMENT_ENTRY_INVALID);
>  }
> 
>  static inline int pmd_none(pmd_t pmd)

No, that would not work well with young rw and ro pmds. We do now
have an extra free bit in the pmd on s390, after the removal of the
splitting bit, so we could try to implement pmd_present() with that
sw bit, but that would also require several not-so-trivial changes
to the other code in arch/s390/include/asm/pgtable.h.

I'll check with Martin, maybe it is actually trivial, then we can
do a quick test it to rule that one out.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-24 Thread Gerald Schaefer
On Tue, 23 Feb 2016 22:33:45 +0300
"Kirill A. Shutemov"  wrote:

> On Tue, Feb 23, 2016 at 07:19:07PM +0100, Gerald Schaefer wrote:
> > I'll check with Martin, maybe it is actually trivial, then we can
> > do a quick test it to rule that one out.
> 
> Oh. I found a bug in __split_huge_pmd_locked(). Although, not sure if it's
> _the_ bug.
> 
> pmdp_invalidate() is called for the wrong address :-/
> I guess that can be destructive on the architecture, right?

Thanks, that's it! We can no longer reproduce the crashes and calling
pmdp_invalidate() with a wrong address also perfectly explains the
memory corruption that I found in several dumps: 0x020 was ORed into
pte entries, which didn't make sense, and caused the list corruption
for example. 0x020 it is the invalid bit for pmd entries on s390 and
thus can be explained by this bug when a pte table lies before a pmd
table in memory.

> 
> Could you check this?
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1c317b85ea7d..4246bc70e55a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2865,7 +2865,7 @@ static void __split_huge_pmd_locked(struct 
> vm_area_struct *vma, pmd_t *pmd,
>   pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>   pmd_populate(mm, &_pmd, pgtable);
> 
> - for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> + for (i = 0; i < HPAGE_PMD_NR; i++) {
>   pte_t entry, *pte;
>   /*
>* Note that NUMA hinting access restrictions are not
> @@ -2886,9 +2886,9 @@ static void __split_huge_pmd_locked(struct 
> vm_area_struct *vma, pmd_t *pmd,
>   }
>   if (dirty)
>   SetPageDirty(page + i);
> - pte = pte_offset_map(&_pmd, haddr);
> + pte = pte_offset_map(&_pmd, haddr + i * PAGE_SIZE);
>   BUG_ON(!pte_none(*pte));
> - set_pte_at(mm, haddr, pte, entry);
> + set_pte_at(mm, haddr + i * PAGE_SIZE, pte, entry);
>   atomic_inc(&page[i]._mapcount);
>   pte_unmap(pte);
>   }
> @@ -2938,7 +2938,7 @@ static void __split_huge_pmd_locked(struct 
> vm_area_struct *vma, pmd_t *pmd,
>   pmd_populate(mm, pmd, pgtable);
> 
>   if (freeze) {
> - for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> + for (i = 0; i < HPAGE_PMD_NR; i++) {
>   page_remove_rmap(page + i, false);
>   put_page(page + i);
>   }

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-11 Thread Gerald Schaefer
Hi,

Sebastian Ott reported random kernel crashes beginning with v4.5-rc1 and
he also bisected this to commit 61f5d698 "mm: re-enable THP". Further
review of the THP rework patches, which cannot be bisected, revealed
commit fecffad "s390, thp: remove infrastructure for handling splitting PMDs"
(and also similar commits for other archs).

This commit removes the THP splitting bit and also the architecture
implementation of pmdp_splitting_flush(), which took care of the IPI for
fast_gup serialization. The commit message says

pmdp_splitting_flush() is not needed too: on splitting PMD we will do
pmdp_clear_flush() + set_pte_at().  pmdp_clear_flush() will do IPI as
needed for fast_gup

The assumption that a TLB flush will also produce an IPI is wrong on s390,
and maybe also on other architectures, and I thought that this was actually
the main reason for having an arch-specific pmdp_splitting_flush().

At least PowerPC and ARM also had an individual implementation of
pmdp_splitting_flush() that used kick_all_cpus_sync() instead of a TLB
flush to send the IPI, and those were also removed. Putting the arch
maintainers and mailing lists on cc to verify.

On s390 this will break the IPI serialization against fast_gup, which
would certainly explain the random kernel crashes, please revert or fix
the pmdp_splitting_flush() removal.

Regards,
Gerald

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-11 Thread Gerald Schaefer
On Thu, 11 Feb 2016 21:09:42 +0200
"Kirill A. Shutemov"  wrote:

> On Thu, Feb 11, 2016 at 07:22:23PM +0100, Gerald Schaefer wrote:
> > Hi,
> > 
> > Sebastian Ott reported random kernel crashes beginning with v4.5-rc1 and
> > he also bisected this to commit 61f5d698 "mm: re-enable THP". Further
> > review of the THP rework patches, which cannot be bisected, revealed
> > commit fecffad "s390, thp: remove infrastructure for handling splitting 
> > PMDs"
> > (and also similar commits for other archs).
> > 
> > This commit removes the THP splitting bit and also the architecture
> > implementation of pmdp_splitting_flush(), which took care of the IPI for
> > fast_gup serialization. The commit message says
> > 
> > pmdp_splitting_flush() is not needed too: on splitting PMD we will do
> > pmdp_clear_flush() + set_pte_at().  pmdp_clear_flush() will do IPI as
> > needed for fast_gup
> > 
> > The assumption that a TLB flush will also produce an IPI is wrong on s390,
> > and maybe also on other architectures, and I thought that this was actually
> > the main reason for having an arch-specific pmdp_splitting_flush().
> > 
> > At least PowerPC and ARM also had an individual implementation of
> > pmdp_splitting_flush() that used kick_all_cpus_sync() instead of a TLB
> > flush to send the IPI, and those were also removed. Putting the arch
> > maintainers and mailing lists on cc to verify.
> > 
> > On s390 this will break the IPI serialization against fast_gup, which
> > would certainly explain the random kernel crashes, please revert or fix
> > the pmdp_splitting_flush() removal.
> 
> Sorry for that.
> 
> I believe, the problem was already addressed for PowerPC:
> 
> http://lkml.kernel.org/g/454980831-16631-1-git-send-email-aneesh.ku...@linux.vnet.ibm.com
> 
> I think kick_all_cpus_sync() in arch-specific pmdp_invalidate() would do
> the trick, right?

Hmm, not sure about that. After pmdp_invalidate(), a pmd_none() check in
fast_gup will still return false, because the pmd is not empty (at least
on s390). So I don't see spontaneously how it will help fast_gup to break
out to the slow path in case of THP splitting.

> 
> If yes, I'll prepare patch tomorrow (some sleep required).
> 

We'll check if adding kick_all_cpus_sync() to pmdp_invalidate() helps.
It would also be good if Martin has a look at this, he'll return on
Monday.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-12 Thread Gerald Schaefer
On Fri, 12 Feb 2016 09:34:33 +0530
"Aneesh Kumar K.V"  wrote:

> Gerald Schaefer  writes:
> 
> > On Thu, 11 Feb 2016 21:09:42 +0200
> > "Kirill A. Shutemov"  wrote:
> >
> >> On Thu, Feb 11, 2016 at 07:22:23PM +0100, Gerald Schaefer wrote:
> >> > Hi,
> >> > 
> >> > Sebastian Ott reported random kernel crashes beginning with v4.5-rc1 and
> >> > he also bisected this to commit 61f5d698 "mm: re-enable THP". Further
> >> > review of the THP rework patches, which cannot be bisected, revealed
> >> > commit fecffad "s390, thp: remove infrastructure for handling splitting 
> >> > PMDs"
> >> > (and also similar commits for other archs).
> >> > 
> >> > This commit removes the THP splitting bit and also the architecture
> >> > implementation of pmdp_splitting_flush(), which took care of the IPI for
> >> > fast_gup serialization. The commit message says
> >> > 
> >> > pmdp_splitting_flush() is not needed too: on splitting PMD we will do
> >> > pmdp_clear_flush() + set_pte_at().  pmdp_clear_flush() will do IPI as
> >> > needed for fast_gup
> >> > 
> >> > The assumption that a TLB flush will also produce an IPI is wrong on 
> >> > s390,
> >> > and maybe also on other architectures, and I thought that this was 
> >> > actually
> >> > the main reason for having an arch-specific pmdp_splitting_flush().
> >> > 
> >> > At least PowerPC and ARM also had an individual implementation of
> >> > pmdp_splitting_flush() that used kick_all_cpus_sync() instead of a TLB
> >> > flush to send the IPI, and those were also removed. Putting the arch
> >> > maintainers and mailing lists on cc to verify.
> >> > 
> >> > On s390 this will break the IPI serialization against fast_gup, which
> >> > would certainly explain the random kernel crashes, please revert or fix
> >> > the pmdp_splitting_flush() removal.
> >> 
> >> Sorry for that.
> >> 
> >> I believe, the problem was already addressed for PowerPC:
> >> 
> >> http://lkml.kernel.org/g/454980831-16631-1-git-send-email-aneesh.ku...@linux.vnet.ibm.com
> >> 
> >> I think kick_all_cpus_sync() in arch-specific pmdp_invalidate() would do
> >> the trick, right?
> >
> > Hmm, not sure about that. After pmdp_invalidate(), a pmd_none() check in
> > fast_gup will still return false, because the pmd is not empty (at least
> > on s390).
> 
> Why can't we do this ? I did this for ppc64.
> 
>  void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>pmd_t *pmdp)
>  {
> - pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0);
> + pmd_hugepage_update(vma->vm_mm, address, pmdp, ~0UL, 0);
> 

Wouldn't that semantically change what pmdp_invalidate() was supposed to
do? The comment before the call says "the pmd_trans_huge and
pmd_trans_splitting must remain set at all times on the pmd". So, after
removing pmd_trans_splitting, it seems to be necessary to at least keep
pmd_trans_huge set.

In your case, the pmd would be completely cleared, which may help to find
it in fast_gup with pmd_none(), but I'm not sure if this would open up
other problems, e.g. with concurrent page faults. But I must also admit that
my THP overview got a little rusty.

> >So I don't see spontaneously how it will help fast_gup to break
> > out to the slow path in case of THP splitting.
> >
> >> 
> >> If yes, I'll prepare patch tomorrow (some sleep required).
> >> 
> >
> > We'll check if adding kick_all_cpus_sync() to pmdp_invalidate() helps.
> > It would also be good if Martin has a look at this, he'll return on
> > Monday.
> 
> -aneesh
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-s390" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-12 Thread Gerald Schaefer
On Fri, 12 Feb 2016 16:57:27 +0100
Christian Borntraeger  wrote:

> On 02/12/2016 04:41 PM, Kirill A. Shutemov wrote:
> > On Thu, Feb 11, 2016 at 08:57:02PM +0100, Gerald Schaefer wrote:
> >> On Thu, 11 Feb 2016 21:09:42 +0200
> >> "Kirill A. Shutemov"  wrote:
> >>
> >>> On Thu, Feb 11, 2016 at 07:22:23PM +0100, Gerald Schaefer wrote:
> >>>> Hi,
> >>>>
> >>>> Sebastian Ott reported random kernel crashes beginning with v4.5-rc1 and
> >>>> he also bisected this to commit 61f5d698 "mm: re-enable THP". Further
> >>>> review of the THP rework patches, which cannot be bisected, revealed
> >>>> commit fecffad "s390, thp: remove infrastructure for handling splitting 
> >>>> PMDs"
> >>>> (and also similar commits for other archs).
> >>>>
> >>>> This commit removes the THP splitting bit and also the architecture
> >>>> implementation of pmdp_splitting_flush(), which took care of the IPI for
> >>>> fast_gup serialization. The commit message says
> >>>>
> >>>> pmdp_splitting_flush() is not needed too: on splitting PMD we will do
> >>>> pmdp_clear_flush() + set_pte_at().  pmdp_clear_flush() will do IPI as
> >>>> needed for fast_gup
> >>>>
> >>>> The assumption that a TLB flush will also produce an IPI is wrong on 
> >>>> s390,
> >>>> and maybe also on other architectures, and I thought that this was 
> >>>> actually
> >>>> the main reason for having an arch-specific pmdp_splitting_flush().
> >>>>
> >>>> At least PowerPC and ARM also had an individual implementation of
> >>>> pmdp_splitting_flush() that used kick_all_cpus_sync() instead of a TLB
> >>>> flush to send the IPI, and those were also removed. Putting the arch
> >>>> maintainers and mailing lists on cc to verify.
> >>>>
> >>>> On s390 this will break the IPI serialization against fast_gup, which
> >>>> would certainly explain the random kernel crashes, please revert or fix
> >>>> the pmdp_splitting_flush() removal.
> >>>
> >>> Sorry for that.
> >>>
> >>> I believe, the problem was already addressed for PowerPC:
> >>>
> >>> http://lkml.kernel.org/g/454980831-16631-1-git-send-email-aneesh.ku...@linux.vnet.ibm.com
> >>>
> >>> I think kick_all_cpus_sync() in arch-specific pmdp_invalidate() would do
> >>> the trick, right?
> >>
> >> Hmm, not sure about that. After pmdp_invalidate(), a pmd_none() check in
> >> fast_gup will still return false, because the pmd is not empty (at least
> >> on s390). So I don't see spontaneously how it will help fast_gup to break
> >> out to the slow path in case of THP splitting.
> > 
> > What pmdp_flush_direct() does in pmdp_invalidate()? It's hard to unwrap for 
> > me :-/
> > Does it make the pmd !pmd_present()?
> 
> It uses the idte instruction, which in an atomic fashion flushes the 
> associated
> TLB entry and changes the value of the pmd entry to invalid. This comes from 
> the
> HW requirement to not  change a PTE/PMD that might be still in use, other 
> than 
> with special instructions that does the tlb handling and the invalidation 
> together.

Correct, and it does _not_ make the pmd !pmd_present(), that would only be the
case after a _clear_flush(). It only marks the pmd as invalid and flushes,
so that it cannot generate a new TLB entry before the following pmd_populate(),
but it keeps its other content. This is to fulfill the requirements outlined in
the comment in mm/huge_memory.c before the call to pmdp_invalidate(). And
independent from that comment, we would need such an _invalidate() or
_clear_flush() on s390 before the pmd_populate() because of the HW details
that Christian described.

Reading the comment again, I do now notice that it also says "mark the current
pmd notpresent", which we cannot do w/o losing the huge and (formerly) splitting
bits, but it also shouldn't be needed to provide the "single TLB guarantee" that
is required from the comment. So, a pmd_present() check on s390 in this state
would still return true. Not sure yet if this is a problem, need more thinking,
this behavior was already present before the THP rework but maybe it was OK
before and is not OK now.

At least for fast_gup this should not be a problem though.

> (It also does some some other magic to the attach_count, which might hold off
> finish_arc

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-15 Thread Gerald Schaefer
On Sat, 13 Feb 2016 01:15:10 +0200
"Kirill A. Shutemov"  wrote:

> 
> I'm trying to wrap my head around the issue and I don't think missing
> serialization with gup_fast is the cause -- we just don't need it
> anymore.
> 
> Previously, __split_huge_page_splitting() required serialization against
> gup_fast to make sure nobody can obtain new reference to the page after
> __split_huge_page_splitting() returns. This was a way to stabilize page
> references before starting to distribute them from head page to tail
> pages.
> 
> With new refcounting, we don't care about this. Splitting PMD is now
> decoupled from splitting underlying compound page. It's okay to get new
> pins after split_huge_pmd(). To stabilize page references during
> split_huge_page() we rely on setting up migration entries once all
> pmds are split into page table entries.
> 
> The theory that serialization against gup_fast is not a root cause of the
> crashes is consistent no crashes on arm64. Problem is somewhere else.

Hmm, ok, I just relied on the commit message of commit fecffad25458, which
talks about "pmdp_clear_flush() will do IPI as needed for fast_gup", as well
as the comments in mm/gup.c, which also still talk about IPIs and THP
splitting.

If IPI serialization with fast_gup is not needed anymore for THP splitting,
please fix at least the comments in mm/gup.c.

> 
> > > (It also does some some other magic to the attach_count, which might hold 
> > > off
> > > finish_arch_post_lock_switch while some flushing is happening, but this 
> > > should
> > > be unrelated here)
> > > 
> > > 
> > > > I'm also confused by pmd_none() is equal to !pmd_present() on s390. Hm?
> > > 
> > > Don't know, Gerald or Martin?
> > 
> > The implementation frequently changes depending on how many new bits Martin
> > needs to squeeze out :-)
> 
> One bit was freed up by the commit you've pointed to as a cause.
> I wounder If it's possible that screw up something while removing it? I
> don't see it, but who knows.
> 
> Could you check if revert of fecffad25458 helps?

I tried reverting fecffad25458, plus re-adding a call to pmdp_splitting_flush()
in __split_huge_pmd_locked(), and I could still reproduce the crashes, so I
guess it really isn't related to fast_gup vs. THP splitting.

> 
> And could you share how crashes looks like? I haven't seen backtraces yet.
> 
> > We don't have a _PAGE_PRESENT bit for pmds, so pmd_present() just checks if 
> > the
> > entry is not empty. pmd_none() of course does the opposite, it checks if it 
> > is
> > empty.
> 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-15 Thread Gerald Schaefer
On Mon, 15 Feb 2016 13:31:59 +0200
"Kirill A. Shutemov"  wrote:

> On Sat, Feb 13, 2016 at 12:58:31PM +0100, Sebastian Ott wrote:
> > 
> > On Sat, 13 Feb 2016, Kirill A. Shutemov wrote:
> > > Could you check if revert of fecffad25458 helps?
> > 
> > I reverted fecffad25458 on top of 721675fcf277cf - it oopsed with:
> > 
> > ¢ 1851.721062! Unable to handle kernel pointer dereference in virtual 
> > kernel address space
> > ¢ 1851.721075! failing address:  TEID: 0483
> > ¢ 1851.721078! Fault in home space mode while using kernel ASCE.
> > ¢ 1851.721085! AS:00d5c007 R3:0007 S:a800 
> > P:003d
> > ¢ 1851.721128! Oops: 0004 ilc:3 ¢#1! PREEMPT SMP DEBUG_PAGEALLOC
> > ¢ 1851.721135! Modules linked in: bridge stp llc btrfs mlx4_ib mlx4_en 
> > ib_sa ib_mad vxlan xor ip6_udp_tunnel ib_core udp_tunnel ptp pps_core 
> > ib_addr ghash_s390raid6_pq prng ecb aes_s390 mlx4_core des_s390 des_generic 
> > genwqe_card sha512_s390 sha256_s390 sha1_s390 sha_common crc_itu_t dm_mod 
> > scm_block vhost_net tun vhost eadm_sch macvtap macvlan kvm autofs4
> > ¢ 1851.721183! CPU: 7 PID: 256422 Comm: bash Not tainted 
> > 4.5.0-rc3-00058-g07923d7-dirty #178
> > ¢ 1851.721186! task: 7fbfd290 ti: 8c604000 task.ti: 
> > 8c604000
> > ¢ 1851.721189! Krnl PSW : 0704d0018000 0045d3b8 
> > (__rb_erase_color+0x280/0x308)
> > ¢ 1851.721200!R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 
> > PM:0 EA:3
> >Krnl GPRS: 0001 0020 
> >  bd07eff1
> > ¢ 1851.721205!0027ca10  
> > 83e45898 77b61198
> > ¢ 1851.721207!7ce1a490 bd07eff0 
> > 7ce1a548 0027ca10
> > ¢ 1851.721210!bd07c350 bd07eff0 
> > 8c607aa8 8c607a68
> > ¢ 1851.721221! Krnl Code: 0045d3aa: e3c0d0080024   stg 
> > %%r12,8(%%r13)
> >   0045d3b0: b9040039   lgr 
> > %%r3,%%r9
> >  #0045d3b4: a53b0001   oill
> > %%r3,1
> >  >0045d3b8: e3301024   stg 
> > %%r3,0(%%r1)
> >   0045d3be: ec28000e007c   cgij
> > %%r2,0,8,45d3da
> >   0045d3c4: e3402004   lg  
> > %%r4,0(%%r2)
> >   0045d3ca: b904001c   lgr 
> > %%r1,%%r12
> >   0045d3ce: ec143f3f0056   rosbg   
> > %%r1,%%r4,63,63,0
> > ¢ 1851.721269! Call Trace:
> > ¢ 1851.721273! (¢<83e45898>! 0x83e45898)
> > ¢ 1851.721279!  ¢<0029342a>! unlink_anon_vmas+0x9a/0x1d8
> > ¢ 1851.721282!  ¢<00283f34>! free_pgtables+0xcc/0x148
> > ¢ 1851.721285!  ¢<0028c376>! exit_mmap+0xd6/0x300
> > ¢ 1851.721289!  ¢<00134db8>! mmput+0x90/0x118
> > ¢ 1851.721294!  ¢<002d76bc>! flush_old_exec+0x5d4/0x700
> > ¢ 1851.721298!  ¢<003369f4>! load_elf_binary+0x2f4/0x13e8
> > ¢ 1851.721301!  ¢<002d6e4a>! search_binary_handler+0x9a/0x1f8
> > ¢ 1851.721304!  ¢<002d8970>! do_execveat_common.isra.32+0x668/0x9a0
> > ¢ 1851.721307!  ¢<002d8cec>! do_execve+0x44/0x58
> > ¢ 1851.721310!  ¢<002d8f92>! SyS_execve+0x3a/0x48
> > ¢ 1851.721315!  ¢<006fb096>! system_call+0xd6/0x258
> > ¢ 1851.721317!  ¢<03ff997436d6>! 0x3ff997436d6
> > ¢ 1851.721319! INFO: lockdep is turned off.
> > ¢ 1851.721321! Last Breaking-Event-Address:
> > ¢ 1851.721323!  ¢<0045d31a>! __rb_erase_color+0x1e2/0x308
> > ¢ 1851.721327!
> > ¢ 1851.721329! ---¢ end trace 0d80041ac00cfae2 !---
> > 
> > 
> > > 
> > > And could you share how crashes looks like? I haven't seen backtraces yet.
> > > 
> > 
> > Sure. I didn't because they really looked random to me. Most of the time
> > in rcu or list debugging but I thought these have just been the messenger
> > observing a corruption first. Anyhow, here is an older one that might look
> > interesting:
> > 
> > [   59.851421] list_del corruption. next->prev should be 6e1eb000, 
> > but was 0400
> 
> This kinda interesting: 0x400 is TAIL_MAPPING.. Hm..
> 
> Could you check if you see the problem on commit 1c290f642101 and its
> immediate parent?
> 

How should the page->mapping poison end up as next->prev in the list of
pre-allocated THP splitting page tables? Also, commit 1c290f642101
is before the THP rework, at least the non-bisectable part, so we should
expect not to see the problem there.

0x400 is also the value of an empty pte on s390, and the thp_deposit/withdraw
listheads are placed inside the pre-allocated pagetables instead of page->lru,
because we have 2K pagetables on s390 and cannot use struct page == pgtable_t.

So, for example, two concurrent withdraws could produce such a list
corruption, because the first withdr

Re: [PATCH V2 0/3] mm/debug: Add more arch page table helper tests

2020-03-31 Thread Gerald Schaefer
On Tue, 24 Mar 2020 10:52:52 +0530
Anshuman Khandual  wrote:

> This series adds more arch page table helper tests. The new tests here are
> either related to core memory functions and advanced arch pgtable helpers.
> This also creates a documentation file enlisting all expected semantics as
> suggested by Mike Rapoport (https://lkml.org/lkml/2020/1/30/40).
> 
> This series has been tested on arm64 and x86 platforms. There is just one
> expected failure on arm64 that will be fixed when we enable THP migration.
> 
> [   21.741634] WARNING: CPU: 0 PID: 1 at mm/debug_vm_pgtable.c:782
> 
> which corresponds to
> 
> WARN_ON(!pmd_present(pmd_mknotpresent(pmd_mkhuge(pmd
> 
> There are many TRANSPARENT_HUGEPAGE and ARCH_HAS_TRANSPARENT_HUGEPAGE_PUD
> ifdefs scattered across the test. But consolidating all the fallback stubs
> is not very straight forward because ARCH_HAS_TRANSPARENT_HUGEPAGE_PUD is
> not explicitly dependent on ARCH_HAS_TRANSPARENT_HUGEPAGE.
> 
> This series has been build tested on many platforms including the ones that
> subscribe the test through ARCH_HAS_DEBUG_VM_PGTABLE.
> 

Hi Anshuman,

thanks for the update. There are a couple of issues on s390, some might
also affect other archs.

1) The pxd_huge_tests are using pxd_set/clear_huge, which defaults to
returning 0 if !CONFIG_HAVE_ARCH_HUGE_VMAP. As result, the checks for
!pxd_test/clear_huge in the pxd_huge_tests will always trigger the
warning. This should affect all archs w/o CONFIG_HAVE_ARCH_HUGE_VMAP.
Could be fixed like this:

@@ -923,8 +923,10 @@ void __init debug_vm_pgtable(void)
pmd_leaf_tests(pmd_aligned, prot);
pud_leaf_tests(pud_aligned, prot);
 
-   pmd_huge_tests(pmdp, pmd_aligned, prot);
-   pud_huge_tests(pudp, pud_aligned, prot);
+   if (IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP)) {
+   pmd_huge_tests(pmdp, pmd_aligned, prot);
+   pud_huge_tests(pudp, pud_aligned, prot);
+   }
 
pte_savedwrite_tests(pte_aligned, prot);
pmd_savedwrite_tests(pmd_aligned, prot);

BTW, please add some comments to the various #ifdef/#else stuff, especially
when the different parts are far away and/or nested.

2) The hugetlb_advanced_test will fail because it directly de-references
huge *ptep pointers instead of using huge_ptep_get() for this. We have
very different pagetable entry layout for pte and (large) pmd on s390,
and unfortunately the whole hugetlbfs code is using pte_t instead of pmd_t
like THP. For this reason, huge_ptep_get() was introduced, which will
return a "converted" pte, because directly reading from a *ptep (pointing
to a large pmd) will not return a proper pte. Only ARM has also an
implementation of huge_ptep_get(), so they could be affected, depending
on what exactly they need it for.

Could be fixed like this (the first de-reference is a bit special,
because at that point *ptep does not really point to a large (pmd) entry
yet, it is initially an invalid pte entry, which breaks our huge_ptep_get()
conversion logic. I also added PMD_MASK alignment for RANDOM_ORVALUE,
because we do have some special bits there in our large pmds. It seems
to also work w/o that alignment, but it feels a bit wrong):

@@ -731,26 +731,26 @@ static void __init hugetlb_advanced_test
  unsigned long vaddr, pgprot_t prot)
 {
struct page *page = pfn_to_page(pfn);
-   pte_t pte = READ_ONCE(*ptep);
+   pte_t pte;

-   pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
+   pte = pte_mkhuge(mk_pte_phys(RANDOM_ORVALUE & PMD_MASK, prot));
set_huge_pte_at(mm, vaddr, ptep, pte);
barrier();
WARN_ON(!pte_same(pte, huge_ptep_get(ptep)));
huge_pte_clear(mm, vaddr, ptep, PMD_SIZE);
-   pte = READ_ONCE(*ptep);
+   pte = huge_ptep_get(ptep);
WARN_ON(!huge_pte_none(pte));
 
pte = mk_huge_pte(page, prot);
set_huge_pte_at(mm, vaddr, ptep, pte);
huge_ptep_set_wrprotect(mm, vaddr, ptep);
-   pte = READ_ONCE(*ptep);
+   pte = huge_ptep_get(ptep);
WARN_ON(huge_pte_write(pte));
 
pte = mk_huge_pte(page, prot);
set_huge_pte_at(mm, vaddr, ptep, pte);
huge_ptep_get_and_clear(mm, vaddr, ptep);
-   pte = READ_ONCE(*ptep);
+   pte = huge_ptep_get(ptep);
WARN_ON(!huge_pte_none(pte));
 
pte = mk_huge_pte(page, prot);
@@ -759,7 +759,7 @@ static void __init hugetlb_advanced_test
pte = huge_pte_mkwrite(pte);
pte = huge_pte_mkdirty(pte);
huge_ptep_set_access_flags(vma, vaddr, ptep, pte, 1);
-   pte = READ_ONCE(*ptep);
+   pte = huge_ptep_get(ptep);
WARN_ON(!(huge_pte_write(pte) && huge_pte_dirty(pte)));
 }
 #else

3) The pmd_protnone_tests() has an issue, because it passes a pmd to
pmd_protnone() which has not been marked as large. We check for large
pmd in the s390 implementation of pmd_protnone(), and will fail if a
pmd is not large. We had similar issues before, in other helpers, where
I 

Re: [PATCH V8] mm/debug: Add tests validating architecture page table helpers

2019-11-05 Thread Gerald Schaefer
On Mon, 28 Oct 2019 10:59:22 +0530
Anshuman Khandual  wrote:

> This adds tests which will validate architecture page table helpers and
> other accessors in their compliance with expected generic MM semantics.
> This will help various architectures in validating changes to existing
> page table helpers or addition of new ones.
> 
> This test covers basic page table entry transformations including but not
> limited to old, young, dirty, clean, write, write protect etc at various
> level along with populating intermediate entries with next page table page
> and validating them.
> 
> Test page table pages are allocated from system memory with required size
> and alignments. The mapped pfns at page table levels are derived from a
> real pfn representing a valid kernel text symbol. This test gets called
> right after page_alloc_init_late().
> 
> This gets build and run when CONFIG_DEBUG_VM_PGTABLE is selected along with
> CONFIG_VM_DEBUG. Architectures willing to subscribe this test also need to
> select CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE which for now is limited to x86 and
> arm64. Going forward, other architectures too can enable this after fixing
> build or runtime problems (if any) with their page table helpers.

I've prepared a couple of commits to our arch code to make this work on s390,
they will go upstream in the next merge window. After that, we can add s390
to the supported architectures.

We had some issues, e.g. because we do not report large entries as bad in
pxd_bad(), do not check for folded page tables in pxd_free(), or assume
that primitives like pmd_mkdirty() will only be called after pmd_mkhuge().
None of those should have any impact on current code, but your test module
revealed that we do not behave like other architectures in some aspects,
and it's good to find and fix such things to prevent possible future issues.

Thanks a lot for the effort!

Regards,
Gerald



Re: [PATCH V2 0/3] mm/debug: Add more arch page table helper tests

2020-04-07 Thread Gerald Schaefer
On Sun, 5 Apr 2020 17:58:14 +0530
Anshuman Khandual  wrote:

[...]
> > 
> > Could be fixed like this (the first de-reference is a bit special,
> > because at that point *ptep does not really point to a large (pmd) entry
> > yet, it is initially an invalid pte entry, which breaks our huge_ptep_get() 
> >  
> 
> There seems to be an inconsistency on s390 platform. Even though it defines
> a huge_ptep_get() override, it does not subscribe __HAVE_ARCH_HUGE_PTEP_GET
> which should have forced it fallback on generic huge_ptep_get() but it does
> not :) Then I realized that __HAVE_ARCH_HUGE_PTEP_GET only makes sense when
> an arch uses . s390 does not use that and hence gets
> away with it's own huge_ptep_get() without __HAVE_ARCH_HUGE_PTEP_GET. Sounds
> confusing ? But I might not have the entire context here.

Yes, that sounds very confusing. Also a bit ironic, since huge_ptep_get()
was initially introduced because of s390, and now we don't select
__HAVE_ARCH_HUGE_PTEP_GET...

As you realized, I guess this is because we do not use generic hugetlb.h.
And when __HAVE_ARCH_HUGE_PTEP_GET was introduced with commit 544db7597ad
("hugetlb: introduce generic version of huge_ptep_get"), that was probably
the reason why we did not get our share of __HAVE_ARCH_HUGE_PTEP_GET.

Nothing really wrong with that, but yes, very confusing. Maybe we could
also select it for s390, even though it wouldn't have any functional
impact (so far), just for less confusion. Maybe also thinking about
using the generic hugetlb.h, not sure if the original reasons for not
doing so would still apply. Now I only need to find the time...

> 
> > conversion logic. I also added PMD_MASK alignment for RANDOM_ORVALUE,
> > because we do have some special bits there in our large pmds. It seems
> > to also work w/o that alignment, but it feels a bit wrong):  
> 
> Sure, we can accommodate that.
> 
> > 
> > @@ -731,26 +731,26 @@ static void __init hugetlb_advanced_test
> >   unsigned long vaddr, pgprot_t 
> > prot)
> >  {
> > struct page *page = pfn_to_page(pfn);
> > -   pte_t pte = READ_ONCE(*ptep);
> > +   pte_t pte;
> > 
> > -   pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
> > +   pte = pte_mkhuge(mk_pte_phys(RANDOM_ORVALUE & PMD_MASK, prot));  
> 
> So that keeps the existing value in 'ptep' pointer at bay and instead
> construct a PTE from scratch. I would rather have READ_ONCE(*ptep) at
> least provide the seed that can be ORed with RANDOM_ORVALUE before
> being masked with PMD_MASK. Do you see any problem ?

Yes, unfortunately. The problem is that the resulting pte is not marked
as present. The conversion pte -> (huge) pmd, which is done in
set_huge_pte_at() for s390, will establish an empty pmd for non-present
ptes, all the RANDOM_ORVALUE stuff is lost. And a subsequent
huge_ptep_get() will not result in the same original pte value. If you
want to preserve and check the RANDOM_ORVALUE, it has to be a present
pte, hence the mk_pte(_phys).

> 
> Some thing like this instead.
> 
> pte_t pte = READ_ONCE(*ptep);
> pte = pte_mkhuge(__pte((pte_val(pte) | RANDOM_ORVALUE) & PMD_MASK));
> 
> We cannot use mk_pte_phys() as it is defined only on some platforms
> without any generic fallback for others.

Oh, didn't know that, sorry. What about using mk_pte() instead, at least
it would result in a present pte:

pte = pte_mkhuge(mk_pte(phys_to_page(RANDOM_ORVALUE & PMD_MASK), prot));

And if you also want to do some with the existing value, which seems
to be an empty pte, then maybe just check if writing and reading that
value with set_huge_pte_at() / huge_ptep_get() returns the same,
i.e. initially w/o RANDOM_ORVALUE.

So, in combination, like this (BTW, why is the barrier() needed, it
is not used for the other set_huge_pte_at() calls later?):

@@ -733,24 +733,28 @@ static void __init hugetlb_advanced_test
struct page *page = pfn_to_page(pfn);
pte_t pte = READ_ONCE(*ptep);
 
-   pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
+   set_huge_pte_at(mm, vaddr, ptep, pte);
+   WARN_ON(!pte_same(pte, huge_ptep_get(ptep)));
+
+   pte = pte_mkhuge(mk_pte(phys_to_page(RANDOM_ORVALUE & PMD_MASK), prot));
set_huge_pte_at(mm, vaddr, ptep, pte);
barrier();
WARN_ON(!pte_same(pte, huge_ptep_get(ptep)));

This would actually add a new test "write empty pte with
set_huge_pte_at(), then verify with huge_ptep_get()", which happens
to trigger a warning on s390 :-)

That (new) warning actually points to misbehavior on s390, we do not
write a correct empty pmd in this case, but one that is empty and also
marked as large. huge_ptep_get() will then not correctly recognize it
as empty and do wrong conversion. It is also not consistent with
huge_ptep_get_and_clear(), where we write the empty pmd w/o marking
as large. Last but not least it would also break our pmd_protnone()
logic (see below). Another nice finding on s390 :-)

I don't think this has any effect in practice (y

Re: [PATCH V2 0/3] mm/debug: Add more arch page table helper tests

2020-04-08 Thread Gerald Schaefer
On Wed, 8 Apr 2020 12:41:51 +0530
Anshuman Khandual  wrote:

[...]
> >   
> >>
> >> Some thing like this instead.
> >>
> >> pte_t pte = READ_ONCE(*ptep);
> >> pte = pte_mkhuge(__pte((pte_val(pte) | RANDOM_ORVALUE) & PMD_MASK));
> >>
> >> We cannot use mk_pte_phys() as it is defined only on some platforms
> >> without any generic fallback for others.  
> > 
> > Oh, didn't know that, sorry. What about using mk_pte() instead, at least
> > it would result in a present pte:
> > 
> > pte = pte_mkhuge(mk_pte(phys_to_page(RANDOM_ORVALUE & PMD_MASK), prot));  
> 
> Lets use mk_pte() here but can we do this instead
> 
> paddr = (__pfn_to_phys(pfn) | RANDOM_ORVALUE) & PMD_MASK;
> pte = pte_mkhuge(mk_pte(phys_to_page(paddr), prot));
> 

Sure, that will also work.

BTW, this RANDOM_ORVALUE is not really very random, the way it is
defined. For s390 we already changed it to mask out some arch bits,
but I guess there are other archs and bits that would always be
set with this "not so random" value, and I wonder if/how that would
affect all the tests using this value, see also below.

> > 
> > And if you also want to do some with the existing value, which seems
> > to be an empty pte, then maybe just check if writing and reading that
> > value with set_huge_pte_at() / huge_ptep_get() returns the same,
> > i.e. initially w/o RANDOM_ORVALUE.
> > 
> > So, in combination, like this (BTW, why is the barrier() needed, it
> > is not used for the other set_huge_pte_at() calls later?):  
> 
> Ahh missed, will add them. Earlier we faced problem without it after
> set_pte_at() for a test on powerpc (64) platform. Hence just added it
> here to be extra careful.
> 
> > 
> > @@ -733,24 +733,28 @@ static void __init hugetlb_advanced_test
> > struct page *page = pfn_to_page(pfn);
> > pte_t pte = READ_ONCE(*ptep);
> >  
> > -   pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
> > +   set_huge_pte_at(mm, vaddr, ptep, pte);
> > +   WARN_ON(!pte_same(pte, huge_ptep_get(ptep)));
> > +
> > +   pte = pte_mkhuge(mk_pte(phys_to_page(RANDOM_ORVALUE & PMD_MASK), 
> > prot));
> > set_huge_pte_at(mm, vaddr, ptep, pte);
> > barrier();
> > WARN_ON(!pte_same(pte, huge_ptep_get(ptep)));
> > 
> > This would actually add a new test "write empty pte with
> > set_huge_pte_at(), then verify with huge_ptep_get()", which happens
> > to trigger a warning on s390 :-)  
> 
> On arm64 as well which checks for pte_present() in set_huge_pte_at().
> But PTE present check is not really present in each set_huge_pte_at()
> implementation especially without __HAVE_ARCH_HUGE_SET_HUGE_PTE_AT.
> Hence wondering if we should add this new test here which will keep
> giving warnings on s390 and arm64 (at the least).

Hmm, interesting. I forgot about huge swap / migration, which is not
(and probably cannot be) supported on s390. The pte_present() check
on arm64 seems to check for such huge swap / migration entries,
according to the comment.

The new test "write empty pte with set_huge_pte_at(), then verify
with huge_ptep_get()" would then probably trigger the
WARN_ON(!pte_present(pte)) in arm64 code. So I guess "writing empty
ptes with set_huge_pte_at()" is not really a valid use case in practice,
or else you would have seen this warning before. In that case, it
might not be a good idea to add this test.

I also do wonder now, why the original test with
"pte = __pte(pte_val(pte) | RANDOM_ORVALUE);"
did not also trigger that warning on arm64. On s390 this test failed
exactly because the constructed pte was not present (initially empty,
or'ing RANDOM_ORVALUE does not make it present for s390). I guess this
just worked by chance on arm64, because the bits from RANDOM_ORVALUE
also happened to mark the pte present for arm64.

This brings us back to the question above, regarding the "randomness"
of RANDOM_ORVALUE. Not really sure what the intention behind that was,
but maybe it would make sense to restrict this RANDOM_ORVALUE to
non-arch-specific bits, i.e. only bits that would be part of the
address value within a page table entry? Or was it intentionally
chosen to also mess with other bits?

Regards,
Gerald



Re: [PATCH v3 0/4] Clean up hugetlb boot command line processing

2020-04-21 Thread Gerald Schaefer
On Fri, 17 Apr 2020 11:50:45 -0700
Mike Kravetz  wrote:

> v3 -
>Used weak attribute method of defining arch_hugetlb_valid_size.
>  This eliminates changes to arch specific hugetlb.h files (Peter)
>Updated documentation (Peter, Randy)
>Fixed handling of implicitly specified gigantic page preallocation
>  in existing code and removed documentation of such.  There is now
>  no difference between handling of gigantic and non-gigantic pages.
>  (Peter, Nitesh).
>  This requires the most review as there is a small change to
>  undocumented behavior.  See patch 4 commit message for details.
>Added Acks and Reviews (Mina, Peter)
> 
> v2 -
>Fix build errors with patch 1 (Will)
>Change arch_hugetlb_valid_size arg to unsigned long and remove
>  irrelevant 'extern' keyword (Christophe)
>Documentation and other misc changes (Randy, Christophe, Mina)
>Do not process command line options if !hugepages_supported()
>  (Dave, but it sounds like we may want to additional changes to
>   hugepages_supported() for x86?  If that is needed I would prefer
>   a separate patch.)
> 
> Longpeng(Mike) reported a weird message from hugetlb command line processing
> and proposed a solution [1].  While the proposed patch does address the
> specific issue, there are other related issues in command line processing.
> As hugetlbfs evolved, updates to command line processing have been made to
> meet immediate needs and not necessarily in a coordinated manner.  The result
> is that some processing is done in arch specific code, some is done in arch
> independent code and coordination is problematic.  Semantics can vary between
> architectures.
> 
> The patch series does the following:
> - Define arch specific arch_hugetlb_valid_size routine used to validate
>   passed huge page sizes.
> - Move hugepagesz= command line parsing out of arch specific code and into
>   an arch independent routine.
> - Clean up command line processing to follow desired semantics and
>   document those semantics.
> 
> [1] 
> https://lore.kernel.org/linux-mm/20200305033014.1152-1-longpe...@huawei.com
> 
> Mike Kravetz (4):
>   hugetlbfs: add arch_hugetlb_valid_size
>   hugetlbfs: move hugepagesz= parsing to arch independent code
>   hugetlbfs: remove hugetlb_add_hstate() warning for existing hstate
>   hugetlbfs: clean up command line processing
> 
>  .../admin-guide/kernel-parameters.txt |  40 ++--
>  Documentation/admin-guide/mm/hugetlbpage.rst  |  35 
>  arch/arm64/mm/hugetlbpage.c   |  30 +--
>  arch/powerpc/mm/hugetlbpage.c |  30 +--
>  arch/riscv/mm/hugetlbpage.c   |  24 +--
>  arch/s390/mm/hugetlbpage.c|  24 +--
>  arch/sparc/mm/init_64.c   |  43 +---
>  arch/x86/mm/hugetlbpage.c |  23 +--
>  include/linux/hugetlb.h   |   2 +-
>  mm/hugetlb.c      | 190 +++---
>  10 files changed, 271 insertions(+), 170 deletions(-)
> 

Looks good and works fine for s390, thanks for cleaning up!

Acked-by: Gerald Schaefer  # s390



Re: [PATCH V12] mm/debug: Add tests validating architecture page table helpers

2020-01-29 Thread Gerald Schaefer
On Tue, 28 Jan 2020 06:57:53 +0530
Anshuman Khandual  wrote:

> This adds tests which will validate architecture page table helpers and
> other accessors in their compliance with expected generic MM semantics.
> This will help various architectures in validating changes to existing
> page table helpers or addition of new ones.
> 
> This test covers basic page table entry transformations including but not
> limited to old, young, dirty, clean, write, write protect etc at various
> level along with populating intermediate entries with next page table page
> and validating them.
> 
> Test page table pages are allocated from system memory with required size
> and alignments. The mapped pfns at page table levels are derived from a
> real pfn representing a valid kernel text symbol. This test gets called
> right after page_alloc_init_late().
> 
> This gets build and run when CONFIG_DEBUG_VM_PGTABLE is selected along with
> CONFIG_VM_DEBUG. Architectures willing to subscribe this test also need to
> select CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE which for now is limited to x86 and
> arm64. Going forward, other architectures too can enable this after fixing
> build or runtime problems (if any) with their page table helpers.
> 
> Folks interested in making sure that a given platform's page table helpers
> conform to expected generic MM semantics should enable the above config
> which will just trigger this test during boot. Any non conformity here will
> be reported as an warning which would need to be fixed. This test will help
> catch any changes to the agreed upon semantics expected from generic MM and
> enable platforms to accommodate it thereafter.
> 

[...]

> 
> Tested-by: Christophe Leroy  #PPC32

Tested-by: Gerald Schaefer  # s390

Thanks again for this effort, and for keeping up the spirit against
all odds and even after 12 iterations :-)

> 
> diff --git a/Documentation/features/debug/debug-vm-pgtable/arch-support.txt 
> b/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
> new file mode 100644
> index ..f3f8111edbe3
> --- /dev/null
> +++ b/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
> @@ -0,0 +1,35 @@
> +#
> +# Feature name:  debug-vm-pgtable
> +# Kconfig:   ARCH_HAS_DEBUG_VM_PGTABLE
> +# description:   arch supports pgtable tests for semantics compliance
> +#
> +---
> +| arch |status|
> +---
> +|   alpha: | TODO |
> +| arc: |  ok  |
> +| arm: | TODO |
> +|   arm64: |  ok  |
> +| c6x: | TODO |
> +|csky: | TODO |
> +|   h8300: | TODO |
> +| hexagon: | TODO |
> +|ia64: | TODO |
> +|m68k: | TODO |
> +|  microblaze: | TODO |
> +|mips: | TODO |
> +|   nds32: | TODO |
> +|   nios2: | TODO |
> +|openrisc: | TODO |
> +|  parisc: | TODO |
> +|  powerpc/32: |  ok  |
> +|  powerpc/64: | TODO |
> +|   riscv: | TODO |
> +|s390: | TODO |

s390 is ok now, with my patches included in v5.5-rc1. So you can now add

--- a/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
+++ b/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
@@ -25,7 +25,7 @@
 |  powerpc/32: |  ok  |
 |  powerpc/64: | TODO |
 |   riscv: | TODO |
-|s390: | TODO |
+|s390: |  ok  |
 |  sh: | TODO |
 |   sparc: | TODO |
 |  um: | TODO |
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -59,6 +59,7 @@ config KASAN_SHADOW_OFFSET
 config S390
def_bool y
select ARCH_BINFMT_ELF_STATE
+   select ARCH_HAS_DEBUG_VM_PGTABLE
select ARCH_HAS_DEVMEM_IS_ALLOWED
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_FORTIFY_SOURCE



Re: [PATCH V12] mm/debug: Add tests validating architecture page table helpers

2020-01-29 Thread Gerald Schaefer
On Mon, 27 Jan 2020 22:33:08 -0500
Qian Cai  wrote:

> > 
> >> Did those tests ever find any regression or this is almost only useful for 
> >> new
> > 
> > The test has already found problems with s390 page table helpers.
> 
> Hmm, that is pretty weak where s390 is not even official supported with this 
> version.
> 

I first had to get the three patches upstream, each fixing non-conform
behavior on s390, and each issue was found by this extremely useful test:

2416cefc504b s390/mm: add mm_pxd_folded() checks to pxd_free()
2d1fc1eb9b54 s390/mm: simplify page table helpers for large entries
1c27a4bc817b s390/mm: make pmd/pud_bad() report large entries as bad

I did not see any direct effect of this misbehavior yet, but I am
very happy that this could be found and fixed in order to prevent
future issues. And this is exactly the value of this test, to make
sure that all architectures have a common understanding of how
the various page table helpers are supposed to work.

For example, who would have thought that pXd_bad() is supposed to
report large entries as bad? It's not really documented anywhere,
so we just checked them for sanity like normal entries, which
apparently worked fine so far, but for how long?



Re: [PATCH V12] mm/debug: Add tests validating architecture page table helpers

2020-02-12 Thread Gerald Schaefer
On Wed, 12 Feb 2020 15:12:54 +0530
Anshuman Khandual  wrote:

> >> +/*
> >> + * On s390 platform, the lower 12 bits are used to identify given page 
> >> table
> >> + * entry type and for other arch specific requirements. But these bits 
> >> might
> >> + * affect the ability to clear entries with pxx_clear(). So while loading 
> >> up
> >> + * the entries skip all lower 12 bits in order to accommodate s390 
> >> platform.
> >> + * It does not have affect any other platform.
> >> + */
> >> +#define RANDOM_ORVALUE(0xf000UL)  
> > 
> > I'd suggest you generate this mask with something like
> > GENMASK(BITS_PER_LONG, PAGE_SHIFT).  
> 
> IIRC the lower 12 bits constrains on s390 platform might not be really related
> to it's PAGE_SHIFT which can be a variable, but instead just a constant 
> number.
> But can definitely use GENMASK or it's variants here.
> 
> https://lkml.org/lkml/2019/9/5/862

PAGE_SHIFT would be fine, it is 12 on s390. However, in order to be
more precise, we do not really need all 12 bits, only the last 4 bits.
So, something like this would work:

#define RANDOM_ORVALUE GENMASK(BITS_PER_LONG - 1, 4)

The text in the comment could then also be changed from 12 to 4, and
be a bit more specific on the fact that the impact on pxx_clear()
results from the dynamic page table folding logic on s390:

/*
 * On s390 platform, the lower 4 bits are used to identify given page table
 * entry type. But these bits might affect the ability to clear entries with
 * pxx_clear() because of how dynamic page table folding works on s390. So
 * while loading up the entries do not change the lower 4 bits.
 * It does not have affect any other platform.
 */



Re: [RFC PATCH v5 16/19] memory-hotplug: free memmap of sparse-vmemmap

2012-07-31 Thread Gerald Schaefer
On Fri, 27 Jul 2012 18:34:38 +0800
Wen Congyang  wrote:

> From: Yasuaki Ishimatsu 
> 
> All pages of virtual mapping in removed memory cannot be freed, since
> some pages used as PGD/PUD includes not only removed memory but also
> other memory. So the patch checks whether page can be freed or not.
> 
> How to check whether page can be freed or not?
>  1. When removing memory, the page structs of the revmoved memory are
> filled with 0FD.
>  2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be
> cleared. In this case, the page used as PT/PMD can be freed.
> 
> Applying patch, __remove_section() of CONFIG_SPARSEMEM_VMEMMAP is
> integrated into one. So __remove_section() of
> CONFIG_SPARSEMEM_VMEMMAP is deleted.

There should also be generic or dummy versions of the functions
vmemmap_free_bootmem(), vmemmap_kfree() and
register_page_bootmem_memmap(). It doesn't compile on other
archtitectures than x86 as it is now:

mm/built-in.o: In function `sparse_remove_one_section':
(.text+0x49fa6): undefined reference to `vmemmap_free_bootmem'
mm/built-in.o: In function `sparse_remove_one_section':
(.text+0x49fcc): undefined reference to `vmemmap_kfree'
mm/built-in.o: In function `register_page_bootmem_info_node':
(.text+0x57c06): undefined reference to `register_page_bootmem_memmap'
mm/built-in.o: In function `sparse_add_one_section':
(.meminit.text+0x2506): undefined reference to `vmemmap_kfree'
mm/built-in.o: In function `sparse_add_one_section':
(.meminit.text+0x2528): undefined reference to `vmemmap_kfree'
make: *** [vmlinux] Error 1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC PATCH v5 12/19] memory-hotplug: introduce new function arch_remove_memory()

2012-07-31 Thread Gerald Schaefer
On Mon, 30 Jul 2012 18:35:37 +0800
Wen Congyang  wrote:

> At 07/30/2012 06:23 PM, Heiko Carstens Wrote:
> > On Fri, Jul 27, 2012 at 06:32:15PM +0800, Wen Congyang wrote:
> >> We don't call __add_pages() directly in the function add_memory()
> >> because some other architecture related things need to be done
> >> before or after calling __add_pages(). So we should introduce
> >> a new function arch_remove_memory() to revert the things
> >> done in arch_add_memory().
> >>
> >> Note: the function for s390 is not implemented(I don't know how to
> >> implement it for s390).
> > 
> > There is no hardware or firmware interface which could trigger a
> > hot memory remove on s390. So there is nothing that needs to be
> > implemented.
> 
> Thanks for providing this information.
> 
> According to this, arch_remove_memory() for s390 can just return
> -EBUSY.

Yes, but there is a prototype mismatch for arch_remove_memory() on s390
and also other architectures (u64 vs. unsigned long).

arch/s390/mm/init.c:262: error: conflicting types for
‘arch_remove_memory’ include/linux/memory_hotplug.h:88: error: previous
declaration of ‘arch_remove_memory’ was here

In memory_hotplug.h you have:
extern int arch_remove_memory(unsigned long start, unsigned long size);

On all archs other than x86 you have:
int arch_remove_memory(u64 start, u64 size)

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/2][v2] powerpc: Make the CMM memory hotplug aware

2009-10-08 Thread Gerald Schaefer

Hi,

I am currently working on the s390 port for the cmm + hotplug
patch, and I'm a little confused about the memory allocation
policy, see below. Is it correct that the balloon cannot grow
into ZONE_MOVABLE, while the pages for the balloon page list
can?

Robert Jennings wrote:

@@ -110,6 +125,9 @@ static long cmm_alloc_pages(long nr)
cmm_dbg("Begin request for %ld pages\n", nr);

while (nr) {
+   if (atomic_read(&hotplug_active))
+   break;
+
addr = __get_free_page(GFP_NOIO | __GFP_NOWARN |
   __GFP_NORETRY | __GFP_NOMEMALLOC);
if (!addr)
@@ -119,8 +137,10 @@ static long cmm_alloc_pages(long nr)
if (!pa || pa->index >= CMM_NR_PAGES) {
/* Need a new page for the page list. */
spin_unlock(&cmm_lock);
-   npa = (struct cmm_page_array *)__get_free_page(GFP_NOIO 
| __GFP_NOWARN |
-  
__GFP_NORETRY | __GFP_NOMEMALLOC);
+   npa = (struct cmm_page_array *)__get_free_page(
+   GFP_NOIO | __GFP_NOWARN |
+   __GFP_NORETRY | __GFP_NOMEMALLOC |
+   __GFP_MOVABLE);
if (!npa) {
pr_info("%s: Can not allocate new page list\n", 
__func__);
free_page(addr);


Why is the __GFP_MOVABLE added here, for the page list alloc, and not
above for the balloon page alloc?

--
Regards,
Gerald

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 2/2][v2] powerpc: Make the CMM memory hotplug aware

2009-10-15 Thread Gerald Schaefer

Robert Jennings wrote:

@@ -110,6 +125,9 @@ static long cmm_alloc_pages(long nr)
cmm_dbg("Begin request for %ld pages\n", nr);

while (nr) {
+   if (atomic_read(&hotplug_active))
+   break;
+
addr = __get_free_page(GFP_NOIO | __GFP_NOWARN |
   __GFP_NORETRY | __GFP_NOMEMALLOC);
if (!addr)
@@ -119,8 +137,10 @@ static long cmm_alloc_pages(long nr)
if (!pa || pa->index >= CMM_NR_PAGES) {
/* Need a new page for the page list. */
spin_unlock(&cmm_lock);
-   npa = (struct cmm_page_array *)__get_free_page(GFP_NOIO 
| __GFP_NOWARN |
-  
__GFP_NORETRY | __GFP_NOMEMALLOC);
+   npa = (struct cmm_page_array *)__get_free_page(
+   GFP_NOIO | __GFP_NOWARN |
+   __GFP_NORETRY | __GFP_NOMEMALLOC |
+   __GFP_MOVABLE);
if (!npa) {
pr_info("%s: Can not allocate new page list\n", 
__func__);
free_page(addr);

Why is the __GFP_MOVABLE added here, for the page list alloc, and not
above for the balloon page alloc?


The pages allocated as __GFP_MOVABLE are used to store the list of pages
allocated by the balloon.  They reference virtual addresses and it would
be fine for the kernel to migrate the physical pages for those, the
balloon would not notice this.


Does page migration really work for kernel pages that were allocated
with __get_free_page()? I was wondering if we can do this on s390, where
we have a 1:1 mapping of kernel virtual to physical addresses, but
looking at migrate_pages() and friends, it seems that kernel pages
w/o mapping and rmap should not be migrateable at all. Any thoughts from
the memory migration experts?

BTW, since we have real memory hotplug support on s390, allowing us
to add and remove memory chunks to/from ZONE_MOVABLE, this basically
makes cmm ballooning in ZONE_NORMAL obsolete, so we decided not to
support memory hotplug aware cmm on s390.

Regards,
Gerald

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-04 Thread Gerald Schaefer
On Tue,  3 Sep 2019 13:31:46 +0530
Anshuman Khandual  wrote:

> This adds a test module which will validate architecture page table helpers
> and accessors regarding compliance with generic MM semantics expectations.
> This will help various architectures in validating changes to the existing
> page table helpers or addition of new ones.
> 
> Test page table and memory pages creating it's entries at various level are
> all allocated from system memory with required alignments. If memory pages
> with required size and alignment could not be allocated, then all depending
> individual tests are skipped.

This looks very useful, thanks. Of course, s390 is quite special and does
not work nicely with this patch (yet), mostly because of our dynamic page
table levels/folding. Still need to figure out what can be fixed in the arch
code and what would need to be changed in the test module. See below for some
generic comments/questions.

At least one real bug in the s390 code was already revealed by this, which
is very nice. In pmd/pud_bad(), we also check large pmds/puds for sanity,
instead of reporting them as bad, which is apparently not how it is expected.

[...]
> +/*
> + * Basic operations
> + *
> + * mkold(entry)  = An old and not a young entry
> + * mkyoung(entry)= A young and not an old entry
> + * mkdirty(entry)= A dirty and not a clean entry
> + * mkclean(entry)= A clean and not a dirty entry
> + * mkwrite(entry)= A write and not a write protected entry
> + * wrprotect(entry)  = A write protected and not a write entry
> + * pxx_bad(entry)= A mapped and non-table entry
> + * pxx_same(entry1, entry2)  = Both entries hold the exact same value
> + */
> +#define VADDR_TEST   (PGDIR_SIZE + PUD_SIZE + PMD_SIZE + PAGE_SIZE)

Why is P4D_SIZE missing in the VADDR_TEST calculation?

[...]
> +
> +#if !defined(__PAGETABLE_PMD_FOLDED) && !defined(__ARCH_HAS_4LEVEL_HACK)
> +static void pud_clear_tests(pud_t *pudp)
> +{
> + memset(pudp, RANDOM_NZVALUE, sizeof(pud_t));
> + pud_clear(pudp);
> + WARN_ON(!pud_none(READ_ONCE(*pudp)));
> +}

For pgd/p4d/pud_clear(), we only clear if the page table level is present
and not folded. The memset() here overwrites the table type bits, so
pud_clear() will not clear anything on s390 and the pud_none() check will
fail.
Would it be possible to OR a (larger) random value into the table, so that
the lower 12 bits would be preserved?

> +
> +static void pud_populate_tests(struct mm_struct *mm, pud_t *pudp, pmd_t 
> *pmdp)
> +{
> + /*
> +  * This entry points to next level page table page.
> +  * Hence this must not qualify as pud_bad().
> +  */
> + pmd_clear(pmdp);
> + pud_clear(pudp);
> + pud_populate(mm, pudp, pmdp);
> + WARN_ON(pud_bad(READ_ONCE(*pudp)));
> +}

This will populate the pud with a pmd pointer that does not point to the
beginning of the pmd table, but to the second entry (because of how
VADDR_TEST is constructed). This will result in failing pud_bad() check
on s390. Not sure why/how it works on other archs, but would it be possible
to align pmdp down to the beginning of the pmd table (and similar for the
other pxd_populate_tests)?

[...]
> +
> + p4d_free(mm, saved_p4dp);
> + pud_free(mm, saved_pudp);
> + pmd_free(mm, saved_pmdp);
> + pte_free(mm, (pgtable_t) virt_to_page(saved_ptep));

pgtable_t is arch-specific, and on s390 it is not a struct page pointer,
but a pte pointer. So this will go wrong, also on all other archs (if any)
where pgtable_t is not struct page.
Would it be possible to use pte_free_kernel() instead, and just pass
saved_ptep directly?

Regards,
Gerald



Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-05 Thread Gerald Schaefer
On Thu, 5 Sep 2019 14:48:14 +0530
Anshuman Khandual  wrote:

> > [...]  
> >> +
> >> +#if !defined(__PAGETABLE_PMD_FOLDED) && !defined(__ARCH_HAS_4LEVEL_HACK)
> >> +static void pud_clear_tests(pud_t *pudp)
> >> +{
> >> +  memset(pudp, RANDOM_NZVALUE, sizeof(pud_t));
> >> +  pud_clear(pudp);
> >> +  WARN_ON(!pud_none(READ_ONCE(*pudp)));
> >> +}  
> > 
> > For pgd/p4d/pud_clear(), we only clear if the page table level is present
> > and not folded. The memset() here overwrites the table type bits, so
> > pud_clear() will not clear anything on s390 and the pud_none() check will
> > fail.
> > Would it be possible to OR a (larger) random value into the table, so that
> > the lower 12 bits would be preserved?  
> 
> So the suggestion is instead of doing memset() on entry with RANDOM_NZVALUE,
> it should OR a large random value preserving lower 12 bits. Hmm, this should
> still do the trick for other platforms, they just need non zero value. So on
> s390, the lower 12 bits on the page table entry already has valid value while
> entering this function which would make sure that pud_clear() really does
> clear the entry ?

Yes, in theory the table entry on s390 would have the type set in the last
4 bits, so preserving those would be enough. If it does not conflict with
others, I would still suggest preserving all 12 bits since those would contain
arch-specific flags in general, just to be sure. For s390, the pte/pmd tests
would also work with the memset, but for consistency I think the same logic
should be used in all pxd_clear_tests.

However, there is another issue on s390 which will make this only work
for pud_clear_tests(), and not for the p4d/pgd_tests. The problem is that
mm_alloc() will only give you a 3-level page table initially on s390.
This means that pudp == p4dp == pgdp, and so the p4d/pgd_tests will
both see the pud level (of course this also affects other tests).

Not sure yet how to fix this, i.e. how to initialize/update the page table
to 5 levels. We can handle 5 level page tables, and it would be good if
all levels could be tested, but using mm_alloc() to establish the page
tables might not work on s390. One option could be to provide an arch-hook
or weak function to allocate/initialize the mm.

IIUC, the (dummy) mm is really only needed to provide an mm->pgd as starting
point, right?

Regards,
Gerald



Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-06 Thread Gerald Schaefer
On Fri, 6 Sep 2019 11:58:59 +0530
Anshuman Khandual  wrote:

> On 09/05/2019 10:36 PM, Gerald Schaefer wrote:
> > On Thu, 5 Sep 2019 14:48:14 +0530
> > Anshuman Khandual  wrote:
> >   
> >>> [...]
> >>>> +
> >>>> +#if !defined(__PAGETABLE_PMD_FOLDED) && !defined(__ARCH_HAS_4LEVEL_HACK)
> >>>> +static void pud_clear_tests(pud_t *pudp)
> >>>> +{
> >>>> +memset(pudp, RANDOM_NZVALUE, sizeof(pud_t));
> >>>> +pud_clear(pudp);
> >>>> +WARN_ON(!pud_none(READ_ONCE(*pudp)));
> >>>> +}
> >>>
> >>> For pgd/p4d/pud_clear(), we only clear if the page table level is present
> >>> and not folded. The memset() here overwrites the table type bits, so
> >>> pud_clear() will not clear anything on s390 and the pud_none() check will
> >>> fail.
> >>> Would it be possible to OR a (larger) random value into the table, so that
> >>> the lower 12 bits would be preserved?
> >>
> >> So the suggestion is instead of doing memset() on entry with 
> >> RANDOM_NZVALUE,
> >> it should OR a large random value preserving lower 12 bits. Hmm, this 
> >> should
> >> still do the trick for other platforms, they just need non zero value. So 
> >> on
> >> s390, the lower 12 bits on the page table entry already has valid value 
> >> while
> >> entering this function which would make sure that pud_clear() really does
> >> clear the entry ?  
> > 
> > Yes, in theory the table entry on s390 would have the type set in the last
> > 4 bits, so preserving those would be enough. If it does not conflict with
> > others, I would still suggest preserving all 12 bits since those would 
> > contain
> > arch-specific flags in general, just to be sure. For s390, the pte/pmd tests
> > would also work with the memset, but for consistency I think the same logic
> > should be used in all pxd_clear_tests.  
> 
> Makes sense but..
> 
> There is a small challenge with this. Modifying individual bits on a given
> page table entry from generic code like this test case is bit tricky. That
> is because there are not enough helpers to create entries with an absolute
> value. This would have been easier if all the platforms provided functions
> like __pxx() which is not the case now. Otherwise something like this should
> have worked.
> 
> 
> pud_t pud = READ_ONCE(*pudp);
> pud = __pud(pud_val(pud) | RANDOM_VALUE (keeping lower 12 bits 0))
> WRITE_ONCE(*pudp, pud);
> 
> But __pud() will fail to build in many platforms.

Hmm, I simply used this on my system to make pud_clear_tests() work, not
sure if it works on all archs:

pud_val(*pudp) |= RANDOM_NZVALUE;

> 
> The other alternative will be to make sure memset() happens on all other
> bits except the lower 12 bits which will depend on endianness. If s390
> has a fixed endianness, we can still use either of them which will hold
> good for others as well.
> 
> memset(pudp, RANDOM_NZVALUE, sizeof(pud_t) - 3);
> 
> OR
> 
> memset(pudp + 3, RANDOM_NZVALUE, sizeof(pud_t) - 3);
> 
> > 
> > However, there is another issue on s390 which will make this only work
> > for pud_clear_tests(), and not for the p4d/pgd_tests. The problem is that
> > mm_alloc() will only give you a 3-level page table initially on s390.
> > This means that pudp == p4dp == pgdp, and so the p4d/pgd_tests will
> > both see the pud level (of course this also affects other tests).  
> 
> Got it.
> 
> > 
> > Not sure yet how to fix this, i.e. how to initialize/update the page table
> > to 5 levels. We can handle 5 level page tables, and it would be good if
> > all levels could be tested, but using mm_alloc() to establish the page
> > tables might not work on s390. One option could be to provide an arch-hook
> > or weak function to allocate/initialize the mm.  
> 
> Sure, got it. Though I plan to do add some arch specific tests or init 
> sequence
> like the above later on but for now the idea is to get the smallest possible 
> set
> of test cases which builds and runs on all platforms without requiring any 
> arch
> specific hooks or special casing (#ifdef) to be agreed upon broadly and 
> accepted.
> 
> Do you think this is absolutely necessary on s390 for the very first set of 
> test
> cases or we can add this later on as an improvement ?

It can be added later, no problem. I did not expect this to work flawlessly
on s390 right from the start anyway, with all our peculiarities, so don't
let this hinder you. I might come up with an add-on patch later.

Actually, using get_unmapped_area() as suggested by Kirill could also
solve this issue. We do create a new mm with 3-level page tables on s390,
and the dynamic upgrade to 4 or 5 levels is then triggered exactly by
arch_get_unmapped_area(), depending on the addr. But I currently don't
see how / where arch_get_unmapped_area() is set up for such a dummy mm
created by mm_alloc().

Regards,
Gerald



Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-09 Thread Gerald Schaefer
On Mon, 9 Sep 2019 11:56:50 +0530
Anshuman Khandual  wrote:

[..]
> > 
> > Hmm, I simply used this on my system to make pud_clear_tests() work, not
> > sure if it works on all archs:
> > 
> > pud_val(*pudp) |= RANDOM_NZVALUE;  
> 
> Which compiles on arm64 but then fails on x86 because of the way pmd_val()
> has been defined there. on arm64 and s390 (with many others) pmd_val() is
> a macro which still got the variable that can be used as lvalue but that is
> not true for some other platforms like x86.
> 
> arch/arm64/include/asm/pgtable-types.h:   #define pmd_val(x)  
> ((x).pmd)
> arch/s390/include/asm/page.h: #define pmd_val(x)  ((x).pmd)
> arch/x86/include/asm/pgtable.h:   #define pmd_val(x)   
> native_pmd_val(x)
> 
> static inline pmdval_t native_pmd_val(pmd_t pmd)
> {
> return pmd.pmd;
> }
> 
> Unless I am mistaken, the return value from this function can not be used as
> lvalue for future assignments.
> 
> mm/arch_pgtable_test.c: In function ‘pud_clear_tests’:
> mm/arch_pgtable_test.c:156:17: error: lvalue required as left operand of 
> assignment
>   pud_val(*pudp) |= RANDOM_ORVALUE;
>  ^~
> AFAICS pxx_val() were never intended to be used as lvalue and using it that 
> way
> might just happen to work on all those platforms which define them as macros.
> They meant to just provide values for an entry as being determined by the 
> platform.
> 
> In principle pxx_val() on an entry was not supposed to be modified directly 
> from
> generic code without going through (again) platform helpers for any specific 
> state
> change (write, old, dirty, special, huge etc). The current use case is a 
> deviation
> for that.
> 
> I originally went with memset() just to load up the entries with non-zero 
> value so
> that we know pxx_clear() are really doing the clearing. The same is being 
> followed
> for all pxx_same() checks.
> 
> Another way for fixing the problem would be to mark them with known attributes
> like write/young/huge etc instead which for sure will create non-zero entries.
> We can do that for pxx_clear() and pxx_same() tests and drop RANDOM_NZVALUE
> completely. Does that sound good ?

Umm, not really. Those mkwrite/young/huge etc. helpers do only exist for
page table levels where we can also have large mappings, at least on s390.
Also, we do (on s390) again check for certain sanity before actually setting
the bits.
Good news is that at least for the pxx_same() checks the memset() is no
problem, because pxx_same() does not do any checks other than the same check.

For the pxx_clear_tests(), maybe it could be an option to put them behind the
pxx_populate_tests(), and rely on them having properly populated (non-clear)
values after that?

[...]
> > 
> > Actually, using get_unmapped_area() as suggested by Kirill could also
> > solve this issue. We do create a new mm with 3-level page tables on s390,
> > and the dynamic upgrade to 4 or 5 levels is then triggered exactly by
> > arch_get_unmapped_area(), depending on the addr. But I currently don't
> > see how / where arch_get_unmapped_area() is set up for such a dummy mm
> > created by mm_alloc().  
> 
> Normally they are set during program loading but we can set it up explicitly
> for the test mm_struct if we need to but there are some other challenges.
> 
> load_[aout|elf|flat|..]_binary()
>   setup_new_exec()
>   arch_pick_mmap_layout().
> 
> I did some initial experiments around get_unmapped_area(). Seems bit tricky
> to get it working on a pure 'test' mm_struct. It expects a real user context
> in the form of current->mm.

Yes, that's where I stopped because it looked rather complicated :-)
Not sure why Kirill suggested it initially, but if using get_unmapped_area()
would only be necessary to get properly initialized page table levels
on s390, you could also defer this to a later add-on patch.

Regards,
Gerald



Re: [PATCH V2 2/2] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-18 Thread Gerald Schaefer
On Wed, 18 Sep 2019 18:26:03 +0200
Christophe Leroy  wrote:

[..] 
> My suggestion was not to completely drop the #ifdef but to do like you 
> did in pgd_clear_tests() for instance, ie to add the following test on 
> top of the function:
> 
>   if (mm_pud_folded(mm) || is_defined(__ARCH_HAS_5LEVEL_HACK))
>   return;
> 

Ah, very nice, this would also fix the remaining issues for s390. Since
we have dynamic page table folding, neither __PAGETABLE_PXX_FOLDED nor
__ARCH_HAS_XLEVEL_HACK is defined, but mm_pxx_folded() will work.

mm_alloc() returns with a 3-level page table by default on s390, so we
will run into issues in p4d_clear/populate_tests(), and also at the end
with p4d/pud_free() (double free).

So, adding the mm_pud_folded() check to p4d_clear/populate_tests(),
and also adding mm_p4d/pud_folded() checks at the end before calling
p4d/pud_free(), would make it all work on s390.

BTW, regarding p4d/pud_free(), I'm not sure if we should rather check
the folding inside our s390 functions, similar to how we do it for
p4d/pud_free_tlb(), instead of relying on not being called for folded
p4d/pud. So far, I see no problem with this behavior, all callers of
p4d/pud_free() should be fine because of our folding check within
p4d/pud_present/none(). But that doesn't mean that it is correct not
to check for the folding inside p4d/pud_free(). At least, with this
test module we do now have a caller of p4d/pud_free() on potentially
folded entries, so instead of adding pxx_folded() checks to this
test module, we could add them to our p4d/pud_free() functions.
Any thoughts on this?

Regards,
Gerald



Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-06-28 Thread Gerald Schaefer
he RCU callee pte_free_half()
can see that it needs to make a further call_rcu() for that other half.

Not adding back unallocated fragments to the list in pte_free_defer()
can result in wasting some amount of memory for pagetables, depending
on how long the allocated fragment will stay in use. In practice, this
effect is expected to be insignificant, and not justify a far more
complex approach, which might allow to add the fragments back later
in __tlb_remove_table(), where we might not have a stable mm any more.

Signed-off-by: Gerald Schaefer 
---
 arch/s390/include/asm/pgalloc.h |4 +
 arch/s390/mm/pgalloc.c  |  136 +---
 2 files changed, 132 insertions(+), 8 deletions(-)

--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -143,6 +143,10 @@ static inline void pmd_populate(struct m
 #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
 #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
 
+/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 void vmem_map_init(void);
 void *vmem_crst_alloc(unsigned long val);
 pte_t *vmem_pte_alloc(void);
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -185,11 +185,13 @@ void page_table_free_pgste(struct page *
  * The upper byte (bits 24-31) of the parent page _refcount is used
  * for tracking contained 2KB-pgtables and has the following format:
  *
- *   PP  AA
+ *   PPHHAA
  * 01234567upper byte (bits 24-31) of struct page::_refcount
- *   ||  ||
- *   ||  |+--- upper 2KB-pgtable is allocated
- *   ||  + lower 2KB-pgtable is allocated
+ *   ||
+ *   |+--- upper 2KB-pgtable is allocated
+ *   + lower 2KB-pgtable is allocated
+ *   |||+- upper 2KB-pgtable is pending free by page->rcu_head
+ *   ||+-- lower 2KB-pgtable is pending free by page->rcu_head
  *   |+--- upper 2KB-pgtable is pending for removal
  *   + lower 2KB-pgtable is pending for removal
  *
@@ -229,6 +231,17 @@ void page_table_free_pgste(struct page *
  * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
  * while the PP bits are never used, nor such a page is added to or removed
  * from mm_context_t::pgtable_list.
+ *
+ * The HH bits are used to prevent double use of page->rcu_head in
+ * pte_free_defer(), when both 2K pagetables inside a page happen to get
+ * freed by that path at the same time.
+ *
+ * pte_free_defer() also cannot add 2K fragments back to the list, because
+ * page->rcu_head overlays with page->lru. This introduces some asymmetry
+ * compared to the other pagetable freeing paths, and the missing list_add()
+ * in pte_free_defer() could result in incorrect list_del(). Therefore, track
+ * the the list status of a page with page->pt_frag_refcount, and check that
+ * before doing list_del() in any freeing path.
  */
 unsigned long *page_table_alloc(struct mm_struct *mm)
 {
@@ -262,6 +275,7 @@ unsigned long *page_table_alloc(struct m
atomic_xor_bits(&page->_refcount,
0x01U << (bit + 24));
list_del(&page->lru);
+   atomic_set(&page->pt_frag_refcount, 0);
}
}
spin_unlock_bh(&mm->context.lock);
@@ -290,6 +304,7 @@ unsigned long *page_table_alloc(struct m
memset64((u64 *)table, _PAGE_INVALID, 2 * PTRS_PER_PTE);
spin_lock_bh(&mm->context.lock);
list_add(&page->lru, &mm->context.pgtable_list);
+   atomic_set(&page->pt_frag_refcount, 1);
spin_unlock_bh(&mm->context.lock);
}
return table;
@@ -325,13 +340,24 @@ void page_table_free(struct mm_struct *m
 */
mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
mask >>= 24;
-   if (mask & 0x03U)
+   if (mask & 0x03U) {
+   /*
+* Other half is allocated, add to list
+*/
list_add(&page->lru, &mm->context.pgtable_list);
-   else
+   atomic_set(&page->pt_frag_refcount, 1);
+   } else if (atomic_read(&page->pt_frag_refcount)) {
+   /*
+* Other half is not allocated, and page is on the list,
+* remove from list
+*/
list_del(&page->lru);
+   atomic_set(&page->pt_frag_refcount, 0);
+   }
spin_unlock_bh(&mm

Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-06-29 Thread Gerald Schaefer
On Thu, 29 Jun 2023 15:59:07 +0200
Alexander Gordeev  wrote:

> On Wed, Jun 28, 2023 at 09:16:24PM +0200, Gerald Schaefer wrote:
> > On Tue, 20 Jun 2023 00:51:19 -0700 (PDT)
> > Hugh Dickins  wrote:  
> 
> Hi Gerald, Hugh!
> 
> ...
> > @@ -407,6 +445,88 @@ void __tlb_remove_table(void *_table)
> > __free_page(page);
> >  }
> >  
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +static void pte_free_now0(struct rcu_head *head);
> > +static void pte_free_now1(struct rcu_head *head);  
> 
> What about pte_free_lower() / pte_free_upper()?

I actually like the 0/1 better, I always get confused what exactly we
mean with "lower / upper" in our code and comments. Is it the first
or second half? With 0/1 it is immediately clear to me.

> 
> ...
> > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > +{
> > +   unsigned int bit, mask;
> > +   struct page *page;
> > +
> > +   page = virt_to_page(pgtable);
> > +   if (mm_alloc_pgste(mm)) {
> > +   /*
> > +* TODO: Do we need gmap_unlink(mm, pgtable, addr), like in
> > +* page_table_free_rcu()?
> > +* If yes -> need addr parameter here, like in pte_free_tlb().
> > +*/
> > +   call_rcu(&page->rcu_head, pte_free_pgste);
> > +   return;
> > +}
> > +   bit = ((unsigned long)pgtable & ~PAGE_MASK) / (PTRS_PER_PTE * 
> > sizeof(pte_t));
> > +
> > +   spin_lock_bh(&mm->context.lock);
> > +   mask = atomic_xor_bits(&page->_refcount, 0x15U << (bit + 24));  
> 
> This  makes the bit logic increasingly complicated to me.

I think it is well in line with existing code in page_table_free[_rcu].
Only instead of doing xor with 0x11U, it does xor with 0x15U to also
switch on the H bit while at it.

> 
> What if instead we set the rule "one bit at a time only"?
> That means an atomic group bit flip is only allowed between
> pairs of bits, namely:
> 
> bit flip  initiated from
> ---   
> P  <- A   page_table_free(), page_table_free_rcu()
>  H <- A   pte_free_defer()
> P <- Hpte_free_half()
> 
> In the current model P bit could be on together with H
> bit simultaneously. That actually brings in equation
> nothing.

P bit has to be set at the latest when __tlb_remove_table() gets
called, because there it is checked / cleared. It might be possible
to not set it in pte_free_defer() already, but only later in
pte_free_half() RCU callback, before calling __tlb_remove_table().
But that would not be in line any more with existing code, where it
is already set before scheduling the RCU callback.

Therefore, I would rather stick to the current approach, unless
you see some bug in it.

> 
> Besides, this check in page_table_alloc() (while still
> correct) makes one (well, me) wonder "what about HH bits?":
> 
>   mask = (mask | (mask >> 4)) & 0x03U;
>   if (mask != 0x03U) {
>   ...
>   }

Without adding fragments back to the list, it is not necessary
to check any H bits page_table_alloc(), or so I hope. Actually,
I like that aspect most, i.e. we have as little impact on current
code as possible.

And H bits are only relevant for preventing double use of rcu_head,
which is what they were designed for, and only the new code has
to care about them.


Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-06-29 Thread Gerald Schaefer
On Thu, 29 Jun 2023 12:22:24 -0300
Jason Gunthorpe  wrote:

> On Wed, Jun 28, 2023 at 10:08:08PM -0700, Hugh Dickins wrote:
> > On Wed, 28 Jun 2023, Gerald Schaefer wrote:  
> > > 
> > > As discussed in the other thread, we would rather go with less complexity,
> > > possibly switching to an approach w/o the list and fragment re-use in the
> > > future. For now, as a first step in that direction, we can try with not
> > > adding fragments back only for pte_free_defer(). Here is an adjusted
> > > version of your patch, copying most of your pte_free_defer() logic and
> > > also description, tested with LTP and all three of your patch series 
> > > applied:  
> > 
> > Thanks, Gerald: I don't mind abandoning my 13/12 SLAB_TYPESAFE_BY_RCU
> > patch (posted with fewer Cc's to the s390 list last week), and switching
> > to your simpler who-cares-if-we-sometimes-don't-make-maximal-use-of-page
> > patch.
> > 
> > But I didn't get deep enough into it today to confirm it - and disappointed
> > that you've found it necessary to play with pt_frag_refcount in addition to
> > _refcount and HH bits.  No real problem with that, but my instinct says it
> > should be simpler.  

Yes, I also found it a bit awkward, but it seemed "good and simple enough",
to have something to go forward with, while my instinct was in line with yours.

> 
> Is there any reason it should be any different at all from what PPC is
> doing?
> 
> I still think the right thing to do here is make the PPC code common
> (with Hugh's proposed RCU modification) and just use it in both
> arches

With the current approach, we would not add back fragments _only_ for
the new pte_free_defer() path, while keeping our cleverness for the other
paths. Not having a good overview of the negative impact wrt potential
memory waste, I would rather take small steps, if possible.

If we later switch to never adding back fragments, of course we should
try to be in line with PPC implementation.


Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-07-03 Thread Gerald Schaefer
On Thu, 29 Jun 2023 23:00:07 -0700 (PDT)
Hugh Dickins  wrote:

> On Thu, 29 Jun 2023, Gerald Schaefer wrote:
> > On Thu, 29 Jun 2023 12:22:24 -0300
> > Jason Gunthorpe  wrote:  
> > > On Wed, Jun 28, 2023 at 10:08:08PM -0700, Hugh Dickins wrote:  
> > > > On Wed, 28 Jun 2023, Gerald Schaefer wrote:
> > > > > 
> > > > > As discussed in the other thread, we would rather go with less 
> > > > > complexity,
> > > > > possibly switching to an approach w/o the list and fragment re-use in 
> > > > > the
> > > > > future. For now, as a first step in that direction, we can try with 
> > > > > not
> > > > > adding fragments back only for pte_free_defer(). Here is an adjusted
> > > > > version of your patch, copying most of your pte_free_defer() logic and
> > > > > also description, tested with LTP and all three of your patch series 
> > > > > applied:
> > > > 
> > > > Thanks, Gerald: I don't mind abandoning my 13/12 SLAB_TYPESAFE_BY_RCU
> > > > patch (posted with fewer Cc's to the s390 list last week), and switching
> > > > to your simpler who-cares-if-we-sometimes-don't-make-maximal-use-of-page
> > > > patch.
> > > > 
> > > > But I didn't get deep enough into it today to confirm it - and 
> > > > disappointed
> > > > that you've found it necessary to play with pt_frag_refcount in 
> > > > addition to
> > > > _refcount and HH bits.  No real problem with that, but my instinct says 
> > > > it
> > > > should be simpler.
> > 
> > Yes, I also found it a bit awkward, but it seemed "good and simple enough",
> > to have something to go forward with, while my instinct was in line with 
> > yours.
> >   
> > > 
> > > Is there any reason it should be any different at all from what PPC is
> > > doing?
> > > 
> > > I still think the right thing to do here is make the PPC code common
> > > (with Hugh's proposed RCU modification) and just use it in both
> > > arches  
> > 
> > With the current approach, we would not add back fragments _only_ for
> > the new pte_free_defer() path, while keeping our cleverness for the other
> > paths. Not having a good overview of the negative impact wrt potential
> > memory waste, I would rather take small steps, if possible.
> > 
> > If we later switch to never adding back fragments, of course we should
> > try to be in line with PPC implementation.  
> 
> I find myself half-agreeing with everyone.
> 
> I agree with Gerald that s390 should keep close to what it is already
> doing (except for adding pte_free_defer()): that changing its strategy
> and implementation to be much more like powerpc, is a job for some other
> occasion (and would depend on gathering data about how well each does).
> 
> But I agree with Jason that the powerpc solution we ended up with cut
> out a lot of unnecessary complication: it shifts the RCU delay from
> when pte_free_defer() is called, to when the shared page comes to be
> freed; which may be a lot later, and might not be welcome in a common
> path, but is quite okay for the uncommon pte_free_defer().

Ok, I guess I must admit that I completely ignored the latest progress in
the powerpc thread, and therefore was not up-to-date. Still had the older
approach in mind, where you also checked for pt_frag_refcount to avoid
double call_rcu().

The new approach sounds very reasonable, and I also like your latest
s390 patch from a first glance. Need to get more up-to-date with PageActive
and maybe also powerpc approach, and give this some proper review tomorrow.

> 
> And I agree with Alexander that pte_free_lower() and pte_free_upper()
> are better names than pte_free_now0() and pte_free_now1(): I was going
> to make that change, except all those functions disappear if we follow
> Jason's advice and switch the call_rcu() to when freeing the page.
> 
> (Lower and upper seem unambiguous to me: Gerald, does your confusion
> come just from the way they are shown the wrong way round in the PP AA
> diagram?  I corrected that in my patch, but you reverted it in yours.)

Ah yes, that could well be, and unfortunately I did not notice that you
fixed that in the comment. I only saw that you "fixed" the bit numbering
from 01234567 to 76543210, which I think is wrong on big-endian s390,
and therefore I simply removed that complete hunk.

But thanks a lot for pointing to that! We will certainly want to fix that
comment in a later patch, to reduce some or maybe all of the (at least
my) upper/lower confusion.


Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-07-04 Thread Gerald Schaefer
On Sat, 1 Jul 2023 21:32:38 -0700 (PDT)
Hugh Dickins  wrote:

> On Thu, 29 Jun 2023, Hugh Dickins wrote:
> > 
> > I've grown to dislike the (ab)use of pt_frag_refcount even more, to the
> > extent that I've not even tried to verify it; but I think I do get the
> > point now, that we need further info than just PPHHAA to know whether
> > the page is on the list or not.  But I think that if we move where the
> > call_rcu() is done, then the page can stay on or off the list by same
> > rules as before (but need to check HH bits along with PP when deciding
> > whether to allocate, and whether to list_add_tail() when freeing).  
> 
> No, not quite the same rules as before: I came to realize that using
> list_add_tail() for the HH pages would be liable to put a page on the
> list which forever blocked reuse of PP list_add_tail() pages after it
> (could be solved by a list_move() somewhere, but we have agreed to
> prefer simplicity).
> 
> I've dropped the HH bits, I'm using PageActive like we did on powerpc,
> I've dropped most of the pte_free_*() helpers, and list_del_init() is
> an easier way of dealing with those "is it on the list" questions.
> I expect that we shall be close to reaching agreement on...

This looks really nice, almost too good and easy to be true. I did not
find any obvious flaw, just some comments below. It also survived LTP
without any visible havoc, so I guess this approach is the best so far.

> 
> [PATCH v? 07/12] s390: add pte_free_defer() for pgtables sharing page
> 
> Add s390-specific pte_free_defer(), to free table page via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon.  This precedes
> the generic version to avoid build breakage from incompatible pgtable_t.
> 
> This version is more complicated than others: because s390 fits two 2K
> page tables into one 4K page (so page->rcu_head must be shared between
> both halves), and already uses page->lru (which page->rcu_head overlays)
> to list any free halves; with clever management by page->_refcount bits.
> 
> Build upon the existing management, adjusted to follow a new rule: that
> a page is never on the free list if pte_free_defer() was used on either
> half (marked by PageActive).  And for simplicity, delay calling RCU until
> both halves are freed.
> 
> Not adding back unallocated fragments to the list in pte_free_defer()
> can result in wasting some amount of memory for pagetables, depending
> on how long the allocated fragment will stay in use. In practice, this
> effect is expected to be insignificant, and not justify a far more
> complex approach, which might allow to add the fragments back later
> in __tlb_remove_table(), where we might not have a stable mm any more.
> 
> Signed-off-by: Hugh Dickins 
> ---
>  arch/s390/include/asm/pgalloc.h |  4 ++
>  arch/s390/mm/pgalloc.c  | 75 +++--
>  2 files changed, 67 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
> index 17eb618f1348..89a9d5ef94f8 100644
> --- a/arch/s390/include/asm/pgalloc.h
> +++ b/arch/s390/include/asm/pgalloc.h
> @@ -143,6 +143,10 @@ static inline void pmd_populate(struct mm_struct *mm,
>  #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
>  #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
>  
> +/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
> +#define pte_free_defer pte_free_defer
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
> +
>  void vmem_map_init(void);
>  void *vmem_crst_alloc(unsigned long val);
>  pte_t *vmem_pte_alloc(void);
> diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
> index 66ab68db9842..fd0c4312da16 100644
> --- a/arch/s390/mm/pgalloc.c
> +++ b/arch/s390/mm/pgalloc.c
> @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
>   * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
>   * while the PP bits are never used, nor such a page is added to or removed
>   * from mm_context_t::pgtable_list.
> + *
> + * pte_free_defer() overrides those rules: it takes the page off 
> pgtable_list,
> + * and prevents both 2K fragments from being reused. pte_free_defer() has to
> + * guarantee that its pgtable cannot be reused before the RCU grace period
> + * has elapsed (which page_table_free_rcu() does not actually guarantee).

Hmm, I think page_table_free_rcu() has to guarantee the same, i.e. not
allow reuse before grace period elapsed. And I hope that it does so, by
setting the PP bits, which would be noticed in page_table_alloc(), in
case the page would be seen there.

Unlike pte_free_defer(), page_table_free_rcu() would add pages back to the
end of the list, and so they could be seen in page_table_alloc(), but they
should not be reused before grace period elapsed and __tlb_remove_table()
cleared the PP bits, as far a

Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-07-05 Thread Gerald Schaefer
On Tue, 4 Jul 2023 10:03:57 -0700 (PDT)
Hugh Dickins  wrote:

> On Tue, 4 Jul 2023, Gerald Schaefer wrote:
> > On Sat, 1 Jul 2023 21:32:38 -0700 (PDT)
> > Hugh Dickins  wrote:  
> > > On Thu, 29 Jun 2023, Hugh Dickins wrote:  
> > > > 
> > > > I've grown to dislike the (ab)use of pt_frag_refcount even more, to the
> > > > extent that I've not even tried to verify it; but I think I do get the
> > > > point now, that we need further info than just PPHHAA to know whether
> > > > the page is on the list or not.  But I think that if we move where the
> > > > call_rcu() is done, then the page can stay on or off the list by same
> > > > rules as before (but need to check HH bits along with PP when deciding
> > > > whether to allocate, and whether to list_add_tail() when freeing).
> > > 
> > > No, not quite the same rules as before: I came to realize that using
> > > list_add_tail() for the HH pages would be liable to put a page on the
> > > list which forever blocked reuse of PP list_add_tail() pages after it
> > > (could be solved by a list_move() somewhere, but we have agreed to
> > > prefer simplicity).
> > > 
> > > I've dropped the HH bits, I'm using PageActive like we did on powerpc,
> > > I've dropped most of the pte_free_*() helpers, and list_del_init() is
> > > an easier way of dealing with those "is it on the list" questions.
> > > I expect that we shall be close to reaching agreement on...  
> > 
> > This looks really nice, almost too good and easy to be true. I did not
> > find any obvious flaw, just some comments below. It also survived LTP
> > without any visible havoc, so I guess this approach is the best so far.  
> 
> Phew! I'm of course glad to hear this: thanks for your efforts on it.
> 
> ...
> > > --- a/arch/s390/mm/pgalloc.c
> > > +++ b/arch/s390/mm/pgalloc.c
> > > @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
> > >   * logic described above. Both AA bits are set to 1 to denote a 
> > > 4KB-pgtable
> > >   * while the PP bits are never used, nor such a page is added to or 
> > > removed
> > >   * from mm_context_t::pgtable_list.
> > > + *
> > > + * pte_free_defer() overrides those rules: it takes the page off 
> > > pgtable_list,
> > > + * and prevents both 2K fragments from being reused. pte_free_defer() 
> > > has to
> > > + * guarantee that its pgtable cannot be reused before the RCU grace 
> > > period
> > > + * has elapsed (which page_table_free_rcu() does not actually 
> > > guarantee).  
> > 
> > Hmm, I think page_table_free_rcu() has to guarantee the same, i.e. not
> > allow reuse before grace period elapsed. And I hope that it does so, by
> > setting the PP bits, which would be noticed in page_table_alloc(), in
> > case the page would be seen there.
> > 
> > Unlike pte_free_defer(), page_table_free_rcu() would add pages back to the
> > end of the list, and so they could be seen in page_table_alloc(), but they
> > should not be reused before grace period elapsed and __tlb_remove_table()
> > cleared the PP bits, as far as I understand.
> > 
> > So what exactly do you mean with "which page_table_free_rcu() does not 
> > actually
> > guarantee"?  
> 
> I'll answer without locating and re-reading what Jason explained earlier,
> perhaps in a separate thread, about pseudo-RCU-ness in tlb_remove_table():
> he may have explained it better.  And without working out again all the
> MMU_GATHER #defines, and which of them do and do not apply to s390 here.
> 
> The detail that sticks in my mind is the fallback in tlb_remove_table()

Ah ok, I was aware of that "semi-RCU" fallback logic in tlb_remove_table(),
but that is rather a generic issue, and not s390-specific. I thought you
meant some s390-oddity here, of which we have a lot, unfortunately...
Of course, we call tlb_remove_table() from our page_table_free_rcu(), so
I guess you could say that page_table_free_rcu() cannot guarantee what
tlb_remove_table() cannot guarantee.

Maybe change to "which page_table_free_rcu() does not actually guarantee,
by calling tlb_remove_table()", to make it clear that this is not a problem
of page_table_free_rcu() itself.

> in mm/mmu_gather.c: if its __get_free_page(GFP_NOWAIT) fails, it cannot
> batch the tables for freeing by RCU, and resorts instead to an immediate 
> TLB flush (I think: that again involves chasing definitions) followed by
> tlb_remove_table_sync_one() - which just delive

Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-07-06 Thread Gerald Schaefer
On Wed, 5 Jul 2023 18:20:21 -0700 (PDT)
Hugh Dickins  wrote:

> On Wed, 5 Jul 2023, Gerald Schaefer wrote:
> > On Tue, 4 Jul 2023 10:03:57 -0700 (PDT)
> > Hugh Dickins  wrote:  
> > > On Tue, 4 Jul 2023, Gerald Schaefer wrote:  
> > > > On Sat, 1 Jul 2023 21:32:38 -0700 (PDT)
> > > > Hugh Dickins  wrote:
> > > > > On Thu, 29 Jun 2023, Hugh Dickins wrote:
> > > ...  
> > > > > --- a/arch/s390/mm/pgalloc.c
> > > > > +++ b/arch/s390/mm/pgalloc.c
> > > > > @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
> > > > >   * logic described above. Both AA bits are set to 1 to denote a 
> > > > > 4KB-pgtable
> > > > >   * while the PP bits are never used, nor such a page is added to or 
> > > > > removed
> > > > >   * from mm_context_t::pgtable_list.
> > > > > + *
> > > > > + * pte_free_defer() overrides those rules: it takes the page off 
> > > > > pgtable_list,
> > > > > + * and prevents both 2K fragments from being reused. 
> > > > > pte_free_defer() has to
> > > > > + * guarantee that its pgtable cannot be reused before the RCU grace 
> > > > > period
> > > > > + * has elapsed (which page_table_free_rcu() does not actually 
> > > > > guarantee).
> > > > 
> > > > Hmm, I think page_table_free_rcu() has to guarantee the same, i.e. not
> > > > allow reuse before grace period elapsed. And I hope that it does so, by
> > > > setting the PP bits, which would be noticed in page_table_alloc(), in
> > > > case the page would be seen there.
> > > > 
> > > > Unlike pte_free_defer(), page_table_free_rcu() would add pages back to 
> > > > the
> > > > end of the list, and so they could be seen in page_table_alloc(), but 
> > > > they
> > > > should not be reused before grace period elapsed and 
> > > > __tlb_remove_table()
> > > > cleared the PP bits, as far as I understand.
> > > > 
> > > > So what exactly do you mean with "which page_table_free_rcu() does not 
> > > > actually
> > > > guarantee"?
> > > 
> > > I'll answer without locating and re-reading what Jason explained earlier,
> > > perhaps in a separate thread, about pseudo-RCU-ness in tlb_remove_table():
> > > he may have explained it better.  And without working out again all the
> > > MMU_GATHER #defines, and which of them do and do not apply to s390 here.
> > > 
> > > The detail that sticks in my mind is the fallback in tlb_remove_table()  
> > 
> > Ah ok, I was aware of that "semi-RCU" fallback logic in tlb_remove_table(),
> > but that is rather a generic issue, and not s390-specific.  
> 
> Yes.
> 
> > I thought you
> > meant some s390-oddity here, of which we have a lot, unfortunately...
> > Of course, we call tlb_remove_table() from our page_table_free_rcu(), so
> > I guess you could say that page_table_free_rcu() cannot guarantee what
> > tlb_remove_table() cannot guarantee.
> > 
> > Maybe change to "which page_table_free_rcu() does not actually guarantee,
> > by calling tlb_remove_table()", to make it clear that this is not a problem
> > of page_table_free_rcu() itself.  
> 
> Okay - I'll rephrase slightly to avoid being sued by s390's lawyers :-)
> 
> >   
> > > in mm/mmu_gather.c: if its __get_free_page(GFP_NOWAIT) fails, it cannot
> > > batch the tables for freeing by RCU, and resorts instead to an immediate 
> > > TLB flush (I think: that again involves chasing definitions) followed by
> > > tlb_remove_table_sync_one() - which just delivers an interrupt to each 
> > > CPU,
> > > and is commented: 
> > > /*
> > >  * This isn't an RCU grace period and hence the page-tables cannot be
> > >  * assumed to be actually RCU-freed.
> > >  *
> > >  * It is however sufficient for software page-table walkers that rely on
> > >  * IRQ disabling.
> > >  */
> > > 
> > > Whether that's good for your PP pages or not, I've given no thought:
> > > I've just taken it on trust that what s390 has working today is good.  
> > 
> > Yes, we should be fine with that, current code can be trusted :-)  
> 
> Glad to hear it :-)  Yes, I think it's not actually relying on the "rcu"
> implied by the function name.

Ah ok, no

Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-07-07 Thread Gerald Schaefer
On Wed, 5 Jul 2023 17:52:40 -0700 (PDT)
Hugh Dickins  wrote:

> On Wed, 5 Jul 2023, Alexander Gordeev wrote:
> > On Sat, Jul 01, 2023 at 09:32:38PM -0700, Hugh Dickins wrote:  
> > > On Thu, 29 Jun 2023, Hugh Dickins wrote:  
> > 
> > Hi Hugh,
> > 
> > ...
> >   
> > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > > +{
> > > + struct page *page;  
> > 
> > If I got your and Claudio conversation right, you were going to add
> > here WARN_ON_ONCE() in case of mm_alloc_pgste(mm)?  

Good point, thanks Alexander for noticing!

> 
> Well, Claudio approved, so I would have put it in, if we had stuck with
> that version which had "if (mm_alloc_pgste(mm)) {" in pte_free_defer();
> but once that went away, it became somewhat irrelevant... to me anyway.
> 
> But I don't mind adding it here, in the v3 I'll post when -rc1 is out,
> if it might help you guys - there is some point, since pte_free_defer()
> is a route which can usefully check for such a case, without confusion
> from harmless traffic from immediate frees of just-in-case allocations.
> 
> But don't expect it to catch all such cases (if they exist): another
> category of s390 page_table_free()s comes from the PageAnon
> zap_deposited_table() in zap_huge_pmd(): those tables might or might
> not have been exposed to userspace at some time in the past.

Right, after THP collapse, the previously active PTE table would be
deposited in this case, and then later freed in zap_deposited_table().
I guess we need to be very careful, if THP was ever enabled for KVM
guests.

> 
> I'll add the WARN_ON_ONCE in pte_free_defer() (after checking that
> WARN_ON_ONCE is the one we want - I get confused by all the different
> flavours of WARN, and have to check the header file each time to be
> sure of the syntax and semantics): but be aware that it won't be
> checking all potential cases.

Thanks, looks good.


Re: [PATCH v3 03/34] s390: Use pt_frag_refcount for pagetables

2023-06-01 Thread Gerald Schaefer
 On Wed, 31 May 2023 14:30:01 -0700
"Vishal Moola (Oracle)"  wrote:

> s390 currently uses _refcount to identify fragmented page tables.
> The page table struct already has a member pt_frag_refcount used by
> powerpc, so have s390 use that instead of the _refcount field as well.
> This improves the safety for _refcount and the page table tracking.
> 
> This also allows us to simplify the tracking since we can once again use
> the lower byte of pt_frag_refcount instead of the upper byte of _refcount.

This would conflict with s390 impact of pte_free_defer() work from Hugh Dickins
https://lore.kernel.org/lkml/35e983f5-7ed3-b310-d949-9ae8b130c...@google.com/
https://lore.kernel.org/lkml/6dd63b39-e71f-2e8b-7e0-83e02f3bc...@google.com/

There he uses pt_frag_refcount, or rather pt_mm in the same union, to save
the mm_struct for deferred pte_free().

I still need to look closer into both of your patch series, but so far it
seems that you have no hard functional requirement to switch from _refcount
to pt_frag_refcount here, for s390.

If this is correct, and you do not e.g. need this to make some other use
of _refcount, I would suggest to drop this patch.


Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page

2023-06-01 Thread Gerald Schaefer
On Mon, 29 May 2023 07:36:40 -0700 (PDT)
Hugh Dickins  wrote:

> On Mon, 29 May 2023, Matthew Wilcox wrote:
> > On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote:  
> > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > > +{
> > > + struct page *page;
> > > +
> > > + page = virt_to_page(pgtable);
> > > + call_rcu(&page->rcu_head, pte_free_now);
> > > +}  
> > 
> > This can't be safe (on ppc).  IIRC you might have up to 16x4k page
> > tables sharing one 64kB page.  So if you have two page tables from the
> > same page being defer-freed simultaneously, you'll reuse the rcu_head
> > and I cannot imagine things go well from that point.  
> 
> Oh yes, of course, thanks for catching that so quickly.
> So my s390 and sparc implementations will be equally broken.
> 
> > 
> > I have no idea how to solve this problem.  
> 
> I do: I'll have to go back to the more complicated implementation we
> actually ran with on powerpc - I was thinking those complications just
> related to deposit/withdraw matters, forgetting the one-rcu_head issue.
> 
> It uses large (0x1) increments of the page refcount, avoiding
> call_rcu() when already active.
> 
> It's not a complication I had wanted to explain or test for now,
> but we shall have to.  Should apply equally well to sparc, but s390
> more of a problem, since s390 already has its own refcount cleverness.

Yes, we have 2 pagetables in one 4K page, which could result in same
rcu_head reuse. It might be possible to use the cleverness from our
page_table_free() function, e.g. to only do the call_rcu() once, for
the case where both 2K pagetable fragments become unused, similar to
how we decide when to actually call __free_page().

However, it might be much worse, and page->rcu_head from a pagetable
page cannot be used at all for s390, because we also use page->lru
to keep our list of free 2K pagetable fragments. I always get confused
by struct page unions, so not completely sure, but it seems to me that
page->rcu_head would overlay with page->lru, right?


Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()

2023-06-06 Thread Gerald Schaefer
On Mon, 5 Jun 2023 22:11:52 -0700 (PDT)
Hugh Dickins  wrote:

> On Sun, 28 May 2023, Hugh Dickins wrote:
> 
> > Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
> > pte_free_defer() will be called inside khugepaged's retract_page_tables()
> > loop, where allocating extra memory cannot be relied upon.  This precedes
> > the generic version to avoid build breakage from incompatible pgtable_t.
> > 
> > This version is more complicated than others: because page_table_free()
> > needs to know which fragment is being freed, and which mm to link it to.
> > 
> > page_table_free()'s fragment handling is clever, but I could too easily
> > break it: what's done here in pte_free_defer() and pte_free_now() might
> > be better integrated with page_table_free()'s cleverness, but not by me!
> > 
> > By the time that page_table_free() gets called via RCU, it's conceivable
> > that mm would already have been freed: so mmgrab() in pte_free_defer()
> > and mmdrop() in pte_free_now().  No, that is not a good context to call
> > mmdrop() from, so make mmdrop_async() public and use that.  
> 
> But Matthew Wilcox quickly pointed out that sharing one page->rcu_head
> between multiple page tables is tricky: something I knew but had lost
> sight of.  So the powerpc and s390 patches were broken: powerpc fairly
> easily fixed, but s390 more painful.
> 
> In https://lore.kernel.org/linux-s390/20230601155751.7c949ca4@thinkpad-T15/
> On Thu, 1 Jun 2023 15:57:51 +0200
> Gerald Schaefer  wrote:
> > 
> > Yes, we have 2 pagetables in one 4K page, which could result in same
> > rcu_head reuse. It might be possible to use the cleverness from our
> > page_table_free() function, e.g. to only do the call_rcu() once, for
> > the case where both 2K pagetable fragments become unused, similar to
> > how we decide when to actually call __free_page().
> > 
> > However, it might be much worse, and page->rcu_head from a pagetable
> > page cannot be used at all for s390, because we also use page->lru
> > to keep our list of free 2K pagetable fragments. I always get confused
> > by struct page unions, so not completely sure, but it seems to me that
> > page->rcu_head would overlay with page->lru, right?  
> 
> Sigh, yes, page->rcu_head overlays page->lru.  But (please correct me if
> I'm wrong) I think that s390 could use exactly the same technique for
> its list of free 2K pagetable fragments as it uses for its list of THP
> "deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
> the first two longs of the page table itself for threading the list.

Nice idea, I think that could actually work, since we only need the empty
2K halves on the list. So it should be possible to store the list_head
inside those.

> 
> And while it could use third and fourth longs instead, I don't see any
> need for that: a deposited pagetable has been allocated, so would not
> be on the list of free fragments.

Correct, that should not interfere.

> 
> Below is one of the grossest patches I've ever posted: gross because
> it's a rushed attempt to see whether that is viable, while it would take
> me longer to understand all the s390 cleverness there (even though the
> PP AA commentary above page_table_alloc() is excellent).

Sounds fair, this is also one of the grossest code we have, which is also
why Alexander added the comment. I guess we could need even more comments
inside the code, as it still confuses me more than it should.

Considering that, you did remarkably well. Your patch seems to work fine,
at least it survived some LTP mm tests. I will also add it to our CI runs,
to give it some more testing. Will report tomorrow when it broke something.
See also below for some patch comments.

> 
> I'm hoping the use of page->lru in arch/s390/mm/gmap.c is disjoint.
> And cmma_init_nodat()? Ah, that's __init so I guess disjoint.

cmma_init_nodat() should be disjoint, not only because it is __init,
but also because it explicitly skips pagetable pages, so it should
never touch page->lru of those.

Not very familiar with the gmap code, it does look disjoint, and we should
also use complete 4K pages for pagetables instead of 2K fragments there,
but Christian or Claudio should also have a look.

> 
> Gerald, s390 folk: would it be possible for you to give this
> a try, suggest corrections and improvements, and then I can make it
> a separate patch of the series; and work on avoiding concurrent use
> of the rcu_head by pagetable fragment buddies (ideally fit in with
> the scheme already there, maybe DD bits to go along with the PP AA).

It feels like it could be possible to not only avo

Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()

2023-06-08 Thread Gerald Schaefer
On Wed, 7 Jun 2023 20:35:05 -0700 (PDT)
Hugh Dickins  wrote:

> On Tue, 6 Jun 2023, Gerald Schaefer wrote:
> > On Mon, 5 Jun 2023 22:11:52 -0700 (PDT)
> > Hugh Dickins  wrote:  
> > > On Thu, 1 Jun 2023 15:57:51 +0200
> > > Gerald Schaefer  wrote:  
> > > > 
> > > > Yes, we have 2 pagetables in one 4K page, which could result in same
> > > > rcu_head reuse. It might be possible to use the cleverness from our
> > > > page_table_free() function, e.g. to only do the call_rcu() once, for
> > > > the case where both 2K pagetable fragments become unused, similar to
> > > > how we decide when to actually call __free_page().
> > > > 
> > > > However, it might be much worse, and page->rcu_head from a pagetable
> > > > page cannot be used at all for s390, because we also use page->lru
> > > > to keep our list of free 2K pagetable fragments. I always get confused
> > > > by struct page unions, so not completely sure, but it seems to me that
> > > > page->rcu_head would overlay with page->lru, right?
> > > 
> > > Sigh, yes, page->rcu_head overlays page->lru.  But (please correct me if
> > > I'm wrong) I think that s390 could use exactly the same technique for
> > > its list of free 2K pagetable fragments as it uses for its list of THP
> > > "deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
> > > the first two longs of the page table itself for threading the list.  
> > 
> > Nice idea, I think that could actually work, since we only need the empty
> > 2K halves on the list. So it should be possible to store the list_head
> > inside those.  
> 
> Jason quickly pointed out the flaw in my thinking there.

Yes, while I had the right concerns about "the to-be-freed pagetables would
still be accessible, but not really valid, if we added them back to the list,
with list_heads inside them", when suggesting the approach w/o passing over
the mm, I missed that we would have the very same issue already with the
existing page_table_free_rcu().

Thankfully Jason was watching out!

> 
> >   
> > > 
> > > And while it could use third and fourth longs instead, I don't see any
> > > need for that: a deposited pagetable has been allocated, so would not
> > > be on the list of free fragments.  
> > 
> > Correct, that should not interfere.
> >   
> > > 
> > > Below is one of the grossest patches I've ever posted: gross because
> > > it's a rushed attempt to see whether that is viable, while it would take
> > > me longer to understand all the s390 cleverness there (even though the
> > > PP AA commentary above page_table_alloc() is excellent).  
> > 
> > Sounds fair, this is also one of the grossest code we have, which is also
> > why Alexander added the comment. I guess we could need even more comments
> > inside the code, as it still confuses me more than it should.
> > 
> > Considering that, you did remarkably well. Your patch seems to work fine,
> > at least it survived some LTP mm tests. I will also add it to our CI runs,
> > to give it some more testing. Will report tomorrow when it broke something.
> > See also below for some patch comments.  
> 
> Many thanks for your effort on this patch.  I don't expect the testing
> of it to catch Jason's point, that I'm corrupting the page table while
> it's on its way through RCU to being freed, but he's right nonetheless.

Right, tests ran fine, but we would have introduced subtle issues with
racing gup_fast, I guess.

> 
> I'll integrate your fixes below into what I have here, but probably
> just archive it as something to refer to later in case it might play
> a part; but probably it will not - sorry for wasting your time.

No worries, looking at that s390 code can never be amiss. It seems I need
regular refresh, at least I'm sure I already understood it better in the
past.

And who knows, with Jasons recent thoughts, that "list_head inside
pagetable" idea might not be dead yet.

> 
> >   
> > > 
> > > I'm hoping the use of page->lru in arch/s390/mm/gmap.c is disjoint.
> > > And cmma_init_nodat()? Ah, that's __init so I guess disjoint.  
> > 
> > cmma_init_nodat() should be disjoint, not only because it is __init,
> > but also because it explicitly skips pagetable pages, so it should
> > never touch page->lru of those.
> > 
> > Not very familiar with the gmap code, it does look disjoint, and we should
> > also use complete 4K pages fo

[PATCH] mm: add PTE pointer parameter to flush_tlb_fix_spurious_fault()

2023-03-06 Thread Gerald Schaefer
s390 can do more fine-grained handling of spurious TLB protection faults,
when there also is the PTE pointer available.

Therefore, pass on the PTE pointer to flush_tlb_fix_spurious_fault() as
an additional parameter.

This will add no functional change to other architectures, but those with
private flush_tlb_fix_spurious_fault() implementations need to be made
aware of the new parameter.

Reviewed-by: Alexander Gordeev 
Signed-off-by: Gerald Schaefer 
---
 arch/arm64/include/asm/pgtable.h  |  2 +-
 arch/mips/include/asm/pgtable.h   |  3 ++-
 arch/powerpc/include/asm/book3s/64/tlbflush.h |  3 ++-
 arch/s390/include/asm/pgtable.h   | 12 +++-
 arch/x86/include/asm/pgtable.h|  2 +-
 include/linux/pgtable.h   |  2 +-
 mm/memory.c   |  3 ++-
 mm/pgtable-generic.c  |  2 +-
 8 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index b6ba466e2e8a..0bd18de9fd97 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -57,7 +57,7 @@ static inline bool arch_thp_swp_supported(void)
  * fault on one CPU which has been handled concurrently by another CPU
  * does not need to perform additional invalidation.
  */
-#define flush_tlb_fix_spurious_fault(vma, address) do { } while (0)
+#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)
 
 /*
  * ZERO_PAGE is a global shared page that is always zero: used
diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
index 791389bf3c12..574fa14ac8b2 100644
--- a/arch/mips/include/asm/pgtable.h
+++ b/arch/mips/include/asm/pgtable.h
@@ -469,7 +469,8 @@ static inline pgprot_t pgprot_writecombine(pgprot_t _prot)
 }
 
 static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
-   unsigned long address)
+   unsigned long address,
+   pte_t *ptep)
 {
 }
 
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush.h
index 2bbc0fcce04a..ff7f0ee179e5 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
@@ -121,7 +121,8 @@ static inline void flush_tlb_page(struct vm_area_struct 
*vma,
 
 #define flush_tlb_fix_spurious_fault flush_tlb_fix_spurious_fault
 static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
-   unsigned long address)
+   unsigned long address,
+   pte_t *ptep)
 {
/*
 * Book3S 64 does not require spurious fault flushes because the PTE
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 2c70b4d1263d..c1f6b46ec555 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1239,7 +1239,8 @@ static inline int pte_allow_rdp(pte_t old, pte_t new)
 }
 
 static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
-   unsigned long address)
+   unsigned long address,
+   pte_t *ptep)
 {
/*
 * RDP might not have propagated the PTE protection reset to all CPUs,
@@ -1247,11 +1248,12 @@ static inline void flush_tlb_fix_spurious_fault(struct 
vm_area_struct *vma,
 * NOTE: This will also be called when a racing pagetable update on
 * another thread already installed the correct PTE. Both cases cannot
 * really be distinguished.
-* Therefore, only do the local TLB flush when RDP can be used, to avoid
-* unnecessary overhead.
+* Therefore, only do the local TLB flush when RDP can be used, and the
+* PTE does not have _PAGE_PROTECT set, to avoid unnecessary overhead.
+* A local RDP can be used to do the flush.
 */
-   if (MACHINE_HAS_RDP)
-   asm volatile("ptlb" : : : "memory");
+   if (MACHINE_HAS_RDP && !(pte_val(*ptep) & _PAGE_PROTECT))
+   __ptep_rdp(address, ptep, 0, 0, 1);
 }
 #define flush_tlb_fix_spurious_fault flush_tlb_fix_spurious_fault
 
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 7425f32e5293..15ae4d6ba476 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1097,7 +1097,7 @@ static inline void ptep_set_wrprotect(struct mm_struct 
*mm,
clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
 }
 
-#define flush_tlb_fix_spurious_fault(vma, address) do { } while (0)
+#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)
 

Re: [PATCH] mm: add PTE pointer parameter to flush_tlb_fix_spurious_fault()

2023-03-06 Thread Gerald Schaefer
On Mon, 6 Mar 2023 17:06:44 +
Catalin Marinas  wrote:

> On Mon, Mar 06, 2023 at 05:15:48PM +0100, Gerald Schaefer wrote:
> > diff --git a/arch/arm64/include/asm/pgtable.h 
> > b/arch/arm64/include/asm/pgtable.h
> > index b6ba466e2e8a..0bd18de9fd97 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -57,7 +57,7 @@ static inline bool arch_thp_swp_supported(void)
> >   * fault on one CPU which has been handled concurrently by another CPU
> >   * does not need to perform additional invalidation.
> >   */
> > -#define flush_tlb_fix_spurious_fault(vma, address) do { } while (0)
> > +#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)
> 
> For arm64:
> 
> Acked-by: Catalin Marinas 
> 
> > diff --git a/arch/s390/include/asm/pgtable.h 
> > b/arch/s390/include/asm/pgtable.h
> > index 2c70b4d1263d..c1f6b46ec555 100644
> > --- a/arch/s390/include/asm/pgtable.h
> > +++ b/arch/s390/include/asm/pgtable.h
> > @@ -1239,7 +1239,8 @@ static inline int pte_allow_rdp(pte_t old, pte_t new)
> >  }
> >  
> >  static inline void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
> > -   unsigned long address)
> > +   unsigned long address,
> > +   pte_t *ptep)
> >  {
> > /*
> >  * RDP might not have propagated the PTE protection reset to all CPUs,
> > @@ -1247,11 +1248,12 @@ static inline void 
> > flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
> >  * NOTE: This will also be called when a racing pagetable update on
> >  * another thread already installed the correct PTE. Both cases cannot
> >  * really be distinguished.
> > -* Therefore, only do the local TLB flush when RDP can be used, to avoid
> > -* unnecessary overhead.
> > +* Therefore, only do the local TLB flush when RDP can be used, and the
> > +* PTE does not have _PAGE_PROTECT set, to avoid unnecessary overhead.
> > +* A local RDP can be used to do the flush.
> >  */
> > -   if (MACHINE_HAS_RDP)
> > -   asm volatile("ptlb" : : : "memory");
> > +   if (MACHINE_HAS_RDP && !(pte_val(*ptep) & _PAGE_PROTECT))
> > +   __ptep_rdp(address, ptep, 0, 0, 1);
> 
> I wonder whether passing the actual entry is somewhat quicker as it
> avoids another memory access (though it might already be in the cache).

The RDP instruction itself only requires the PTE pointer as input, or more
precisely a pointer to the pagetable origin. We calculate that from the PTE
pointer, by masking out some bits, w/o actual memory access to the PTE entry
value.

Of course, there is the pte_val(*ptep) & _PAGE_PROTECT check here, with
memory access, but this might get removed in the future. TBH, I simply
wasn't sure (enough) yet, if we could technically ever end up here with
_PAGE_PROTECT set at all. For "real" spurious protection faults, it should
never be set, not so sure about racing pagetable updates though.

So this might actually be an unnecessary / overly cautious check, that
gets removed in the future, and not worth passing along the PTE value
in addition to the pointer.


Re: [PATCH v4 00/13] mm/debug_vm_pgtable fixes

2020-09-05 Thread Gerald Schaefer
On Fri, 4 Sep 2020 12:18:05 +0530
Anshuman Khandual  wrote:

> 
> 
> On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote:
> > This patch series includes fixes for debug_vm_pgtable test code so that
> > they follow page table updates rules correctly. The first two patches 
> > introduce
> > changes w.r.t ppc64. The patches are included in this series for 
> > completeness. We can
> > merge them via ppc64 tree if required.
> > 
> > Hugetlb test is disabled on ppc64 because that needs larger change to 
> > satisfy
> > page table update rules.
> > 
> > These tests are broken w.r.t page table update rules and results in kernel
> > crash as below. 
> > 
> > [   21.083519] kernel BUG at arch/powerpc/mm/pgtable.c:304!
> > cpu 0x0: Vector: 700 (Program Check) at [c00c6d1e76c0]
> > pc: c009a5ec: assert_pte_locked+0x14c/0x380
> > lr: c05c: pte_update+0x11c/0x190
> > sp: c00c6d1e7950
> >msr: 82029033
> >   current = 0xc00c6d172c80
> >   paca= 0xc3ba   irqmask: 0x03   irq_happened: 0x01
> > pid   = 1, comm = swapper/0
> > kernel BUG at arch/powerpc/mm/pgtable.c:304!
> > [link register   ] c05c pte_update+0x11c/0x190
> > [c00c6d1e7950] 0001 (unreliable)
> > [c00c6d1e79b0] c05eee14 pte_update+0x44/0x190
> > [c00c6d1e7a10] c1a2ca9c pte_advanced_tests+0x160/0x3d8
> > [c00c6d1e7ab0] c1a2d4fc debug_vm_pgtable+0x7e8/0x1338
> > [c00c6d1e7ba0] c00116ec do_one_initcall+0xac/0x5f0
> > [c00c6d1e7c80] c19e4fac kernel_init_freeable+0x4dc/0x5a4
> > [c00c6d1e7db0] c0012474 kernel_init+0x24/0x160
> > [c00c6d1e7e20] c000cbd0 ret_from_kernel_thread+0x5c/0x6c
> > 
> > With DEBUG_VM disabled
> > 
> > [   20.530152] BUG: Kernel NULL pointer dereference on read at 0x
> > [   20.530183] Faulting instruction address: 0xc00df330
> > cpu 0x33: Vector: 380 (Data SLB Access) at [c00c6d19f700]
> > pc: c00df330: memset+0x68/0x104
> > lr: c009f6d8: hash__pmdp_huge_get_and_clear+0xe8/0x1b0
> > sp: c00c6d19f990
> >msr: 82009033
> >dar: 0
> >   current = 0xc00c6d177480
> >   paca= 0xc0001ec4f400   irqmask: 0x03   irq_happened: 0x01
> > pid   = 1, comm = swapper/0
> > [link register   ] c009f6d8 hash__pmdp_huge_get_and_clear+0xe8/0x1b0
> > [c00c6d19f990] c009f748 
> > hash__pmdp_huge_get_and_clear+0x158/0x1b0 (unreliable)
> > [c00c6d19fa10] c19ebf30 pmd_advanced_tests+0x1f0/0x378
> > [c00c6d19fab0] c19ed088 debug_vm_pgtable+0x79c/0x1244
> > [c00c6d19fba0] c00116ec do_one_initcall+0xac/0x5f0
> > [c00c6d19fc80] c19a4fac kernel_init_freeable+0x4dc/0x5a4
> > [c00c6d19fdb0] c0012474 kernel_init+0x24/0x160
> > [c00c6d19fe20] c000cbd0 ret_from_kernel_thread+0x5c/0x6c
> > 
> > Changes from v3:
> > * Address review feedback
> > * Move page table depost and withdraw patch after adding pmdlock to avoid 
> > bisect failure.
> 
> This version
> 
> - Builds on x86, arm64, s390, arc, powerpc and riscv (defconfig with 
> DEBUG_VM_PGTABLE)
> - Runs on arm64 and x86 without any regression, atleast nothing that I have 
> noticed
> - Will be great if this could get tested on s390, arc, riscv, ppc32 platforms 
> as well

When I quickly tested v3, it worked fine, but now it turned out to
only work fine "sometimes", both v3 and v4. I need to look into it
further, but so far it seems related to the hugetlb_advanced_tests().

I guess there was already some discussion on this test, but we did
not receive all of the thread(s). Please always add at least
linux-s...@vger.kernel.org and maybe myself and Vasily Gorbik 

for further discussions.

That being said, sorry for duplications, this might already have been
discussed. Preliminary analysis showed that it only seems to go wrong
for certain random vaddr values. I cannot make any sense of that yet,
but what seems strange to me is that the hugetlb_advanced_tests()
take a (real) pte_t pointer as input, and also use that for all
kinds of operations (set_huge_pte_at, huge_ptep_get_and_clear, etc.).

Although all the hugetlb code in the kernel is (mis)using pte_t
pointers instead of the correct pmd/pud_t pointers like THP, that
is just for historic reasons. The pointers will actually never point
to a real pte_t (i.e. page table entry), but of course to a pmd
or pud entry, depending on hugepage size.

What is passed in as ptep to hugetlb_advanced_tests() seems to be
the result from the previous ptep = pte_alloc_map(mm, pmdp, vaddr),
so I would expect that it points to a real page table entry. Need
to investigate further, but IIUC, using such a pointer for adding
large pte entries (i.e. pmd/pud entries) at least feels very wrong
to me, and I assume it is related to the issues we see on s390.

We actually see different issues, e.g. once a panic directly in
hugetlb_advanced_tests() -> hug

Re: [PATCH v4 00/13] mm/debug_vm_pgtable fixes

2020-09-05 Thread Gerald Schaefer
On Fri, 4 Sep 2020 17:26:47 +0200
Gerald Schaefer  wrote:

> On Fri, 4 Sep 2020 12:18:05 +0530
> Anshuman Khandual  wrote:
> 
> > 
> > 
> > On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote:
> > > This patch series includes fixes for debug_vm_pgtable test code so that
> > > they follow page table updates rules correctly. The first two patches 
> > > introduce
> > > changes w.r.t ppc64. The patches are included in this series for 
> > > completeness. We can
> > > merge them via ppc64 tree if required.
> > > 
> > > Hugetlb test is disabled on ppc64 because that needs larger change to 
> > > satisfy
> > > page table update rules.
> > > 
> > > These tests are broken w.r.t page table update rules and results in kernel
> > > crash as below. 
> > > 
> > > [   21.083519] kernel BUG at arch/powerpc/mm/pgtable.c:304!
> > > cpu 0x0: Vector: 700 (Program Check) at [c00c6d1e76c0]
> > > pc: c009a5ec: assert_pte_locked+0x14c/0x380
> > > lr: c05c: pte_update+0x11c/0x190
> > > sp: c00c6d1e7950
> > >msr: 82029033
> > >   current = 0xc00c6d172c80
> > >   paca= 0xc3ba   irqmask: 0x03   irq_happened: 0x01
> > > pid   = 1, comm = swapper/0
> > > kernel BUG at arch/powerpc/mm/pgtable.c:304!
> > > [link register   ] c05c pte_update+0x11c/0x190
> > > [c00c6d1e7950] 0001 (unreliable)
> > > [c00c6d1e79b0] c05eee14 pte_update+0x44/0x190
> > > [c00c6d1e7a10] c1a2ca9c pte_advanced_tests+0x160/0x3d8
> > > [c00c6d1e7ab0] c1a2d4fc debug_vm_pgtable+0x7e8/0x1338
> > > [c00c6d1e7ba0] c00116ec do_one_initcall+0xac/0x5f0
> > > [c00c6d1e7c80] c19e4fac kernel_init_freeable+0x4dc/0x5a4
> > > [c00c6d1e7db0] c0012474 kernel_init+0x24/0x160
> > > [c00c6d1e7e20] c000cbd0 ret_from_kernel_thread+0x5c/0x6c
> > > 
> > > With DEBUG_VM disabled
> > > 
> > > [   20.530152] BUG: Kernel NULL pointer dereference on read at 0x
> > > [   20.530183] Faulting instruction address: 0xc00df330
> > > cpu 0x33: Vector: 380 (Data SLB Access) at [c00c6d19f700]
> > > pc: c00df330: memset+0x68/0x104
> > > lr: c009f6d8: hash__pmdp_huge_get_and_clear+0xe8/0x1b0
> > > sp: c00c6d19f990
> > >msr: 82009033
> > >dar: 0
> > >   current = 0xc00c6d177480
> > >   paca= 0xc0001ec4f400   irqmask: 0x03   irq_happened: 0x01
> > > pid   = 1, comm = swapper/0
> > > [link register   ] c009f6d8 
> > > hash__pmdp_huge_get_and_clear+0xe8/0x1b0
> > > [c00c6d19f990] c009f748 
> > > hash__pmdp_huge_get_and_clear+0x158/0x1b0 (unreliable)
> > > [c00c6d19fa10] c19ebf30 pmd_advanced_tests+0x1f0/0x378
> > > [c00c6d19fab0] c19ed088 debug_vm_pgtable+0x79c/0x1244
> > > [c00c6d19fba0] c00116ec do_one_initcall+0xac/0x5f0
> > > [c00c6d19fc80] c19a4fac kernel_init_freeable+0x4dc/0x5a4
> > > [c00c6d19fdb0] c0012474 kernel_init+0x24/0x160
> > > [c00c6d19fe20] c000cbd0 ret_from_kernel_thread+0x5c/0x6c
> > > 
> > > Changes from v3:
> > > * Address review feedback
> > > * Move page table depost and withdraw patch after adding pmdlock to avoid 
> > > bisect failure.
> > 
> > This version
> > 
> > - Builds on x86, arm64, s390, arc, powerpc and riscv (defconfig with 
> > DEBUG_VM_PGTABLE)
> > - Runs on arm64 and x86 without any regression, atleast nothing that I have 
> > noticed
> > - Will be great if this could get tested on s390, arc, riscv, ppc32 
> > platforms as well
> 
> When I quickly tested v3, it worked fine, but now it turned out to
> only work fine "sometimes", both v3 and v4. I need to look into it
> further, but so far it seems related to the hugetlb_advanced_tests().
> 
> I guess there was already some discussion on this test, but we did
> not receive all of the thread(s). Please always add at least
> linux-s...@vger.kernel.org and maybe myself and Vasily Gorbik 
> 
> for further discussions.

BTW, with myself I mean the new address gerald.schae...@linux.ibm.com.
The old gerald.schae...@de.ibm.com seems to work (again), but is not
very reliable.

BTW2, a quick test with this change (so far) made the issues on s390
go away:

@@ -1069,7 +1074,7 @@ static int __init debug_vm_pgtable(void)
spin_unlock(ptl);
 
 #ifndef CONFIG_PPC_BOOK3S_64
-   hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
+   hugetlb_advanced_tests(mm, vma, (pte_t *) pmdp, pmd_aligned, vaddr, 
prot);
 #endif
 
spin_lock(&mm->page_table_lock);

That would more match the "pte_t pointer" usage for hugetlb code,
i.e. just cast a pmd_t pointer to it. Also changed to pmd_aligned,
but I think the root cause is the pte_t pointer.

Not entirely sure though if that would really be the correct fix.
I somehow lost whatever little track I had about what these tests
really want to check, and if that would still be valid with that
change.


Re: [PATCH v4 00/13] mm/debug_vm_pgtable fixes

2020-09-05 Thread Gerald Schaefer
On Fri, 4 Sep 2020 18:01:15 +0200
Gerald Schaefer  wrote:

> On Fri, 4 Sep 2020 17:26:47 +0200
> Gerald Schaefer  wrote:
> 
> > On Fri, 4 Sep 2020 12:18:05 +0530
> > Anshuman Khandual  wrote:
> > 
> > > 
> > > 
> > > On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote:
> > > > This patch series includes fixes for debug_vm_pgtable test code so that
> > > > they follow page table updates rules correctly. The first two patches 
> > > > introduce
> > > > changes w.r.t ppc64. The patches are included in this series for 
> > > > completeness. We can
> > > > merge them via ppc64 tree if required.
> > > > 
> > > > Hugetlb test is disabled on ppc64 because that needs larger change to 
> > > > satisfy
> > > > page table update rules.
> > > > 
> > > > These tests are broken w.r.t page table update rules and results in 
> > > > kernel
> > > > crash as below. 
> > > > 
> > > > [   21.083519] kernel BUG at arch/powerpc/mm/pgtable.c:304!
> > > > cpu 0x0: Vector: 700 (Program Check) at [c00c6d1e76c0]
> > > > pc: c009a5ec: assert_pte_locked+0x14c/0x380
> > > > lr: c05c: pte_update+0x11c/0x190
> > > > sp: c00c6d1e7950
> > > >msr: 82029033
> > > >   current = 0xc00c6d172c80
> > > >   paca= 0xc3ba   irqmask: 0x03   irq_happened: 0x01
> > > > pid   = 1, comm = swapper/0
> > > > kernel BUG at arch/powerpc/mm/pgtable.c:304!
> > > > [link register   ] c05c pte_update+0x11c/0x190
> > > > [c00c6d1e7950] 0001 (unreliable)
> > > > [c00c6d1e79b0] c05eee14 pte_update+0x44/0x190
> > > > [c00c6d1e7a10] c1a2ca9c pte_advanced_tests+0x160/0x3d8
> > > > [c00c6d1e7ab0] c1a2d4fc debug_vm_pgtable+0x7e8/0x1338
> > > > [c00c6d1e7ba0] c00116ec do_one_initcall+0xac/0x5f0
> > > > [c00c6d1e7c80] c19e4fac kernel_init_freeable+0x4dc/0x5a4
> > > > [c00c6d1e7db0] c0012474 kernel_init+0x24/0x160
> > > > [c00c6d1e7e20] c000cbd0 ret_from_kernel_thread+0x5c/0x6c
> > > > 
> > > > With DEBUG_VM disabled
> > > > 
> > > > [   20.530152] BUG: Kernel NULL pointer dereference on read at 
> > > > 0x
> > > > [   20.530183] Faulting instruction address: 0xc00df330
> > > > cpu 0x33: Vector: 380 (Data SLB Access) at [c00c6d19f700]
> > > > pc: c00df330: memset+0x68/0x104
> > > > lr: c009f6d8: hash__pmdp_huge_get_and_clear+0xe8/0x1b0
> > > > sp: c00c6d19f990
> > > >msr: 82009033
> > > >dar: 0
> > > >   current = 0xc00c6d177480
> > > >   paca= 0xc0001ec4f400   irqmask: 0x03   irq_happened: 0x01
> > > > pid   = 1, comm = swapper/0
> > > > [link register   ] c009f6d8 
> > > > hash__pmdp_huge_get_and_clear+0xe8/0x1b0
> > > > [c00c6d19f990] c009f748 
> > > > hash__pmdp_huge_get_and_clear+0x158/0x1b0 (unreliable)
> > > > [c00c6d19fa10] c19ebf30 pmd_advanced_tests+0x1f0/0x378
> > > > [c00c6d19fab0] c19ed088 debug_vm_pgtable+0x79c/0x1244
> > > > [c00c6d19fba0] c00116ec do_one_initcall+0xac/0x5f0
> > > > [c00c6d19fc80] c19a4fac kernel_init_freeable+0x4dc/0x5a4
> > > > [c00c6d19fdb0] c0012474 kernel_init+0x24/0x160
> > > > [c00c6d19fe20] c000cbd0 ret_from_kernel_thread+0x5c/0x6c
> > > > 
> > > > Changes from v3:
> > > > * Address review feedback
> > > > * Move page table depost and withdraw patch after adding pmdlock to 
> > > > avoid bisect failure.
> > > 
> > > This version
> > > 
> > > - Builds on x86, arm64, s390, arc, powerpc and riscv (defconfig with 
> > > DEBUG_VM_PGTABLE)
> > > - Runs on arm64 and x86 without any regression, atleast nothing that I 
> > > have noticed
> > > - Will be great if this could get tested on s390, arc, riscv, ppc32 
> > > platforms as well
> > 
> > When I quickly tested v3, it worked fine, but now it turned out to
> > only work fine "sometimes", both v3 and v4. I need to look into it
> > further, but so far it seems related to the hugetlb_advanced_tests().
> > 
> >

[RFC PATCH v2 3/3] mm: make generic pXd_addr_end() macros inline functions

2020-09-07 Thread Gerald Schaefer
From: Alexander Gordeev 

Since pXd_addr_end() macros take pXd page-table entry as a
parameter it makes sense to check the entry type on compile.
Even though most archs do not make use of page-table entries
in pXd_addr_end() calls, checking the type in traversal code
paths could help to avoid subtle bugs.

Signed-off-by: Alexander Gordeev 
Signed-off-by: Gerald Schaefer 
---
 include/linux/pgtable.h | 36 
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 67ebc22cf83d..d9e7d16c2263 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -656,31 +656,35 @@ static inline int arch_unmap_one(struct mm_struct *mm,
  */
 
 #ifndef pgd_addr_end
-#define pgd_addr_end(pgd, addr, end)   \
-({ unsigned long __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;  \
-   (__boundary - 1 < (end) - 1)? __boundary: (end);\
-})
+#define pgd_addr_end pgd_addr_end
+static inline unsigned long pgd_addr_end(pgd_t pgd, unsigned long addr, 
unsigned long end)
+{  unsigned long __boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
+   return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef p4d_addr_end
-#define p4d_addr_end(p4d, addr, end)   \
-({ unsigned long __boundary = ((addr) + P4D_SIZE) & P4D_MASK;  \
-   (__boundary - 1 < (end) - 1)? __boundary: (end);\
-})
+#define p4d_addr_end p4d_addr_end
+static inline unsigned long p4d_addr_end(p4d_t p4d, unsigned long addr, 
unsigned long end)
+{  unsigned long __boundary = (addr + P4D_SIZE) & P4D_MASK;
+   return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef pud_addr_end
-#define pud_addr_end(pud, addr, end)   \
-({ unsigned long __boundary = ((addr) + PUD_SIZE) & PUD_MASK;  \
-   (__boundary - 1 < (end) - 1)? __boundary: (end);\
-})
+#define pud_addr_end pud_addr_end
+static inline unsigned long pud_addr_end(pud_t pud, unsigned long addr, 
unsigned long end)
+{  unsigned long __boundary = (addr + PUD_SIZE) & PUD_MASK;
+   return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 #ifndef pmd_addr_end
-#define pmd_addr_end(pmd, addr, end)   \
-({ unsigned long __boundary = ((addr) + PMD_SIZE) & PMD_MASK;  \
-   (__boundary - 1 < (end) - 1)? __boundary: (end);\
-})
+#define pmd_addr_end pmd_addr_end
+static inline unsigned long pmd_addr_end(pmd_t pmd, unsigned long addr, 
unsigned long end)
+{  unsigned long __boundary = (addr + PMD_SIZE) & PMD_MASK;
+   return (__boundary - 1 < end - 1) ? __boundary : end;
+}
 #endif
 
 /*
-- 
2.17.1



[RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-07 Thread Gerald Schaefer
From: Alexander Gordeev 

Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
code") introduced a subtle but severe bug on s390 with gup_fast, due to
dynamic page table folding.

The question "What would it require for the generic code to work for s390"
has already been discussed here
https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
and ended with a promising approach here
https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
which in the end unfortunately didn't quite work completely.

We tried to mimic static level folding by changing pgd_offset to always
calculate top level page table offset, and do nothing in folded pXd_offset.
What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
not reflect this dynamic behaviour, and still act like static 5-level
page tables.

Here is an example of what happens with gup_fast on s390, for a task with
3-levels paging, crossing a 2 GB pud boundary:

// addr = 0x1007000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
 unsigned int flags, struct page **pages, int *nr)
{
unsigned long next;
pud_t *pudp;

// pud_offset returns &p4d itself (a pointer to a value on stack)
pudp = pud_offset(&p4d, addr);
do {
// on second iteratation reading "random" stack value
pud_t pud = READ_ONCE(*pudp);

// next = 0x1008000, due to PUD_SIZE/MASK != 
PGDIR_SIZE/MASK on s390
next = pud_addr_end(addr, end);
...
} while (pudp++, addr = next, addr != end); // pudp++ iterating over 
stack

return 1;
}

pud_addr_end = 0x1008000 is correct, but the previous pgd/p4d_addr_end
should also have returned that limit, instead of the 5-level static
pgd/p4d limits with PUD_SIZE/MASK != PGDIR_SIZE/MASK. Then the "end"
parameter for gup_pud_range would also have been 0x1008000, and we
would not iterate further in gup_pud_range, but rather go back and
(correctly) do it in gup_pgd_range.

So, for the second iteration in gup_pud_range, we will increase pudp,
which pointed to a stack value and not the real pud table. This new pudp
will then point to whatever lies behind the p4d stack value. In general,
this happens to be the previously read pgd, but it probably could also
be something different, depending on compiler decisions.

Most unfortunately, if it happens to be the pgd value, which is the
same as the p4d / pud due to folding, it is a valid and present entry.
So after the increment, we would still point to the same pud entry.
The addr however has been increased in the second iteration, so that we
now have different pmd/pte_index values, which will result in very wrong
behaviour for the remaining gup_pmd/pte_range calls. We will effectively
operate on an address minus 2 GB, due to missing pudp increase.

In the "good case", if nothing is mapped there, we will fall back to
the slow gup path. But if something is mapped there, and valid
for gup_fast, we will end up (silently) getting references on the wrong
pages and also add the wrong pages to the **pages result array. This
can cause data corruption.

Fix this by introducing new pXd_addr_end_folded helpers, which take an
additional pXd entry value parameter, that can be used on s390
to determine the correct page table level and return corresponding
end / boundary. With that, the pointer iteration will always
happen in gup_pgd_range for s390. No change for other architectures
introduced.

Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Cc:  # 5.2+
Reviewed-by: Gerald Schaefer 
Signed-off-by: Alexander Gordeev 
Signed-off-by: Gerald Schaefer 
---
 arch/s390/include/asm/pgtable.h | 42 +
 include/linux/pgtable.h | 16 +
 mm/gup.c|  8 +++
 3 files changed, 62 insertions(+), 4 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 7eb01a5459cd..027206e4959d 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
 }
 #define mm_pmd_folded(mm) mm_pmd_folded(mm)
 
+/*
+ * With dynamic page table levels on s390, the static pXd_addr_end() functions
+ * will not return corresponding dynamic boundaries. This is no problem as long
+ * as only pXd pointers are passed down during page table walk, because
+ * pXd_offset() will simply return the given pointer for folded levels, and the
+ * pointer iteration over a range simply happens at the correct page table
+ * level.
+ * It is however a problem with gup_fast, or other places walking the page
+ * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
+ * of pointers. In this case, the pointer gi

[RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware

2020-09-07 Thread Gerald Schaefer
From: Alexander Gordeev 

Unlike all other page-table abstractions pXd_addr_end() do not take
into account a particular table entry in which context the functions
are called. On architectures with dynamic page-tables folding that
might lead to lack of necessary information that is difficult to
obtain other than from the table entry itself. That already led to
a subtle memory corruption issue on s390.

By letting pXd_addr_end() functions know about the page-table entry
we allow archs not only make extra checks, but also optimizations.

As result of this change the pXd_addr_end_folded() functions used
in gup_fast traversal code become unnecessary and get replaced with
universal pXd_addr_end() variants.

The arch-specific updates not only add dereferencing of page-table
entry pointers, but also small changes to the code flow to make those
dereferences possible, at least for x86 and powerpc. Also for arm64,
but in way that should not have any impact.

So, even though the dereferenced page-table entries are not used on
archs other than s390, and are optimized out by the compiler, there
is a small change in kernel size and this is what bloat-o-meter reports:

x86:
add/remove: 0/0 grow/shrink: 2/0 up/down: 10/0 (10)
Function old new   delta
vmemmap_populate 587 592  +5
munlock_vma_pages_range  556 561  +5
Total: Before=15534694, After=15534704, chg +0.00%

powerpc:
add/remove: 0/0 grow/shrink: 1/0 up/down: 4/0 (4)
Function old new   delta
.remove_pagetable   16481652  +4
Total: Before=21478240, After=21478244, chg +0.00%

arm64:
add/remove: 0/0 grow/shrink: 0/0 up/down: 0/0 (0)
Function old new   delta
Total: Before=20240851, After=20240851, chg +0.00%

sparc:
add/remove: 0/0 grow/shrink: 0/0 up/down: 0/0 (0)
Function old new   delta
Total: Before=4907262, After=4907262, chg +0.00%

Signed-off-by: Alexander Gordeev 
Signed-off-by: Gerald Schaefer 
---
 arch/arm/include/asm/pgtable-2level.h|  2 +-
 arch/arm/mm/idmap.c  |  6 ++--
 arch/arm/mm/mmu.c|  8 ++---
 arch/arm64/kernel/hibernate.c| 16 ++
 arch/arm64/kvm/mmu.c | 16 +-
 arch/arm64/mm/kasan_init.c   |  8 ++---
 arch/arm64/mm/mmu.c  | 25 +++
 arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++---
 arch/powerpc/mm/hugetlbpage.c|  6 ++--
 arch/s390/include/asm/pgtable.h  |  8 ++---
 arch/s390/mm/page-states.c   |  8 ++---
 arch/s390/mm/pageattr.c  |  8 ++---
 arch/s390/mm/vmem.c  |  8 ++---
 arch/sparc/mm/hugetlbpage.c  |  6 ++--
 arch/um/kernel/tlb.c |  8 ++---
 arch/x86/mm/init_64.c| 15 -
 arch/x86/mm/kasan_init_64.c  | 16 +-
 include/asm-generic/pgtable-nop4d.h  |  2 +-
 include/asm-generic/pgtable-nopmd.h  |  2 +-
 include/asm-generic/pgtable-nopud.h  |  2 +-
 include/linux/pgtable.h  | 26 ---
 mm/gup.c |  8 ++---
 mm/ioremap.c |  8 ++---
 mm/kasan/init.c  | 17 +-
 mm/madvise.c |  4 +--
 mm/memory.c  | 40 
 mm/mlock.c   | 18 ---
 mm/mprotect.c|  8 ++---
 mm/pagewalk.c|  8 ++---
 mm/swapfile.c|  8 ++---
 mm/vmalloc.c | 16 +-
 31 files changed, 165 insertions(+), 173 deletions(-)

diff --git a/arch/arm/include/asm/pgtable-2level.h 
b/arch/arm/include/asm/pgtable-2level.h
index 3502c2f746ca..5e6416b339f4 100644
--- a/arch/arm/include/asm/pgtable-2level.h
+++ b/arch/arm/include/asm/pgtable-2level.h
@@ -209,7 +209,7 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long 
addr)
} while (0)
 
 /* we don't need complex calculations here as the pmd is folded into the pgd */
-#define pmd_addr_end(addr,end) (end)
+#define pmd_addr_end(pmd,addr,end) (end)
 
 #define set_pte_ext(ptep,pte,ext) cpu_set_pte_ext(ptep,pte,ext)
 
diff --git a/arch/arm/mm/idmap.c b/arch/arm/mm/idmap.c
index 448e57c6f653..5437f943ca8b 100644
--- a/arch/arm/mm/idmap.c
+++ b/arch/arm/mm/idmap.c
@@ -46,7 +46,7 @@ static void idmap_add_pmd(pud_t *pud, unsigned long addr, 
unsigned long end,
pmd = pmd_offset(pud, addr);
 
do {
-   next = pmd_addr_end(addr, end);
+   next = pmd_addr_end(*pmd, addr, end);
*pmd = __pmd((addr & PMD_MASK) | prot);
flush_pmd_entry(pmd);
} while (pmd++, addr = next, ad

[RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-07 Thread Gerald Schaefer
This is v2 of an RFC previously discussed here:
https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schae...@linux.ibm.com/

Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
to common gup_fast code. It will introduce special helper functions
pXd_addr_end_folded(), which have to be used in places where pagetable walk
is done w/o lock and with READ_ONCE, so currently only in gup_fast.

Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
themselves by adding an extra pXd value parameter. That was suggested by
Jason during v1 discussion, because he is already thinking of some other
places where he might want to switch to the READ_ONCE logic for pagetable
walks. In general, that would be the cleanest / safest solution, but there
is some impact on other architectures and common code, hence the new and
greatly enlarged recipient list.

Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
functions instead of #defines, so that we get some type checking for the
new pXd value parameter.

Not sure about Fixes/stable tags for the generic solution. Only patch 1
fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
still be nice to have in stable, to ease future backports, but I guess
"nice to have" does not really qualify for stable backports.

Changes in v2:
- Pick option 2 from v1 discussion (pXd_addr_end_folded helpers)
- Add patch 2 + 3 for more generic approach

Alexander Gordeev (3):
  mm/gup: fix gup_fast with dynamic page table folding
  mm: make pXd_addr_end() functions page-table entry aware
  mm: make generic pXd_addr_end() macros inline functions

 arch/arm/include/asm/pgtable-2level.h|  2 +-
 arch/arm/mm/idmap.c  |  6 ++--
 arch/arm/mm/mmu.c|  8 ++---
 arch/arm64/kernel/hibernate.c| 16 +
 arch/arm64/kvm/mmu.c | 16 -
 arch/arm64/mm/kasan_init.c   |  8 ++---
 arch/arm64/mm/mmu.c  | 25 +++---
 arch/powerpc/mm/book3s64/radix_pgtable.c |  7 ++--
 arch/powerpc/mm/hugetlbpage.c|  6 ++--
 arch/s390/include/asm/pgtable.h  | 42 
 arch/s390/mm/page-states.c   |  8 ++---
 arch/s390/mm/pageattr.c  |  8 ++---
 arch/s390/mm/vmem.c  |  8 ++---
 arch/sparc/mm/hugetlbpage.c  |  6 ++--
 arch/um/kernel/tlb.c |  8 ++---
 arch/x86/mm/init_64.c| 15 -
 arch/x86/mm/kasan_init_64.c  | 16 -
 include/asm-generic/pgtable-nop4d.h  |  2 +-
 include/asm-generic/pgtable-nopmd.h  |  2 +-
 include/asm-generic/pgtable-nopud.h  |  2 +-
 include/linux/pgtable.h  | 38 -
 mm/gup.c |  8 ++---
 mm/ioremap.c |  8 ++---
 mm/kasan/init.c  | 17 +-
 mm/madvise.c |  4 +--
 mm/memory.c  | 40 +++---
 mm/mlock.c   | 18 +++---
 mm/mprotect.c|  8 ++---
 mm/pagewalk.c|  8 ++---
 mm/swapfile.c|  8 ++---
 mm/vmalloc.c | 16 -
 31 files changed, 219 insertions(+), 165 deletions(-)

-- 
2.17.1



Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-08 Thread Gerald Schaefer
On Tue, 8 Sep 2020 14:40:10 +0200
Christophe Leroy  wrote:

> 
> 
> Le 08/09/2020 à 14:09, Christian Borntraeger a écrit :
> > 
> > 
> > On 08.09.20 07:06, Christophe Leroy wrote:
> >>
> >>
> >> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> >>> From: Alexander Gordeev 
> >>>
> >>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> >>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> >>> dynamic page table folding.
> >>>
> >>> The question "What would it require for the generic code to work for s390"
> >>> has already been discussed here
> >>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> >>> and ended with a promising approach here
> >>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
> >>> which in the end unfortunately didn't quite work completely.
> >>>
> >>> We tried to mimic static level folding by changing pgd_offset to always
> >>> calculate top level page table offset, and do nothing in folded 
> >>> pXd_offset.
> >>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
> >>> not reflect this dynamic behaviour, and still act like static 5-level
> >>> page tables.
> >>>
> >>
> >> [...]
> >>
> >>>
> >>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
> >>> additional pXd entry value parameter, that can be used on s390
> >>> to determine the correct page table level and return corresponding
> >>> end / boundary. With that, the pointer iteration will always
> >>> happen in gup_pgd_range for s390. No change for other architectures
> >>> introduced.
> >>
> >> Not sure pXd_addr_end_folded() is the best understandable name, allthough 
> >> I don't have any alternative suggestion at the moment.
> >> Maybe could be something like pXd_addr_end_fixup() as it will disappear in 
> >> the next patch, or pXd_addr_end_gup() ?
> >>
> >> Also, if it happens to be acceptable to get patch 2 in stable, I think you 
> >> should switch patch 1 and patch 2 to avoid the step through 
> >> pXd_addr_end_folded()
> > 
> > given that this fixes a data corruption issue, wouldnt it be the best to go 
> > forward
> > with this patch ASAP and then handle the other patches on top with all the 
> > time that
> > we need?
> 
> I have no strong opinion on this, but I feel rather tricky to have to 
> change generic part of GUP to use a new fonction then revert that change 
> in the following patch, just because you want the first patch in stable 
> and not the second one.
> 
> Regardless, I was wondering, why do we need a reference to the pXd at 
> all when calling pXd_addr_end() ?
> 
> Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the 
> passed addr ?

Apart from performance impact when re-doing that what has already been
done by the caller, I think we would also break the READ_ONCE semantics.
After all, the pXd_offset() would also require some pXd pointer input,
which we don't have. So we would need to start over again from mm->pgd.

Also, it seems to be more in line with other primitives that take
a pXd value or pointer.


Re: [PATCH v4 00/13] mm/debug_vm_pgtable fixes

2020-09-08 Thread Gerald Schaefer
On Fri, 4 Sep 2020 18:01:15 +0200
Gerald Schaefer  wrote:

[...]
> 
> BTW2, a quick test with this change (so far) made the issues on s390
> go away:
> 
> @@ -1069,7 +1074,7 @@ static int __init debug_vm_pgtable(void)
> spin_unlock(ptl);
> 
>  #ifndef CONFIG_PPC_BOOK3S_64
> -   hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
> +   hugetlb_advanced_tests(mm, vma, (pte_t *) pmdp, pmd_aligned, vaddr, 
> prot);
>  #endif
> 
> spin_lock(&mm->page_table_lock);
> 
> That would more match the "pte_t pointer" usage for hugetlb code,
> i.e. just cast a pmd_t pointer to it. Also changed to pmd_aligned,
> but I think the root cause is the pte_t pointer.
> 
> Not entirely sure though if that would really be the correct fix.
> I somehow lost whatever little track I had about what these tests
> really want to check, and if that would still be valid with that
> change.

Uh oh, wasn't aware that this (or some predecessor) already went
upstream, and broke our debug kernel today.

I found out now what goes (horribly) wrong on s390, see below for
more details. In short, using hugetlb primitives with ptep pointers
that do _not_ point to a pmd or pud entry will not work on s390.
It also seems to make no sense to verify / test such a thing in general,
as it would also be a severe bug if any kernel code would do that.
After all, with hugepages, there are no pte tables, only pmd etc.
tables.

My change above would fix the issue for s390, but I can still not
completely judge if that would not break other things for your
tests. In general, for normal kernel code, much of what you do would
be very broken, but I guess your tests are doing such "special" things
because they can. E.g. because they operate on some "sandbox" mm
and page tables, and you also do not need properly populated page
tables for some exit / free cleanup, you just throw them away
explicitly with pXd_free at the end. So it might just be "the right
thing" to pass a casted pmd pointer to hugetlb_advanced_tests(),
to simulate and test (proper) usage of the hugetlb primitives.
I also see no other way to make this work for s390, than using a
proper pmd/pud pointer. If not possible, please add us to the
#ifndef.

So, for all those interested, here is what goes wrong on s390.
huge_ptep_get_and_clear() uses the "idte" instruction for the
clearing (and TLB invalidation) part. That instruction expects
a "region or segment table" origin, which is a pmd/pud/p4d/pgd,
but not a pte table. Even worse, when we calculate the table
origin from the given ptep (which *should* not point to a pte),
due to different table sizes for pte / pXd tables, we end up
at some place before the given pte table.

The "idte" instruction also gets the virtual address, and does
corresponding index addition to the given table origin. Depending
on the pmd_index we now end up either within the pte table again,
in which case we see a panic because idte complains about seeing
a pte value. If we are unlucky, then we end up outside the pte
table, and depending on the content of that memory location, idte
might succeed, effectively corrupting that memory.

That explains why we only see the panic sometimes, depending on
random vaddr, other symptoms other times, and probably completely
silent memory corruption for the rest...


Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-08 Thread Gerald Schaefer
On Tue, 8 Sep 2020 07:22:39 +0200
Christophe Leroy  wrote:

> 
> 
> Le 07/09/2020 à 22:12, Mike Rapoport a écrit :
> > On Mon, Sep 07, 2020 at 08:00:55PM +0200, Gerald Schaefer wrote:
> >> This is v2 of an RFC previously discussed here:
> >> https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schae...@linux.ibm.com/
> >>
> >> Patch 1 is a fix for a regression in gup_fast on s390, after our conversion
> >> to common gup_fast code. It will introduce special helper functions
> >> pXd_addr_end_folded(), which have to be used in places where pagetable walk
> >> is done w/o lock and with READ_ONCE, so currently only in gup_fast.
> >>
> >> Patch 2 is an attempt to make that more generic, i.e. change pXd_addr_end()
> >> themselves by adding an extra pXd value parameter. That was suggested by
> >> Jason during v1 discussion, because he is already thinking of some other
> >> places where he might want to switch to the READ_ONCE logic for pagetable
> >> walks. In general, that would be the cleanest / safest solution, but there
> >> is some impact on other architectures and common code, hence the new and
> >> greatly enlarged recipient list.
> >>
> >> Patch 3 is a "nice to have" add-on, which makes pXd_addr_end() inline
> >> functions instead of #defines, so that we get some type checking for the
> >> new pXd value parameter.
> >>
> >> Not sure about Fixes/stable tags for the generic solution. Only patch 1
> >> fixes a real bug on s390, and has Fixes/stable tags. Patches 2 + 3 might
> >> still be nice to have in stable, to ease future backports, but I guess
> >> "nice to have" does not really qualify for stable backports.
> > 
> > I also think that adding pXd parameter to pXd_addr_end() is a cleaner
> > way and with this patch 1 is not really required. I would even merge
> > patches 2 and 3 into a single patch and use only it as the fix.
> 
> Why not merging patches 2 and 3, but I would keep patch 1 separate but 
> after the generic changes, so that we first do the generic changes, then 
> we do the specific S390 use of it.

Yes, we thought about that approach too. It would at least allow to
get all into stable, more or less nicely, as prerequisite for the s390
fix.

Two concerns kept us from going that way. For once, it might not be
the nicest way to get it all in stable, and we would not want to risk
further objections due to the imminent and rather scary data corruption
issue that we want to fix asap.

For the same reason, we thought that the generalization part might
need more time and agreement from various people, so that we could at
least get the first patch as short-term solution.

It seems now that the generalization is very well accepted so far,
apart from some apparent issues on arm. Also, merging 2 + 3 and
putting them first seems to be acceptable, so we could do that for
v3, if there are no objections.

Of course, we first need to address the few remaining issues for
arm(32?), which do look quite confusing to me so far. BTW, sorry for
the compile error with patch 3, I guess we did the cross-compile only
for 1 + 2 applied, to see the bloat-o-meter changes. But I guess
patch 3 already proved its usefulness by that :-)


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-08 Thread Gerald Schaefer
On Tue, 8 Sep 2020 07:30:50 -0700
Dave Hansen  wrote:

> On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> > code") introduced a subtle but severe bug on s390 with gup_fast, due to
> > dynamic page table folding.
> 
> Would it be fair to say that the "fake" page table entries s390
> allocates on the stack are what's causing the trouble here?  That might
> be a nice thing to open up with here.  "Dynamic page table folding"
> really means nothing to me.

We do not really allocate anything on the stack, it is the generic logic
from gup_fast that passes over pXd values (read once before), and using
pointers to such (stack) variables instead of real pXd pointers.
That, combined with the fact that we just return the passed in pointer in
pXd_offset() for folded levels.

That works similar on x86 IIUC, but with static folding, and thus also
proper pXd_addr_end() results because of statically (and correspondingly)
defined Pxd_INDEX/SHIFT. We always have static 5-level PxD_INDEX/SHIFT, and
that cannot really be made dynamic, so we just make pXd_addr_end()
dynamic instead, and that requires the pXd value to determine the correct
pagetable level.

Still makes my head spin when trying to explain, sorry. It is a very
special s390 oddity, or lets call it "feature", because I don't think any
other architecture has "dynamic pagetable folding" capability, depending
on process requirements, for whatever it is worth...

> 
> > @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long 
> > addr, unsigned long end,
> > do {
> > pmd_t pmd = READ_ONCE(*pmdp);
> >  
> > -   next = pmd_addr_end(addr, end);
> > +   next = pmd_addr_end_folded(pmd, addr, end);
> > if (!pmd_present(pmd))
> > return 0;
> 
> It looks like you fix this up later, but this would be a problem if left
> this way.  There's no documentation for whether I use
> pmd_addr_end_folded() or pmd_addr_end() when writing a page table walker.

Yes, that is very unfortunate. We did have some lengthy comment in
include/linux/pgtable.h where the pXd_addr_end(_folded) were defined.
But that was moved to arch/s390/include/asm/pgtable.h in this version,
probably because we already had the generalization in mind, where we
would not need such explanation in common header any more.

So, it might help better understand the issue that we have with
dynamic page table folding and READ_ONCE-style pagetable walkers
when looking at that comment.

Thanks for pointing out, that comment should definitely go into
include/linux/pgtable.h again. At least if we would still go for
that "s390 fix first, generalization second" approach, but it
seems we have other / better options now.


Re: [PATCH v4 00/13] mm/debug_vm_pgtable fixes

2020-09-09 Thread Gerald Schaefer
On Wed, 9 Sep 2020 13:45:48 +0530
Anshuman Khandual  wrote:

[...]
> > 
> > That would more match the "pte_t pointer" usage for hugetlb code,
> > i.e. just cast a pmd_t pointer to it. Also changed to pmd_aligned,
> > but I think the root cause is the pte_t pointer.
> 
> Ideally, the pte_t pointer used here should be from huge_pte_alloc()
> not from pte_alloc_map_lock() as the case currently.

Ah, good point. I assumed that this would also always return casted
pmd etc. pointers, and never pte pointers. Unfortunately, that doesn't
seem to be true for all architectures, e.g. ia64, parisc, (some) powerpc,
where they really do a pte_alloc_map() for some reason.

I guess that means you cannot simply cast the pmd pointer, as suggested,
although I really do not understand how any architecture can work with
real ptes for hugepages. But that's fair, s390 also does some things that
nobody would expect or understand for other architectures...

So, for using huge_pte_alloc() you'd also need some size, maybe
iterating over hstates with for_each_hstate() could be an option,
if they are already initialized at that point. Then you have the
size(s) with huge_page_size(hstate) and can actually call the
hugetlb tests for all supported sizes, and with proper pointer
from huge_pte_alloc().


Re: [PATCH v4 00/13] mm/debug_vm_pgtable fixes

2020-09-09 Thread Gerald Schaefer
On Wed, 09 Sep 2020 11:38:39 +0530
"Aneesh Kumar K.V"  wrote:

> Gerald Schaefer  writes:
> 
> > On Fri, 4 Sep 2020 18:01:15 +0200
> > Gerald Schaefer  wrote:
> >
> > [...]
> >> 
> >> BTW2, a quick test with this change (so far) made the issues on s390
> >> go away:
> >> 
> >> @@ -1069,7 +1074,7 @@ static int __init debug_vm_pgtable(void)
> >> spin_unlock(ptl);
> >> 
> >>  #ifndef CONFIG_PPC_BOOK3S_64
> >> -   hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
> >> +   hugetlb_advanced_tests(mm, vma, (pte_t *) pmdp, pmd_aligned, 
> >> vaddr, prot);
> >>  #endif
> >> 
> >> spin_lock(&mm->page_table_lock);
> >> 
> >> That would more match the "pte_t pointer" usage for hugetlb code,
> >> i.e. just cast a pmd_t pointer to it. Also changed to pmd_aligned,
> >> but I think the root cause is the pte_t pointer.
> >> 
> >> Not entirely sure though if that would really be the correct fix.
> >> I somehow lost whatever little track I had about what these tests
> >> really want to check, and if that would still be valid with that
> >> change.
> >
> > Uh oh, wasn't aware that this (or some predecessor) already went
> > upstream, and broke our debug kernel today.
> 
> Not sure i followed the above. Are you finding that s390 kernel crash
> after this patch series or the original patchset? As noted in my patch
> the hugetlb test is broken and we should fix that. A quick fix is to
> comment out that test for s390 too as i have done for PPC64.

We see it with both, it basically is broken since there is a hugetlb
test using real pte pointers. It doesn't always show, depending on
random vaddr, so it slipped through earlier testing.

I guess we also would have had one or the other chance to notice
that earlier, through better review, or better reading of previous
mails. I must admit that I neglected this a bit.


Re: [PATCH v4 00/13] mm/debug_vm_pgtable fixes

2020-09-09 Thread Gerald Schaefer
On Wed, 9 Sep 2020 13:38:25 +0530
Anshuman Khandual  wrote:

> 
> 
> On 09/04/2020 08:56 PM, Gerald Schaefer wrote:
> > On Fri, 4 Sep 2020 12:18:05 +0530
> > Anshuman Khandual  wrote:
> > 
> >>
> >>
> >> On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote:
> >>> This patch series includes fixes for debug_vm_pgtable test code so that
> >>> they follow page table updates rules correctly. The first two patches 
> >>> introduce
> >>> changes w.r.t ppc64. The patches are included in this series for 
> >>> completeness. We can
> >>> merge them via ppc64 tree if required.
> >>>
> >>> Hugetlb test is disabled on ppc64 because that needs larger change to 
> >>> satisfy
> >>> page table update rules.
> >>>
> >>> These tests are broken w.r.t page table update rules and results in kernel
> >>> crash as below. 
> >>>
> >>> [   21.083519] kernel BUG at arch/powerpc/mm/pgtable.c:304!
> >>> cpu 0x0: Vector: 700 (Program Check) at [c00c6d1e76c0]
> >>> pc: c009a5ec: assert_pte_locked+0x14c/0x380
> >>> lr: c05c: pte_update+0x11c/0x190
> >>> sp: c00c6d1e7950
> >>>msr: 82029033
> >>>   current = 0xc00c6d172c80
> >>>   paca= 0xc3ba   irqmask: 0x03   irq_happened: 0x01
> >>> pid   = 1, comm = swapper/0
> >>> kernel BUG at arch/powerpc/mm/pgtable.c:304!
> >>> [link register   ] c05c pte_update+0x11c/0x190
> >>> [c00c6d1e7950] 0001 (unreliable)
> >>> [c00c6d1e79b0] c05eee14 pte_update+0x44/0x190
> >>> [c00c6d1e7a10] c1a2ca9c pte_advanced_tests+0x160/0x3d8
> >>> [c00c6d1e7ab0] c1a2d4fc debug_vm_pgtable+0x7e8/0x1338
> >>> [c00c6d1e7ba0] c00116ec do_one_initcall+0xac/0x5f0
> >>> [c00c6d1e7c80] c19e4fac kernel_init_freeable+0x4dc/0x5a4
> >>> [c00c6d1e7db0] c0012474 kernel_init+0x24/0x160
> >>> [c00c6d1e7e20] c000cbd0 ret_from_kernel_thread+0x5c/0x6c
> >>>
> >>> With DEBUG_VM disabled
> >>>
> >>> [   20.530152] BUG: Kernel NULL pointer dereference on read at 0x
> >>> [   20.530183] Faulting instruction address: 0xc00df330
> >>> cpu 0x33: Vector: 380 (Data SLB Access) at [c00c6d19f700]
> >>> pc: c00df330: memset+0x68/0x104
> >>> lr: c009f6d8: hash__pmdp_huge_get_and_clear+0xe8/0x1b0
> >>> sp: c00c6d19f990
> >>>msr: 82009033
> >>>dar: 0
> >>>   current = 0xc00c6d177480
> >>>   paca= 0xc0001ec4f400   irqmask: 0x03   irq_happened: 0x01
> >>> pid   = 1, comm = swapper/0
> >>> [link register   ] c009f6d8 
> >>> hash__pmdp_huge_get_and_clear+0xe8/0x1b0
> >>> [c00c6d19f990] c009f748 
> >>> hash__pmdp_huge_get_and_clear+0x158/0x1b0 (unreliable)
> >>> [c00c6d19fa10] c19ebf30 pmd_advanced_tests+0x1f0/0x378
> >>> [c00c6d19fab0] c19ed088 debug_vm_pgtable+0x79c/0x1244
> >>> [c00c6d19fba0] c00116ec do_one_initcall+0xac/0x5f0
> >>> [c00c6d19fc80] c19a4fac kernel_init_freeable+0x4dc/0x5a4
> >>> [c00c6d19fdb0] c0012474 kernel_init+0x24/0x160
> >>> [c00c6d19fe20] c000cbd0 ret_from_kernel_thread+0x5c/0x6c
> >>>
> >>> Changes from v3:
> >>> * Address review feedback
> >>> * Move page table depost and withdraw patch after adding pmdlock to avoid 
> >>> bisect failure.
> >>
> >> This version
> >>
> >> - Builds on x86, arm64, s390, arc, powerpc and riscv (defconfig with 
> >> DEBUG_VM_PGTABLE)
> >> - Runs on arm64 and x86 without any regression, atleast nothing that I 
> >> have noticed
> >> - Will be great if this could get tested on s390, arc, riscv, ppc32 
> >> platforms as well
> > 
> > When I quickly tested v3, it worked fine, but now it turned out to
> > only work fine "sometimes", both v3 and v4. I need to look into it
> > further, but so far it seems related to the hugetlb_advanced_tests().
> > 
> > I guess there was already some discussion on this test, but we did
> > not receive all of the thread(s). Please always add at least
> > linux-s...@vger.kernel.org and maybe myself an

Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-09 Thread Gerald Schaefer
On Tue, 8 Sep 2020 07:30:50 -0700
Dave Hansen  wrote:

> On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> > Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> > code") introduced a subtle but severe bug on s390 with gup_fast, due to
> > dynamic page table folding.
> 
> Would it be fair to say that the "fake" page table entries s390
> allocates on the stack are what's causing the trouble here?  That might
> be a nice thing to open up with here.  "Dynamic page table folding"
> really means nothing to me.

Sorry, I guess my previous reply does not really explain "what the heck
is dynamic page table folding?".

On s390, we can have different number of page table levels for different
processes / mms. We always start with 3 levels, and update dynamically
on process demand to 4 or 5 levels, hence the dynamic folding. Still,
the PxD_SIZE/SHIFT is defined statically, so that e.g. pXd_addr_end() will
not reflect this dynamic behavior.

For the various pagetable walkers using pXd_addr_end() (w/o READ_ONCE
logic) this is no problem. With static folding, iteration over the folded
levels will always happen at pgd level (top-level folding). For s390,
we stay at the respective level and iterate there (dynamic middle-level
folding), only return to pgd level if there really were 5 levels.

This only works well as long there are real pagetable pointers involved,
that can also be used for iteration. For gup_fast, or any other future
pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
There are pointers involved to local pXd values on the stack, because of
the READ_ONCE logic, and our middle-level iteration will suddenly iterate
over such stack pointers instead of pagetable pointers.

This will be addressed by making the pXd_addr_end() dynamic, for which
we need to see the pXd value in order to determine its level / type.


Re: [RFC PATCH v2 0/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-09 Thread Gerald Schaefer
On Tue, 8 Sep 2020 19:36:50 +0200
Gerald Schaefer  wrote:

[..]
> 
> It seems now that the generalization is very well accepted so far,
> apart from some apparent issues on arm. Also, merging 2 + 3 and
> putting them first seems to be acceptable, so we could do that for
> v3, if there are no objections.
> 
> Of course, we first need to address the few remaining issues for
> arm(32?), which do look quite confusing to me so far. BTW, sorry for
> the compile error with patch 3, I guess we did the cross-compile only
> for 1 + 2 applied, to see the bloat-o-meter changes. But I guess
> patch 3 already proved its usefulness by that :-)

Umm, replace "arm" with "power", sorry. No issues on arm so far, but
also no ack I think.

Thanks to Christophe for the power change, and to Mike for volunteering
for some cross compilation and cross-arch testing. Will send v3 with
merged and re-ordered patches after some more testing.


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-09 Thread Gerald Schaefer
On Wed, 9 Sep 2020 09:18:46 -0700
Dave Hansen  wrote:

> On 9/9/20 5:29 AM, Gerald Schaefer wrote:
> > This only works well as long there are real pagetable pointers involved,
> > that can also be used for iteration. For gup_fast, or any other future
> > pagetable walkers using the READ_ONCE logic w/o lock, that is not true.
> > There are pointers involved to local pXd values on the stack, because of
> > the READ_ONCE logic, and our middle-level iteration will suddenly iterate
> > over such stack pointers instead of pagetable pointers.
> 
> By "There are pointers involved to local pXd values on the stack", did
> you mean "locate" instead of "local"?  That sentence confused me.
> 
> Which code is it, exactly that allocates these troublesome on-stack pXd
> values, btw?

It is the gup_pXd_range() call sequence in mm/gup.c. It starts in
gup_pgd_range() with "pgdp = pgd_offset(current->mm, addr)" and then
the "pgd_t pgd = READ_ONCE(*pgdp)" which creates the first local
stack variable "pgd".

The next-level call to gup_p4d_range() gets this "pgd" value as
input, but not the original pgdp pointer where it was read from.
This is already the essential difference to other pagetable walkers
like e.g. walk_pXd_range() in mm/pagewalk.c, where the original
pointer is passed through. With READ_ONCE, that pointer must not
be further de-referenced, so instead the value is passed over.

In gup_p4d_range() we then have "p4dp = p4d_offset(&pgd, addr)",
with &pgd being a pointer to the passed over pgd value, so that's
the first pXd pointer that does not point directly to the pXd in
the page table, but a local stack variable.

With folded p4d, p4d_offset(&pgd, addr) will simply return
the passed-in &pgd pointer, so we now also have p4dp point to that.
That continues with "p4d_t p4d = READ_ONCE(*p4dp)", and that second
stack variable passed to gup_huge_pud() and so on. Due to inlining,
all those variables will not really be passed anywhere, but simply
sit on the stack.

So far, IIUC, that would also happen on x86 (or everywhere else
actually) for folded levels, i.e. some pXd_offset() calls would
simply return the passed in (stack) value pointer. This works
as designed, and it will not lead to the "iteration over stack
pointer" for anybody but s390, because the pXd_addr_end()
boundaries usually take care that you always return to pgd
level for iteration, and that is the only level with a real
pagetable pointer. For s390, we stay at the first non-folded
level and do the iteration there, which is fine for other
pagetable walkers using the original pointers, but not for
the READ_ONCE-style gup_fast.

I actually had to draw myself a picture to get some hold of
this, or rather a walk-through with a certain pud-crossing
range in a folded 3-level scenario. Not sure if I would have
understood my explanation above w/o that, but I hope you can
make some sense out of it. Or draw yourself a picture :-)


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-10 Thread Gerald Schaefer
On Wed, 9 Sep 2020 15:03:24 -0300
Jason Gunthorpe  wrote:

> On Wed, Sep 09, 2020 at 07:25:34PM +0200, Gerald Schaefer wrote:
> > I actually had to draw myself a picture to get some hold of
> > this, or rather a walk-through with a certain pud-crossing
> > range in a folded 3-level scenario. Not sure if I would have
> > understood my explanation above w/o that, but I hope you can
> > make some sense out of it. Or draw yourself a picture :-)
> 
> What I don't understand is how does anything work with S390 today?

That is totally comprehensible :-)

> If the fix is only to change pxx_addr_end() then than generic code
> like mm/pagewalk.c will iterate over a *different list* of page table
> entries. 
> 
> It's choice of entries to look at is entirely driven by pxx_addr_end().
> 
> Which suggest to me that mm/pagewalk.c also doesn't work properly
> today on S390 and this issue is not really about stack variables?

I guess you are confused by the fact that the generic change will indeed
change the logic for _all_ pagetable walkers on s390, not just for
the gup_fast case. But that doesn't mean that they were doing it wrong
before, we simply can do it both ways. However, we probably should
make that (in theory useless) change more explicit.

Let's compare before and after for mm/pagewalk.c on s390, with 3-level
pagetables, range crossing 2 GB pud boundary.

* Before (with pXd_addr_end always using static 5-level PxD_SIZE):

walk_pgd_range()
-> pgd_addr_end() will use static 2^53 PGDIR_SIZE, range is not cropped,
  no iterations needed, passed over to next level

walk_p4d_range()
-> p4d_addr_end() will use static 2^42 P4D_SIZE, range still not cropped

walk_pud_range()
-> pud_addr_end() now we're cropping, with 2^31 PUD_SIZE, need two
  iterations for range crossing pud boundary, doing
  that right here on a pudp which is actually the
  previously passed-through pgdp/p4dp (pointing to
  correct pagetable entry)

* After (with dynamic pXd_addr_end using "correct" PxD_SIZE boundaries,
 should be similar to other archs static "top-level folding"):

walk_pgd_range()
-> pgd_addr_end() will now determine "correct" boundary based on pgd
  value, i.e. 2^31 PUD_SIZE, do cropping now, iteration
  will now happen here

walk_p4d/pud_range()
->  operate on cropped range, will not iterate, instead return to pgd level,
which will then use the same pointer for iteration as in the "Before"
case, but not on the same level.

IMHO, our "Before" logic is more efficient, and also feels more natural.
After all, it is not really necessary to return to pgd level, and it will
surely cost some extra instructions. We are willing to take that cost
for the sake of doing it in a more generic way, hoping that will reduce
future issues. E.g. you already mentioned that you have plans for using
the READ_ONCE logic also in other places, and that would be such a
"future issue".

> Fundamentally if pXX_offset() and pXX_addr_end() must be consistent
> together, if pXX_offset() is folded then pXX_addr_end() must cause a
> single iteration of that level.

well, that sounds correct in theory, but I guess it depends on "how
you fold it". E.g. what does "if pXX_offset() is folded" mean?
Take pgd_offset() for the 3-level case above. From our previous
"middle-level folding/iteration" perspective, I would say that
pgd/p4d are folded into pud, so if you say "if pgd_offset() is folded
then pgd_addr_end() must cause a single iteration of that level",
we were doing it all correctly, i.e only having single iteration
on pgd/p4d level. You could even say that all others are doing /
using it wrong :-)

Now take pgd_offset() from the "top-level folding/iteration".
Here you would say that p4d/pud are folded into pgd, which again
does not sound like the natural / most efficient way to me,
but IIUC this has to be how it works for all other archs with
(static) pagetable folding. Now you'd say "if pud/p4d_offset()
is folded then pud/p4d_addr_end() must cause a single iteration
of that level", and that would sound correct. At least until
you look more closely, because e.g. p4d_addr_end() in
include/asm-generic/pgtable-nop4d.h is simply this:
#define p4d_addr_end(addr, end) (end)

How can that cause a single iteration? It clearly won't, it only
works because the previous pgd_addr_end already cropped the range
so that there will be only single iterations for p4d/pud.

The more I think of it, the more it sounds like s390 "middle-level
folding/iteration" was doing it "the right way", and everybody else
was wrong, or at least not in an optimally efficient way :-) Might
also 

Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-10 Thread Gerald Schaefer
On Thu, 10 Sep 2020 10:02:33 -0300
Jason Gunthorpe  wrote:

> On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> 
> > As Gerald mentioned, it is very difficult to explain in a clear way.
> > Hopefully, one could make sense ot of it.  
> 
> I would say the page table API requires this invariant:
> 
> pud = pud_offset(p4d, addr);
> do {
>   WARN_ON(pud != pud_offset(p4d, addr);
> next = pud_addr_end(addr, end);
> } while (pud++, addr = next, addr != end);
> 
> ie pud++ is supposed to be a shortcut for 
>   pud_offset(p4d, next)
> 
> While S390 does not follow this. Fixing addr_end brings it into
> alignment by preventing pud++ from happening.
> 
> The only currently known side effect is that gup_fast crashes, but it
> sure is an unexpected thing.

It only is unexpected in a "top-level folding" world, see my other reply.
Consider it an optimization, which was possible because of how our dynamic
folding works, and e.g. because we can determine the correct pagetable
level from a pXd value in pXd_offset.

> This suggests another fix, which is to say that pud++ is undefined and
> pud_offset() must always be called, but I think that would cause worse
> codegen on all other archs.

There really is nothing to fix for s390 outside of gup_fast, or other
potential future READ_ONCE pagetable walkers. We do take the side-effect
of the generic change on all other pagetable walkers for s390, but it
really is rather a slight degradation than a fix.


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-10 Thread Gerald Schaefer
On Thu, 10 Sep 2020 12:10:26 -0300
Jason Gunthorpe  wrote:

> On Thu, Sep 10, 2020 at 03:28:03PM +0200, Gerald Schaefer wrote:
> > On Thu, 10 Sep 2020 10:02:33 -0300
> > Jason Gunthorpe  wrote:
> >   
> > > On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> > >   
> > > > As Gerald mentioned, it is very difficult to explain in a clear way.
> > > > Hopefully, one could make sense ot of it.
> > > 
> > > I would say the page table API requires this invariant:
> > > 
> > > pud = pud_offset(p4d, addr);
> > > do {
> > >   WARN_ON(pud != pud_offset(p4d, addr);
> > > next = pud_addr_end(addr, end);
> > > } while (pud++, addr = next, addr != end);
> > > 
> > > ie pud++ is supposed to be a shortcut for 
> > >   pud_offset(p4d, next)
> > > 
> > > While S390 does not follow this. Fixing addr_end brings it into
> > > alignment by preventing pud++ from happening.
> > > 
> > > The only currently known side effect is that gup_fast crashes, but it
> > > sure is an unexpected thing.  
> > 
> > It only is unexpected in a "top-level folding" world, see my other reply.
> > Consider it an optimization, which was possible because of how our dynamic
> > folding works, and e.g. because we can determine the correct pagetable
> > level from a pXd value in pXd_offset.  
> 
> No, I disagree. The page walker API the arch presents has to have well
> defined semantics. For instance, there is an effort to define tests
> and invarients for the page table accesses to bring this understanding
> and uniformity:
> 
>  mm/debug_vm_pgtable.c
> 
> If we fix S390 using the pX_addr_end() change then the above should be
> updated with an invariant to check it. I've added Anshuman for some
> thoughts..

We are very aware of those tests, and actually a big supporter of the
idea. Also part of the supported architectures already, and it has
already helped us find / fix some s390 oddities.

However, we did not see any issues wrt to our pagetable walking,
neither with the current version, nor with the new generic approach.
We do currently see other issues, Anshuman will know what I mean :-)

> For better or worse, that invariant does exclude arches from using
> other folding techniques.
> 
> The other solution would be to address the other side of != and adjust
> the pud++
> 
> eg replcae pud++ with something like:
>   pud = pud_next_entry(p4d, pud, next)
> 
> Such that:
>   pud_next_entry(p4d, pud, next) === pud_offset(p4d, next)
> 
> In which case the invarient changes to 'callers can never do pointer
> arithmetic on the result of pXX_offset()' which is a bit harder to
> enforce.

I might have lost track a bit. Are we still talking about possible
functional impacts of either our current pagetable walking with s390
(apart from gup_fast), or the proposed generic change (for s390, or
others?)?

Or is this rather some (other) generic issue / idea that you have,
in order to put "some more structure / enforcement" to generic
pagetable walkers?


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-10 Thread Gerald Schaefer
On Thu, 10 Sep 2020 10:02:33 -0300
Jason Gunthorpe  wrote:

> On Thu, Sep 10, 2020 at 11:39:25AM +0200, Alexander Gordeev wrote:
> 
> > As Gerald mentioned, it is very difficult to explain in a clear way.
> > Hopefully, one could make sense ot of it.
> 
> I would say the page table API requires this invariant:
> 
> pud = pud_offset(p4d, addr);
> do {
>   WARN_ON(pud != pud_offset(p4d, addr);
> next = pud_addr_end(addr, end);
> } while (pud++, addr = next, addr != end);
> 
> ie pud++ is supposed to be a shortcut for 
>   pud_offset(p4d, next)
> 

Hmm, IIUC, all architectures with static folding will simply return
the passed-in p4d pointer for pud_offset(p4d, addr), for 3-level
pagetables. There is no difference for s390. For gup_fast, that p4d
pointer is not really a pointer to a value in a pagetable, but
to some local copy of such a value, and not just for s390.

So, pud = p4d = pointer to copy, and increasing that pud pointer
cannot be the same as pud_offset(p4d, next). I do see your point
however, at last I think :-) My problem is that I do not see where
we would have an s390-specific issue here. Maybe my understanding
of how it works for others with static folding is wrong. That
would explain my difficulties in getting your point...

> While S390 does not follow this. Fixing addr_end brings it into
> alignment by preventing pud++ from happening.

Exactly, only that nobody seems to follow it, IIUC. Fixing it up
with pXd_addr_end was my impression of what we need to do, in order to
have it work the same way as for others.

> The only currently known side effect is that gup_fast crashes, but it
> sure is an unexpected thing.

Well, from my understanding it feels more unexpected that something
that is supposed to be a pointer to an entry in a page table, really is
just a pointer to some copy somewhere.


Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding

2020-09-10 Thread Gerald Schaefer
On Thu, 10 Sep 2020 11:33:17 -0700
Linus Torvalds  wrote:

> On Thu, Sep 10, 2020 at 11:13 AM Jason Gunthorpe  wrote:
> >
> > So.. To change away from the stack option I think we'd have to pass
> > the READ_ONCE value to pXX_offset() as an extra argument instead of it
> > derefing the pointer internally.
> 
> Yeah, but I think that would actually be the better model than passing
> an address to a random stack location.
> 
> It's also effectively what we do in some other places, eg the whole
> logic with "orig" in the regular pte fault handling is basically doing
> unlocked loads of the pte, various decisions on that, and then doing a
> final "is this still the same pte" after it has gotten the page table
> lock.

That sounds a lot like the pXd_offset_orig() from Martins first approach
in this thread:
https://lore.kernel.org/linuxppc-dev/20190418100218.0a4afd51@mschwideX1/

It is also the "Patch 1" option from the start of this thread:
https://lore.kernel.org/lkml/20200828140314.8556-1-gerald.schae...@linux.ibm.com/

I guess I chose wrongly there, should have had more trust in Martins
approach, and not try so hard to do it like others...

So, maybe we can start over again, from that patch option. It would of
course also initially introduce some gup-specific helpers, like with
the other approach. It seemed harder to generalize when I thought
about it back then, but I guess it should not be a lot harder than
the _addr_end stuff.

Or, maybe this time, just not to risk Christian getting a heart attack,
we could go for the gup-specific helper first, so that we would at
least have a fix for the possible s390 data corruption. Jason, would
you agree that we send a new RFC, this time with pXd_offset_orig()
approach, and have that accepted as short-term fix?

Or would you rather also wait for some proper generic change? Have
lost that option from my radar, so cannot really judge how much more
effort it would be. I'm on vacation next week anyway, but Alexander
or Vasily (who did the option 1 patch) could look into this further.


Re: Freeing page tables through RCU

2021-02-26 Thread Gerald Schaefer
On Thu, 25 Feb 2021 20:58:20 +
Matthew Wilcox  wrote:

> In order to walk the page tables without the mmap semaphore, it must
> be possible to prevent them from being freed and reused (eg if munmap()
> races with viewing /proc/$pid/smaps).
> 
> There is various commentary within the mm on how to prevent this.  One way
> is to disable interrupts, relying on that to block rcu_sched or IPIs.
> I don't think the RT people are terribly happy about reading a proc file
> disabling interrupts, and it doesn't work for architectures that free
> page tables directly instead of batching them into an rcu_sched (because
> the IPI may not be sent to this CPU if the task has never run on it).
> 
> See "Fast GUP" in mm/gup.c
> 
> Ideally, I'd like rcu_read_lock() to delay page table reuse.  This is
> close to trivial for architectures which use entire pages or multiple
> pages for levels of their page tables as we can use the rcu_head embedded
> in struct page to queue the page for RCU.
> 
> s390 and powerpc are the only two architectures I know of that have
> levels of their page table that are smaller than their PAGE_SIZE.
> I'd like to discuss options.  There may be a complicated scheme that
> allows partial pages to be freed via RCU, but I have something simpler
> in mind.  For powerpc in particular, it can have a PAGE_SIZE of 64kB
> and then the MMU wants to see 4kB entries in the PMD.  I suggest that
> instead of allocating each 4kB entry individually, we allocate a 64kB
> page and fill in 16 consecutive PMDs.  This could cost a bit more memory
> (although if you've asked for a CONFIG_PAGE_SIZE of 64kB, you presumably
> don't care too much about it), but it'll make future page faults cheaper
> (as the PMDs will already be present, assuming you have good locality
> of reference).
> 
> I'd like to hear better ideas than this.

Some background on the situation for s390: The architecture defines
an 8 bit pagetable index, so we have 256 entries in a 2 KB pagetable,
but PAGE_SIZE is 4 KB. pte_alloc(_one) will use alloc_page() to allocate
a full 4 KB page, and then do some housekeeping to maintain a per-mm list
of such 4 KB pages, which will contain either one or two 2 KB pagetable
fragments.

This is also the reason why pgtable_t on s390 is not pointing to the
struct page of the (4 KB) page containing a 2 KB pagetable fragment, but
rather to the 2 KB pagetable itself.

I see at least two issues here, with using rcu_head embedded in the
struct page (for a 4 KB page):

1) There might be two 2 KB pagetables present in that 4 KB page,
and the rcu_head would affect both. Not sure if this would really be
a problem, because we already have a similar situation with the split
ptlock embedded in struct page, which also might lock two 2 KB pagetables,
i.e. more than necessary. It still is far less "over-locking" than
using mm->page_table_lock, and the move_pte() code e.g. takes care
to avoid a deadlock if src and dst ptlocks happen to be on the
same page.

So, a similar "over-locking" might also be possible and acceptable
for the rcu_head approach, but I do not really understand if that could
have some deadlock or other unwanted side-effects.

2) The "housekeeping" of our 2 KB pagetable fragments uses page->lru
to maintain the per-mm list. It also (mis)uses page->_refcount to mark
which 2 KB half is used/free, but that should not be an issue I guess.
Using page->lru will be an issue though. IIUC, then page->rcu_head will
overlay with page->lru, so using page->rcu_head for pagetable pages
on s390 would conflict with our page->lru usage for such pagetable pages.

I do not really see how that could be fixed, maybe we could find and
re-use other struct page members for our 2 KB fragment list. Also, for
kvm, there seem to be even more users of page->lru for pagetable pages,
in arch/s390/mm/gmap.c. Not sure though if those would actually also
affect "regular" pagetable walks, or if they are somehow independent.
But if we'd find some new list home for the 2 KB fragments, then that
could probably also be used for the gmap stuff.


Re: [PATCH v1 5/7] s390/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE

2022-03-15 Thread Gerald Schaefer
On Tue, 15 Mar 2022 15:18:35 +0100
David Hildenbrand  wrote:

> Let's steal one bit from the offset. While at it, document the meaning
> of bit 62 for swap ptes.

You define _PAGE_SWP_EXCLUSIVE as _PAGE_LARGE, which is bit 52, and
this is not part of the swap pte offset IIUC. So stealing any bit might
actually not be necessary, see below.

Also, bit 62 should be the soft dirty bit for normal PTEs, and this
doesn't seem to be used for swap PTEs at all. But I might be missing
some use case where softdirty also needs to be preserved in swap PTEs.

> 
> Signed-off-by: David Hildenbrand 
> ---
>  arch/s390/include/asm/pgtable.h | 37 ++---
>  1 file changed, 30 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 008a6c856fa4..c182212a2b44 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -181,6 +181,8 @@ static inline int is_module_addr(void *addr)
>  #define _PAGE_SOFT_DIRTY 0x000
>  #endif
>  
> +#define _PAGE_SWP_EXCLUSIVE _PAGE_LARGE  /* SW pte exclusive swap bit */
> +
>  /* Set of bits not changed in pte_modify */
>  #define _PAGE_CHG_MASK   (PAGE_MASK | _PAGE_SPECIAL | 
> _PAGE_DIRTY | \
>_PAGE_YOUNG | _PAGE_SOFT_DIRTY)
> @@ -796,6 +798,24 @@ static inline int pmd_protnone(pmd_t pmd)
>  }
>  #endif
>  
> +#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
> +static inline pte_t pte_swp_mkexclusive(pte_t pte)
> +{
> + pte_val(pte) |= _PAGE_SWP_EXCLUSIVE;
> + return pte;
> +}
> +
> +static inline int pte_swp_exclusive(pte_t pte)
> +{
> + return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
> +}
> +
> +static inline pte_t pte_swp_clear_exclusive(pte_t pte)
> +{
> + pte_val(pte) &= ~_PAGE_SWP_EXCLUSIVE;
> + return pte;
> +}
> +
>  static inline int pte_soft_dirty(pte_t pte)
>  {
>   return pte_val(pte) & _PAGE_SOFT_DIRTY;
> @@ -1675,16 +1695,19 @@ static inline int has_transparent_hugepage(void)
>   * information in the lowcore.
>   * Bits 54 and 63 are used to indicate the page type.
>   * A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
> - * This leaves the bits 0-51 and bits 56-62 to store type and offset.
> - * We use the 5 bits from 57-61 for the type and the 52 bits from 0-51
> + * This leaves the bits 0-50 and bits 56-61 to store type and offset.
> + * We use the 5 bits from 57-61 for the type and the 51 bits from 0-50
>   * for the offset.
> - * |   offset|01100|type |00|
> - * |001122334455|5|55566|66|
> - * |0123456789012345678901234567890123456789012345678901|23456|78901|23|
> + * |   offset   |E|01100|type |S0|
> + * |00112233445|5|5|55566|66|
> + * |012345678901234567890123456789012345678901234567890|1|23456|78901|23|
> + *
> + * S (bit 62) is used for softdirty tracking.

Unless there is some use for softdirty tracking in swap PTEs, I think
this description does not belong here, to the swap PTE layout.

> + * E (bit 51) is used to remember PG_anon_exclusive.

It is bit 52, at least with this patch, so I guess this could all be
done w/o stealing anything. That is, of course, only if it is allowed
to use bit 52 in this case. The POP says bit 52 has to be 0, or else
a "translation-specification exception" is recognized. However, I think
it could be OK for PTEs marked as invalid, like it is the case for swap
PTEs.

The comment here says at the beginning:
/*
 * 64 bit swap entry format:
 * A page-table entry has some bits we have to treat in a special way.
 * Bits 52 and bit 55 have to be zero, otherwise a specification
 * exception will occur instead of a page translation exception. The
 * specification exception has the bad habit not to store necessary
 * information in the lowcore.

This would mean that it is not OK to have bit 52 not zero for swap PTEs.
But if I read the POP correctly, all bits except for the DAT-protection
would be ignored for invalid PTEs, so maybe this comment needs some update
(for both bits 52 and also 55).

Heiko might also have some more insight.

Anyway, stealing bit 51 might still be an option, but then
_PAGE_SWP_EXCLUSIVE would need to be defined appropriately.


Re: [PATCH v1 5/7] s390/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE

2022-03-16 Thread Gerald Schaefer
On Tue, 15 Mar 2022 18:12:16 +0100
David Hildenbrand  wrote:

> On 15.03.22 17:58, David Hildenbrand wrote:
> > 
> >>> This would mean that it is not OK to have bit 52 not zero for swap PTEs.
> >>> But if I read the POP correctly, all bits except for the DAT-protection
> >>> would be ignored for invalid PTEs, so maybe this comment needs some update
> >>> (for both bits 52 and also 55).
> >>>
> >>> Heiko might also have some more insight.
> >>
> >> Indeed, I wonder why we should get a specification exception when the
> >> PTE is invalid. I'll dig a bit into the PoP.
> > 
> > SA22-7832-12 6-46 ("Translation-Specification Exception") is clearer
> > 
> > "The page-table entry used for the translation is
> > valid, and bit position 52 does not contain zero."
> > 
> > "The page-table entry used for the translation is
> > valid, EDAT-1 does not apply, the instruction-exe-
> > cution-protection facility is not installed, and bit
> > position 55 does not contain zero. It is model
> > dependent whether this condition is recognized."
> > 
> 
> I wonder if the following matches reality:
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 008a6c856fa4..6a227a8c3712 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1669,18 +1669,16 @@ static inline int has_transparent_hugepage(void)
>  /*
>   * 64 bit swap entry format:
>   * A page-table entry has some bits we have to treat in a special way.
> - * Bits 52 and bit 55 have to be zero, otherwise a specification
> - * exception will occur instead of a page translation exception. The
> - * specification exception has the bad habit not to store necessary
> - * information in the lowcore.
>   * Bits 54 and 63 are used to indicate the page type.
>   * A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
> - * This leaves the bits 0-51 and bits 56-62 to store type and offset.
> - * We use the 5 bits from 57-61 for the type and the 52 bits from 0-51
> - * for the offset.
> - * | offset|01100|type |00|
> + * | offset|XX1XX|type |S0|
>   * |001122334455|5|55566|66|
>   * |0123456789012345678901234567890123456789012345678901|23456|78901|23|
> + *
> + * Bits 0-51 store the offset.
> + * Bits 57-62 store the type.
> + * Bit 62 (S) is used for softdirty tracking.
> + * Bits 52, 53, 55 and 56 (X) are unused.
>   */
>  
>  #define __SWP_OFFSET_MASK  ((1UL << 52) - 1)
> 
> 
> I'm not sure why bit 53 was indicated as "1" and bit 55 was indicated as
> "0". At least for 52 and 55 there was a clear description.

Bit 53 is the invalid bit, and that is always 1 for swap ptes, in addition
to protection bit 54. Bit 55, along with bit 52, has to be zero according
to the (potentially deprecated) comment.

It is interesting that bit 56 seems to be unused, at least according
to the comment, but that would also mention bit 62 as unused, so that
clearly needs some update.

If bit 56 could be used for _PAGE_SWP_EXCLUSIVE, that would be better
than stealing a bit from the offset, or using potentially dangerous
bit 52. It is defined as _PAGE_UNUSED and only used for kvm, not sure
if this is also relevant for swap ptes, similar to bit 62.

Adding Christian on cc, maybe he has some insight on _PAGE_UNUSED
bit 56 and swap ptes.


Re: [PATCH v1 5/7] s390/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE

2022-03-16 Thread Gerald Schaefer
On Wed, 16 Mar 2022 14:01:07 +0100
Christian Borntraeger  wrote:

> 
> 
> Am 16.03.22 um 11:56 schrieb Gerald Schaefer:
> > On Tue, 15 Mar 2022 18:12:16 +0100
> > David Hildenbrand  wrote:
> > 
> >> On 15.03.22 17:58, David Hildenbrand wrote:
> >>>
> >>>>> This would mean that it is not OK to have bit 52 not zero for swap PTEs.
> >>>>> But if I read the POP correctly, all bits except for the DAT-protection
> >>>>> would be ignored for invalid PTEs, so maybe this comment needs some 
> >>>>> update
> >>>>> (for both bits 52 and also 55).
> >>>>>
> >>>>> Heiko might also have some more insight.
> >>>>
> >>>> Indeed, I wonder why we should get a specification exception when the
> >>>> PTE is invalid. I'll dig a bit into the PoP.
> >>>
> >>> SA22-7832-12 6-46 ("Translation-Specification Exception") is clearer
> >>>
> >>> "The page-table entry used for the translation is
> >>> valid, and bit position 52 does not contain zero."
> >>>
> >>> "The page-table entry used for the translation is
> >>> valid, EDAT-1 does not apply, the instruction-exe-
> >>> cution-protection facility is not installed, and bit
> >>> position 55 does not contain zero. It is model
> >>> dependent whether this condition is recognized."
> >>>
> >>
> >> I wonder if the following matches reality:
> >>
> >> diff --git a/arch/s390/include/asm/pgtable.h 
> >> b/arch/s390/include/asm/pgtable.h
> >> index 008a6c856fa4..6a227a8c3712 100644
> >> --- a/arch/s390/include/asm/pgtable.h
> >> +++ b/arch/s390/include/asm/pgtable.h
> >> @@ -1669,18 +1669,16 @@ static inline int has_transparent_hugepage(void)
> >>   /*
> >>* 64 bit swap entry format:
> >>* A page-table entry has some bits we have to treat in a special way.
> >> - * Bits 52 and bit 55 have to be zero, otherwise a specification
> >> - * exception will occur instead of a page translation exception. The
> >> - * specification exception has the bad habit not to store necessary
> >> - * information in the lowcore.
> >>* Bits 54 and 63 are used to indicate the page type.
> >>* A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
> >> - * This leaves the bits 0-51 and bits 56-62 to store type and offset.
> >> - * We use the 5 bits from 57-61 for the type and the 52 bits from 0-51
> >> - * for the offset.
> >> - * | offset|01100|type |00|
> >> + * | offset|XX1XX|type |S0|
> >>* |001122334455|5|55566|66|
> >>* |0123456789012345678901234567890123456789012345678901|23456|78901|23|
> >> + *
> >> + * Bits 0-51 store the offset.
> >> + * Bits 57-62 store the type.
> >> + * Bit 62 (S) is used for softdirty tracking.
> >> + * Bits 52, 53, 55 and 56 (X) are unused.
> >>*/
> >>   
> >>   #define __SWP_OFFSET_MASK  ((1UL << 52) - 1)
> >>
> >>
> >> I'm not sure why bit 53 was indicated as "1" and bit 55 was indicated as
> >> "0". At least for 52 and 55 there was a clear description.
> > 
> > Bit 53 is the invalid bit, and that is always 1 for swap ptes, in addition
> > to protection bit 54. Bit 55, along with bit 52, has to be zero according
> > to the (potentially deprecated) comment.
> > 
> > It is interesting that bit 56 seems to be unused, at least according
> > to the comment, but that would also mention bit 62 as unused, so that
> > clearly needs some update.
> > 
> > If bit 56 could be used for _PAGE_SWP_EXCLUSIVE, that would be better
> > than stealing a bit from the offset, or using potentially dangerous
> > bit 52. It is defined as _PAGE_UNUSED and only used for kvm, not sure
> > if this is also relevant for swap ptes, similar to bit 62.
> > 
> > Adding Christian on cc, maybe he has some insight on _PAGE_UNUSED
> > bit 56 and swap ptes.
> 
> I think _PAGE_UNUSED is not used for swap ptes. It is used _before_ swapping
> to decide whether we swap or discard the page.
> 
> Regarding bit 52, the POP says in chapter 3 for the page table entry
> 
> [..]
> Page-Invalid Bit (I): Bit 53 controls whether the
> page associated with the page-table entry is avail-
> ab

Re: [PATCH v2 5/8] s390/pgtable: cleanup description of swp pte layout

2022-03-30 Thread Gerald Schaefer
On Tue, 29 Mar 2022 18:43:26 +0200
David Hildenbrand  wrote:

> Bit 52 and bit 55 don't have to be zero: they only trigger a
> translation-specifiation exception if the PTE is marked as valid, which
> is not the case for swap ptes.
> 
> Document which bits are used for what, and which ones are unused.
> 
> Signed-off-by: David Hildenbrand 
> ---
>  arch/s390/include/asm/pgtable.h | 17 -
>  1 file changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 9df679152620..3982575bb586 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1712,18 +1712,17 @@ static inline int has_transparent_hugepage(void)
>  /*
>   * 64 bit swap entry format:
>   * A page-table entry has some bits we have to treat in a special way.
> - * Bits 52 and bit 55 have to be zero, otherwise a specification
> - * exception will occur instead of a page translation exception. The
> - * specification exception has the bad habit not to store necessary
> - * information in the lowcore.
> - * Bits 54 and 63 are used to indicate the page type.
> + * Bits 54 and 63 are used to indicate the page type. Bit 53 marks the pte
> + * as invalid.
>   * A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
> - * This leaves the bits 0-51 and bits 56-62 to store type and offset.
> - * We use the 5 bits from 57-61 for the type and the 52 bits from 0-51
> - * for the offset.
> - * |   offset|01100|type |00|
> + * |   offset|X11XX|type |S0|
>   * |001122334455|5|55566|66|
>   * |0123456789012345678901234567890123456789012345678901|23456|78901|23|
> + *
> + * Bits 0-51 store the offset.
> + * Bits 57-61 store the type.
> + * Bit 62 (S) is used for softdirty tracking.
> + * Bits 52, 55 and 56 (X) are unused.
>   */
>  
>  #define __SWP_OFFSET_MASK((1UL << 52) - 1)

Thanks David!

Reviewed-by: Gerald Schaefer 


Re: [PATCH v2 6/8] s390/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE

2022-03-30 Thread Gerald Schaefer
On Tue, 29 Mar 2022 18:43:27 +0200
David Hildenbrand  wrote:

> Let's use bit 52, which is unused.
> 
> Signed-off-by: David Hildenbrand 
> ---
>  arch/s390/include/asm/pgtable.h | 23 +--
>  1 file changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 3982575bb586..a397b072a580 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -181,6 +181,8 @@ static inline int is_module_addr(void *addr)
>  #define _PAGE_SOFT_DIRTY 0x000
>  #endif
>  
> +#define _PAGE_SWP_EXCLUSIVE _PAGE_LARGE  /* SW pte exclusive swap bit */
> +
>  /* Set of bits not changed in pte_modify */
>  #define _PAGE_CHG_MASK   (PAGE_MASK | _PAGE_SPECIAL | 
> _PAGE_DIRTY | \
>_PAGE_YOUNG | _PAGE_SOFT_DIRTY)
> @@ -826,6 +828,22 @@ static inline int pmd_protnone(pmd_t pmd)
>  }
>  #endif
>  
> +#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
> +static inline int pte_swp_exclusive(pte_t pte)
> +{
> + return pte_val(pte) & _PAGE_SWP_EXCLUSIVE;
> +}
> +
> +static inline pte_t pte_swp_mkexclusive(pte_t pte)
> +{
> + return set_pte_bit(pte, __pgprot(_PAGE_SWP_EXCLUSIVE));
> +}
> +
> +static inline pte_t pte_swp_clear_exclusive(pte_t pte)
> +{
> + return clear_pte_bit(pte, __pgprot(_PAGE_SWP_EXCLUSIVE));
> +}
> +
>  static inline int pte_soft_dirty(pte_t pte)
>  {
>   return pte_val(pte) & _PAGE_SOFT_DIRTY;
> @@ -1715,14 +1733,15 @@ static inline int has_transparent_hugepage(void)
>   * Bits 54 and 63 are used to indicate the page type. Bit 53 marks the pte
>   * as invalid.
>   * A swap pte is indicated by bit pattern (pte & 0x201) == 0x200
> - * |   offset|X11XX|type |S0|
> + * |   offset|E11XX|type |S0|
>   * |001122334455|5|55566|66|
>   * |0123456789012345678901234567890123456789012345678901|23456|78901|23|
>   *
>   * Bits 0-51 store the offset.
> + * Bit 52 (E) is used to remember PG_anon_exclusive.
>   * Bits 57-61 store the type.
>   * Bit 62 (S) is used for softdirty tracking.
> - * Bits 52, 55 and 56 (X) are unused.
> + * Bits 55 and 56 (X) are unused.
>   */
>  
>  #define __SWP_OFFSET_MASK((1UL << 52) - 1)

Thanks David!

Reviewed-by: Gerald Schaefer 


Re: [PATCH 3/3] mm: rmap: Fix CONT-PTE/PMD size hugetlb issue when unmapping

2022-04-29 Thread Gerald Schaefer
On Fri, 29 Apr 2022 16:14:43 +0800
Baolin Wang  wrote:

> On some architectures (like ARM64), it can support CONT-PTE/PMD size
> hugetlb, which means it can support not only PMD/PUD size hugetlb:
> 2M and 1G, but also CONT-PTE/PMD size: 64K and 32M if a 4K page
> size specified.
> 
> When unmapping a hugetlb page, we will get the relevant page table
> entry by huge_pte_offset() only once to nuke it. This is correct
> for PMD or PUD size hugetlb, since they always contain only one
> pmd entry or pud entry in the page table.
> 
> However this is incorrect for CONT-PTE and CONT-PMD size hugetlb,
> since they can contain several continuous pte or pmd entry with
> same page table attributes, so we will nuke only one pte or pmd
> entry for this CONT-PTE/PMD size hugetlb page.
> 
> And now we only use try_to_unmap() to unmap a poisoned hugetlb page,
> which means now we will unmap only one pte entry for a CONT-PTE or
> CONT-PMD size poisoned hugetlb page, and we can still access other
> subpages of a CONT-PTE or CONT-PMD size poisoned hugetlb page,
> which will cause serious issues possibly.
> 
> So we should change to use huge_ptep_clear_flush() to nuke the
> hugetlb page table to fix this issue, which already considered
> CONT-PTE and CONT-PMD size hugetlb.
> 
> Note we've already used set_huge_swap_pte_at() to set a poisoned
> swap entry for a poisoned hugetlb page.
> 
> Signed-off-by: Baolin Wang 
> ---
>  mm/rmap.c | 34 +-
>  1 file changed, 17 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 7cf2408..1e168d7 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1564,28 +1564,28 @@ static bool try_to_unmap_one(struct folio *folio, 
> struct vm_area_struct *vma,
>   break;
>   }
>   }
> + pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);

Unlike in your patch 2/3, I do not see that this (huge) pteval would later
be used again with set_huge_pte_at() instead of set_pte_at(). Not sure if
this (huge) pteval could end up at a set_pte_at() later, but if yes, then
this would be broken on s390, and you'd need to use set_huge_pte_at()
instead of set_pte_at() like in your patch 2/3.

Please note that huge_ptep_get functions do not return valid PTEs on s390,
and such PTEs must never be set directly with set_pte_at(), but only with
set_huge_pte_at().

Background is that, for hugetlb pages, we are of course not really dealing
with PTEs at this level, but rather PMDs or PUDs, depending on hugetlb size.
On s390, the layout is quite different for PTEs and PMDs / PUDs, and
unfortunately the hugetlb code is not properly reflecting this by using
PMD or PUD types, like the THP code does.

So, as work-around, on s390, the huge_ptep_xxx functions will return
only fake PTEs, which must be converted again to a proper PMD or PUD,
before writing them to the page table, which is what happens in
set_huge_pte_at(), but not in set_pte_at().


Re: [PATCH 3/3] mm: rmap: Fix CONT-PTE/PMD size hugetlb issue when unmapping

2022-05-02 Thread Gerald Schaefer
On Sat, 30 Apr 2022 11:22:33 +0800
Baolin Wang  wrote:

> 
> 
> On 4/30/2022 4:02 AM, Gerald Schaefer wrote:
> > On Fri, 29 Apr 2022 16:14:43 +0800
> > Baolin Wang  wrote:
> > 
> >> On some architectures (like ARM64), it can support CONT-PTE/PMD size
> >> hugetlb, which means it can support not only PMD/PUD size hugetlb:
> >> 2M and 1G, but also CONT-PTE/PMD size: 64K and 32M if a 4K page
> >> size specified.
> >>
> >> When unmapping a hugetlb page, we will get the relevant page table
> >> entry by huge_pte_offset() only once to nuke it. This is correct
> >> for PMD or PUD size hugetlb, since they always contain only one
> >> pmd entry or pud entry in the page table.
> >>
> >> However this is incorrect for CONT-PTE and CONT-PMD size hugetlb,
> >> since they can contain several continuous pte or pmd entry with
> >> same page table attributes, so we will nuke only one pte or pmd
> >> entry for this CONT-PTE/PMD size hugetlb page.
> >>
> >> And now we only use try_to_unmap() to unmap a poisoned hugetlb page,
> >> which means now we will unmap only one pte entry for a CONT-PTE or
> >> CONT-PMD size poisoned hugetlb page, and we can still access other
> >> subpages of a CONT-PTE or CONT-PMD size poisoned hugetlb page,
> >> which will cause serious issues possibly.
> >>
> >> So we should change to use huge_ptep_clear_flush() to nuke the
> >> hugetlb page table to fix this issue, which already considered
> >> CONT-PTE and CONT-PMD size hugetlb.
> >>
> >> Note we've already used set_huge_swap_pte_at() to set a poisoned
> >> swap entry for a poisoned hugetlb page.
> >>
> >> Signed-off-by: Baolin Wang 
> >> ---
> >>   mm/rmap.c | 34 +-
> >>   1 file changed, 17 insertions(+), 17 deletions(-)
> >>
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index 7cf2408..1e168d7 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -1564,28 +1564,28 @@ static bool try_to_unmap_one(struct folio *folio, 
> >> struct vm_area_struct *vma,
> >>break;
> >>}
> >>}
> >> +  pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
> > 
> > Unlike in your patch 2/3, I do not see that this (huge) pteval would later
> > be used again with set_huge_pte_at() instead of set_pte_at(). Not sure if
> > this (huge) pteval could end up at a set_pte_at() later, but if yes, then
> > this would be broken on s390, and you'd need to use set_huge_pte_at()
> > instead of set_pte_at() like in your patch 2/3.
> 
> IIUC, As I said in the commit message, we will only unmap a poisoned 
> hugetlb page by try_to_unmap(), and the poisoned hugetlb page will be 
> remapped with a poisoned entry by set_huge_swap_pte_at() in 
> try_to_unmap_one(). So I think no need change to use set_huge_pte_at() 
> instead of set_pte_at() for other cases, since the hugetlb page will not 
> hit other cases.
> 
> if (PageHWPoison(subpage) && !(flags & TTU_IGNORE_HWPOISON)) {
>   pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
>   if (folio_test_hugetlb(folio)) {
>   hugetlb_count_sub(folio_nr_pages(folio), mm);
>   set_huge_swap_pte_at(mm, address, pvmw.pte, pteval,
>vma_mmu_pagesize(vma));
>   } else {
>   dec_mm_counter(mm, mm_counter(&folio->page));
>   set_pte_at(mm, address, pvmw.pte, pteval);
>   }
> 
> }

OK, but wouldn't the pteval be overwritten here with
pteval = swp_entry_to_pte(make_hwpoison_entry(subpage))?
IOW, what sense does it make to save the returned pteval from
huge_ptep_clear_flush(), when it is never being used anywhere?


Re: [PATCH 3/3] mm: rmap: Fix CONT-PTE/PMD size hugetlb issue when unmapping

2022-05-03 Thread Gerald Schaefer
On Tue, 3 May 2022 10:19:46 +0800
Baolin Wang  wrote:

> 
> 
> On 5/2/2022 10:02 PM, Gerald Schaefer wrote:
> > On Sat, 30 Apr 2022 11:22:33 +0800
> > Baolin Wang  wrote:
> > 
> >>
> >>
> >> On 4/30/2022 4:02 AM, Gerald Schaefer wrote:
> >>> On Fri, 29 Apr 2022 16:14:43 +0800
> >>> Baolin Wang  wrote:
> >>>
> >>>> On some architectures (like ARM64), it can support CONT-PTE/PMD size
> >>>> hugetlb, which means it can support not only PMD/PUD size hugetlb:
> >>>> 2M and 1G, but also CONT-PTE/PMD size: 64K and 32M if a 4K page
> >>>> size specified.
> >>>>
> >>>> When unmapping a hugetlb page, we will get the relevant page table
> >>>> entry by huge_pte_offset() only once to nuke it. This is correct
> >>>> for PMD or PUD size hugetlb, since they always contain only one
> >>>> pmd entry or pud entry in the page table.
> >>>>
> >>>> However this is incorrect for CONT-PTE and CONT-PMD size hugetlb,
> >>>> since they can contain several continuous pte or pmd entry with
> >>>> same page table attributes, so we will nuke only one pte or pmd
> >>>> entry for this CONT-PTE/PMD size hugetlb page.
> >>>>
> >>>> And now we only use try_to_unmap() to unmap a poisoned hugetlb page,
> >>>> which means now we will unmap only one pte entry for a CONT-PTE or
> >>>> CONT-PMD size poisoned hugetlb page, and we can still access other
> >>>> subpages of a CONT-PTE or CONT-PMD size poisoned hugetlb page,
> >>>> which will cause serious issues possibly.
> >>>>
> >>>> So we should change to use huge_ptep_clear_flush() to nuke the
> >>>> hugetlb page table to fix this issue, which already considered
> >>>> CONT-PTE and CONT-PMD size hugetlb.
> >>>>
> >>>> Note we've already used set_huge_swap_pte_at() to set a poisoned
> >>>> swap entry for a poisoned hugetlb page.
> >>>>
> >>>> Signed-off-by: Baolin Wang 
> >>>> ---
> >>>>mm/rmap.c | 34 +-
> >>>>1 file changed, 17 insertions(+), 17 deletions(-)
> >>>>
> >>>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>>> index 7cf2408..1e168d7 100644
> >>>> --- a/mm/rmap.c
> >>>> +++ b/mm/rmap.c
> >>>> @@ -1564,28 +1564,28 @@ static bool try_to_unmap_one(struct folio 
> >>>> *folio, struct vm_area_struct *vma,
> >>>>  break;
> >>>>  }
> >>>>  }
> >>>> +pteval = huge_ptep_clear_flush(vma, address, 
> >>>> pvmw.pte);
> >>>
> >>> Unlike in your patch 2/3, I do not see that this (huge) pteval would later
> >>> be used again with set_huge_pte_at() instead of set_pte_at(). Not sure if
> >>> this (huge) pteval could end up at a set_pte_at() later, but if yes, then
> >>> this would be broken on s390, and you'd need to use set_huge_pte_at()
> >>> instead of set_pte_at() like in your patch 2/3.
> >>
> >> IIUC, As I said in the commit message, we will only unmap a poisoned
> >> hugetlb page by try_to_unmap(), and the poisoned hugetlb page will be
> >> remapped with a poisoned entry by set_huge_swap_pte_at() in
> >> try_to_unmap_one(). So I think no need change to use set_huge_pte_at()
> >> instead of set_pte_at() for other cases, since the hugetlb page will not
> >> hit other cases.
> >>
> >> if (PageHWPoison(subpage) && !(flags & TTU_IGNORE_HWPOISON)) {
> >>pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
> >>if (folio_test_hugetlb(folio)) {
> >>hugetlb_count_sub(folio_nr_pages(folio), mm);
> >>set_huge_swap_pte_at(mm, address, pvmw.pte, pteval,
> >> vma_mmu_pagesize(vma));
> >>} else {
> >>dec_mm_counter(mm, mm_counter(&folio->page));
> >>set_pte_at(mm, address, pvmw.pte, pteval);
> >>}
> >>
> >> }
> > 
> > OK, but wouldn't the pteval be overwritten here with
> > pteval = swp_entry_to_pte(make_hwpoison_entry(subpage))?
> > IOW, what sense does it make to save the returned pteval from
> > huge_ptep_clear_flu

Re: [PATCH V3 0/4] mm/debug_vm_pgtable: Add some more tests

2020-06-24 Thread Gerald Schaefer
On Wed, 24 Jun 2020 13:05:39 +0200
Alexander Gordeev  wrote:

> On Wed, Jun 24, 2020 at 08:43:10AM +0530, Anshuman Khandual wrote:
> 
> [...]
> 
> > Hello Gerald/Christophe/Vineet,
> > 
> > It would be really great if you could give this series a quick test
> > on s390/ppc/arc platforms respectively. Thank you.
> 
> That worked for me with the default and debug s390 configurations.
> Would you like to try with some particular options or combinations
> of the options?

It will be enabled automatically on all archs that set
ARCH_HAS_DEBUG_VM_PGTABLE, which we do for s390 unconditionally.
Also, DEBUG_VM has to be set, which we have only in the debug config.
So only the s390 debug config will have it enabled, you can check
dmesg for "debug_vm_pgtable" to see when / where it was run, and if it
triggered any warnings.

I also checked with the v3 series, and it works fine for s390.


Re: [PATCH 19/44] s390: implement ->mapping_error

2017-06-08 Thread Gerald Schaefer
On Thu,  8 Jun 2017 15:25:44 +0200
Christoph Hellwig  wrote:

> s390 can also use noop_dma_ops, and while that currently does not return
> errors it will so in the future.  Implementing the mapping_error method
> is the proper way to have per-ops error conditions.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Gerald Schaefer 



Re: [RFC PATCH 0/9] Enable THP migration for all possible architectures

2018-04-27 Thread Gerald Schaefer
On Thu, 26 Apr 2018 10:27:55 -0400
Zi Yan  wrote:

> From: Zi Yan 
> 
> Hi all,
> 
> THP migration is only enabled on x86_64 with a special
> ARCH_ENABLE_THP_MIGRATION macro. This patchset enables THP migration for
> all architectures that uses transparent hugepage, so that special macro can
> be dropped. Instead, THP migration is enabled/disabled via
> /sys/kernel/mm/transparent_hugepage/enable_thp_migration.
> 
> I grepped for TRANSPARENT_HUGEPAGE in arch folder and got 9 architectures that
> are supporting transparent hugepage. I mechanically add __pmd_to_swp_entry() 
> and
> __swp_entry_to_pmd() based on existing __pte_to_swp_entry() and
> __swp_entry_to_pte() for all these architectures, except tile which is going 
> to
> be dropped.

This will not work on s390, the pmd layout is very different from the pte
layout. Using __swp_entry/type/offset() on a pmd will go horribly wrong.
I currently don't see a chance to make this work for us, so please make/keep
this configurable, and do not configure it for s390.

Regards,
Gerald



Re: [PATCH 05/12] mm/memory: Add dax_insert_pfn

2024-10-01 Thread Gerald Schaefer
On Sun, 22 Sep 2024 03:41:57 +0200
Dan Williams  wrote:

> [ add s390 folks to comment on CONFIG_FS_DAX_LIMITED ]

[...]

> > @@ -2516,6 +2545,44 @@ static vm_fault_t __vm_insert_mixed(struct 
> > vm_area_struct *vma,
> > return VM_FAULT_NOPAGE;
> >  }
> >  
> > +vm_fault_t dax_insert_pfn(struct vm_fault *vmf, pfn_t pfn_t, bool write)
> > +{
> > +   struct vm_area_struct *vma = vmf->vma;
> > +   pgprot_t pgprot = vma->vm_page_prot;
> > +   unsigned long pfn = pfn_t_to_pfn(pfn_t);
> > +   struct page *page = pfn_to_page(pfn);  
> 
> The problem here is that we stubbornly have __dcssblk_direct_access() to
> worry about. That is the only dax driver that does not return
> pfn_valid() pfns.
> 
> In fact, it looks like __dcssblk_direct_access() is the only thing
> standing in the way of the removal of pfn_t.
> 
> It turns out it has been 3 years since the last time the question of
> bringing s390 fully into the ZONE_DEVICE regime was raised:
> 
> https://lore.kernel.org/all/20210820210318.187742e8@thinkpad/
> 
> Given that this series removes PTE_DEVMAP which was a stumbling block,
> would it be feasible to remove CONFIG_FS_DAX_LIMITED for a few kernel
> cycles until someone from the s390 side can circle back to add full
> ZONE_DEVICE support?

Yes, see also my reply to your "dcssblk: Mark DAX broken" patch.
Thanks Alistair for your effort, making ZONE_DEVICE usable w/o extra
PTE bit!



Re: [PATCH] mm/hugetlb: bring gigantic page allocation under hugepages_supported()

2025-01-22 Thread Gerald Schaefer
On Tue, 21 Jan 2025 20:34:19 +0530
Sourabh Jain  wrote:

> Despite having kernel arguments to enable gigantic hugepages, this
> provides a way for the architecture to disable gigantic hugepages on the
> fly, similar to what we do for hugepages.
> 
> Components like fadump (PowerPC-specific) need this functionality to
> disable gigantic hugepages when the kernel is booted solely to collect
> the kernel core dump.
> 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: Heiko Carstens 
> Cc: Vasily Gorbik 
> Cc: Muchun Song 
> Cc: Madhavan Srinivasan 
> Cc: Michael Ellerman 
> Cc: linux...@kvack.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Sourabh Jain 
> ---
> 
> To evaluate the impact of this change on architectures other than
> PowerPC, I did the following analysis:
> 
> For architectures where hugepages_supported() is not redefined, it
> depends on HPAGE_SHIFT, which is found to be a constant. It is mostly
> initialized to PMD_SHIFT.
> 
> Architecture : HPAGE_SHIFT initialized with
> 
> ARC: PMD_SHIFT (constant)
> ARM: PMD_SHIFT (constant)
> ARM64: PMD_SHIFT (constant)
> Hexagon: 22 (constant)
> LoongArch: (PAGE_SHIFT + PAGE_SHIFT - 3) (appears to be constant)
> MIPS: (PAGE_SHIFT + PAGE_SHIFT - 3) (appears to be constant)
> PARISC: PMD_SHIFT (appears to be constant)
> RISC-V: PMD_SHIFT (constant)
> SH: 16 | 18 | 20 | 22 | 26 (constant)
> SPARC: 23 (constant)
> 
> So seems like this change shouldn't have any impact on above
> architectures.
> 
> On the S390 and X86 architectures, hugepages_supported() is redefined,
> and I am uncertain at what point it is safe to call
> hugepages_supported().

For s390, hugepages_supported() checks EDAT1 machine flag, which is
initialized long before any initcalls. So it is safe to be called
here.

My common code hugetlb skills got a little rusty, but shouldn't
arch_hugetlb_valid_size() already prevent getting here for gigantic
hugepages, in case they are not supported? And could you not use
that for your purpose?

BTW, please also add arch mailing list for such questions.

> 
> ---
>  mm/hugetlb.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index cec4b121193f..48b42b8d26b4 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4629,7 +4629,7 @@ static int __init hugepages_setup(char *s)
>* But we need to allocate gigantic hstates here early to still
>* use the bootmem allocator.
>*/
> - if (hugetlb_max_hstate && hstate_is_gigantic(parsed_hstate))
> + if (hugetlb_max_hstate && hstate_is_gigantic(parsed_hstate) && 
> hugepages_supported())
>   hugetlb_hstate_alloc_pages(parsed_hstate);
>  
>   last_mhp = mhp;




Re: [PATCH v8 20/20] device/dax: Properly refcount device dax pages when mapping

2025-02-20 Thread Gerald Schaefer
On Tue, 18 Feb 2025 14:55:36 +1100
Alistair Popple  wrote:

[...]
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 9a8879b..532a52a 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -460,11 +460,10 @@ void free_zone_device_folio(struct folio *folio)
>  {
>   struct dev_pagemap *pgmap = folio->pgmap;
>  
> - if (WARN_ON_ONCE(!pgmap->ops))
> - return;
> -
> - if (WARN_ON_ONCE(pgmap->type != MEMORY_DEVICE_FS_DAX &&
> -  !pgmap->ops->page_free))
> + if (WARN_ON_ONCE((!pgmap->ops &&
> +   pgmap->type != MEMORY_DEVICE_GENERIC) ||
> +  (pgmap->ops && !pgmap->ops->page_free &&
> +   pgmap->type != MEMORY_DEVICE_FS_DAX)))

Playing around with dcssblk, adding devm_memremap_pages() and
pgmap.type = MEMORY_DEVICE_FS_DAX, similar to the other two existing
FS_DAX drivers drivers/nvdimm/pmem.c and fs/fuse/virtio_fs.c, I hit
this warning when executing binaries from DAX-mounted fs.

I do not set up pgmap->ops, similar to fs/fuse/virtio_fs.c, and I don't see
why they would be needed here anyway, at least for MEMORY_DEVICE_FS_DAX.
drivers/nvdimm/pmem.c does set up pgmap->ops, but only ->memory_failure,
which is still good enough to not trigger the warning here, probably just
by chance.

Now I wonder:
1) What is this check / warning good for, when this function only ever
   calls pgmap->ops->page_free(), but not for MEMORY_DEVICE_FS_DAX and
   not for MEMORY_DEVICE_GENERIC (the latter only after this patch)?
2) Is the warning also seen for virtio DAX mappings (added Vivek and
   Stefan on CC)? No pgmap->ops set up there, so I would guess "yes",
   and already before this series, with the old check / warning.
3) Could this be changed to only check / warn if pgmap->ops (or maybe
   rather pgmap->ops->page_free) is not set up, but not for
   MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_FS_DAX where this is not
   being called?
4) Or is there any reason why pgmap->ops would be required for
   MEMORY_DEVICE_FS_DAX?

Apart from the warning, we would also miss out on the
wake_up_var(&folio->page) in the MEMORY_DEVICE_FS_DAX case, when no
pgmap->ops was set up. IIUC, even before this change / series (i.e.
for virtio DAX only, since dcssblk was not using ZONE_DEVICE before,
and pmem seems to work by chance because they have ops->memory_failure).

>   return;
>  
>   mem_cgroup_uncharge(folio);
> @@ -494,7 +493,8 @@ void free_zone_device_folio(struct folio *folio)
>* zero which indicating the page has been removed from the file
>* system mapping.
>*/
> - if (pgmap->type != MEMORY_DEVICE_FS_DAX)
> + if (pgmap->type != MEMORY_DEVICE_FS_DAX &&
> + pgmap->type != MEMORY_DEVICE_GENERIC)
>   folio->mapping = NULL;
>  
>   switch (pgmap->type) {
> @@ -509,7 +509,6 @@ void free_zone_device_folio(struct folio *folio)
>* Reset the refcount to 1 to prepare for handing out the page
>* again.
>*/
> - pgmap->ops->page_free(folio_page(folio, 0));

Ok, this is probably the reason why you adjusted the check above, since
no more pgmap->ops needed for MEMORY_DEVICE_GENERIC.
Still, the MEMORY_DEVICE_FS_DAX case also does not seem to need
pgmap->ops, and at least the existing virtio DAX should already be
affected, and of course future dcssblk DAX.

But maybe that should be addressed in a separate patch, since your changes
here seem consistent, and not change or worsen anything wrt !pgmap->ops
and MEMORY_DEVICE_FS_DAX.

>   folio_set_count(folio, 1);
>   break;
>  

For reference, this is call trace I see when I hit the warning:

[  283.567945] [ cut here ]
[  283.567947] WARNING: CPU: 12 PID: 878 at mm/memremap.c:436 
free_zone_device_folio+0x6e/0x140
[  283.567959] Modules linked in:
[  283.567963] CPU: 12 UID: 0 PID: 878 Comm: ls Not tainted 
6.14.0-rc3-next-20250220-00012-gd072dabf62e8-dirty #44
[  283.567968] Hardware name: IBM 3931 A01 704 (z/VM 7.4.0)
[  283.567971] Krnl PSW : 0704d0018000 01ec0548b44a 
(free_zone_device_folio+0x72/0x140)
[  283.567978]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 
RI:0 EA:3
[  283.567982] Krnl GPRS: 0038  0003 
01ec06cc42e8
[  283.567986]0004cc38e400  0003 
93eacd00
[  283.567990]9a68413f 016614010940 9553a640 
016614010940
[  283.567994]  01ec0548b416 
016c05da3bf8
[  283.568004] Krnl Code: 01ec0548b43e: a70e0003chi %r0,3
  01ec0548b442: a7840006brc 
8,01ec0548b44e
 #01ec0548b446: af00mc  0,0
 >01ec0548b44a: a7f4005fbrc 
15,01ec0548b508