Re: [kvm-unit-tests PATCH v2 03/18] scripts: Refuse to run the tests if not configured for qemu

2025-02-17 Thread Al Dunsmuir
Hello Alexandru,

On Monday, February 10, 2025, 1:04:29 PM, you wrote:

> Hi Drew,

> On Mon, Feb 10, 2025 at 02:56:25PM +0100, Andrew Jones wrote:
>> On Mon, Feb 10, 2025 at 10:41:53AM +, Alexandru Elisei wrote:
>> > Hi Drew,
>> > 
>> > On Tue, Jan 21, 2025 at 03:48:55PM +0100, Andrew Jones wrote:
>> > > On Mon, Jan 20, 2025 at 04:43:01PM +, Alexandru Elisei wrote:
>> > 
>> > > > ---
>> > > >  arm/efi/run | 8 
>> > > >  arm/run | 9 +
>> > > >  run_tests.sh| 8 
>> > > >  scripts/mkstandalone.sh | 8 
>> > > >  4 files changed, 33 insertions(+)
>> > 
>> > > > +case "$TARGET" in
>> > > > +qemu)
>> > > > +;;
>> > > > +*)
>> > > > +echo "'$TARGET' not supported for standlone tests"
>> > > > +exit 2
>> > > > +esac
>> > > 
>> > > I think we could put the check in a function in scripts/arch-run.bash and
>> > > just use the same error message for all cases.
>> > 
>> > Coming back to the series.
>> > 
>> > arm/efi/run and arm/run source scripts/arch-run.bash; run_tests.sh and
>> > scripts/mkstandalone.sh don't source scripts/arch-run.bash. There doesn't
>> > seem to be a common file that is sourced by all of them.
>> 
>> scripts/mkstandalone.sh uses arch-run.bash, see generate_test().

> Are you referring to this bit:

> generate_test ()
> {
> 
> (echo "#!/usr/bin/env bash"
>  cat scripts/arch-run.bash "$TEST_DIR/run")

> I think scripts/arch-run.bash would need to be sourced for any functions 
> defined
> there to be usable in mkstandalone.sh.

> What I was thinking is something like this:

> if ! vmm_supported $TARGET; then
> echo "$0 does not support '$TARGET'"
> exit 2
> fi

> Were you thinking of something else?

> I think mkstandalone should error at the top level (when you do make
> standalone), and not rely on the individual scripts to error if the VMM is
> not supported. That's because I think creating the test files, booting a
> machine and copying the files only to find out that kvm-unit-tests was
> misconfigured is a pretty suboptimal experience.

>> run_tests.sh doesn't, but I'm not sure it needs to validate TARGET
>> since it can leave that to the lower-level scripts.

> I put the check in arm/run, and removed it from run_tests.sh, and this is
> what I got:

> $ ./run_tests.sh selftest-setup
> SKIP selftest-setup (./arm/run does not supported 'kvmtool')

> which looks good to me.

Grammar nit:  This should be
SKIP selftest-setup (./arm/run does not support 'kvmtool')

Al

>> 
>> > 
>> > How about creating a new file in scripts (vmm.bash?) with only this
>> > function?
>> 
>> If we need a new file, then we can add one, but I'd try using
>> arch-run.bash or common.bash first.

> common.bash seems to work (and the name fits), so I'll give that a go.

> Thanks,
> Alex





Re: [PATCH v13 4/5] arm64: support copy_mc_[user]_highpage()

2025-02-17 Thread Catalin Marinas
On Mon, Feb 17, 2025 at 04:07:49PM +0800, Tong Tiangen wrote:
> 在 2025/2/15 1:24, Catalin Marinas 写道:
> > On Fri, Feb 14, 2025 at 10:49:01AM +0800, Tong Tiangen wrote:
> > > 在 2025/2/13 1:11, Catalin Marinas 写道:
> > > > On Mon, Dec 09, 2024 at 10:42:56AM +0800, Tong Tiangen wrote:
> > > > > Currently, many scenarios that can tolerate memory errors when 
> > > > > copying page
> > > > > have been supported in the kernel[1~5], all of which are implemented 
> > > > > by
> > > > > copy_mc_[user]_highpage(). arm64 should also support this mechanism.
> > > > > 
> > > > > Due to mte, arm64 needs to have its own copy_mc_[user]_highpage()
> > > > > architecture implementation, macros __HAVE_ARCH_COPY_MC_HIGHPAGE and
> > > > > __HAVE_ARCH_COPY_MC_USER_HIGHPAGE have been added to control it.
> > > > > 
> > > > > Add new helper copy_mc_page() which provide a page copy 
> > > > > implementation with
> > > > > hardware memory error safe. The code logic of copy_mc_page() is the 
> > > > > same as
> > > > > copy_page(), the main difference is that the ldp insn of 
> > > > > copy_mc_page()
> > > > > contains the fixup type EX_TYPE_KACCESS_ERR_ZERO_MEM_ERR, therefore, 
> > > > > the
> > > > > main logic is extracted to copy_page_template.S. In addition, the 
> > > > > fixup of
> > > > > MOPS insn is not considered at present.
> > > > 
> > > > Could we not add the exception table entry permanently but ignore the
> > > > exception table entry if it's not on the do_sea() path? That would save
> > > > some code duplication.
> > > 
> > > I'm sorry, I didn't catch your point, that the do_sea() and non do_sea()
> > > paths use different exception tables?
> > 
> > No, they would have the same exception table, only that we'd interpret
> > it differently depending on whether it's a SEA error or not. Or rather
> > ignore the exception table altogether for non-SEA errors.
> 
> You mean to use the same exception type (EX_TYPE_KACCESS_ERR_ZERO) and
> then do different processing on SEA errors and non-SEA errors, right?

Right.

> If so, some instructions of copy_page() did not add to the exception
> table will be added to the exception table, and the original logic will
> be affected.
> 
> For example, if an instruction is not added to the exception table, the
> instruction will panic when it triggers a non-SEA error. If this
> instruction is added to the exception table because of SEA processing,
> and then a non-SEA error is triggered, should we fix it?

No, we shouldn't fix it. The exception table entries have a type
associated. For a non-SEA error, we preserve the original behaviour even
if we find a SEA-specific entry in the exception table. You already need
such logic even if you duplicate the code for configurations where you
have MC enabled.

-- 
Catalin



[PATCH 5/7] clk: mvebu: Embed syscore_ops in clock context

2025-02-17 Thread Thierry Reding
From: Thierry Reding 

This enables the syscore callbacks to obtain the clock context without
relying on a separate global variable.

Signed-off-by: Thierry Reding 
---
 drivers/clk/mvebu/common.c | 21 ++---
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/drivers/clk/mvebu/common.c b/drivers/clk/mvebu/common.c
index ee77d307efe0..53712c3e5087 100644
--- a/drivers/clk/mvebu/common.c
+++ b/drivers/clk/mvebu/common.c
@@ -189,6 +189,7 @@ void __init mvebu_coreclk_setup(struct device_node *np,
 DEFINE_SPINLOCK(ctrl_gating_lock);
 
 struct clk_gating_ctrl {
+   struct syscore_ops syscore;
spinlock_t *lock;
struct clk **gates;
int num_gates;
@@ -196,11 +197,15 @@ struct clk_gating_ctrl {
u32 saved_reg;
 };
 
-static struct clk_gating_ctrl *ctrl;
+static inline struct clk_gating_ctrl *from_syscore(struct syscore_ops *ops)
+{
+   return container_of(ops, struct clk_gating_ctrl, syscore);
+}
 
 static struct clk *clk_gating_get_src(
struct of_phandle_args *clkspec, void *data)
 {
+   struct clk_gating_ctrl *ctrl = data;
int n;
 
if (clkspec->args_count < 1)
@@ -217,23 +222,23 @@ static struct clk *clk_gating_get_src(
 
 static int mvebu_clk_gating_suspend(struct syscore_ops *ops)
 {
+   struct clk_gating_ctrl *ctrl = from_syscore(ops);
+
ctrl->saved_reg = readl(ctrl->base);
return 0;
 }
 
 static void mvebu_clk_gating_resume(struct syscore_ops *ops)
 {
+   struct clk_gating_ctrl *ctrl = from_syscore(ops);
+
writel(ctrl->saved_reg, ctrl->base);
 }
 
-static struct syscore_ops clk_gate_syscore_ops = {
-   .suspend = mvebu_clk_gating_suspend,
-   .resume = mvebu_clk_gating_resume,
-};
-
 void __init mvebu_clk_gating_setup(struct device_node *np,
   const struct clk_gating_soc_desc *desc)
 {
+   static struct clk_gating_ctrl *ctrl;
struct clk *clk;
void __iomem *base;
const char *default_parent = NULL;
@@ -284,7 +289,9 @@ void __init mvebu_clk_gating_setup(struct device_node *np,
 
of_clk_add_provider(np, clk_gating_get_src, ctrl);
 
-   register_syscore_ops(&clk_gate_syscore_ops);
+   ctrl->syscore.suspend = mvebu_clk_gating_suspend;
+   ctrl->syscore.resume = mvebu_clk_gating_resume;
+   register_syscore_ops(&ctrl->syscore);
 
return;
 gates_out:
-- 
2.48.1




[PATCH 1/7] syscore: Pass context data to callbacks

2025-02-17 Thread Thierry Reding
From: Thierry Reding 

Pass a pointer to the syscore_ops structure that was registered to the
callbacks. This enables callbacks to act on instance data (syscore_ops
can be embedded into other structures, and driver-specific data can be
obtained using container_of()) rather than the current practice of
relying on global variables.

Signed-off-by: Thierry Reding 
---
 arch/arm/mach-exynos/mcpm-exynos.c|  4 ++--
 arch/arm/mach-exynos/suspend.c| 14 +++---
 arch/arm/mach-pxa/irq.c   |  4 ++--
 arch/arm/mach-pxa/mfp-pxa2xx.c|  4 ++--
 arch/arm/mach-pxa/mfp-pxa3xx.c|  4 ++--
 arch/arm/mach-pxa/smemc.c |  4 ++--
 arch/arm/mach-s3c/irq-pm-s3c64xx.c|  4 ++--
 arch/arm/mach-s5pv210/pm.c|  2 +-
 arch/arm/mach-versatile/integrator_ap.c   |  4 ++--
 arch/arm/mm/cache-b15-rac.c   |  4 ++--
 arch/loongarch/kernel/smp.c   |  4 ++--
 arch/mips/alchemy/common/dbdma.c  |  4 ++--
 arch/mips/alchemy/common/irq.c|  8 
 arch/mips/alchemy/common/usb.c|  4 ++--
 arch/mips/pci/pci-alchemy.c   |  4 ++--
 arch/powerpc/platforms/cell/spu_base.c|  2 +-
 arch/powerpc/platforms/powermac/pic.c |  4 ++--
 arch/powerpc/sysdev/fsl_lbc.c |  4 ++--
 arch/powerpc/sysdev/fsl_pci.c |  4 ++--
 arch/powerpc/sysdev/ipic.c|  4 ++--
 arch/powerpc/sysdev/mpic.c|  4 ++--
 arch/powerpc/sysdev/mpic_timer.c  |  2 +-
 arch/sh/mm/pmb.c  |  2 +-
 arch/x86/events/amd/ibs.c |  4 ++--
 arch/x86/hyperv/hv_init.c |  4 ++--
 arch/x86/kernel/amd_gart_64.c |  2 +-
 arch/x86/kernel/apic/apic.c   |  4 ++--
 arch/x86/kernel/apic/io_apic.c|  9 +++--
 arch/x86/kernel/cpu/aperfmperf.c  |  6 +++---
 arch/x86/kernel/cpu/intel_epb.c   |  8 
 arch/x86/kernel/cpu/mce/core.c|  6 +++---
 arch/x86/kernel/cpu/microcode/core.c  |  7 ++-
 arch/x86/kernel/cpu/mtrr/legacy.c |  4 ++--
 arch/x86/kernel/cpu/umwait.c  |  2 +-
 arch/x86/kernel/i8237.c   |  2 +-
 arch/x86/kernel/i8259.c   |  6 +++---
 arch/x86/kernel/kvm.c |  4 ++--
 drivers/acpi/pci_link.c   |  2 +-
 drivers/acpi/sleep.c  |  4 ++--
 drivers/base/firmware_loader/main.c   |  2 +-
 drivers/base/syscore.c|  8 
 drivers/bus/mvebu-mbus.c  |  4 ++--
 drivers/clk/at91/pmc.c|  4 ++--
 drivers/clk/imx/clk-vf610.c   |  4 ++--
 drivers/clk/ingenic/pm.c  |  4 ++--
 drivers/clk/ingenic/tcu.c |  4 ++--
 drivers/clk/mvebu/common.c|  4 ++--
 drivers/clk/rockchip/clk-rk3288.c |  4 ++--
 drivers/clk/samsung/clk-s5pv210-audss.c   |  4 ++--
 drivers/clk/samsung/clk.c |  4 ++--
 drivers/clk/tegra/clk-tegra210.c  |  4 ++--
 drivers/clocksource/timer-armada-370-xp.c |  4 ++--
 drivers/cpuidle/cpuidle-psci.c|  4 ++--
 drivers/gpio/gpio-mxc.c   |  4 ++--
 drivers/gpio/gpio-pxa.c   |  4 ++--
 drivers/hv/vmbus_drv.c|  4 ++--
 drivers/iommu/amd/init.c  |  4 ++--
 drivers/iommu/intel/iommu.c   |  4 ++--
 drivers/irqchip/exynos-combiner.c |  6 --
 drivers/irqchip/irq-armada-370-xp.c   |  4 ++--
 drivers/irqchip/irq-bcm7038-l1.c  |  4 ++--
 drivers/irqchip/irq-gic-v3-its.c  |  4 ++--
 drivers/irqchip/irq-i8259.c   |  4 ++--
 drivers/irqchip/irq-imx-gpcv2.c   |  4 ++--
 drivers/irqchip/irq-loongson-eiointc.c|  4 ++--
 drivers/irqchip/irq-loongson-htpic.c  |  2 +-
 drivers/irqchip/irq-loongson-htvec.c  |  4 ++--
 drivers/irqchip/irq-loongson-pch-lpc.c|  4 ++--
 drivers/irqchip/irq-loongson-pch-pic.c|  4 ++--
 drivers/irqchip/irq-mchp-eic.c|  4 ++--
 drivers/irqchip/irq-mst-intc.c|  4 ++--
 drivers/irqchip/irq-mtk-cirq.c|  4 ++--
 drivers/irqchip/irq-renesas-rzg2l.c   |  4 ++--
 drivers/irqchip/irq-sa11x0.c  |  4 ++--
 drivers/irqchip/irq-sifive-plic.c |  4 ++--
 drivers/irqchip/irq-sun6i-r.c | 10 +-
 drivers/irqchip/irq-tegra.c   |  4 ++--
 drivers/irqchip/irq-vic.c |  4 ++--
 drivers/leds/trigger/ledtrig-cpu.c|  6 +++---
 drivers/macintosh/via-pmu.c   |  4 ++--
 drivers/power/reset/sc27xx-poweroff.c |  2 +-
 drivers/sh/clk/core.c |  2 +-
 drivers/sh/intc/core.c|  4 ++--
 drivers/soc/bcm/brcmstb/biuctrl.c |  4 ++--
 drivers/soc/tegra/pmc.c   |  4 ++--
 drivers/thermal/intel/intel_hfi.c |  4 ++--
 drivers/xen/xen-acpi-processor.c  |  2 +-

[PATCH 0/7] syscore: Pass context data to callbacks

2025-02-17 Thread Thierry Reding
From: Thierry Reding 

Hi,

Something that's been bugging me over the years is how some drivers have
had to adopt file-scoped variables to pass data into something like the
syscore operations. This is often harmless, but usually leads to drivers
not being able to deal with multiple instances, or additional frameworks
or data structures needing to be created to handle multiple instances.

This series proposes to "objectify" struct syscore_ops by passing a
pointer to struct syscore_ops to the syscore callbacks. Implementations
of these callbacks can then make use of container_of() to get access to
contextual data that struct syscore_ops was embedded in. This elegantly
avoids the need for file-scoped, singleton variables, by tying syscore
to individual instances.

Patch 1 contains the bulk of these changes. It's fairly intrusive
because it does the conversion of the function signature all in one
patch. An alternative would've been to introduce new callbacks such that
these changes could be staged in. However, the amount of changes here
are not quite numerous enough to justify that, in my opinion, and
syscore isn't very frequently used, so the risk of another user getting
added while this is merged is rather small. All in all I think merging
this in one go is the simplest way.

Patches 2-7 are conversions of some existing drivers to take advantage
of this new parameter and tie the code to per-instance data.

Given that the recipient list for this is huge, I'm limiting this to
Greg (because it's at the core a... core change) and a set of larger
lists for architectures and subsystems that are impacted.

Thanks,
Thierry

Thierry Reding (7):
  syscore: Pass context data to callbacks
  MIPS: Embed syscore_ops in PCI context
  bus: mvebu-mbus: Embed syscore_ops in mbus context
  clk: ingenic: tcu: Embed syscore_ops in TCU context
  clk: mvebu: Embed syscore_ops in clock context
  irqchip/irq-imx-gpcv2: Embed syscore_ops in chip context
  soc/tegra: pmc: Derive PMC context from syscore ops

 arch/arm/mach-exynos/mcpm-exynos.c|  4 +-
 arch/arm/mach-exynos/suspend.c| 14 +++---
 arch/arm/mach-pxa/irq.c   |  4 +-
 arch/arm/mach-pxa/mfp-pxa2xx.c|  4 +-
 arch/arm/mach-pxa/mfp-pxa3xx.c|  4 +-
 arch/arm/mach-pxa/smemc.c |  4 +-
 arch/arm/mach-s3c/irq-pm-s3c64xx.c|  4 +-
 arch/arm/mach-s5pv210/pm.c|  2 +-
 arch/arm/mach-versatile/integrator_ap.c   |  4 +-
 arch/arm/mm/cache-b15-rac.c   |  4 +-
 arch/loongarch/kernel/smp.c   |  4 +-
 arch/mips/alchemy/common/dbdma.c  |  4 +-
 arch/mips/alchemy/common/irq.c|  8 ++--
 arch/mips/alchemy/common/usb.c|  4 +-
 arch/mips/pci/pci-alchemy.c   | 28 ++--
 arch/powerpc/platforms/cell/spu_base.c|  2 +-
 arch/powerpc/platforms/powermac/pic.c |  4 +-
 arch/powerpc/sysdev/fsl_lbc.c |  4 +-
 arch/powerpc/sysdev/fsl_pci.c |  4 +-
 arch/powerpc/sysdev/ipic.c|  4 +-
 arch/powerpc/sysdev/mpic.c|  4 +-
 arch/powerpc/sysdev/mpic_timer.c  |  2 +-
 arch/sh/mm/pmb.c  |  2 +-
 arch/x86/events/amd/ibs.c |  4 +-
 arch/x86/hyperv/hv_init.c |  4 +-
 arch/x86/kernel/amd_gart_64.c |  2 +-
 arch/x86/kernel/apic/apic.c   |  4 +-
 arch/x86/kernel/apic/io_apic.c|  9 +++-
 arch/x86/kernel/cpu/aperfmperf.c  |  6 +--
 arch/x86/kernel/cpu/intel_epb.c   |  8 ++--
 arch/x86/kernel/cpu/mce/core.c|  6 +--
 arch/x86/kernel/cpu/microcode/core.c  |  7 ++-
 arch/x86/kernel/cpu/mtrr/legacy.c |  4 +-
 arch/x86/kernel/cpu/umwait.c  |  2 +-
 arch/x86/kernel/i8237.c   |  2 +-
 arch/x86/kernel/i8259.c   |  6 +--
 arch/x86/kernel/kvm.c |  4 +-
 drivers/acpi/pci_link.c   |  2 +-
 drivers/acpi/sleep.c  |  4 +-
 drivers/base/firmware_loader/main.c   |  2 +-
 drivers/base/syscore.c|  8 ++--
 drivers/bus/mvebu-mbus.c  | 24 +-
 drivers/clk/at91/pmc.c|  4 +-
 drivers/clk/imx/clk-vf610.c   |  4 +-
 drivers/clk/ingenic/pm.c  |  4 +-
 drivers/clk/ingenic/tcu.c | 54 +++
 drivers/clk/mvebu/common.c| 25 +++
 drivers/clk/rockchip/clk-rk3288.c |  4 +-
 drivers/clk/samsung/clk-s5pv210-audss.c   |  4 +-
 drivers/clk/samsung/clk.c |  4 +-
 drivers/clk/tegra/clk-tegra210.c  |  4 +-
 drivers/clocksource/timer-armada-370-xp.c |  4 +-
 drivers/cpuidle/cpuidle-psci.c|  4 +-
 drivers/gpio/gpio-mxc.c   |  4 +-
 drivers/gpio/gpio-pxa.c   |  4 +-
 drivers/hv/vmbus_drv.c|  4 +-
 drivers/iommu/amd/init.c  |  4 +-
 drivers/iomm

[PATCH 4/7] clk: ingenic: tcu: Embed syscore_ops in TCU context

2025-02-17 Thread Thierry Reding
From: Thierry Reding 

This enables the syscore callbacks to obtain the TCU context without
relying on a separate global variable.

Signed-off-by: Thierry Reding 
---
 drivers/clk/ingenic/tcu.c | 54 ++-
 1 file changed, 25 insertions(+), 29 deletions(-)

diff --git a/drivers/clk/ingenic/tcu.c b/drivers/clk/ingenic/tcu.c
index 85bd4bc73c1b..503a58d08224 100644
--- a/drivers/clk/ingenic/tcu.c
+++ b/drivers/clk/ingenic/tcu.c
@@ -53,9 +53,9 @@ struct ingenic_tcu {
struct clk *clk;
 
struct clk_hw_onecell_data *clocks;
-};
 
-static struct ingenic_tcu *ingenic_tcu;
+   struct syscore_ops syscore;
+};
 
 static inline struct ingenic_tcu_clk *to_tcu_clk(struct clk_hw *hw)
 {
@@ -332,6 +332,24 @@ static const struct of_device_id __maybe_unused 
ingenic_tcu_of_match[] __initcon
{ /* sentinel */ }
 };
 
+static int __maybe_unused tcu_pm_suspend(struct syscore_ops *ops)
+{
+   struct ingenic_tcu *tcu = container_of(ops, typeof(*tcu), syscore);
+
+   if (tcu->clk)
+   clk_disable(tcu->clk);
+
+   return 0;
+}
+
+static void __maybe_unused tcu_pm_resume(struct syscore_ops *ops)
+{
+   struct ingenic_tcu *tcu = container_of(ops, typeof(*tcu), syscore);
+
+   if (tcu->clk)
+   clk_enable(tcu->clk);
+}
+
 static int __init ingenic_tcu_probe(struct device_node *np)
 {
const struct of_device_id *id = of_match_node(ingenic_tcu_of_match, np);
@@ -430,7 +448,11 @@ static int __init ingenic_tcu_probe(struct device_node *np)
goto err_unregister_ost_clock;
}
 
-   ingenic_tcu = tcu;
+   if (IS_ENABLED(CONFIG_PM_SLEEP)) {
+   tcu->syscore.suspend = tcu_pm_suspend;
+   tcu->syscore.resume = tcu_pm_resume;
+   register_syscore_ops(&tcu->syscore);
+   }
 
return 0;
 
@@ -455,38 +477,12 @@ static int __init ingenic_tcu_probe(struct device_node 
*np)
return ret;
 }
 
-static int __maybe_unused tcu_pm_suspend(struct syscore_ops *ops)
-{
-   struct ingenic_tcu *tcu = ingenic_tcu;
-
-   if (tcu->clk)
-   clk_disable(tcu->clk);
-
-   return 0;
-}
-
-static void __maybe_unused tcu_pm_resume(struct syscore_ops *ops)
-{
-   struct ingenic_tcu *tcu = ingenic_tcu;
-
-   if (tcu->clk)
-   clk_enable(tcu->clk);
-}
-
-static struct syscore_ops __maybe_unused tcu_pm_ops = {
-   .suspend = tcu_pm_suspend,
-   .resume = tcu_pm_resume,
-};
-
 static void __init ingenic_tcu_init(struct device_node *np)
 {
int ret = ingenic_tcu_probe(np);
 
if (ret)
pr_crit("Failed to initialize TCU clocks: %d\n", ret);
-
-   if (IS_ENABLED(CONFIG_PM_SLEEP))
-   register_syscore_ops(&tcu_pm_ops);
 }
 
 CLK_OF_DECLARE_DRIVER(jz4740_cgu, "ingenic,jz4740-tcu", ingenic_tcu_init);
-- 
2.48.1




[PATCH 6/7] irqchip/irq-imx-gpcv2: Embed syscore_ops in chip context

2025-02-17 Thread Thierry Reding
From: Thierry Reding 

This enables the syscore callbacks to obtain the IRQ chip context
without relying on a separate global variable.

Signed-off-by: Thierry Reding 
---
 drivers/irqchip/irq-imx-gpcv2.c | 29 +++--
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/drivers/irqchip/irq-imx-gpcv2.c b/drivers/irqchip/irq-imx-gpcv2.c
index 83b009881e2a..61ba06a28fc4 100644
--- a/drivers/irqchip/irq-imx-gpcv2.c
+++ b/drivers/irqchip/irq-imx-gpcv2.c
@@ -19,6 +19,7 @@
 
 
 struct gpcv2_irqchip_data {
+   struct syscore_ops  syscore;
struct raw_spinlock rlock;
void __iomem*gpc_base;
u32 wakeup_sources[IMR_NUM];
@@ -26,7 +27,11 @@ struct gpcv2_irqchip_data {
u32 cpu2wakeup;
 };
 
-static struct gpcv2_irqchip_data *imx_gpcv2_instance __ro_after_init;
+static inline struct gpcv2_irqchip_data *
+from_syscore(struct syscore_ops *ops)
+{
+   return container_of(ops, struct gpcv2_irqchip_data, syscore);
+}
 
 static void __iomem *gpcv2_idx_to_reg(struct gpcv2_irqchip_data *cd, int i)
 {
@@ -35,14 +40,10 @@ static void __iomem *gpcv2_idx_to_reg(struct 
gpcv2_irqchip_data *cd, int i)
 
 static int gpcv2_wakeup_source_save(struct syscore_ops *ops)
 {
-   struct gpcv2_irqchip_data *cd;
+   struct gpcv2_irqchip_data *cd = from_syscore(ops);
void __iomem *reg;
int i;
 
-   cd = imx_gpcv2_instance;
-   if (!cd)
-   return 0;
-
for (i = 0; i < IMR_NUM; i++) {
reg = gpcv2_idx_to_reg(cd, i);
cd->saved_irq_mask[i] = readl_relaxed(reg);
@@ -54,22 +55,13 @@ static int gpcv2_wakeup_source_save(struct syscore_ops *ops)
 
 static void gpcv2_wakeup_source_restore(struct syscore_ops *ops)
 {
-   struct gpcv2_irqchip_data *cd;
+   struct gpcv2_irqchip_data *cd = from_syscore(ops);
int i;
 
-   cd = imx_gpcv2_instance;
-   if (!cd)
-   return;
-
for (i = 0; i < IMR_NUM; i++)
writel_relaxed(cd->saved_irq_mask[i], gpcv2_idx_to_reg(cd, i));
 }
 
-static struct syscore_ops imx_gpcv2_syscore_ops = {
-   .suspend= gpcv2_wakeup_source_save,
-   .resume = gpcv2_wakeup_source_restore,
-};
-
 static int imx_gpcv2_irq_set_wake(struct irq_data *d, unsigned int on)
 {
struct gpcv2_irqchip_data *cd = d->chip_data;
@@ -275,8 +267,9 @@ static int __init imx_gpcv2_irqchip_init(struct device_node 
*node,
 */
writel_relaxed(~0x1, cd->gpc_base + cd->cpu2wakeup);
 
-   imx_gpcv2_instance = cd;
-   register_syscore_ops(&imx_gpcv2_syscore_ops);
+   cd->syscore.suspend = gpcv2_wakeup_source_save;
+   cd->syscore.resume = gpcv2_wakeup_source_restore;
+   register_syscore_ops(&cd->syscore);
 
/*
 * Clear the OF_POPULATED flag set in of_irq_init so that
-- 
2.48.1




[PATCH 3/7] bus: mvebu-mbus: Embed syscore_ops in mbus context

2025-02-17 Thread Thierry Reding
From: Thierry Reding 

This enables the syscore callbacks to obtain the mbus context without
relying on a separate global variable.

Signed-off-by: Thierry Reding 
---
 drivers/bus/mvebu-mbus.c | 20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/bus/mvebu-mbus.c b/drivers/bus/mvebu-mbus.c
index 92daa45cc844..1f22bff0773c 100644
--- a/drivers/bus/mvebu-mbus.c
+++ b/drivers/bus/mvebu-mbus.c
@@ -130,6 +130,7 @@ struct mvebu_mbus_win_data {
 };
 
 struct mvebu_mbus_state {
+   struct syscore_ops syscore;
void __iomem *mbuswins_base;
void __iomem *sdramwins_base;
void __iomem *mbusbridge_base;
@@ -148,6 +149,12 @@ struct mvebu_mbus_state {
struct mvebu_mbus_win_data wins[MBUS_WINS_MAX];
 };
 
+static inline struct mvebu_mbus_state *
+syscore_to_mbus(struct syscore_ops *ops)
+{
+   return container_of(ops, struct mvebu_mbus_state, syscore);
+}
+
 static struct mvebu_mbus_state mbus_state;
 
 /*
@@ -1008,7 +1015,7 @@ fs_initcall(mvebu_mbus_debugfs_init);
 
 static int mvebu_mbus_suspend(struct syscore_ops *ops)
 {
-   struct mvebu_mbus_state *s = &mbus_state;
+   struct mvebu_mbus_state *s = syscore_to_mbus(ops);
int win;
 
if (!s->mbusbridge_base)
@@ -1042,7 +1049,7 @@ static int mvebu_mbus_suspend(struct syscore_ops *ops)
 
 static void mvebu_mbus_resume(struct syscore_ops *ops)
 {
-   struct mvebu_mbus_state *s = &mbus_state;
+   struct mvebu_mbus_state *s = syscore_to_mbus(ops);
int win;
 
writel(s->mbus_bridge_ctrl,
@@ -1069,11 +1076,6 @@ static void mvebu_mbus_resume(struct syscore_ops *ops)
}
 }
 
-static struct syscore_ops mvebu_mbus_syscore_ops = {
-   .suspend= mvebu_mbus_suspend,
-   .resume = mvebu_mbus_resume,
-};
-
 static int __init mvebu_mbus_common_init(struct mvebu_mbus_state *mbus,
 phys_addr_t mbuswins_phys_base,
 size_t mbuswins_size,
@@ -1118,7 +1120,9 @@ static int __init mvebu_mbus_common_init(struct 
mvebu_mbus_state *mbus,
writel(UNIT_SYNC_BARRIER_ALL,
   mbus->mbuswins_base + UNIT_SYNC_BARRIER_OFF);
 
-   register_syscore_ops(&mvebu_mbus_syscore_ops);
+   mbus->syscore.suspend = mvebu_mbus_suspend;
+   mbus->syscore.resume = mvebu_mbus_resume;
+   register_syscore_ops(&mbus->syscore);
 
return 0;
 }
-- 
2.48.1




[PATCH RESEND] powerpc: Fix compiler warning by guarding with '__powerpc64__'

2025-02-17 Thread Yu-Chun Lin
As reported by the kernel test robot, the following error occurs:

   arch/powerpc/lib/sstep.c: In function 'analyse_instr':
>> arch/powerpc/lib/sstep.c:1172:28: warning: variable 'suffix' set but not 
>> used [-Wunused-but-set-variable]
1172 | unsigned int word, suffix;
 |^~
   arch/powerpc/lib/sstep.c:1168:38: warning: variable 'rc' set but not used 
[-Wunused-but-set-variable]
1168 | unsigned int opcode, ra, rb, rc, rd, spr, u;
 |  ^~

These variables are now conditionally defined with the '__powerpc64__'
macro to ensure they are only used when applicable.

Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-kbuild-all/202501100247.gemkqu8j-...@intel.com/
Signed-off-by: Yu-Chun Lin 
---
 arch/powerpc/lib/sstep.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index ac3ee19531d8..eea8653464e7 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -1354,15 +1354,21 @@ int analyse_instr(struct instruction_op *op, const 
struct pt_regs *regs,
 #ifdef CONFIG_PPC64
unsigned int suffixopcode, prefixtype, prefix_r;
 #endif
-   unsigned int opcode, ra, rb, rc, rd, spr, u;
+   unsigned int opcode, ra, rb, rd, spr, u;
unsigned long int imm;
unsigned long int val, val2;
unsigned int mb, me, sh;
-   unsigned int word, suffix;
+   unsigned int word;
+#ifdef __powerpc64__
+   unsigned int suffix;
+   unsigned int rc;
+#endif
long ival;
 
word = ppc_inst_val(instr);
+#ifdef __powerpc64__
suffix = ppc_inst_suffix(instr);
+#endif
 
op->type = COMPUTE;
 
@@ -1480,7 +1486,9 @@ int analyse_instr(struct instruction_op *op, const struct 
pt_regs *regs,
rd = (word >> 21) & 0x1f;
ra = (word >> 16) & 0x1f;
rb = (word >> 11) & 0x1f;
+#ifdef __powerpc64__
rc = (word >> 6) & 0x1f;
+#endif
 
switch (opcode) {
 #ifdef __powerpc64__
-- 
2.43.0




[PATCH 2/7] MIPS: Embed syscore_ops in PCI context

2025-02-17 Thread Thierry Reding
From: Thierry Reding 

This enables the syscore callbacks to obtain the PCI context without
relying on a separate global variable.

Signed-off-by: Thierry Reding 
---
 arch/mips/pci/pci-alchemy.c | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/mips/pci/pci-alchemy.c b/arch/mips/pci/pci-alchemy.c
index a20de7160b6b..02f0616518e1 100644
--- a/arch/mips/pci/pci-alchemy.c
+++ b/arch/mips/pci/pci-alchemy.c
@@ -33,6 +33,7 @@
 
 struct alchemy_pci_context {
struct pci_controller alchemy_pci_ctrl; /* leave as first member! */
+   struct syscore_ops pmops;
void __iomem *regs; /* ctrl base */
/* tools for wired entry for config space access */
unsigned long last_elo0;
@@ -46,6 +47,12 @@ struct alchemy_pci_context {
int (*board_pci_idsel)(unsigned int devsel, int assert);
 };
 
+static inline struct alchemy_pci_context *
+syscore_to_pci_context(struct syscore_ops *ops)
+{
+   return container_of(ops, struct alchemy_pci_context, pmops);
+}
+
 /* for syscore_ops. There's only one PCI controller on Alchemy chips, so this
  * should suffice for now.
  */
@@ -306,9 +313,7 @@ static int alchemy_pci_def_idsel(unsigned int devsel, int 
assert)
 /* save PCI controller register contents. */
 static int alchemy_pci_suspend(struct syscore_ops *ops)
 {
-   struct alchemy_pci_context *ctx = __alchemy_pci_ctx;
-   if (!ctx)
-   return 0;
+   struct alchemy_pci_context *ctx = syscore_to_pci_context(ops);
 
ctx->pm[0]  = __raw_readl(ctx->regs + PCI_REG_CMEM);
ctx->pm[1]  = __raw_readl(ctx->regs + PCI_REG_CONFIG) & 0x0009;
@@ -328,9 +333,7 @@ static int alchemy_pci_suspend(struct syscore_ops *ops)
 
 static void alchemy_pci_resume(struct syscore_ops *ops)
 {
-   struct alchemy_pci_context *ctx = __alchemy_pci_ctx;
-   if (!ctx)
-   return;
+   struct alchemy_pci_context *ctx = syscore_to_pci_context(ops);
 
__raw_writel(ctx->pm[0],  ctx->regs + PCI_REG_CMEM);
__raw_writel(ctx->pm[2],  ctx->regs + PCI_REG_B2BMASK_CCH);
@@ -354,11 +357,6 @@ static void alchemy_pci_resume(struct syscore_ops *ops)
alchemy_pci_wired_entry(ctx);   /* install it */
 }
 
-static struct syscore_ops alchemy_pci_pmops = {
-   .suspend= alchemy_pci_suspend,
-   .resume = alchemy_pci_resume,
-};
-
 static int alchemy_pci_probe(struct platform_device *pdev)
 {
struct alchemy_pci_platdata *pd = pdev->dev.platform_data;
@@ -478,7 +476,9 @@ static int alchemy_pci_probe(struct platform_device *pdev)
 
__alchemy_pci_ctx = ctx;
platform_set_drvdata(pdev, ctx);
-   register_syscore_ops(&alchemy_pci_pmops);
+   ctx->pmops.suspend = alchemy_pci_suspend;
+   ctx->pmops.resume = alchemy_pci_resume;
+   register_syscore_ops(&ctx->pmops);
register_pci_controller(&ctx->alchemy_pci_ctrl);
 
dev_info(&pdev->dev, "PCI controller at %ld MHz\n",
-- 
2.48.1




[PATCH 7/7] soc/tegra: pmc: Derive PMC context from syscore ops

2025-02-17 Thread Thierry Reding
From: Thierry Reding 

Rather than relying on a global variable, make use of the fact that the
syscore ops are embedded in the PMC context and can be obtained via
container_of().

Signed-off-by: Thierry Reding 
---
 drivers/soc/tegra/pmc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/soc/tegra/pmc.c b/drivers/soc/tegra/pmc.c
index 6a3923e1c792..ea26c2651497 100644
--- a/drivers/soc/tegra/pmc.c
+++ b/drivers/soc/tegra/pmc.c
@@ -3143,6 +3143,7 @@ static void tegra186_pmc_process_wake_events(struct 
tegra_pmc *pmc, unsigned int
 
 static void tegra186_pmc_wake_syscore_resume(struct syscore_ops *ops)
 {
+   struct tegra_pmc *pmc = container_of(ops, struct tegra_pmc, syscore);
u32 status, mask;
unsigned int i;
 
@@ -3156,6 +3157,8 @@ static void tegra186_pmc_wake_syscore_resume(struct 
syscore_ops *ops)
 
 static int tegra186_pmc_wake_syscore_suspend(struct syscore_ops *ops)
 {
+   struct tegra_pmc *pmc = container_of(ops, struct tegra_pmc, syscore);
+
wke_read_sw_wake_status(pmc);
 
/* flip the wakeup trigger for dual-edge triggered pads
-- 
2.48.1




Re: [GIT PULL] Please pull powerpc/linux.git powerpc-6.14-3 tag

2025-02-17 Thread pr-tracker-bot
The pull request you sent on Mon, 17 Feb 2025 15:40:00 +0530:

> https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
> tags/powerpc-6.14-3

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/6186bdd120eccf4ca44fcba8967fc59ea50b11b8

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html



[PATCH v8 00/20] fs/dax: Fix ZONE_DEVICE page reference counts

2025-02-17 Thread Alistair Popple
Main updates since v6:

 - Clean ups and fixes based on feedback from David and Dan.

 - Rebased from next-20241216 to v6.14-rc1. No conflicts.

 - Dropped the PTE bit removals and clean-ups - will post this as a
   separate series to be merged after this one as Dan wanted it split
   up more and this series is already too big.

Main updates since v5:

 - Reworked patch 1 based on Dan's feedback.

 - Fixed build issues on PPC and when CONFIG_PGTABLE_HAS_HUGE_LEAVES
   is no defined.

 - Minor comment formatting and documentation fixes.

 - Remove PTE_DEVMAP definitions from Loongarch which were added since
   this series was initially written.

Main updates since v4:

 - Removed most of the devdax/fsdax checks in fs/proc/task_mmu.c. This
   means smaps/pagemap may contain DAX pages.

 - Fixed rmap accounting of PUD mapped pages.

 - Minor code clean-ups.

Main updates since v3:

 - Rebased onto next-20241216. The rebase wasn't too difficult, but in
   the interests of getting this out sooner for Andrew to look at as
   requested by him I have yet to extensively build/run test this
   version of the series.

 - Fixed a bunch of build breakages reported by John Hubbard and the
   kernel test robot due to various combinations of CONFIG options.

 - Split the rmap changes into a separate patch as suggested by David H.

 - Reworded the description for the P2PDMA change.

Main updates since v2:

 - Rename the DAX specific dax_insert_XXX functions to vmf_insert_XXX
   and have them pass the vmf struct.

 - Separate out the device DAX changes.

 - Restore the page share mapping counting and associated warnings.

 - Rework truncate to require file-systems to have previously called
   dax_break_layout() to remove the address space mapping for a
   page. This found several bugs which are fixed by the first half of
   the series. The motivation for this was initially to allow the FS
   DAX page-cache mappings to hold a reference on the page.

   However that turned out to be a dead-end (see the comments on patch
   21), but it found several bugs and I think overall it is an
   improvement so I have left it here.

Device and FS DAX pages have always maintained their own page
reference counts without following the normal rules for page reference
counting. In particular pages are considered free when the refcount
hits one rather than zero and refcounts are not added when mapping the
page.

Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
mechanism for allowing GUP to hold references on the page (see
get_dev_pagemap). However there doesn't seem to be any reason why FS
DAX pages need their own reference counting scheme.

By treating the refcounts on these pages the same way as normal pages
we can remove a lot of special checks. In particular pXd_trans_huge()
becomes the same as pXd_leaf(), although I haven't made that change
here. It also frees up a valuable SW define PTE bit on architectures
that have devmap PTE bits defined.

It also almost certainly allows further clean-up of the devmap managed
functions, but I have left that as a future improvment. It also
enables support for compound ZONE_DEVICE pages which is one of my
primary motivators for doing this work.

Signed-off-by: Alistair Popple 
Tested-by: Alison Schofield 

---

Cc: l...@asahilina.net
Cc: zhang.l...@gmail.com
Cc: gerald.schae...@linux.ibm.com
Cc: dan.j.willi...@intel.com
Cc: vishal.l.ve...@intel.com
Cc: dave.ji...@intel.com
Cc: log...@deltatee.com
Cc: bhelg...@google.com
Cc: j...@suse.cz
Cc: j...@ziepe.ca
Cc: catalin.mari...@arm.com
Cc: w...@kernel.org
Cc: m...@ellerman.id.au
Cc: npig...@gmail.com
Cc: dave.han...@linux.intel.com
Cc: ira.we...@intel.com
Cc: wi...@infradead.org
Cc: djw...@kernel.org
Cc: ty...@mit.edu
Cc: linmia...@huawei.com
Cc: da...@redhat.com
Cc: pet...@redhat.com
Cc: linux-...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-e...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: jhubb...@nvidia.com
Cc: h...@lst.de
Cc: da...@fromorbit.com
Cc: chenhua...@kernel.org
Cc: ker...@xen0n.name
Cc: loonga...@lists.linux.dev

Alistair Popple (19):
  fuse: Fix dax truncate/punch_hole fault path
  fs/dax: Return unmapped busy pages from dax_layout_busy_page_range()
  fs/dax: Don't skip locked entries when scanning entries
  fs/dax: Refactor wait for dax idle page
  fs/dax: Create a common implementation to break DAX layouts
  fs/dax: Always remove DAX page-cache entries when breaking layouts
  fs/dax: Ensure all pages are idle prior to filesystem unmount
  fs/dax: Remove PAGE_MAPPING_DAX_SHARED mapping flag
  mm/gup: Remove redundant check for PCI P2PDMA page
  mm/mm_init: Move p2pdma page refcount initialisation to p2pdma
  mm: Allow compound zone device pages
  mm/memory: Enhance insert_page_into_pte_locked() to create writable map

[PATCH v8 01/20] fuse: Fix dax truncate/punch_hole fault path

2025-02-17 Thread Alistair Popple
FS DAX requires file systems to call into the DAX layout prior to unlinking
inodes to ensure there is no ongoing DMA or other remote access to the
direct mapped page. The fuse file system implements
fuse_dax_break_layouts() to do this which includes a comment indicating
that passing dmap_end == 0 leads to unmapping of the whole file.

However this is not true - passing dmap_end == 0 will not unmap anything
before dmap_start, and further more dax_layout_busy_page_range() will not
scan any of the range to see if there maybe ongoing DMA access to the
range. Fix this by passing -1 for dmap_end to fuse_dax_break_layouts()
which will invalidate the entire file range to
dax_layout_busy_page_range().

Signed-off-by: Alistair Popple 
Co-developed-by: Dan Williams 
Signed-off-by: Dan Williams 
Reviewed-by: Balbir Singh 
Fixes: 6ae330cad6ef ("virtiofs: serialize truncate/punch_hole and dax fault 
path")
Cc: Vivek Goyal 

---

Changes for v6:

 - Original patch had a misplaced hunk due to a bad rebase.
 - Reworked fix based on Dan's comments.
---
 fs/fuse/dax.c  | 1 -
 fs/fuse/dir.c  | 2 +-
 fs/fuse/file.c | 4 ++--
 3 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index 0b6ee6d..b7f805d 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -682,7 +682,6 @@ static int __fuse_dax_break_layouts(struct inode *inode, 
bool *retry,
0, 0, fuse_wait_dax_page(inode));
 }
 
-/* dmap_end == 0 leads to unmapping of whole file */
 int fuse_dax_break_layouts(struct inode *inode, u64 dmap_start,
  u64 dmap_end)
 {
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 198862b..6c5d441 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1940,7 +1940,7 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct 
dentry *dentry,
if (FUSE_IS_DAX(inode) && is_truncate) {
filemap_invalidate_lock(mapping);
fault_blocked = true;
-   err = fuse_dax_break_layouts(inode, 0, 0);
+   err = fuse_dax_break_layouts(inode, 0, -1);
if (err) {
filemap_invalidate_unlock(mapping);
return err;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 7d92a54..dc90613 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -253,7 +253,7 @@ static int fuse_open(struct inode *inode, struct file *file)
 
if (dax_truncate) {
filemap_invalidate_lock(inode->i_mapping);
-   err = fuse_dax_break_layouts(inode, 0, 0);
+   err = fuse_dax_break_layouts(inode, 0, -1);
if (err)
goto out_inode_unlock;
}
@@ -3196,7 +3196,7 @@ static long fuse_file_fallocate(struct file *file, int 
mode, loff_t offset,
inode_lock(inode);
if (block_faults) {
filemap_invalidate_lock(inode->i_mapping);
-   err = fuse_dax_break_layouts(inode, 0, 0);
+   err = fuse_dax_break_layouts(inode, 0, -1);
if (err)
goto out;
}
-- 
git-series 0.9.1



[PATCH v8 02/20] fs/dax: Return unmapped busy pages from dax_layout_busy_page_range()

2025-02-17 Thread Alistair Popple
dax_layout_busy_page_range() is used by file systems to scan the DAX
page-cache to unmap mapping pages from user-space and to determine if
any pages in the given range are busy, either due to ongoing DMA or
other get_user_pages() usage.

Currently it checks to see the file mapping is mapped into user-space
with mapping_mapped() and returns early if not, skipping the check for
DMA busy pages. This is wrong as pages may still be undergoing DMA
access even if they have subsequently been unmapped from
user-space. Fix this by dropping the check for mapping_mapped().

Signed-off-by: Alistair Popple 
Suggested-by: Dan Williams 
Reviewed-by: Dan Williams 
Reviewed-by: Balbir Singh 
---
 fs/dax.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 972febc..b35f538 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -691,7 +691,7 @@ struct page *dax_layout_busy_page_range(struct 
address_space *mapping,
if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
return NULL;
 
-   if (!dax_mapping(mapping) || !mapping_mapped(mapping))
+   if (!dax_mapping(mapping))
return NULL;
 
/* If end == LLONG_MAX, all pages from start to till end of file */
-- 
git-series 0.9.1



[PATCH 1/1] powerpc: use __clang__ instead of CONFIG_CC_IS_CLANG

2025-02-17 Thread Shung-Hsi Yu
Due to include chain (below), powerpc's asm-compat.h is part of UAPI,
thus it should use the __clang__ macro to directly detect whether Clang
is used rather then relying on the kernel config setting. The later is
unreliable because the userspace tools that uses UAPI may be compile
with a different compiler than the one used for the kernel, leading to
incorrect constrain selection (see link for an example of such).

  include/uapi/linux/ptrace.h
  arch/powerpc/include/asm/ptrace.h
  arch/powerpc/include/asm/paca.h
  arch/powerpc/include/asm/atomic.h
  arch/powerpc/include/asm/asm-compat.h

Link: https://github.com/iovisor/bcc/issues/5172
Signed-off-by: Shung-Hsi Yu 
---
 arch/powerpc/include/asm/asm-compat.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/asm-compat.h 
b/arch/powerpc/include/asm/asm-compat.h
index f48e644900a2..34f8740909a9 100644
--- a/arch/powerpc/include/asm/asm-compat.h
+++ b/arch/powerpc/include/asm/asm-compat.h
@@ -37,7 +37,7 @@
 #define STDX_BEstringify_in_c(stdbrx)
 #endif
 
-#ifdef CONFIG_CC_IS_CLANG
+#ifdef __clang__
 #define DS_FORM_CONSTRAINT "Z<>"
 #else
 #define DS_FORM_CONSTRAINT "YZ<>"
-- 
2.48.1




[PATCH v8 08/20] fs/dax: Remove PAGE_MAPPING_DAX_SHARED mapping flag

2025-02-17 Thread Alistair Popple
The page ->mapping pointer can have magic values like
PAGE_MAPPING_DAX_SHARED and PAGE_MAPPING_ANON for page owner specific
usage. Currently PAGE_MAPPING_DAX_SHARED and PAGE_MAPPING_ANON alias to the
same value. This isn't a problem because FS DAX pages are never seen by the
anonymous mapping code and vice versa.

However a future change will make FS DAX pages more like normal pages, so
folio_test_anon() must not return true for a FS DAX page.

We could explicitly test for a FS DAX page in folio_test_anon(),
etc. however the PAGE_MAPPING_DAX_SHARED flag isn't actually
needed. Instead we can use the page->mapping field to implicitly track the
first mapping of a page. If page->mapping is non-NULL it implies the page
is associated with a single mapping at page->index. If the page is
associated with a second mapping clear page->mapping and set page->share to
1.

This is possible because a shared mapping implies the file-system
implements dax_holder_operations which makes the ->mapping and ->index,
which is a union with ->share, unused.

The page is considered shared when page->mapping == NULL and
page->share > 0 or page->mapping != NULL, implying it is present in at
least one address space. This also makes it easier for a future change to
detect when a page is first mapped into an address space which requires
special handling.

Signed-off-by: Alistair Popple 

---

Changes for v8:

 - Rebased on mm-unstable which includes Matthew Wilcox's "dax: use
   folios more widely within DAX"

Changes for v7:

 - Fix for checking when creating a shared mapping in dax_associate_entry.
 - Remove dax_page_share_get().
 - Add dax_page_make_shared().
---
 fs/dax.c   | 55 +++
 include/linux/page-flags.h |  6 +
 2 files changed, 33 insertions(+), 28 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index bc538ba..6674540 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -351,27 +351,40 @@ static unsigned long dax_end_pfn(void *entry)
for (pfn = dax_to_pfn(entry); \
pfn < dax_end_pfn(entry); pfn++)
 
+/*
+ * A DAX folio is considered shared if it has no mapping set and ->share (which
+ * shares the ->index field) is non-zero. Note this may return false even if 
the
+ * page is shared between multiple files but has not yet actually been mapped
+ * into multiple address spaces.
+ */
 static inline bool dax_folio_is_shared(struct folio *folio)
 {
-   return folio->mapping == PAGE_MAPPING_DAX_SHARED;
+   return !folio->mapping && folio->page.share;
 }
 
 /*
- * Set the folio->mapping with PAGE_MAPPING_DAX_SHARED flag, increase the
- * refcount.
+ * When it is called by dax_insert_entry(), the shared flag will indicate
+ * whether this entry is shared by multiple files. If the page has not
+ * previously been associated with any mappings the ->mapping and ->index
+ * fields will be set. If it has already been associated with a mapping
+ * the mapping will be cleared and the share count set. It's then up to
+ * reverse map users like memory_failure() to call back into the filesystem to
+ * recover ->mapping and ->index information. For example by implementing
+ * dax_holder_operations.
  */
-static inline void dax_folio_share_get(struct folio *folio)
+static void dax_folio_make_shared(struct folio *folio)
 {
-   if (folio->mapping != PAGE_MAPPING_DAX_SHARED) {
-   /*
-* Reset the index if the page was already mapped
-* regularly before.
-*/
-   if (folio->mapping)
-   folio->page.share = 1;
-   folio->mapping = PAGE_MAPPING_DAX_SHARED;
-   }
-   folio->page.share++;
+   /*
+* folio is not currently shared so mark it as shared by clearing
+* folio->mapping.
+*/
+   folio->mapping = NULL;
+
+   /*
+* folio has previously been mapped into one address space so set the
+* share count.
+*/
+   folio->page.share = 1;
 }
 
 static inline unsigned long dax_folio_share_put(struct folio *folio)
@@ -379,12 +392,6 @@ static inline unsigned long dax_folio_share_put(struct 
folio *folio)
return --folio->page.share;
 }
 
-/*
- * When it is called in dax_insert_entry(), the shared flag will indicate
- * that whether this entry is shared by multiple files.  If so, set
- * the folio->mapping PAGE_MAPPING_DAX_SHARED, and use page->share
- * as refcount.
- */
 static void dax_associate_entry(void *entry, struct address_space *mapping,
struct vm_area_struct *vma, unsigned long address, bool shared)
 {
@@ -398,8 +405,12 @@ static void dax_associate_entry(void *entry, struct 
address_space *mapping,
for_each_mapped_pfn(entry, pfn) {
struct folio *folio = pfn_folio(pfn);
 
-   if (shared) {
-   dax_folio_share_get(folio);
+   if (shared && (folio->mapping || folio->page.share)) {
+   if (fol

[PATCH v8 09/20] mm/gup: Remove redundant check for PCI P2PDMA page

2025-02-17 Thread Alistair Popple
PCI P2PDMA pages are not mapped with pXX_devmap PTEs therefore the
check in __gup_device_huge() is redundant. Remove it

Signed-off-by: Alistair Popple 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Dan Wiliams 
Acked-by: David Hildenbrand 
---
 mm/gup.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index e42e4fd..e5d6454 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3013,11 +3013,6 @@ static int gup_fast_devmap_leaf(unsigned long pfn, 
unsigned long addr,
break;
}
 
-   if (!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
-   gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
-   break;
-   }
-
folio = try_grab_folio_fast(page, 1, flags);
if (!folio) {
gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
-- 
git-series 0.9.1



Re: [RFC PATCH v2 0/3] sched/fair: introduce new scheduler group type group_parked

2025-02-17 Thread Shrikanth Hegde




On 2/17/25 17:02, Tobias Huschle wrote:

Changes to v1

parked vs idle
- parked CPUs are now never considered to be idle
- a scheduler group is now considered parked iff there are parked CPUs
   and there are no idle CPUs, i.e. all non parked CPUs are busy or there
   are only parked CPUs. A scheduler group with parked tasks can be
   considered to not be parked, if it has idle CPUs which can pick up
   the parked tasks.
- idle_cpu_without always returns that the CPU will not be idle if the
   CPU is parked

active balance, no_hz, queuing
- should_we_balance always returns true if a scheduler groups contains
   a parked CPU and that CPU has a running task
- stopping the tick on parked CPUs is now prevented in sched_can_stop_tick
   if a task is running
- tasks are being prevented to be queued on parked CPUs in ttwu_queue_cond

cleanup
- removed duplicate checks for parked CPUs

CPU capacity
- added a patch which removes parked cpus and their capacity from
   scheduler statistics


Original description:

Adding a new scheduler group type which allows to remove all tasks
from certain CPUs through load balancing can help in scenarios where
such CPUs are currently unfavorable to use, for example in a
virtualized environment.

Functionally, this works as intended. The question would be, if this
could be considered to be added and would be worth going forward
with. If so, which areas would need additional attention?
Some cases are referenced below.

The underlying concept and the approach of adding a new scheduler
group type were presented in the Sched MC of the 2024 LPC.
A short summary:

Some architectures (e.g. s390) provide virtualization on a firmware
level. This implies, that Linux kernels running on such architectures
run on virtualized CPUs.

Like in other virtualized environments, the CPUs are most likely shared
with other guests on the hardware level. This implies, that Linux
kernels running in such an environment may encounter 'steal time'. In
other words, instead of being able to use all available time on a
physical CPU, some of said available time is 'stolen' by other guests.

This can cause side effects if a guest is interrupted at an unfavorable
point in time or if the guest is waiting for one of its other virtual
CPUs to perform certain actions while those are suspended in favour of
another guest.

Architectures, like arch/s390, address this issue by providing an
alternative classification for the CPUs seen by the Linux kernel.

The following example is arch/s390 specific:
In the default mode (horizontal CPU polarization), all CPUs are treated
equally and can be subject to steal time equally.
In the alternate mode (vertical CPU polarization), the underlying
firmware hypervisor assigns the CPUs, visible to the guest, different
types, depending on how many CPUs the guest is entitled to use. Said
entitlement is configured by assigning weights to all active guests.
The three CPU types are:
 - vertical high   : On these CPUs, the guest has always highest
 priority over other guests. This means
 especially that if the guest executes tasks on
 these CPUs, it will encounter no steal time.
 - vertical medium : These CPUs are meant to cover fractions of
 entitlement.
 - vertical low: These CPUs will have no priority when being
 scheduled. This implies especially, that while
 all other guests are using their full
 entitlement, these CPUs might not be ran for a
 significant amount of time.

As a consequence, using vertical lows while the underlying hypervisor
experiences a high load, driven by all defined guests, is to be avoided.

In order to consequently move tasks off of vertical lows, introduce a
new type of scheduler groups: group_parked.
Parked implies, that processes should be evacuated as fast as possible
from these CPUs. This implies that other CPUs should start pulling tasks
immediately, while the parked CPUs should refuse to pull any tasks
themselves.
Adding a group type beyond group_overloaded achieves the expected
behavior. By making its selection architecture dependent, it has
no effect on architectures which will not make use of that group type.

This approach works very well for many kinds of workloads. Tasks are
getting migrated back and forth in line with changing the parked
state of the involved CPUs.

There are a couple of issues and corner cases which need further
considerations:
- rt & dl:  Realtime and deadline scheduling require some additional
 attention.


I think we need to address atleast rt, there would be some non percpu 
kworker threads which need to move out of parked cpus.



- ext:  Probably affected as well. Needs some conceptional
 thoughts first.
- raciness: Right now, there are no synchronization efforts. It needs
 

Re: [RFC PATCH v2 1/3] sched/fair: introduce new scheduler group type group_parked

2025-02-17 Thread Shrikanth Hegde

Hi Tobias.

On 2/17/25 17:02, Tobias Huschle wrote:

A parked CPU is considered to be flagged as unsuitable to process
workload at the moment, but might be become usable anytime. Depending on
the necessity for additional computation power and/or available capacity
of the underlying hardware.

A scheduler group is considered to be parked, if there are tasks queued
on parked CPUs and there are no idle CPUs, i.e. all non parked CPUs are
busy or there are only parked CPUs. A scheduler group with parked tasks
can be considered to not be parked, if it has idle CPUs which can pick
up the parked tasks. A parked scheduler group is considered to be busier
than another if it runs more tasks on parked CPUs than another parked
scheduler group.

A parked CPU must keep its scheduler tick (or have it re-enabled if
necessary) in order to make sure that parked CPUs which only run a
single task which does not give up its runtime voluntarily is still
evacuated as it would otherwise go into NO_HZ.

The status of the underlying hardware must be considered to be
architecture dependent. Therefore the check whether a CPU is parked is
architecture specific. For architectures not relying on this feature,
the check is mostly a NOP.

This is more efficient and non-disruptive compared to CPU hotplug in
environments where such changes can be necessary on a frequent basis.

Signed-off-by: Tobias Huschle 
---
  include/linux/sched/topology.h | 19 
  kernel/sched/core.c| 13 -
  kernel/sched/fair.c| 86 +-
  kernel/sched/syscalls.c|  3 ++
  4 files changed, 109 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 7f3dbafe1817..2a4730729988 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -265,6 +265,25 @@ unsigned long arch_scale_cpu_capacity(int cpu)
  }
  #endif
  
+#ifndef arch_cpu_parked

+/**
+ * arch_cpu_parked - Check if a given CPU is currently parked.
+ *
+ * A parked CPU cannot run any kind of workload since underlying
+ * physical CPU should not be used at the moment .
+ *
+ * @cpu: the CPU in question.
+ *
+ * By default assume CPU is not parked
+ *
+ * Return: Parked state of CPU
+ */
+static __always_inline bool arch_cpu_parked(int cpu)
+{
+   return false;
+}
+#endif
+
  #ifndef arch_scale_hw_pressure
  static __always_inline
  unsigned long arch_scale_hw_pressure(int cpu)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 165c90ba64ea..9ed15911ec60 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1352,6 +1352,9 @@ bool sched_can_stop_tick(struct rq *rq)
if (rq->cfs.h_nr_queued > 1)
return false;
  
+	if (rq->cfs.nr_running > 0 && arch_cpu_parked(cpu_of(rq)))

+   return false;
+


you mean rq->cfs.h_nr_queued or rq->nr_running ?


/*
 * If there is one task and it has CFS runtime bandwidth constraints
 * and it's on the cpu now we don't want to stop the tick.
@@ -2443,7 +2446,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, 
int cpu)
  
  	/* Non kernel threads are not allowed during either online or offline. */

if (!(p->flags & PF_KTHREAD))
-   return cpu_active(cpu);
+   return !arch_cpu_parked(cpu) && cpu_active(cpu);
  
  	/* KTHREAD_IS_PER_CPU is always allowed. */

if (kthread_is_per_cpu(p))
@@ -2453,6 +2456,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, 
int cpu)
if (cpu_dying(cpu))
return false;
  
+	/* CPU should be avoided at the moment */

+   if (arch_cpu_parked(cpu))
+   return false;
+
/* But are allowed during online. */
return cpu_online(cpu);
  }
@@ -3930,6 +3937,10 @@ static inline bool ttwu_queue_cond(struct task_struct 
*p, int cpu)
if (task_on_scx(p))
return false;
  
+	/* The task should not be queued onto a parked CPU. */

+   if (arch_cpu_parked(cpu))
+   return false;
+
/*
 * Do not complicate things with the async wake_list while the CPU is
 * in hotplug state.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1c0ef435a7aa..5eb1a3113704 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6871,6 +6871,8 @@ static int sched_idle_rq(struct rq *rq)
  #ifdef CONFIG_SMP
  static int sched_idle_cpu(int cpu)
  {
+   if (arch_cpu_parked(cpu))
+   return 0;
return sched_idle_rq(cpu_rq(cpu));
  }
  #endif
@@ -7399,6 +7401,9 @@ static int wake_affine(struct sched_domain *sd, struct 
task_struct *p,
  {
int target = nr_cpumask_bits;
  
+	if (arch_cpu_parked(target))

+   return prev_cpu;
+
if (sched_feat(WA_IDLE))
target = wake_affine_idle(this_cpu, prev_cpu, sync);
  
@@ -9182,7 +9187,12 @@ enum group_type {

 * The CPU is overloaded and can't provide expected CPU cycles t

Re: [PATCH 1/1] powerpc: use __clang__ instead of CONFIG_CC_IS_CLANG

2025-02-17 Thread Michal Suchánek
Hello,

how does this happen?

On Tue, Feb 18, 2025 at 12:48:01PM +0800, Shung-Hsi Yu wrote:
> Due to include chain (below), powerpc's asm-compat.h is part of UAPI,
> thus it should use the __clang__ macro to directly detect whether Clang
> is used rather then relying on the kernel config setting. The later is
> unreliable because the userspace tools that uses UAPI may be compile
> with a different compiler than the one used for the kernel, leading to
> incorrect constrain selection (see link for an example of such).
> 
>   include/uapi/linux/ptrace.h
>   arch/powerpc/include/asm/ptrace.h

There is arch/powerpc/include/uapi/asm/ptrace.h

and if the installed header is used this is what should be included.

That does no include other asm headers.

Thanks

Michal

>   arch/powerpc/include/asm/paca.h
>   arch/powerpc/include/asm/atomic.h
>   arch/powerpc/include/asm/asm-compat.h
> 
> Link: https://github.com/iovisor/bcc/issues/5172
> Signed-off-by: Shung-Hsi Yu 
> ---
>  arch/powerpc/include/asm/asm-compat.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/asm-compat.h 
> b/arch/powerpc/include/asm/asm-compat.h
> index f48e644900a2..34f8740909a9 100644
> --- a/arch/powerpc/include/asm/asm-compat.h
> +++ b/arch/powerpc/include/asm/asm-compat.h
> @@ -37,7 +37,7 @@
>  #define STDX_BE  stringify_in_c(stdbrx)
>  #endif
>  
> -#ifdef CONFIG_CC_IS_CLANG
> +#ifdef __clang__
>  #define DS_FORM_CONSTRAINT "Z<>"
>  #else
>  #define DS_FORM_CONSTRAINT "YZ<>"
> -- 
> 2.48.1
> 



Re: [PATCH] tools/perf: Add check to tool pmu tests to ensure if the event is valid

2025-02-17 Thread Athira Rajeev


> On 13 Feb 2025, at 9:04 AM, Namhyung Kim  wrote:
> 
> On Thu, Feb 13, 2025 at 12:24:38AM +0530, Athira Rajeev wrote:
>> "Tool PMU" tests fails on powerpc as below:
>> 
>>   12.1: Parsing without PMU name:
>>   --- start ---
>>   test child forked, pid 48492
>>   Using CPUID 0x00800200
>>   Attempt to add: tool/duration_time/
>>   ..after resolving event: tool/config=0x1/
>>   duration_time -> tool/duration_time/
>>   Attempt to add: tool/user_time/
>>   ..after resolving event: tool/config=0x2/
>>   user_time -> tool/user_time/
>>   Attempt to add: tool/system_time/
>>   ..after resolving event: tool/config=0x3/
>>   system_time -> tool/system_time/
>>   Attempt to add: tool/has_pmem/
>>   ..after resolving event: tool/config=0x4/
>>   has_pmem -> tool/has_pmem/
>>   Attempt to add: tool/num_cores/
>>   ..after resolving event: tool/config=0x5/
>>   num_cores -> tool/num_cores/
>>   Attempt to add: tool/num_cpus/
>>   ..after resolving event: tool/config=0x6/
>>   num_cpus -> tool/num_cpus/
>>   Attempt to add: tool/num_cpus_online/
>>   ..after resolving event: tool/config=0x7/
>>   num_cpus_online -> tool/num_cpus_online/
>>   Attempt to add: tool/num_dies/
>>   ..after resolving event: tool/config=0x8/
>>   num_dies -> tool/num_dies/
>>   Attempt to add: tool/num_packages/
>>   ..after resolving event: tool/config=0x9/
>>   num_packages -> tool/num_packages/
>> 
>>    unexpected signal (11) 
>>   12.1: Parsing without PMU name  : 
>> FAILED!
>> 
>> Same fail is observed for "Parsing with PMU name" as well.
>> 
>> The testcase loops through events in tool_pmu__for_each_event()
>> and access event name using "tool_pmu__event_to_str()".
>> Here tool_pmu__event_to_str returns null for "slots" event
>> and "system_tsc_freq" event. These two events are only applicable
>> for arm64 and x86 respectively. So the function tool_pmu__event_to_str()
>> skips for unsupported events and returns null. This null value is
>> causing testcase fail.
>> 
>> To address this in "Tool PMU" testcase, add a helper function
>> tool_pmu__all_event_to_str() which returns the name for all
>> events mapping to the tool_pmu_event index including the
>> skipped ones. So that even if its a skipped event, the
>> helper function helps to resolve the tool_pmu_event index to
>> its mapping event name. Update the testcase to check for null event
>> names before proceeding the test.
>> 
>> Signed-off-by: Athira Rajeev 
> 
> Please take a look at:
> https://lore.kernel.org/r/20250212163859.1489916-1-james.cl...@linaro.org
> 
> Thanks,
> Namhyung
Hi,

Sure thanks for the fix James

Thomas,
Thanks for testing this patch.  But James already fixed this with a different 
patch and it is part of perf-tools-next
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/commit/?h=perf-tools-next&id=615ec00b06f78912c370b372426190768402a5b9

Please test with latest perf-tools-next 

Thanks
Athira

> 
>> ---
>> tools/perf/tests/tool_pmu.c | 12 
>> tools/perf/util/tool_pmu.c  | 17 +
>> tools/perf/util/tool_pmu.h  |  1 +
>> 3 files changed, 30 insertions(+)
>> 
>> diff --git a/tools/perf/tests/tool_pmu.c b/tools/perf/tests/tool_pmu.c
>> index 187942b749b7..e468e5fb3c73 100644
>> --- a/tools/perf/tests/tool_pmu.c
>> +++ b/tools/perf/tests/tool_pmu.c
>> @@ -19,6 +19,18 @@ static int do_test(enum tool_pmu_event ev, bool with_pmu)
>>  return TEST_FAIL;
>>  }
>> 
>> +/*
>> + * if tool_pmu__event_to_str returns NULL, Check if the event is
>> + * valid for the platform.
>> + * Example:
>> + * slots event is only on arm64.
>> + * system_tsc_freq event is only on x86.
>> + */
>> +if (!tool_pmu__event_to_str(ev) && 
>> tool_pmu__skip_event(tool_pmu__all_event_to_str(ev))) {
>> +ret = TEST_OK;
>> +goto out;
>> +}
>> +
>>  if (with_pmu)
>>  snprintf(str, sizeof(str), "tool/%s/", 
>> tool_pmu__event_to_str(ev));
>>  else
>> diff --git a/tools/perf/util/tool_pmu.c b/tools/perf/util/tool_pmu.c
>> index 3a68debe7143..572422797f6e 100644
>> --- a/tools/perf/util/tool_pmu.c
>> +++ b/tools/perf/util/tool_pmu.c
>> @@ -60,6 +60,15 @@ int tool_pmu__num_skip_events(void)
>>  return num;
>> }
>> 
>> +/*
>> + * tool_pmu__event_to_str returns only supported event names.
>> + * For events which are supposed to be skipped in the platform,
>> + * return NULL
>> + *
>> + * tool_pmu__all_event_to_str returns the name for all
>> + * events mapping to the tool_pmu_event index including the
>> + * skipped ones.
>> + */
>> const char *tool_pmu__event_to_str(enum tool_pmu_event ev)
>> {
>>  if ((ev > TOOL_PMU__EVENT_NONE && ev < TOOL_PMU__EVENT_MAX) &&
>> @@ -69,6 +78,14 @@ const char *tool_pmu__event_to_str(enum tool_pmu_event ev)
>>  return NULL;
>> }
>> 
>> +const char *tool_pmu__all_event_to_str(enum tool_pmu_event ev)
>> +{
>> +if (ev > TOOL_PMU__EVENT_NONE && ev < TOOL_PMU__EVENT_MAX)
>

Re: [PATCH] tools/perf: Pick the correct dwarf die while adding probe point for a function

2025-02-17 Thread James Clark




On 12/02/2025 1:19 pm, Athira Rajeev wrote:

Perf probe on vfs_fstatat fails as below on a powerpc system

./perf probe -nf --max-probes=512 -a 'vfs_fstatat $params'
Segmentation fault (core dumped)

This is observed while running perftool-testsuite_probe testcase.

While running with verbose, its observed that segfault happens
at:

synthesize_probe_trace_arg ()
synthesize_probe_trace_command ()
probe_file.add_event ()
apply_perf_probe_events ()
__cmd_probe ()
cmd_probe ()
run_builtin ()
handle_internal_command ()
main ()

Code in synthesize_probe_trace_arg() access a null value and results in
segfault. Data structure which is null:
struct probe_trace_arg arg->value

We are hitting a case where arg->value is null in probe point:
"vfs_fstatat $params". This is happening since 'commit e896474fe485
("getname_maybe_null() - the third variant of pathname copy-in")'
Before the commit, probe point for vfs_fstatat was getting added only
for one location:

Writing event: p:probe/vfs_fstatat _text+6345404 dfd=%gpr3:s32 
filename=%gpr4:x64 stat=%gpr5:x64 flags=%gpr6:s32

With this change, vfs_fstatat code is inlined for other locations in the
code:
Probe point found: __do_sys_lstat64+48
Probe point found: __do_sys_stat64+48
Probe point found: __do_sys_newlstat+48
Probe point found: __do_sys_newstat+48
Probe point found: vfs_fstatat+0

When trying to find matching dwarf information entry (DIE)
from the debuginfo, the code incorrectly picks DIE which is
not referring to vfs_fstatat. Snippet from dwarf entry in vmlinux
debuginfo file.

The main abstract die is:
  <1><4214883>: Abbrev Number: 147 (DW_TAG_subprogram)
 <4214885>   DW_AT_external: 1
 <4214885>   DW_AT_name: (indirect string, offset: 0x17b9f3): 
vfs_fstatat

With formal parameters:
  <2><4214896>: Abbrev Number: 51 (DW_TAG_formal_parameter)
 <4214897>   DW_AT_name: dfd
  <2><42148a3>: Abbrev Number: 23 (DW_TAG_formal_parameter)
 <42148a4>   DW_AT_name: (indirect string, offset: 0x8fda9): 
filename
  <2><42148b0>: Abbrev Number: 23 (DW_TAG_formal_parameter)
 <42148b1>   DW_AT_name: (indirect string, offset: 0x16bd9c): stat
  <2><42148bd>: Abbrev Number: 23 (DW_TAG_formal_parameter)
 <42148be>   DW_AT_name: (indirect string, offset: 0x39832b): flags

While collecting variables/parameters for a probe point, the function
copy_variables_cb() also looks at dwarf debug entries based on the
instruction address. Snippet

 if (dwarf_haspc(die_mem, vf->pf->addr))
 return DIE_FIND_CB_CONTINUE;
 else
 return DIE_FIND_CB_SIBLING;

But incase of inlined function instance for vfs_fstatat, there are two
entries which has the instruction address entry point as same.

Instance 1: which is for vfs_fstatat and DW_AT_abstract_origin points to
0x4214883 (reference above for main abstract die)

<3><42131fa>: Abbrev Number: 59 (DW_TAG_inlined_subroutine)
 <42131fb>   DW_AT_abstract_origin: <0x4214883>
 <42131ff>   DW_AT_entry_pc: 0xc062b1e0

Instance 2: which is not for vfs_fstatat but for getname

  <5><4213270>: Abbrev Number: 39 (DW_TAG_inlined_subroutine)
 <4213271>   DW_AT_abstract_origin: <0x4215b6b>
 <4213275>   DW_AT_entry_pc: 0xc062b1e0

But the copy_variables_cb() continues to add parameters from second
instance also based on the dwarf_haspc() check. This results in
formal parameters for getname also appended to params. But while
filling in the args->value for these parameters, since these args
are not part of dwarf with offset "42131fa". Hence value will be
null. This incorrect args results in segfault when value field is
accessed.

Save the Dwarf_Die which is the actual DW_TAG_subprogram as part of
"struct probe_finder". In copy_variables_cb(), include check to make
sure the DW_AT_abstract_origin points to the correct entry if the
dwarf_haspc() matches the instruction address.

Signed-off-by: Athira Rajeev 
---
  tools/perf/util/probe-finder.c | 21 ++---
  tools/perf/util/probe-finder.h |  1 +
  2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/probe-finder.c b/tools/perf/util/probe-finder.c
index 1e769b68da37..361086a7adae 100644
--- a/tools/perf/util/probe-finder.c
+++ b/tools/perf/util/probe-finder.c
@@ -973,6 +973,7 @@ static int probe_point_search_cb(Dwarf_Die *sp_die, void 
*data)
pr_debug("Matched function: %s [%lx]\n", dwarf_diename(sp_die),
 (unsigned long)dwarf_dieoffset(sp_die));
pf->fname = fname;
+   memcpy(&pf->abstract_die, sp_die, sizeof(Dwarf_Die));
if (pp->line) { /* Function relative line */
dwarf_decl_line(sp_die, &pf->lno);
pf->lno += pp->line;
@@ -1179,6 +1180,8 @@ static int copy_variables_cb(Dwarf_Die *die_mem, void 
*data)
struct local_vars_finder *vf = data;
struct probe_finder *pf = vf->pf;
int tag;
+   D

[PATCH v8 05/20] fs/dax: Create a common implementation to break DAX layouts

2025-02-17 Thread Alistair Popple
Prior to freeing a block file systems supporting FS DAX must check
that the associated pages are both unmapped from user-space and not
undergoing DMA or other access from eg. get_user_pages(). This is
achieved by unmapping the file range and scanning the FS DAX
page-cache to see if any pages within the mapping have an elevated
refcount.

This is done using two functions - dax_layout_busy_page_range() which
returns a page to wait for the refcount to become idle on. Rather than
open-code this introduce a common implementation to both unmap and
wait for the page to become idle.

Signed-off-by: Alistair Popple 
Reviewed-by: Dan Williams 

---

Changes for v7:

 - Fix smatch warning, also reported by Dan and Darrick
 - Make sure xfs_break_layouts() can return -ERESTARTSYS, reported by
   Darrick
 - Use common definition of dax_page_is_idle()
 - Removed misplaced hunk changing madvise
 - Renamed dax_break_mapping() to dax_break_layout() suggested by Dan
 - Fix now unused variables in ext4

Changes for v5:

 - Don't wait for idle pages on non-DAX mappings

Changes for v4:

 - Fixed some build breakage due to missing symbol exports reported by
   John Hubbard (thanks!).
---
 fs/dax.c| 33 +
 fs/ext4/inode.c | 13 +
 fs/fuse/dax.c   | 27 +++
 fs/xfs/xfs_inode.c  | 26 +++---
 fs/xfs/xfs_inode.h  |  2 +-
 include/linux/dax.h | 23 ++-
 6 files changed, 63 insertions(+), 61 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index f5fdb43..f1945aa 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -846,6 +846,39 @@ int dax_delete_mapping_entry(struct address_space 
*mapping, pgoff_t index)
return ret;
 }
 
+static int wait_page_idle(struct page *page,
+   void (cb)(struct inode *),
+   struct inode *inode)
+{
+   return ___wait_var_event(page, dax_page_is_idle(page),
+   TASK_INTERRUPTIBLE, 0, 0, cb(inode));
+}
+
+/*
+ * Unmaps the inode and waits for any DMA to complete prior to deleting the
+ * DAX mapping entries for the range.
+ */
+int dax_break_layout(struct inode *inode, loff_t start, loff_t end,
+   void (cb)(struct inode *))
+{
+   struct page *page;
+   int error = 0;
+
+   if (!dax_mapping(inode->i_mapping))
+   return 0;
+
+   do {
+   page = dax_layout_busy_page_range(inode->i_mapping, start, end);
+   if (!page)
+   break;
+
+   error = wait_page_idle(page, cb, inode);
+   } while (error == 0);
+
+   return error;
+}
+EXPORT_SYMBOL_GPL(dax_break_layout);
+
 /*
  * Invalidate DAX entry if it is clean.
  */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cc1acb1..2342bac 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3911,21 +3911,10 @@ static void ext4_wait_dax_page(struct inode *inode)
 
 int ext4_break_layouts(struct inode *inode)
 {
-   struct page *page;
-   int error;
-
if (WARN_ON_ONCE(!rwsem_is_locked(&inode->i_mapping->invalidate_lock)))
return -EINVAL;
 
-   do {
-   page = dax_layout_busy_page(inode->i_mapping);
-   if (!page)
-   return 0;
-
-   error = dax_wait_page_idle(page, ext4_wait_dax_page, inode);
-   } while (error == 0);
-
-   return error;
+   return dax_break_layout_inode(inode, ext4_wait_dax_page);
 }
 
 /*
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index bf6faa3..0502bf3 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -666,33 +666,12 @@ static void fuse_wait_dax_page(struct inode *inode)
filemap_invalidate_lock(inode->i_mapping);
 }
 
-/* Should be called with mapping->invalidate_lock held exclusively */
-static int __fuse_dax_break_layouts(struct inode *inode, bool *retry,
-   loff_t start, loff_t end)
-{
-   struct page *page;
-
-   page = dax_layout_busy_page_range(inode->i_mapping, start, end);
-   if (!page)
-   return 0;
-
-   *retry = true;
-   return dax_wait_page_idle(page, fuse_wait_dax_page, inode);
-}
-
+/* Should be called with mapping->invalidate_lock held exclusively. */
 int fuse_dax_break_layouts(struct inode *inode, u64 dmap_start,
  u64 dmap_end)
 {
-   boolretry;
-   int ret;
-
-   do {
-   retry = false;
-   ret = __fuse_dax_break_layouts(inode, &retry, dmap_start,
-  dmap_end);
-   } while (ret == 0 && retry);
-
-   return ret;
+   return dax_break_layout(inode, dmap_start, dmap_end,
+   fuse_wait_dax_page);
 }
 
 ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 1b5613d..d4f07e0 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2735,21 +273

[PATCH v8 19/20] fs/dax: Properly refcount fs dax pages

2025-02-17 Thread Alistair Popple
Currently fs dax pages are considered free when the refcount drops to
one and their refcounts are not increased when mapped via PTEs or
decreased when unmapped. This requires special logic in mm paths to
detect that these pages should not be properly refcounted, and to
detect when the refcount drops to one instead of zero.

On the other hand get_user_pages(), etc. will properly refcount fs dax
pages by taking a reference and dropping it when the page is
unpinned.

Tracking this special behaviour requires extra PTE bits
(eg. pte_devmap) and introduces rules that are potentially confusing
and specific to FS DAX pages. To fix this, and to possibly allow
removal of the special PTE bits in future, convert the fs dax page
refcounts to be zero based and instead take a reference on the page
each time it is mapped as is currently the case for normal pages.

This may also allow a future clean-up to remove the pgmap refcounting
that is currently done in mm/gup.c.

Signed-off-by: Alistair Popple 
Reviewed-by: Dan Williams 

---

Changes for v8:

 - Rebased on mm-unstable - conflicts with Matthew's earlier changes.
 - Made dax_folio_put() easier to read thanks to David's suggestions.
 - Removed a useless WARN_ON_ONCE()

Changes for v7:
 - s/dax_device_folio_init/dax_folio_init/ as suggested by Dan
 - s/dax_folio_share_put/dax_folio_put/

Changes since v2:

Based on some questions from Dan I attempted to have the FS DAX page
cache (ie. address space) hold a reference to the folio whilst it was
mapped. However I came to the strong conclusion that this was not the
right thing to do.

If the page refcount == 0 it means the page is:

1. not mapped into user-space
2. not subject to other access via DMA/GUP/etc.

Ie. From the core MM perspective the page is not in use.

The fact a page may or may not be present in one or more address space
mappings is irrelevant for core MM. It just means the page is still in
use or valid from the file system perspective, and it's a
responsiblity of the file system to remove these mappings if the pfn
mapping becomes invalid (along with first making sure the MM state,
ie. page->refcount, is idle). So we shouldn't be trying to track that
lifetime with MM refcounts.

Doing so just makes DMA-idle tracking more complex because there is
now another thing (one or more address spaces) which can hold
references on a page. And FS DAX can't even keep track of all the
address spaces which might contain a reference to the page in the
XFS/reflink case anyway.

We could do this if we made file systems invalidate all address space
mappings prior to calling dax_break_layouts(), but that isn't
currently neccessary and would lead to increased faults just so we
could do some superfluous refcounting which the file system already
does.

I have however put the page sharing checks and WARN_ON's back which
also turned out to be useful for figuring out when to re-initialising
a folio.
---
 drivers/nvdimm/pmem.c|   4 +-
 fs/dax.c | 186 
 fs/fuse/virtio_fs.c  |   3 +-
 include/linux/dax.h  |   2 +-
 include/linux/mm.h   |  27 +--
 include/linux/mm_types.h |   7 +-
 mm/gup.c |   9 +--
 mm/huge_memory.c |   6 +-
 mm/internal.h|   2 +-
 mm/memory-failure.c  |   6 +-
 mm/memory.c  |   6 +-
 mm/memremap.c|  47 --
 mm/mm_init.c |   9 +--
 mm/swap.c|   2 +-
 14 files changed, 165 insertions(+), 151 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index d81faa9..785b2d2 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -513,7 +513,7 @@ static int pmem_attach_disk(struct device *dev,
 
pmem->disk = disk;
pmem->pgmap.owner = pmem;
-   pmem->pfn_flags = PFN_DEV;
+   pmem->pfn_flags = 0;
if (is_nd_pfn(dev)) {
pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
pmem->pgmap.ops = &fsdax_pagemap_ops;
@@ -522,7 +522,6 @@ static int pmem_attach_disk(struct device *dev,
pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
pmem->pfn_pad = resource_size(res) -
range_len(&pmem->pgmap.range);
-   pmem->pfn_flags |= PFN_MAP;
bb_range = pmem->pgmap.range;
bb_range.start += pmem->data_offset;
} else if (pmem_should_map_pages(dev)) {
@@ -532,7 +531,6 @@ static int pmem_attach_disk(struct device *dev,
pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
pmem->pgmap.ops = &fsdax_pagemap_ops;
addr = devm_memremap_pages(dev, &pmem->pgmap);
-   pmem->pfn_flags |= PFN_MAP;
bb_range = pmem->pgmap.range;
} else {
addr = devm_memremap(dev, pmem->phys_addr,
diff --git a/fs/dax.c b/fs/dax.c
index 6674540..cf96f3d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -71,6 +71,11 @@ static unsigned long dax_to_pfn(voi

[PATCH v8 14/20] mm/rmap: Add support for PUD sized mappings to rmap

2025-02-17 Thread Alistair Popple
The rmap doesn't currently support adding a PUD mapping of a
folio. This patch adds support for entire PUD mappings of folios,
primarily to allow for more standard refcounting of device DAX
folios. Currently DAX is the only user of this and it doesn't require
support for partially mapped PUD-sized folios so we don't support for
that for now.

Signed-off-by: Alistair Popple 
Acked-by: David Hildenbrand 
Reviewed-by: Dan Williams 

---

Changes for v8:

 - Rebase on mm-unstable, only a minor conflict due to code addition
   at the same place.

Changes for v6:

 - Minor comment formatting fix
 - Add an additional check for CONFIG_TRANSPARENT_HUGEPAGE to fix a
   build breakage when CONFIG_PGTABLE_HAS_HUGE_LEAVES is not defined.

Changes for v5:

 - Fixed accounting as suggested by David.

Changes for v4:

 - New for v4, split out rmap changes as suggested by David.
---
 include/linux/rmap.h | 15 ++-
 mm/rmap.c| 67 ++---
 2 files changed, 78 insertions(+), 4 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 69e9a43..6abf796 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -192,6 +192,7 @@ typedef int __bitwise rmap_t;
 enum rmap_level {
RMAP_LEVEL_PTE = 0,
RMAP_LEVEL_PMD,
+   RMAP_LEVEL_PUD,
 };
 
 static inline void __folio_rmap_sanity_checks(const struct folio *folio,
@@ -228,6 +229,14 @@ static inline void __folio_rmap_sanity_checks(const struct 
folio *folio,
VM_WARN_ON_FOLIO(folio_nr_pages(folio) != HPAGE_PMD_NR, folio);
VM_WARN_ON_FOLIO(nr_pages != HPAGE_PMD_NR, folio);
break;
+   case RMAP_LEVEL_PUD:
+   /*
+* Assume that we are creating a single "entire" mapping of the
+* folio.
+*/
+   VM_WARN_ON_FOLIO(folio_nr_pages(folio) != HPAGE_PUD_NR, folio);
+   VM_WARN_ON_FOLIO(nr_pages != HPAGE_PUD_NR, folio);
+   break;
default:
VM_WARN_ON_ONCE(true);
}
@@ -251,12 +260,16 @@ void folio_add_file_rmap_ptes(struct folio *, struct page 
*, int nr_pages,
folio_add_file_rmap_ptes(folio, page, 1, vma)
 void folio_add_file_rmap_pmd(struct folio *, struct page *,
struct vm_area_struct *);
+void folio_add_file_rmap_pud(struct folio *, struct page *,
+   struct vm_area_struct *);
 void folio_remove_rmap_ptes(struct folio *, struct page *, int nr_pages,
struct vm_area_struct *);
 #define folio_remove_rmap_pte(folio, page, vma) \
folio_remove_rmap_ptes(folio, page, 1, vma)
 void folio_remove_rmap_pmd(struct folio *, struct page *,
struct vm_area_struct *);
+void folio_remove_rmap_pud(struct folio *, struct page *,
+   struct vm_area_struct *);
 
 void hugetlb_add_anon_rmap(struct folio *, struct vm_area_struct *,
unsigned long address, rmap_t flags);
@@ -341,6 +354,7 @@ static __always_inline void __folio_dup_file_rmap(struct 
folio *folio,
atomic_add(orig_nr_pages, &folio->_large_mapcount);
break;
case RMAP_LEVEL_PMD:
+   case RMAP_LEVEL_PUD:
atomic_inc(&folio->_entire_mapcount);
atomic_inc(&folio->_large_mapcount);
break;
@@ -437,6 +451,7 @@ static __always_inline int __folio_try_dup_anon_rmap(struct 
folio *folio,
atomic_add(orig_nr_pages, &folio->_large_mapcount);
break;
case RMAP_LEVEL_PMD:
+   case RMAP_LEVEL_PUD:
if (PageAnonExclusive(page)) {
if (unlikely(maybe_pinned))
return -EBUSY;
diff --git a/mm/rmap.c b/mm/rmap.c
index 333ecac..bcec867 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1269,12 +1269,19 @@ static __always_inline unsigned int 
__folio_add_rmap(struct folio *folio,
atomic_add(orig_nr_pages, &folio->_large_mapcount);
break;
case RMAP_LEVEL_PMD:
+   case RMAP_LEVEL_PUD:
first = atomic_inc_and_test(&folio->_entire_mapcount);
if (first) {
nr = atomic_add_return_relaxed(ENTIRELY_MAPPED, mapped);
if (likely(nr < ENTIRELY_MAPPED + ENTIRELY_MAPPED)) {
-   *nr_pmdmapped = folio_nr_pages(folio);
-   nr = *nr_pmdmapped - (nr & FOLIO_PAGES_MAPPED);
+   nr_pages = folio_nr_pages(folio);
+   /*
+* We only track PMD mappings of PMD-sized
+* folios separately.
+*/
+   if (level == RMAP_LEVEL_PMD)
+   *nr_pmdmapped = nr_pages;
+   nr = nr_pages - (nr & FOLIO_PAGES_MAPPED);
   

[PATCH v8 06/20] fs/dax: Always remove DAX page-cache entries when breaking layouts

2025-02-17 Thread Alistair Popple
Prior to any truncation operations file systems call
dax_break_mapping() to ensure pages in the range are not under going
DMA. Later DAX page-cache entries will be removed by
truncate_folio_batch_exceptionals() in the generic page-cache code.

However this makes it possible for folios to be removed from the
page-cache even though they are still DMA busy if the file-system
hasn't called dax_break_mapping(). It also means they can never be
waited on in future because FS DAX will lose track of them once the
page-cache entry has been deleted.

Instead it is better to delete the FS DAX entry when the file-system
calls dax_break_mapping() as part of it's truncate operation. This
ensures only idle pages can be removed from the FS DAX page-cache and
makes it easy to detect if a file-system hasn't called
dax_break_mapping() prior to a truncate operation.

Signed-off-by: Alistair Popple 
Reviewed-by: Dan Williams 

---

Changes for v7:

 - s/dax_break_mapping/dax_break_layout/ suggested by Dan.
 - Rework dax_break_mapping() to take a NULL callback for NOWAIT
   behaviour as suggested by Dan.
---
 fs/dax.c| 40 
 fs/xfs/xfs_inode.c  |  5 ++---
 include/linux/dax.h |  2 ++
 mm/truncate.c   | 16 +++-
 4 files changed, 59 insertions(+), 4 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index f1945aa..14fbe51 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -846,6 +846,36 @@ int dax_delete_mapping_entry(struct address_space 
*mapping, pgoff_t index)
return ret;
 }
 
+void dax_delete_mapping_range(struct address_space *mapping,
+   loff_t start, loff_t end)
+{
+   void *entry;
+   pgoff_t start_idx = start >> PAGE_SHIFT;
+   pgoff_t end_idx;
+   XA_STATE(xas, &mapping->i_pages, start_idx);
+
+   /* If end == LLONG_MAX, all pages from start to till end of file */
+   if (end == LLONG_MAX)
+   end_idx = ULONG_MAX;
+   else
+   end_idx = end >> PAGE_SHIFT;
+
+   xas_lock_irq(&xas);
+   xas_for_each(&xas, entry, end_idx) {
+   if (!xa_is_value(entry))
+   continue;
+   entry = wait_entry_unlocked_exclusive(&xas, entry);
+   if (!entry)
+   continue;
+   dax_disassociate_entry(entry, mapping, true);
+   xas_store(&xas, NULL);
+   mapping->nrpages -= 1UL << dax_entry_order(entry);
+   put_unlocked_entry(&xas, entry, WAKE_ALL);
+   }
+   xas_unlock_irq(&xas);
+}
+EXPORT_SYMBOL_GPL(dax_delete_mapping_range);
+
 static int wait_page_idle(struct page *page,
void (cb)(struct inode *),
struct inode *inode)
@@ -857,6 +887,9 @@ static int wait_page_idle(struct page *page,
 /*
  * Unmaps the inode and waits for any DMA to complete prior to deleting the
  * DAX mapping entries for the range.
+ *
+ * For NOWAIT behavior, pass @cb as NULL to early-exit on first found
+ * busy page
  */
 int dax_break_layout(struct inode *inode, loff_t start, loff_t end,
void (cb)(struct inode *))
@@ -871,10 +904,17 @@ int dax_break_layout(struct inode *inode, loff_t start, 
loff_t end,
page = dax_layout_busy_page_range(inode->i_mapping, start, end);
if (!page)
break;
+   if (!cb) {
+   error = -ERESTARTSYS;
+   break;
+   }
 
error = wait_page_idle(page, cb, inode);
} while (error == 0);
 
+   if (!page)
+   dax_delete_mapping_range(inode->i_mapping, start, end);
+
return error;
 }
 EXPORT_SYMBOL_GPL(dax_break_layout);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index d4f07e0..8008337 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2735,7 +2735,6 @@ xfs_mmaplock_two_inodes_and_break_dax_layout(
struct xfs_inode*ip2)
 {
int error;
-   struct page *page;
 
if (ip1->i_ino > ip2->i_ino)
swap(ip1, ip2);
@@ -2759,8 +2758,8 @@ xfs_mmaplock_two_inodes_and_break_dax_layout(
 * need to unlock & lock the XFS_MMAPLOCK_EXCL which is not suitable
 * for this nested lock case.
 */
-   page = dax_layout_busy_page(VFS_I(ip2)->i_mapping);
-   if (!dax_page_is_idle(page)) {
+   error = dax_break_layout(VFS_I(ip2), 0, -1, NULL);
+   if (error) {
xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
goto again;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index a6b277f..2fbb262 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -255,6 +255,8 @@ vm_fault_t dax_iomap_fault(struct vm_fault *vmf, unsigned 
int order,
 vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
unsigned int order, pfn_t pfn);
 int dax_delete_mapping_

[PATCH v8 03/20] fs/dax: Don't skip locked entries when scanning entries

2025-02-17 Thread Alistair Popple
Several functions internal to FS DAX use the following pattern when
trying to obtain an unlocked entry:

xas_for_each(&xas, entry, end_idx) {
if (dax_is_locked(entry))
entry = get_unlocked_entry(&xas, 0);

This is problematic because get_unlocked_entry() will get the next
present entry in the range, and the next entry may not be
locked. Therefore any processing of the original locked entry will be
skipped. This can cause dax_layout_busy_page_range() to miss DMA-busy
pages in the range, leading file systems to free blocks whilst DMA
operations are ongoing which can lead to file system corruption.

Instead callers from within a xas_for_each() loop should be waiting
for the current entry to be unlocked without advancing the XArray
state so a new function is introduced to wait.

Also while we are here rename get_unlocked_entry() to
get_next_unlocked_entry() to make it clear that it may advance the
iterator state.

Signed-off-by: Alistair Popple 
Reviewed-by: Dan Williams 
---
 fs/dax.c | 50 +-
 1 file changed, 41 insertions(+), 9 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index b35f538..f5fdb43 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -206,7 +206,7 @@ static void dax_wake_entry(struct xa_state *xas, void 
*entry,
  *
  * Must be called with the i_pages lock held.
  */
-static void *get_unlocked_entry(struct xa_state *xas, unsigned int order)
+static void *get_next_unlocked_entry(struct xa_state *xas, unsigned int order)
 {
void *entry;
struct wait_exceptional_entry_queue ewait;
@@ -236,6 +236,37 @@ static void *get_unlocked_entry(struct xa_state *xas, 
unsigned int order)
 }
 
 /*
+ * Wait for the given entry to become unlocked. Caller must hold the i_pages
+ * lock and call either put_unlocked_entry() if it did not lock the entry or
+ * dax_unlock_entry() if it did. Returns an unlocked entry if still present.
+ */
+static void *wait_entry_unlocked_exclusive(struct xa_state *xas, void *entry)
+{
+   struct wait_exceptional_entry_queue ewait;
+   wait_queue_head_t *wq;
+
+   init_wait(&ewait.wait);
+   ewait.wait.func = wake_exceptional_entry_func;
+
+   while (unlikely(dax_is_locked(entry))) {
+   wq = dax_entry_waitqueue(xas, entry, &ewait.key);
+   prepare_to_wait_exclusive(wq, &ewait.wait,
+   TASK_UNINTERRUPTIBLE);
+   xas_pause(xas);
+   xas_unlock_irq(xas);
+   schedule();
+   finish_wait(wq, &ewait.wait);
+   xas_lock_irq(xas);
+   entry = xas_load(xas);
+   }
+
+   if (xa_is_internal(entry))
+   return NULL;
+
+   return entry;
+}
+
+/*
  * The only thing keeping the address space around is the i_pages lock
  * (it's cycled in clear_inode() after removing the entries from i_pages)
  * After we call xas_unlock_irq(), we cannot touch xas->xa.
@@ -250,7 +281,7 @@ static void wait_entry_unlocked(struct xa_state *xas, void 
*entry)
 
wq = dax_entry_waitqueue(xas, entry, &ewait.key);
/*
-* Unlike get_unlocked_entry() there is no guarantee that this
+* Unlike get_next_unlocked_entry() there is no guarantee that this
 * path ever successfully retrieves an unlocked entry before an
 * inode dies. Perform a non-exclusive wait in case this path
 * never successfully performs its own wake up.
@@ -581,7 +612,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
 retry:
pmd_downgrade = false;
xas_lock_irq(xas);
-   entry = get_unlocked_entry(xas, order);
+   entry = get_next_unlocked_entry(xas, order);
 
if (entry) {
if (dax_is_conflict(entry))
@@ -717,8 +748,7 @@ struct page *dax_layout_busy_page_range(struct 
address_space *mapping,
xas_for_each(&xas, entry, end_idx) {
if (WARN_ON_ONCE(!xa_is_value(entry)))
continue;
-   if (unlikely(dax_is_locked(entry)))
-   entry = get_unlocked_entry(&xas, 0);
+   entry = wait_entry_unlocked_exclusive(&xas, entry);
if (entry)
page = dax_busy_page(entry);
put_unlocked_entry(&xas, entry, WAKE_NEXT);
@@ -751,7 +781,7 @@ static int __dax_invalidate_entry(struct address_space 
*mapping,
void *entry;
 
xas_lock_irq(&xas);
-   entry = get_unlocked_entry(&xas, 0);
+   entry = get_next_unlocked_entry(&xas, 0);
if (!entry || WARN_ON_ONCE(!xa_is_value(entry)))
goto out;
if (!trunc &&
@@ -777,7 +807,9 @@ static int __dax_clear_dirty_range(struct address_space 
*mapping,
 
xas_lock_irq(&xas);
xas_for_each(&xas, entry, end) {
-   entry = get_unlocked_entry(&xas, 0);
+   entry = wait_entry_unlocked_exclusive(&xas, entry);
+   if (!entry)
+   continue;
  

[PATCH v8 07/20] fs/dax: Ensure all pages are idle prior to filesystem unmount

2025-02-17 Thread Alistair Popple
File systems call dax_break_mapping() prior to reallocating file system
blocks to ensure the page is not undergoing any DMA or other
accesses. Generally this is needed when a file is truncated to ensure that
if a block is reallocated nothing is writing to it. However filesystems
currently don't call this when an FS DAX inode is evicted.

This can cause problems when the file system is unmounted as a page can
continue to be under going DMA or other remote access after unmount. This
means if the file system is remounted any truncate or other operation which
requires the underlying file system block to be freed will not wait for the
remote access to complete. Therefore a busy block may be reallocated to a
new file leading to corruption.

Signed-off-by: Alistair Popple 

---

Changes for v7:

 - Don't take locks during inode eviction as suggested by Darrick and
   therefore remove the callback for dax_break_mapping_uninterruptible().
 - Use common definition of dax_page_is_idle().
 - Fixed smatch suggestion in dax_break_mapping_uninterruptible().
 - Rename dax_break_mapping_uninterruptible() to dax_break_layout_final()
   as suggested by Dan.

Changes for v5:

 - Don't wait for pages to be idle in non-DAX mappings
---
 fs/dax.c| 27 +++
 fs/ext4/inode.c |  2 ++
 fs/xfs/xfs_super.c  | 12 
 include/linux/dax.h |  5 +
 4 files changed, 46 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index 14fbe51..bc538ba 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -884,6 +884,13 @@ static int wait_page_idle(struct page *page,
TASK_INTERRUPTIBLE, 0, 0, cb(inode));
 }
 
+static void wait_page_idle_uninterruptible(struct page *page,
+   struct inode *inode)
+{
+   ___wait_var_event(page, dax_page_is_idle(page),
+   TASK_UNINTERRUPTIBLE, 0, 0, schedule());
+}
+
 /*
  * Unmaps the inode and waits for any DMA to complete prior to deleting the
  * DAX mapping entries for the range.
@@ -919,6 +926,26 @@ int dax_break_layout(struct inode *inode, loff_t start, 
loff_t end,
 }
 EXPORT_SYMBOL_GPL(dax_break_layout);
 
+void dax_break_layout_final(struct inode *inode)
+{
+   struct page *page;
+
+   if (!dax_mapping(inode->i_mapping))
+   return;
+
+   do {
+   page = dax_layout_busy_page_range(inode->i_mapping, 0,
+   LLONG_MAX);
+   if (!page)
+   break;
+
+   wait_page_idle_uninterruptible(page, inode);
+   } while (true);
+
+   dax_delete_mapping_range(inode->i_mapping, 0, LLONG_MAX);
+}
+EXPORT_SYMBOL_GPL(dax_break_layout_final);
+
 /*
  * Invalidate DAX entry if it is clean.
  */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2342bac..3cc8da6 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -181,6 +181,8 @@ void ext4_evict_inode(struct inode *inode)
 
trace_ext4_evict_inode(inode);
 
+   dax_break_layout_final(inode);
+
if (EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL)
ext4_evict_ea_inode(inode);
if (inode->i_nlink) {
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index d92d7a0..22abe0e 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -751,6 +751,17 @@ xfs_fs_drop_inode(
return generic_drop_inode(inode);
 }
 
+STATIC void
+xfs_fs_evict_inode(
+   struct inode*inode)
+{
+   if (IS_DAX(inode))
+   dax_break_layout_final(inode);
+
+   truncate_inode_pages_final(&inode->i_data);
+   clear_inode(inode);
+}
+
 static void
 xfs_mount_free(
struct xfs_mount*mp)
@@ -1215,6 +1226,7 @@ static const struct super_operations xfs_super_operations 
= {
.destroy_inode  = xfs_fs_destroy_inode,
.dirty_inode= xfs_fs_dirty_inode,
.drop_inode = xfs_fs_drop_inode,
+   .evict_inode= xfs_fs_evict_inode,
.put_super  = xfs_fs_put_super,
.sync_fs= xfs_fs_sync_fs,
.freeze_fs  = xfs_fs_freeze,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 2fbb262..2333c30 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -232,6 +232,10 @@ static inline int __must_check dax_break_layout(struct 
inode *inode,
 {
return 0;
 }
+
+static inline void dax_break_layout_final(struct inode *inode)
+{
+}
 #endif
 
 bool dax_alive(struct dax_device *dax_dev);
@@ -266,6 +270,7 @@ static inline int __must_check 
dax_break_layout_inode(struct inode *inode,
 {
return dax_break_layout(inode, 0, LLONG_MAX, cb);
 }
+void dax_break_layout_final(struct inode *inode);
 int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
  struct inode *dest, loff_t destoff,
  loff_t len, bool *is_same,
-- 
git-series 0.9.1



[PATCH v8 04/20] fs/dax: Refactor wait for dax idle page

2025-02-17 Thread Alistair Popple
A FS DAX page is considered idle when its refcount drops to one. This
is currently open-coded in all file systems supporting FS DAX. Move
the idle detection to a common function to make future changes easier.

Signed-off-by: Alistair Popple 
Reviewed-by: Jan Kara 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Dan Williams 
Acked-by: Theodore Ts'o 
---
 fs/ext4/inode.c | 5 +
 fs/fuse/dax.c   | 4 +---
 fs/xfs/xfs_inode.c  | 4 +---
 include/linux/dax.h | 8 
 4 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7c54ae5..cc1acb1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3922,10 +3922,7 @@ int ext4_break_layouts(struct inode *inode)
if (!page)
return 0;
 
-   error = ___wait_var_event(&page->_refcount,
-   atomic_read(&page->_refcount) == 1,
-   TASK_INTERRUPTIBLE, 0, 0,
-   ext4_wait_dax_page(inode));
+   error = dax_wait_page_idle(page, ext4_wait_dax_page, inode);
} while (error == 0);
 
return error;
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index b7f805d..bf6faa3 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -677,9 +677,7 @@ static int __fuse_dax_break_layouts(struct inode *inode, 
bool *retry,
return 0;
 
*retry = true;
-   return ___wait_var_event(&page->_refcount,
-   atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
-   0, 0, fuse_wait_dax_page(inode));
+   return dax_wait_page_idle(page, fuse_wait_dax_page, inode);
 }
 
 int fuse_dax_break_layouts(struct inode *inode, u64 dmap_start,
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index b1f9f15..1b5613d 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3020,9 +3020,7 @@ xfs_break_dax_layouts(
return 0;
 
*retry = true;
-   return ___wait_var_event(&page->_refcount,
-   atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
-   0, 0, xfs_wait_dax_page(inode));
+   return dax_wait_page_idle(page, xfs_wait_dax_page, inode);
 }
 
 int
diff --git a/include/linux/dax.h b/include/linux/dax.h
index df41a00..9b1ce98 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -207,6 +207,14 @@ int dax_zero_range(struct inode *inode, loff_t pos, loff_t 
len, bool *did_zero,
 int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
const struct iomap_ops *ops);
 
+static inline int dax_wait_page_idle(struct page *page,
+   void (cb)(struct inode *),
+   struct inode *inode)
+{
+   return ___wait_var_event(page, page_ref_count(page) == 1,
+   TASK_INTERRUPTIBLE, 0, 0, cb(inode));
+}
+
 #if IS_ENABLED(CONFIG_DAX)
 int dax_read_lock(void);
 void dax_read_unlock(int id);
-- 
git-series 0.9.1



[PATCH v8 12/20] mm/memory: Enhance insert_page_into_pte_locked() to create writable mappings

2025-02-17 Thread Alistair Popple
In preparation for using insert_page() for DAX, enhance
insert_page_into_pte_locked() to handle establishing writable
mappings.  Recall that DAX returns VM_FAULT_NOPAGE after installing a
PTE which bypasses the typical set_pte_range() in finish_fault.

Signed-off-by: Alistair Popple 
Suggested-by: Dan Williams 
Reviewed-by: Dan Williams 
Acked-by: David Hildenbrand 

---

Changes for v7:
 - Drop entry and reuse pteval as suggested by David.

Changes for v5:
 - Minor comment/formatting fixes suggested by David Hildenbrand

Changes since v2:
 - New patch split out from "mm/memory: Add dax_insert_pfn"
---
 mm/memory.c | 39 ++-
 1 file changed, 30 insertions(+), 9 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 905ed2f..becfaf4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2126,19 +2126,39 @@ static int validate_page_before_insert(struct 
vm_area_struct *vma,
 }
 
 static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
-   unsigned long addr, struct page *page, pgprot_t prot)
+   unsigned long addr, struct page *page,
+   pgprot_t prot, bool mkwrite)
 {
struct folio *folio = page_folio(page);
-   pte_t pteval;
+   pte_t pteval = ptep_get(pte);
+
+   if (!pte_none(pteval)) {
+   if (!mkwrite)
+   return -EBUSY;
+
+   /* see insert_pfn(). */
+   if (pte_pfn(pteval) != page_to_pfn(page)) {
+   WARN_ON_ONCE(!is_zero_pfn(pte_pfn(pteval)));
+   return -EFAULT;
+   }
+   pteval = maybe_mkwrite(pteval, vma);
+   pteval = pte_mkyoung(pteval);
+   if (ptep_set_access_flags(vma, addr, pte, pteval, 1))
+   update_mmu_cache(vma, addr, pte);
+   return 0;
+   }
 
-   if (!pte_none(ptep_get(pte)))
-   return -EBUSY;
/* Ok, finally just insert the thing.. */
pteval = mk_pte(page, prot);
if (unlikely(is_zero_folio(folio))) {
pteval = pte_mkspecial(pteval);
} else {
folio_get(folio);
+   pteval = mk_pte(page, prot);
+   if (mkwrite) {
+   pteval = pte_mkyoung(pteval);
+   pteval = maybe_mkwrite(pte_mkdirty(pteval), vma);
+   }
inc_mm_counter(vma->vm_mm, mm_counter_file(folio));
folio_add_file_rmap_pte(folio, page, vma);
}
@@ -2147,7 +2167,7 @@ static int insert_page_into_pte_locked(struct 
vm_area_struct *vma, pte_t *pte,
 }
 
 static int insert_page(struct vm_area_struct *vma, unsigned long addr,
-   struct page *page, pgprot_t prot)
+   struct page *page, pgprot_t prot, bool mkwrite)
 {
int retval;
pte_t *pte;
@@ -2160,7 +2180,8 @@ static int insert_page(struct vm_area_struct *vma, 
unsigned long addr,
pte = get_locked_pte(vma->vm_mm, addr, &ptl);
if (!pte)
goto out;
-   retval = insert_page_into_pte_locked(vma, pte, addr, page, prot);
+   retval = insert_page_into_pte_locked(vma, pte, addr, page, prot,
+   mkwrite);
pte_unmap_unlock(pte, ptl);
 out:
return retval;
@@ -2174,7 +2195,7 @@ static int insert_page_in_batch_locked(struct 
vm_area_struct *vma, pte_t *pte,
err = validate_page_before_insert(vma, page);
if (err)
return err;
-   return insert_page_into_pte_locked(vma, pte, addr, page, prot);
+   return insert_page_into_pte_locked(vma, pte, addr, page, prot, false);
 }
 
 /* insert_pages() amortizes the cost of spinlock operations
@@ -2310,7 +2331,7 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned 
long addr,
BUG_ON(vma->vm_flags & VM_PFNMAP);
vm_flags_set(vma, VM_MIXEDMAP);
}
-   return insert_page(vma, addr, page, vma->vm_page_prot);
+   return insert_page(vma, addr, page, vma->vm_page_prot, false);
 }
 EXPORT_SYMBOL(vm_insert_page);
 
@@ -2590,7 +2611,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct 
*vma,
 * result in pfn_t_has_page() == false.
 */
page = pfn_to_page(pfn_t_to_pfn(pfn));
-   err = insert_page(vma, addr, page, pgprot);
+   err = insert_page(vma, addr, page, pgprot, mkwrite);
} else {
return insert_pfn(vma, addr, pfn, pgprot, mkwrite);
}
-- 
git-series 0.9.1



[PATCH v8 11/20] mm: Allow compound zone device pages

2025-02-17 Thread Alistair Popple
Zone device pages are used to represent various type of device memory
managed by device drivers. Currently compound zone device pages are
not supported. This is because MEMORY_DEVICE_FS_DAX pages are the only
user of higher order zone device pages and have their own page
reference counting.

A future change will unify FS DAX reference counting with normal page
reference counting rules and remove the special FS DAX reference
counting. Supporting that requires compound zone device pages.

Supporting compound zone device pages requires compound_head() to
distinguish between head and tail pages whilst still preserving the
special struct page fields that are specific to zone device pages.

A tail page is distinguished by having bit zero being set in
page->compound_head, with the remaining bits pointing to the head
page. For zone device pages page->compound_head is shared with
page->pgmap.

The page->pgmap field must be common to all pages within a folio, even
if the folio spans memory sections.  Therefore pgmap is the same for
both head and tail pages and can be moved into the folio and we can
use the standard scheme to find compound_head from a tail page.

Signed-off-by: Alistair Popple 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Dan Williams 
Acked-by: David Hildenbrand 

---

Changes for v7:
 - Skip ZONE_DEVICE PMDs during mlock which was previously a separate
   patch.

Changes for v4:
 - Fix build breakages reported by kernel test robot

Changes since v2:

 - Indentation fix
 - Rename page_dev_pagemap() to page_pgmap()
 - Rename folio _unused field to _unused_pgmap_compound_head
 - s/WARN_ON/VM_WARN_ON_ONCE_PAGE/

Changes since v1:

 - Move pgmap to the folio as suggested by Matthew Wilcox
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c |  3 ++-
 drivers/pci/p2pdma.c   |  6 +++---
 include/linux/memremap.h   |  6 +++---
 include/linux/migrate.h|  4 ++--
 include/linux/mm_types.h   |  9 +++--
 include/linux/mmzone.h | 12 +++-
 lib/test_hmm.c |  3 ++-
 mm/hmm.c   |  2 +-
 mm/memory.c|  4 +++-
 mm/memremap.c  | 14 +++---
 mm/migrate_device.c|  7 +--
 mm/mlock.c |  2 ++
 mm/mm_init.c   |  2 +-
 13 files changed, 49 insertions(+), 25 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 1a07256..61d0f41 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -88,7 +88,8 @@ struct nouveau_dmem {
 
 static struct nouveau_dmem_chunk *nouveau_page_to_chunk(struct page *page)
 {
-   return container_of(page->pgmap, struct nouveau_dmem_chunk, pagemap);
+   return container_of(page_pgmap(page), struct nouveau_dmem_chunk,
+   pagemap);
 }
 
 static struct nouveau_drm *page_to_drm(struct page *page)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 04773a8..19214ec 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -202,7 +202,7 @@ static const struct attribute_group p2pmem_group = {
 
 static void p2pdma_page_free(struct page *page)
 {
-   struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
+   struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_pgmap(page));
/* safe to dereference while a reference is held to the percpu ref */
struct pci_p2pdma *p2pdma =
rcu_dereference_protected(pgmap->provider->p2pdma, 1);
@@ -1025,8 +1025,8 @@ enum pci_p2pdma_map_type
 pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
   struct scatterlist *sg)
 {
-   if (state->pgmap != sg_page(sg)->pgmap) {
-   state->pgmap = sg_page(sg)->pgmap;
+   if (state->pgmap != page_pgmap(sg_page(sg))) {
+   state->pgmap = page_pgmap(sg_page(sg));
state->map = pci_p2pdma_map_type(state->pgmap, dev);
state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
}
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 3f7143a..0256a42 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -161,7 +161,7 @@ static inline bool is_device_private_page(const struct page 
*page)
 {
return IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
is_zone_device_page(page) &&
-   page->pgmap->type == MEMORY_DEVICE_PRIVATE;
+   page_pgmap(page)->type == MEMORY_DEVICE_PRIVATE;
 }
 
 static inline bool folio_is_device_private(const struct folio *folio)
@@ -173,13 +173,13 @@ static inline bool is_pci_p2pdma_page(const struct page 
*page)
 {
return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
is_zone_device_page(page) &&
-   page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
+   page_pgmap(page)

[PATCH v8 15/20] mm/huge_memory: Add vmf_insert_folio_pud()

2025-02-17 Thread Alistair Popple
Currently DAX folio/page reference counts are managed differently to
normal pages. To allow these to be managed the same as normal pages
introduce vmf_insert_folio_pud. This will map the entire PUD-sized folio
and take references as it would for a normally mapped page.

This is distinct from the current mechanism, vmf_insert_pfn_pud, which
simply inserts a special devmap PUD entry into the page table without
holding a reference to the page for the mapping.

Signed-off-by: Alistair Popple 
Reviewed-by: Dan Williams 
Acked-by: David Hildenbrand 

---

Changes for v7:
 - Added a comment clarifying why we can insert without a reference.

Changes for v5:
 - Removed is_huge_zero_pud() as it's unlikely to ever be implemented.
 - Minor code clean-up suggested by David.
---
 include/linux/huge_mm.h |   2 +-
 mm/huge_memory.c|  99 -
 2 files changed, 89 insertions(+), 12 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2bd1811..b60e2d4 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -39,6 +39,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
 
 vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
 vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
+vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio,
+   bool write);
 
 enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_UNSUPPORTED,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3159ae0..1da6047 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1482,19 +1482,17 @@ static void insert_pfn_pud(struct vm_area_struct *vma, 
unsigned long addr,
struct mm_struct *mm = vma->vm_mm;
pgprot_t prot = vma->vm_page_prot;
pud_t entry;
-   spinlock_t *ptl;
 
-   ptl = pud_lock(mm, pud);
if (!pud_none(*pud)) {
if (write) {
if (WARN_ON_ONCE(pud_pfn(*pud) != pfn_t_to_pfn(pfn)))
-   goto out_unlock;
+   return;
entry = pud_mkyoung(*pud);
entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
if (pudp_set_access_flags(vma, addr, pud, entry, 1))
update_mmu_cache_pud(vma, addr, pud);
}
-   goto out_unlock;
+   return;
}
 
entry = pud_mkhuge(pfn_t_pud(pfn, prot));
@@ -1508,9 +1506,6 @@ static void insert_pfn_pud(struct vm_area_struct *vma, 
unsigned long addr,
}
set_pud_at(mm, addr, pud, entry);
update_mmu_cache_pud(vma, addr, pud);
-
-out_unlock:
-   spin_unlock(ptl);
 }
 
 /**
@@ -1528,6 +1523,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t 
pfn, bool write)
unsigned long addr = vmf->address & PUD_MASK;
struct vm_area_struct *vma = vmf->vma;
pgprot_t pgprot = vma->vm_page_prot;
+   spinlock_t *ptl;
 
/*
 * If we had pud_special, we could avoid all these restrictions,
@@ -1545,10 +1541,57 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, 
pfn_t pfn, bool write)
 
track_pfn_insert(vma, &pgprot, pfn);
 
+   ptl = pud_lock(vma->vm_mm, vmf->pud);
insert_pfn_pud(vma, addr, vmf->pud, pfn, write);
+   spin_unlock(ptl);
+
return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(vmf_insert_pfn_pud);
+
+/**
+ * vmf_insert_folio_pud - insert a pud size folio mapped by a pud entry
+ * @vmf: Structure describing the fault
+ * @folio: folio to insert
+ * @write: whether it's a write fault
+ *
+ * Return: vm_fault_t value.
+ */
+vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio,
+   bool write)
+{
+   struct vm_area_struct *vma = vmf->vma;
+   unsigned long addr = vmf->address & PUD_MASK;
+   pud_t *pud = vmf->pud;
+   struct mm_struct *mm = vma->vm_mm;
+   spinlock_t *ptl;
+
+   if (addr < vma->vm_start || addr >= vma->vm_end)
+   return VM_FAULT_SIGBUS;
+
+   if (WARN_ON_ONCE(folio_order(folio) != PUD_ORDER))
+   return VM_FAULT_SIGBUS;
+
+   ptl = pud_lock(mm, pud);
+
+   /*
+* If there is already an entry present we assume the folio is
+* already mapped, hence no need to take another reference. We
+* still call insert_pfn_pud() though in case the mapping needs
+* upgrading to writeable.
+*/
+   if (pud_none(*vmf->pud)) {
+   folio_get(folio);
+   folio_add_file_rmap_pud(folio, &folio->page, vma);
+   add_mm_counter(mm, mm_counter_file(folio), HPAGE_PUD_NR);
+   }
+   insert_pfn_pud(vma, addr, vmf->pud, pfn_to_pfn_t(folio_pfn(folio)),
+   write);
+   spin_unlock(ptl);
+
+   return VM_FAULT_NOPAGE;
+}
+EXPORT_SYMBOL_GPL(vmf_insert_folio

[PATCH v8 10/20] mm/mm_init: Move p2pdma page refcount initialisation to p2pdma

2025-02-17 Thread Alistair Popple
Currently ZONE_DEVICE page reference counts are initialised by core
memory management code in __init_zone_device_page() as part of the
memremap() call which driver modules make to obtain ZONE_DEVICE
pages. This initialises page refcounts to 1 before returning them to
the driver.

This was presumably done because it drivers had a reference of sorts
on the page. It also ensured the page could always be mapped with
vm_insert_page() for example and would never get freed (ie. have a
zero refcount), freeing drivers of manipulating page reference counts.

However it complicates figuring out whether or not a page is free from
the mm perspective because it is no longer possible to just look at
the refcount. Instead the page type must be known and if GUP is used a
secondary pgmap reference is also sometimes needed.

To simplify this it is desirable to remove the page reference count
for the driver, so core mm can just use the refcount without having to
account for page type or do other types of tracking. This is possible
because drivers can always assume the page is valid as core kernel
will never offline or remove the struct page.

This means it is now up to drivers to initialise the page refcount as
required. P2PDMA uses vm_insert_page() to map the page, and that
requires a non-zero reference count when initialising the page so set
that when the page is first mapped.

Signed-off-by: Alistair Popple 
Reviewed-by: Dan Williams 
Acked-by: David Hildenbrand 

---

Changes since v2:

 - Initialise the page refcount for all pages covered by the kaddr
---
 drivers/pci/p2pdma.c | 13 +++--
 mm/memremap.c| 17 +
 mm/mm_init.c | 22 ++
 3 files changed, 42 insertions(+), 10 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 0cb7e0a..04773a8 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -140,13 +140,22 @@ static int p2pmem_alloc_mmap(struct file *filp, struct 
kobject *kobj,
rcu_read_unlock();
 
for (vaddr = vma->vm_start; vaddr < vma->vm_end; vaddr += PAGE_SIZE) {
-   ret = vm_insert_page(vma, vaddr, virt_to_page(kaddr));
+   struct page *page = virt_to_page(kaddr);
+
+   /*
+* Initialise the refcount for the freshly allocated page. As
+* we have just allocated the page no one else should be
+* using it.
+*/
+   VM_WARN_ON_ONCE_PAGE(!page_ref_count(page), page);
+   set_page_count(page, 1);
+   ret = vm_insert_page(vma, vaddr, page);
if (ret) {
gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
return ret;
}
percpu_ref_get(ref);
-   put_page(virt_to_page(kaddr));
+   put_page(page);
kaddr += PAGE_SIZE;
len -= PAGE_SIZE;
}
diff --git a/mm/memremap.c b/mm/memremap.c
index 40d4547..07bbe0e 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -488,15 +488,24 @@ void free_zone_device_folio(struct folio *folio)
folio->mapping = NULL;
folio->page.pgmap->ops->page_free(folio_page(folio, 0));
 
-   if (folio->page.pgmap->type != MEMORY_DEVICE_PRIVATE &&
-   folio->page.pgmap->type != MEMORY_DEVICE_COHERENT)
+   switch (folio->page.pgmap->type) {
+   case MEMORY_DEVICE_PRIVATE:
+   case MEMORY_DEVICE_COHERENT:
+   put_dev_pagemap(folio->page.pgmap);
+   break;
+
+   case MEMORY_DEVICE_FS_DAX:
+   case MEMORY_DEVICE_GENERIC:
/*
 * Reset the refcount to 1 to prepare for handing out the page
 * again.
 */
folio_set_count(folio, 1);
-   else
-   put_dev_pagemap(folio->page.pgmap);
+   break;
+
+   case MEMORY_DEVICE_PCI_P2PDMA:
+   break;
+   }
 }
 
 void zone_device_page_init(struct page *page)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index c767946..6be9796 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1017,12 +1017,26 @@ static void __ref __init_zone_device_page(struct page 
*page, unsigned long pfn,
}
 
/*
-* ZONE_DEVICE pages are released directly to the driver page allocator
-* which will set the page count to 1 when allocating the page.
+* ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC and
+* MEMORY_TYPE_FS_DAX pages are released directly to the driver page
+* allocator which will set the page count to 1 when allocating the
+* page.
+*
+* MEMORY_TYPE_GENERIC and MEMORY_TYPE_FS_DAX pages automatically have
+* their refcount reset to one whenever they are freed (ie. after
+* their refcount drops to 0).
 */
-   if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
-   pgmap->type == MEMORY_DEVICE_COHERENT)
+   switch (pgma

[PATCH v8 16/20] mm/huge_memory: Add vmf_insert_folio_pmd()

2025-02-17 Thread Alistair Popple
Currently DAX folio/page reference counts are managed differently to normal
pages. To allow these to be managed the same as normal pages introduce
vmf_insert_folio_pmd. This will map the entire PMD-sized folio and take
references as it would for a normally mapped page.

This is distinct from the current mechanism, vmf_insert_pfn_pmd, which
simply inserts a special devmap PMD entry into the page table without
holding a reference to the page for the mapping.

It is not currently useful to implement a more generic vmf_insert_folio()
which selects the correct behaviour based on folio_order(). This is because
PTE faults require only a subpage of the folio to be PTE mapped rather than
the entire folio. It would be possible to add this context somewhere but
callers already need to handle PTE faults and PMD faults separately so a
more generic function is not useful.

Signed-off-by: Alistair Popple 
Acked-by: David Hildenbrand 

---

Changes for v8:

 - Cleanup useless and confusing pgtable assignment.
 - Fix line lengths

Changes for v7:

 - Fix bad pgtable handling for PPC64 (Thanks Dan and Dave)
 - Add lockdep_assert() to document locking requirements for insert_pfn_pmd()

Changes for v5:

 - Minor code cleanup suggested by David
---
 include/linux/huge_mm.h |  2 +-
 mm/huge_memory.c| 65 ++
 2 files changed, 55 insertions(+), 12 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b60e2d4..e893d54 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -39,6 +39,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
 
 vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
 vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
+vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio,
+   bool write);
 vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio,
bool write);
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1da6047..d189826 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1375,20 +1375,20 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault 
*vmf)
return __do_huge_pmd_anonymous_page(vmf);
 }
 
-static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
+static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, pfn_t pfn, pgprot_t prot, bool write,
pgtable_t pgtable)
 {
struct mm_struct *mm = vma->vm_mm;
pmd_t entry;
-   spinlock_t *ptl;
 
-   ptl = pmd_lock(mm, pmd);
+   lockdep_assert_held(pmd_lockptr(mm, pmd));
+
if (!pmd_none(*pmd)) {
if (write) {
if (pmd_pfn(*pmd) != pfn_t_to_pfn(pfn)) {
WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
-   goto out_unlock;
+   return -EEXIST;
}
entry = pmd_mkyoung(*pmd);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -1396,7 +1396,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, 
unsigned long addr,
update_mmu_cache_pmd(vma, addr, pmd);
}
 
-   goto out_unlock;
+   return -EEXIST;
}
 
entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
@@ -1412,16 +1412,11 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, 
unsigned long addr,
if (pgtable) {
pgtable_trans_huge_deposit(mm, pmd, pgtable);
mm_inc_nr_ptes(mm);
-   pgtable = NULL;
}
 
set_pmd_at(mm, addr, pmd, entry);
update_mmu_cache_pmd(vma, addr, pmd);
-
-out_unlock:
-   spin_unlock(ptl);
-   if (pgtable)
-   pte_free(mm, pgtable);
+   return 0;
 }
 
 /**
@@ -1440,6 +1435,8 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t 
pfn, bool write)
struct vm_area_struct *vma = vmf->vma;
pgprot_t pgprot = vma->vm_page_prot;
pgtable_t pgtable = NULL;
+   spinlock_t *ptl;
+   int error;
 
/*
 * If we had pmd_special, we could avoid all these restrictions,
@@ -1462,12 +1459,56 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, 
pfn_t pfn, bool write)
}
 
track_pfn_insert(vma, &pgprot, pfn);
+   ptl = pmd_lock(vma->vm_mm, vmf->pmd);
+   error = insert_pfn_pmd(vma, addr, vmf->pmd, pfn, pgprot, write,
+   pgtable);
+   spin_unlock(ptl);
+   if (error && pgtable)
+   pte_free(vma->vm_mm, pgtable);
 
-   insert_pfn_pmd(vma, addr, vmf->pmd, pfn, pgprot, write, pgtable);
return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(vmf_insert_pfn_pmd);
 
+vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio,
+   

[PATCH v8 18/20] dcssblk: Mark DAX broken, remove FS_DAX_LIMITED support

2025-02-17 Thread Alistair Popple
From: Dan Williams 

The dcssblk driver has long needed special case supoprt to enable
limited dax operation, so called CONFIG_FS_DAX_LIMITED. This mode
works around the incomplete support for ZONE_DEVICE on s390 by forgoing
the ability of dax-mapped pages to support GUP.

Now, pending cleanups to fsdax that fix its reference counting [1] depend on
the ability of all dax drivers to supply ZONE_DEVICE pages.

To allow that work to move forward, dax support needs to be paused for
dcssblk until ZONE_DEVICE support arrives. That work has been known for
a few years [2], and the removal of "pte_devmap" requirements [3] makes the
conversion easier.

For now, place the support behind CONFIG_BROKEN, and remove PFN_SPECIAL
(dcssblk was the only user).

Link: 
http://lore.kernel.org/cover.9f0e45d52f5cff58807831b6b867084d0b14b61c.1725941415.git-series.apop...@nvidia.com
 [1]
Link: http://lore.kernel.org/20210820210318.187742e8@thinkpad/ [2]
Link: 
http://lore.kernel.org/4511465a4f8429f45e2ac70d2e65dc5e1df1eb47.1725941415.git-series.apop...@nvidia.com
 [3]
Reviewed-by: Gerald Schaefer 
Tested-by: Alexander Gordeev 
Acked-by: David Hildenbrand 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Christian Borntraeger 
Cc: Sven Schnelle 
Cc: Jan Kara 
Cc: Matthew Wilcox 
Cc: Christoph Hellwig 
Cc: Alistair Popple 
Signed-off-by: Dan Williams 
---
 Documentation/filesystems/dax.rst |  1 -
 drivers/s390/block/Kconfig| 12 ++--
 drivers/s390/block/dcssblk.c  | 27 +--
 3 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/Documentation/filesystems/dax.rst 
b/Documentation/filesystems/dax.rst
index 719e90f..08dd5e2 100644
--- a/Documentation/filesystems/dax.rst
+++ b/Documentation/filesystems/dax.rst
@@ -207,7 +207,6 @@ implement direct_access.
 
 These block devices may be used for inspiration:
 - brd: RAM backed block device driver
-- dcssblk: s390 dcss block device driver
 - pmem: NVDIMM persistent memory driver
 
 
diff --git a/drivers/s390/block/Kconfig b/drivers/s390/block/Kconfig
index e3710a7..4bfe469 100644
--- a/drivers/s390/block/Kconfig
+++ b/drivers/s390/block/Kconfig
@@ -4,13 +4,21 @@ comment "S/390 block device drivers"
 
 config DCSSBLK
def_tristate m
-   select FS_DAX_LIMITED
-   select DAX
prompt "DCSSBLK support"
depends on S390 && BLOCK
help
  Support for dcss block device
 
+config DCSSBLK_DAX
+   def_bool y
+   depends on DCSSBLK
+   # requires S390 ZONE_DEVICE support
+   depends on BROKEN
+   select DAX
+   prompt "DCSSBLK DAX support"
+   help
+ Enable DAX operation for the dcss block device
+
 config DASD
def_tristate y
prompt "Support for DASD devices"
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 0f14d27..7248e54 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -534,6 +534,21 @@ static const struct attribute_group 
*dcssblk_dev_attr_groups[] = {
NULL,
 };
 
+static int dcssblk_setup_dax(struct dcssblk_dev_info *dev_info)
+{
+   struct dax_device *dax_dev;
+
+   if (!IS_ENABLED(CONFIG_DCSSBLK_DAX))
+   return 0;
+
+   dax_dev = alloc_dax(dev_info, &dcssblk_dax_ops);
+   if (IS_ERR(dax_dev))
+   return PTR_ERR(dax_dev);
+   set_dax_synchronous(dax_dev);
+   dev_info->dax_dev = dax_dev;
+   return dax_add_host(dev_info->dax_dev, dev_info->gd);
+}
+
 /*
  * device attribute for adding devices
  */
@@ -547,7 +562,6 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char 
int rc, i, j, num_of_segments;
struct dcssblk_dev_info *dev_info;
struct segment_info *seg_info, *temp;
-   struct dax_device *dax_dev;
char *local_buf;
unsigned long seg_byte_size;
 
@@ -674,14 +688,7 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char 
if (rc)
goto put_dev;
 
-   dax_dev = alloc_dax(dev_info, &dcssblk_dax_ops);
-   if (IS_ERR(dax_dev)) {
-   rc = PTR_ERR(dax_dev);
-   goto put_dev;
-   }
-   set_dax_synchronous(dax_dev);
-   dev_info->dax_dev = dax_dev;
-   rc = dax_add_host(dev_info->dax_dev, dev_info->gd);
+   rc = dcssblk_setup_dax(dev_info);
if (rc)
goto out_dax;
 
@@ -917,7 +924,7 @@ __dcssblk_direct_access(struct dcssblk_dev_info *dev_info, 
pgoff_t pgoff,
*kaddr = __va(dev_info->start + offset);
if (pfn)
*pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset),
-   PFN_DEV|PFN_SPECIAL);
+ PFN_DEV);
 
return (dev_sz - offset) / PAGE_SIZE;
 }
-- 
git-series 0.9.1



[PATCH v8 13/20] mm/memory: Add vmf_insert_page_mkwrite()

2025-02-17 Thread Alistair Popple
Currently to map a DAX page the DAX driver calls vmf_insert_pfn. This
creates a special devmap PTE entry for the pfn but does not take a
reference on the underlying struct page for the mapping. This is
because DAX page refcounts are treated specially, as indicated by the
presence of a devmap entry.

To allow DAX page refcounts to be managed the same as normal page
refcounts introduce vmf_insert_page_mkwrite(). This will take a
reference on the underlying page much the same as vmf_insert_page,
except it also permits upgrading an existing mapping to be writable if
requested/possible.

Signed-off-by: Alistair Popple 
Acked-by: David Hildenbrand 

---

Changes for v8:
 - Remove temp suggested by David.

Changes for v7:
 - Fix vmf_insert_page_mkwrite by removing pfn gunk as suggested by
   David.

Updates from v2:

 - Rename function to make not DAX specific

 - Split the insert_page_into_pte_locked() change into a separate
   patch.

Updates from v1:

 - Re-arrange code in insert_page_into_pte_locked() based on comments
   from Jan Kara.

 - Call mkdrity/mkyoung for the mkwrite case, also suggested by Jan.
---
 include/linux/mm.h |  2 ++
 mm/memory.c| 20 
 2 files changed, 22 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fabd537..d1f260d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3638,6 +3638,8 @@ int vm_map_pages(struct vm_area_struct *vma, struct page 
**pages,
unsigned long num);
 int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
unsigned long num);
+vm_fault_t vmf_insert_page_mkwrite(struct vm_fault *vmf, struct page *page,
+   bool write);
 vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn);
 vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index becfaf4..a978b77 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2624,6 +2624,26 @@ static vm_fault_t __vm_insert_mixed(struct 
vm_area_struct *vma,
return VM_FAULT_NOPAGE;
 }
 
+vm_fault_t vmf_insert_page_mkwrite(struct vm_fault *vmf, struct page *page,
+   bool write)
+{
+   pgprot_t pgprot = vmf->vma->vm_page_prot;
+   unsigned long addr = vmf->address;
+   int err;
+
+   if (addr < vmf->vma->vm_start || addr >= vmf->vma->vm_end)
+   return VM_FAULT_SIGBUS;
+
+   err = insert_page(vmf->vma, addr, page, pgprot, write);
+   if (err == -ENOMEM)
+   return VM_FAULT_OOM;
+   if (err < 0 && err != -EBUSY)
+   return VM_FAULT_SIGBUS;
+
+   return VM_FAULT_NOPAGE;
+}
+EXPORT_SYMBOL_GPL(vmf_insert_page_mkwrite);
+
 vm_fault_t vmf_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
pfn_t pfn)
 {
-- 
git-series 0.9.1



[PATCH v8 17/20] mm/gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages

2025-02-17 Thread Alistair Popple
Longterm pinning of FS DAX pages should already be disallowed by
various pXX_devmap checks. However a future change will cause these
checks to be invalid for FS DAX pages so make
folio_is_longterm_pinnable() return false for FS DAX pages.

Signed-off-by: Alistair Popple 
Reviewed-by: John Hubbard 
Reviewed-by: Dan Williams 
Acked-by: David Hildenbrand 
---
 include/linux/memremap.h | 11 +++
 include/linux/mm.h   |  7 +++
 2 files changed, 18 insertions(+)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 0256a42..4aa1519 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -187,6 +187,17 @@ static inline bool folio_is_device_coherent(const struct 
folio *folio)
return is_device_coherent_page(&folio->page);
 }
 
+static inline bool is_fsdax_page(const struct page *page)
+{
+   return is_zone_device_page(page) &&
+   page_pgmap(page)->type == MEMORY_DEVICE_FS_DAX;
+}
+
+static inline bool folio_is_fsdax(const struct folio *folio)
+{
+   return is_fsdax_page(&folio->page);
+}
+
 #ifdef CONFIG_ZONE_DEVICE
 void zone_device_page_init(struct page *page);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d1f260d..066aebd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2109,6 +2109,13 @@ static inline bool folio_is_longterm_pinnable(struct 
folio *folio)
if (folio_is_device_coherent(folio))
return false;
 
+   /*
+* Filesystems can only tolerate transient delays to truncate and
+* hole-punch operations
+*/
+   if (folio_is_fsdax(folio))
+   return false;
+
/* Otherwise, non-movable zone folios can be pinned. */
return !folio_is_zone_movable(folio);
 
-- 
git-series 0.9.1



[PATCH v8 20/20] device/dax: Properly refcount device dax pages when mapping

2025-02-17 Thread Alistair Popple
Device DAX pages are currently not reference counted when mapped,
instead relying on the devmap PTE bit to ensure mapping code will not
get/put references. This requires special handling in various page
table walkers, particularly GUP, to manage references on the
underlying pgmap to ensure the pages remain valid.

However there is no reason these pages can't be refcounted properly at
map time. Doning so eliminates the need for the devmap PTE bit,
freeing up a precious PTE bit. It also simplifies GUP as it no longer
needs to manage the special pgmap references and can instead just
treat the pages normally as defined by vm_normal_page().

Signed-off-by: Alistair Popple 
---
 drivers/dax/device.c | 15 +--
 mm/memremap.c| 13 ++---
 2 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index bc871a3..328231c 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -125,11 +125,12 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax 
*dev_dax,
return VM_FAULT_SIGBUS;
}
 
-   pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
+   pfn = phys_to_pfn_t(phys, 0);
 
dax_set_mapping(vmf, pfn, fault_size);
 
-   return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
+   return vmf_insert_page_mkwrite(vmf, pfn_t_to_page(pfn),
+   vmf->flags & FAULT_FLAG_WRITE);
 }
 
 static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
@@ -168,11 +169,12 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax 
*dev_dax,
return VM_FAULT_SIGBUS;
}
 
-   pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
+   pfn = phys_to_pfn_t(phys, 0);
 
dax_set_mapping(vmf, pfn, fault_size);
 
-   return vmf_insert_pfn_pmd(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);
+   return vmf_insert_folio_pmd(vmf, page_folio(pfn_t_to_page(pfn)),
+   vmf->flags & FAULT_FLAG_WRITE);
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
@@ -213,11 +215,12 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax 
*dev_dax,
return VM_FAULT_SIGBUS;
}
 
-   pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
+   pfn = phys_to_pfn_t(phys, 0);
 
dax_set_mapping(vmf, pfn, fault_size);
 
-   return vmf_insert_pfn_pud(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);
+   return vmf_insert_folio_pud(vmf, page_folio(pfn_t_to_page(pfn)),
+   vmf->flags & FAULT_FLAG_WRITE);
 }
 #else
 static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
diff --git a/mm/memremap.c b/mm/memremap.c
index 9a8879b..532a52a 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -460,11 +460,10 @@ void free_zone_device_folio(struct folio *folio)
 {
struct dev_pagemap *pgmap = folio->pgmap;
 
-   if (WARN_ON_ONCE(!pgmap->ops))
-   return;
-
-   if (WARN_ON_ONCE(pgmap->type != MEMORY_DEVICE_FS_DAX &&
-!pgmap->ops->page_free))
+   if (WARN_ON_ONCE((!pgmap->ops &&
+ pgmap->type != MEMORY_DEVICE_GENERIC) ||
+(pgmap->ops && !pgmap->ops->page_free &&
+ pgmap->type != MEMORY_DEVICE_FS_DAX)))
return;
 
mem_cgroup_uncharge(folio);
@@ -494,7 +493,8 @@ void free_zone_device_folio(struct folio *folio)
 * zero which indicating the page has been removed from the file
 * system mapping.
 */
-   if (pgmap->type != MEMORY_DEVICE_FS_DAX)
+   if (pgmap->type != MEMORY_DEVICE_FS_DAX &&
+   pgmap->type != MEMORY_DEVICE_GENERIC)
folio->mapping = NULL;
 
switch (pgmap->type) {
@@ -509,7 +509,6 @@ void free_zone_device_folio(struct folio *folio)
 * Reset the refcount to 1 to prepare for handing out the page
 * again.
 */
-   pgmap->ops->page_free(folio_page(folio, 0));
folio_set_count(folio, 1);
break;
 
-- 
git-series 0.9.1



Re: [PATCH] tools/perf: Add check to tool pmu tests to ensure if the event is valid

2025-02-17 Thread Athira Rajeev



> On 13 Feb 2025, at 9:04 AM, Namhyung Kim  wrote:
> 
> On Thu, Feb 13, 2025 at 12:24:38AM +0530, Athira Rajeev wrote:
>> "Tool PMU" tests fails on powerpc as below:
>> 
>>   12.1: Parsing without PMU name:
>>   --- start ---
>>   test child forked, pid 48492
>>   Using CPUID 0x00800200
>>   Attempt to add: tool/duration_time/
>>   ..after resolving event: tool/config=0x1/
>>   duration_time -> tool/duration_time/
>>   Attempt to add: tool/user_time/
>>   ..after resolving event: tool/config=0x2/
>>   user_time -> tool/user_time/
>>   Attempt to add: tool/system_time/
>>   ..after resolving event: tool/config=0x3/
>>   system_time -> tool/system_time/
>>   Attempt to add: tool/has_pmem/
>>   ..after resolving event: tool/config=0x4/
>>   has_pmem -> tool/has_pmem/
>>   Attempt to add: tool/num_cores/
>>   ..after resolving event: tool/config=0x5/
>>   num_cores -> tool/num_cores/
>>   Attempt to add: tool/num_cpus/
>>   ..after resolving event: tool/config=0x6/
>>   num_cpus -> tool/num_cpus/
>>   Attempt to add: tool/num_cpus_online/
>>   ..after resolving event: tool/config=0x7/
>>   num_cpus_online -> tool/num_cpus_online/
>>   Attempt to add: tool/num_dies/
>>   ..after resolving event: tool/config=0x8/
>>   num_dies -> tool/num_dies/
>>   Attempt to add: tool/num_packages/
>>   ..after resolving event: tool/config=0x9/
>>   num_packages -> tool/num_packages/
>> 
>>    unexpected signal (11) 
>>   12.1: Parsing without PMU name  : 
>> FAILED!
>> 
>> Same fail is observed for "Parsing with PMU name" as well.
>> 
>> The testcase loops through events in tool_pmu__for_each_event()
>> and access event name using "tool_pmu__event_to_str()".
>> Here tool_pmu__event_to_str returns null for "slots" event
>> and "system_tsc_freq" event. These two events are only applicable
>> for arm64 and x86 respectively. So the function tool_pmu__event_to_str()
>> skips for unsupported events and returns null. This null value is
>> causing testcase fail.
>> 
>> To address this in "Tool PMU" testcase, add a helper function
>> tool_pmu__all_event_to_str() which returns the name for all
>> events mapping to the tool_pmu_event index including the
>> skipped ones. So that even if its a skipped event, the
>> helper function helps to resolve the tool_pmu_event index to
>> its mapping event name. Update the testcase to check for null event
>> names before proceeding the test.
>> 
>> Signed-off-by: Athira Rajeev 
> 
> Please take a look at:
> https://lore.kernel.org/r/20250212163859.1489916-1-james.cl...@linaro.org
> 
> Thanks,
> Namhyung

Hi,

Sure thanks for the fix James

Thomas,
Thanks for testing this patch.  But James already fixed this with a different 
patch and it is part of perf-tools-next
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/commit/?h=perf-tools-next&id=615ec00b06f78912c370b372426190768402a5b9

Please test with latest perf-tools-next 

Thanks
Athira

> 
>> ---
>> tools/perf/tests/tool_pmu.c | 12 
>> tools/perf/util/tool_pmu.c  | 17 +
>> tools/perf/util/tool_pmu.h  |  1 +
>> 3 files changed, 30 insertions(+)
>> 
>> diff --git a/tools/perf/tests/tool_pmu.c b/tools/perf/tests/tool_pmu.c
>> index 187942b749b7..e468e5fb3c73 100644
>> --- a/tools/perf/tests/tool_pmu.c
>> +++ b/tools/perf/tests/tool_pmu.c
>> @@ -19,6 +19,18 @@ static int do_test(enum tool_pmu_event ev, bool with_pmu)
>> return TEST_FAIL;
>> }
>> 
>> + /*
>> +  * if tool_pmu__event_to_str returns NULL, Check if the event is
>> +  * valid for the platform.
>> +  * Example:
>> +  * slots event is only on arm64.
>> +  * system_tsc_freq event is only on x86.
>> +  */
>> + if (!tool_pmu__event_to_str(ev) && 
>> tool_pmu__skip_event(tool_pmu__all_event_to_str(ev))) {
>> + ret = TEST_OK;
>> + goto out;
>> + }
>> +
>> if (with_pmu)
>> snprintf(str, sizeof(str), "tool/%s/", tool_pmu__event_to_str(ev));
>> else
>> diff --git a/tools/perf/util/tool_pmu.c b/tools/perf/util/tool_pmu.c
>> index 3a68debe7143..572422797f6e 100644
>> --- a/tools/perf/util/tool_pmu.c
>> +++ b/tools/perf/util/tool_pmu.c
>> @@ -60,6 +60,15 @@ int tool_pmu__num_skip_events(void)
>> return num;
>> }
>> 
>> +/*
>> + * tool_pmu__event_to_str returns only supported event names.
>> + * For events which are supposed to be skipped in the platform,
>> + * return NULL
>> + *
>> + * tool_pmu__all_event_to_str returns the name for all
>> + * events mapping to the tool_pmu_event index including the
>> + * skipped ones.
>> + */
>> const char *tool_pmu__event_to_str(enum tool_pmu_event ev)
>> {
>> if ((ev > TOOL_PMU__EVENT_NONE && ev < TOOL_PMU__EVENT_MAX) &&
>> @@ -69,6 +78,14 @@ const char *tool_pmu__event_to_str(enum tool_pmu_event ev)
>> return NULL;
>> }
>> 
>> +const char *tool_pmu__all_event_to_str(enum tool_pmu_event ev)
>> +{
>> + if (ev > TOOL_PMU__EVENT_NONE && ev < TOOL_PMU__EVENT_MAX)
>> + return tool_pmu__event_names[ev];
>> +
>> + return NULL;
>> +}
>> +
>> enum tool_pmu_event tool_pmu__str_t

Re: [RESEND v4 0/3] mm/pkey: Add PKEY_UNRESTRICTED macro

2025-02-17 Thread Catalin Marinas
On Mon, 13 Jan 2025 17:06:16 +, Yury Khrustalev wrote:
> Add PKEY_UNRESTRICTED macro to mman.h and use it in selftests.
> 
> For context, this change will also allow for more consistent update of the
> Glibc manual which in turn will help with introducing memory protection
> keys on AArch64 targets.
> 
> Applies to 5bc55a333a2f (tag: v6.13-rc7).
> 
> [...]

Applied to arm64 (for-next/pkey_unrestricted), thanks!

[1/3] mm/pkey: Add PKEY_UNRESTRICTED macro
  https://git.kernel.org/arm64/c/6d61527d931b
[2/3] selftests/mm: Use PKEY_UNRESTRICTED macro
  https://git.kernel.org/arm64/c/3809cefe93f6
[3/3] selftests/powerpc: Use PKEY_UNRESTRICTED macro
  https://git.kernel.org/arm64/c/00894c3fc917

-- 
Catalin




Re: panic in cpufreq_online() in 6.14-rc1 on PowerNV

2025-02-17 Thread Nicholas Piggin
On Thu Feb 6, 2025 at 6:41 PM AEST, Dan Horák wrote:
> Hi,
>
> I am getting a kernel panic on my Raptor Talos Power9 system after
> updating to the 6.14-rc1 kernel from 6.13. Seems reproducable every
> time, but I haven't start bisecting yet. Does it sound familiar to
> anyone?

No, but it's possible it could be skiboot changes in PM code.

Thanks,
Nick



Re: [PATCH v7 16/20] huge_memory: Add vmf_insert_folio_pmd()

2025-02-17 Thread David Hildenbrand

On 17.02.25 05:29, Alistair Popple wrote:

On Mon, Feb 10, 2025 at 07:45:09PM +0100, David Hildenbrand wrote:

On 04.02.25 23:48, Alistair Popple wrote:

Currently DAX folio/page reference counts are managed differently to normal
pages. To allow these to be managed the same as normal pages introduce
vmf_insert_folio_pmd. This will map the entire PMD-sized folio and take
references as it would for a normally mapped page.

This is distinct from the current mechanism, vmf_insert_pfn_pmd, which
simply inserts a special devmap PMD entry into the page table without
holding a reference to the page for the mapping.

It is not currently useful to implement a more generic vmf_insert_folio()
which selects the correct behaviour based on folio_order(). This is because
PTE faults require only a subpage of the folio to be PTE mapped rather than
the entire folio. It would be possible to add this context somewhere but
callers already need to handle PTE faults and PMD faults separately so a
more generic function is not useful.

Signed-off-by: Alistair Popple 


Nit: patch subject ;)



---

Changes for v7:

   - Fix bad pgtable handling for PPC64 (Thanks Dan and Dave)


Is it? ;) insert_pfn_pmd() still doesn't consume a "pgtable_t *"

But maybe I am missing something ...


At a high-level all I'm trying to do (perhaps badly) is pull the ptl locking one
level up the callstack.

As far as I can tell the pgtable is consumed here:

static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, pfn_t pfn, pgprot_t prot, bool write,
pgtable_t pgtable)

[...]

if (pgtable) {
pgtable_trans_huge_deposit(mm, pmd, pgtable);
mm_inc_nr_ptes(mm);
pgtable = NULL;
}

[...]

return 0;

Now I can see I failed to clean up the useless pgtable = NULL asignment, which
is confusing because I'm not trying to look at pgtable in the caller (ie.
vmf_insert_pfn_pmd()/vmf_insert_folio_pmd()) to determine if it needs freeing.
So I will remove this assignment.


Ahhh, yes, the "pgtable = NULL" confused me, so I was looking for a 
"pgtable_t *pgtable" being passed instead, that we could manipulate.




Instead callers just look at the return code from insert_pfn_pmd() - if there
was an error pgtable_trans_huge_deposit(pgtable) wasn't called and if the caller
passed a pgtable it should be freed. Otherwise if insert_pfn_pmd() succeeded
then callers can assume the pgtable was consumed by pgtable_trans_huge_deposit()
and therefore should not be freed.

Hopefully that all makes sense, but maybe I've missed something obvious too...


Yes, you assume that if insert_pfn_pmd() succeeds, the table was 
consumed, otherwise it must be freed.


Thanks!

Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v13 4/5] arm64: support copy_mc_[user]_highpage()

2025-02-17 Thread Tong Tiangen




在 2025/2/15 1:24, Catalin Marinas 写道:

On Fri, Feb 14, 2025 at 10:49:01AM +0800, Tong Tiangen wrote:

在 2025/2/13 1:11, Catalin Marinas 写道:

On Mon, Dec 09, 2024 at 10:42:56AM +0800, Tong Tiangen wrote:

Currently, many scenarios that can tolerate memory errors when copying page
have been supported in the kernel[1~5], all of which are implemented by
copy_mc_[user]_highpage(). arm64 should also support this mechanism.

Due to mte, arm64 needs to have its own copy_mc_[user]_highpage()
architecture implementation, macros __HAVE_ARCH_COPY_MC_HIGHPAGE and
__HAVE_ARCH_COPY_MC_USER_HIGHPAGE have been added to control it.

Add new helper copy_mc_page() which provide a page copy implementation with
hardware memory error safe. The code logic of copy_mc_page() is the same as
copy_page(), the main difference is that the ldp insn of copy_mc_page()
contains the fixup type EX_TYPE_KACCESS_ERR_ZERO_MEM_ERR, therefore, the
main logic is extracted to copy_page_template.S. In addition, the fixup of
MOPS insn is not considered at present.


Could we not add the exception table entry permanently but ignore the
exception table entry if it's not on the do_sea() path? That would save
some code duplication.


I'm sorry, I didn't catch your point, that the do_sea() and non do_sea()
paths use different exception tables?


No, they would have the same exception table, only that we'd interpret
it differently depending on whether it's a SEA error or not. Or rather
ignore the exception table altogether for non-SEA errors.


You mean to use the same exception type (EX_TYPE_KACCESS_ERR_ZERO) and
then do different processing on SEA errors and non-SEA errors, right?

If so, some instructions of copy_page() did not add to the exception
table will be added to the exception table, and the original logic will
be affected.

For example, if an instruction is not added to the exception table, the
instruction will panic when it triggers a non-SEA error. If this
instruction is added to the exception table because of SEA processing,
and then a non-SEA error is triggered, should we fix it?

Thanks,
Tong.




My understanding is that the
exception table entry problem is fine. After all, the search is
performed only after a fault trigger. Code duplication can be solved by
extracting repeated logic to a public file.


If the new exception table entries are only taken into account for SEA
errors, why do we need a duplicate copy_mc_page() function generated?
Isn't the copy_page() and copy_mc_page() code identical (except for the
additional labels to jump to for the exception)?





[PATCH v6 0/6] ptrace: introduce PTRACE_SET_SYSCALL_INFO API

2025-02-17 Thread Dmitry V. Levin
PTRACE_SET_SYSCALL_INFO is a generic ptrace API that complements
PTRACE_GET_SYSCALL_INFO by letting the ptracer modify details of
system calls the tracee is blocked in.

This API allows ptracers to obtain and modify system call details in a
straightforward and architecture-agnostic way, providing a consistent way
of manipulating the system call number and arguments across architectures.

As in case of PTRACE_GET_SYSCALL_INFO, PTRACE_SET_SYSCALL_INFO also
does not aim to address numerous architecture-specific system call ABI
peculiarities, like differences in the number of system call arguments
for such system calls as pread64 and preadv.

The current implementation supports changing only those bits of system call
information that are used by strace system call tampering, namely, syscall
number, syscall arguments, and syscall return value.

Support of changing additional details returned by PTRACE_GET_SYSCALL_INFO,
such as instruction pointer and stack pointer, could be added later if
needed, by using struct ptrace_syscall_info.flags to specify the additional
details that should be set.  Currently, "flags" and "reserved" fields of
struct ptrace_syscall_info must be initialized with zeroes; "arch",
"instruction_pointer", and "stack_pointer" fields are currently ignored.

PTRACE_SET_SYSCALL_INFO currently supports only PTRACE_SYSCALL_INFO_ENTRY,
PTRACE_SYSCALL_INFO_EXIT, and PTRACE_SYSCALL_INFO_SECCOMP operations.
Other operations could be added later if needed.

Ideally, PTRACE_SET_SYSCALL_INFO should have been introduced along with
PTRACE_GET_SYSCALL_INFO, but it didn't happen.  The last straw that
convinced me to implement PTRACE_SET_SYSCALL_INFO was apparent failure
to provide an API of changing the first system call argument on riscv
architecture [1].

ptrace(2) man page:

long ptrace(enum __ptrace_request request, pid_t pid, void *addr, void *data);
...
PTRACE_SET_SYSCALL_INFO
   Modify information about the system call that caused the stop.
   The "data" argument is a pointer to struct ptrace_syscall_info
   that specifies the system call information to be set.
   The "addr" argument should be set to sizeof(struct ptrace_syscall_info)).

[1] https://lore.kernel.org/all/59505464-c84a-403d-972f-d4b2055ee...@gmail.com/

Notes:
v6:
* mips: Submit mips_get_syscall_arg() o32 fix via mips tree
  to get it merged into v6.14-rc3
* Rebase to v6.14-rc3
* v5: https://lore.kernel.org/all/20250210113336.ga...@strace.io/

v5:
* ptrace: Extend the commit message to say that the new API does not aim
  to address numerous architecture-specific syscall ABI peculiarities
* selftests: Add a workaround for s390 16-bit syscall numbers
* Add more Acked-by
* v4: https://lore.kernel.org/all/20250203065849.ga14...@strace.io/

v4:
* Split out syscall_set_return_value() for hexagon into a separate patch
* s390: Change the style of syscall_set_arguments() implementation as
  requested
* Add more Reviewed-by
* v3: https://lore.kernel.org/all/20250128091445.ga8...@strace.io/

v3:
* powerpc: Submit syscall_set_return_value() fix for "sc" case separately
* mips: Do not introduce erroneous argument truncation on mips n32,
  add a detailed description to the commit message of the
  mips_get_syscall_arg() change
* ptrace: Add explicit padding to the end of struct ptrace_syscall_info,
  simplify obtaining of user ptrace_syscall_info,
  do not introduce PTRACE_SYSCALL_INFO_SIZE_VER0
* ptrace: Change the return type of ptrace_set_syscall_info_* functions
  from "unsigned long" to "int"
* ptrace: Add -ERANGE check to ptrace_set_syscall_info_exit(),
  add comments to -ERANGE checks
* ptrace: Update comments about supported syscall stops
* selftests: Extend set_syscall_info test, fix for mips n32
* Add Tested-by and Reviewed-by

v2:
* Add patch to fix syscall_set_return_value() on powerpc
* Add patch to fix mips_get_syscall_arg() on mips
* Add syscall_set_return_value() implementation on hexagon
* Add syscall_set_return_value() invocation to syscall_set_nr()
  on arm and arm64.
* Fix syscall_set_nr() and mips_set_syscall_arg() on mips
* Add a comment to syscall_set_nr() on arc, powerpc, s390, sh,
  and sparc
* Remove redundant ptrace_syscall_info.op assignments in
  ptrace_get_syscall_info_*
* Minor style tweaks in ptrace_get_syscall_info_op()
* Remove syscall_set_return_value() invocation from
  ptrace_set_syscall_info_entry()
* Skip syscall_set_arguments() invocation in case of syscall number -1
  in ptrace_set_syscall_info_entry() 
* Split ptrace_syscall_info.reserved into ptrace_syscall_info.reserved
  and ptrace_syscall_info.flags
* Use __kernel_ulong_t instead of unsigned long in set_syscall_info test

Dmitry V. Levin (6):
  hexagon: add syscall_set_return_value()
  syscall.h: add syscall_set_arguments()
  syscall.h: intro

[RFC PATCH v2 3/3] s390/topology: Add initial implementation for selection of parked CPUs

2025-02-17 Thread Tobias Huschle
At first, vertical low CPUs will be parked generally. This will later
be adjusted by making the parked state dependent on the overall
utilization on the underlying hypervisor.

Vertical lows are always bound to the highest CPU IDs. This implies that
the three types of vertically polarized CPUs are always clustered by ID.
This has the following implications:
- There might be scheduler domains consisting of only vertical highs
- There might be scheduler domains consisting of only vertical lows

Signed-off-by: Tobias Huschle 
---
 arch/s390/include/asm/smp.h | 2 ++
 arch/s390/kernel/smp.c  | 5 +
 2 files changed, 7 insertions(+)

diff --git a/arch/s390/include/asm/smp.h b/arch/s390/include/asm/smp.h
index 7feca96c48c6..d4b65c5cebdc 100644
--- a/arch/s390/include/asm/smp.h
+++ b/arch/s390/include/asm/smp.h
@@ -13,6 +13,7 @@
 
 #define raw_smp_processor_id() (get_lowcore()->cpu_nr)
 #define arch_scale_cpu_capacity smp_cpu_get_capacity
+#define arch_cpu_parked smp_cpu_parked
 
 extern struct mutex smp_cpu_state_mutex;
 extern unsigned int smp_cpu_mt_shift;
@@ -38,6 +39,7 @@ extern int smp_cpu_get_polarization(int cpu);
 extern void smp_cpu_set_capacity(int cpu, unsigned long val);
 extern void smp_set_core_capacity(int cpu, unsigned long val);
 extern unsigned long smp_cpu_get_capacity(int cpu);
+extern bool smp_cpu_parked(int cpu);
 extern int smp_cpu_get_cpu_address(int cpu);
 extern void smp_fill_possible_mask(void);
 extern void smp_detect_cpus(void);
diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
index 7b08399b0846..e65850cac02b 100644
--- a/arch/s390/kernel/smp.c
+++ b/arch/s390/kernel/smp.c
@@ -686,6 +686,11 @@ void smp_set_core_capacity(int cpu, unsigned long val)
smp_cpu_set_capacity(i, val);
 }
 
+bool smp_cpu_parked(int cpu)
+{
+   return smp_cpu_get_polarization(cpu) == POLARIZATION_VL;
+}
+
 int smp_cpu_get_cpu_address(int cpu)
 {
return per_cpu(pcpu_devices, cpu).address;
-- 
2.34.1




[RFC PATCH v2 1/3] sched/fair: introduce new scheduler group type group_parked

2025-02-17 Thread Tobias Huschle
A parked CPU is considered to be flagged as unsuitable to process
workload at the moment, but might be become usable anytime. Depending on
the necessity for additional computation power and/or available capacity
of the underlying hardware.

A scheduler group is considered to be parked, if there are tasks queued
on parked CPUs and there are no idle CPUs, i.e. all non parked CPUs are
busy or there are only parked CPUs. A scheduler group with parked tasks
can be considered to not be parked, if it has idle CPUs which can pick
up the parked tasks. A parked scheduler group is considered to be busier
than another if it runs more tasks on parked CPUs than another parked
scheduler group.

A parked CPU must keep its scheduler tick (or have it re-enabled if
necessary) in order to make sure that parked CPUs which only run a
single task which does not give up its runtime voluntarily is still
evacuated as it would otherwise go into NO_HZ.

The status of the underlying hardware must be considered to be
architecture dependent. Therefore the check whether a CPU is parked is
architecture specific. For architectures not relying on this feature,
the check is mostly a NOP.

This is more efficient and non-disruptive compared to CPU hotplug in
environments where such changes can be necessary on a frequent basis.

Signed-off-by: Tobias Huschle 
---
 include/linux/sched/topology.h | 19 
 kernel/sched/core.c| 13 -
 kernel/sched/fair.c| 86 +-
 kernel/sched/syscalls.c|  3 ++
 4 files changed, 109 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 7f3dbafe1817..2a4730729988 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -265,6 +265,25 @@ unsigned long arch_scale_cpu_capacity(int cpu)
 }
 #endif
 
+#ifndef arch_cpu_parked
+/**
+ * arch_cpu_parked - Check if a given CPU is currently parked.
+ *
+ * A parked CPU cannot run any kind of workload since underlying
+ * physical CPU should not be used at the moment .
+ *
+ * @cpu: the CPU in question.
+ *
+ * By default assume CPU is not parked
+ *
+ * Return: Parked state of CPU
+ */
+static __always_inline bool arch_cpu_parked(int cpu)
+{
+   return false;
+}
+#endif
+
 #ifndef arch_scale_hw_pressure
 static __always_inline
 unsigned long arch_scale_hw_pressure(int cpu)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 165c90ba64ea..9ed15911ec60 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1352,6 +1352,9 @@ bool sched_can_stop_tick(struct rq *rq)
if (rq->cfs.h_nr_queued > 1)
return false;
 
+   if (rq->cfs.nr_running > 0 && arch_cpu_parked(cpu_of(rq)))
+   return false;
+
/*
 * If there is one task and it has CFS runtime bandwidth constraints
 * and it's on the cpu now we don't want to stop the tick.
@@ -2443,7 +2446,7 @@ static inline bool is_cpu_allowed(struct task_struct *p, 
int cpu)
 
/* Non kernel threads are not allowed during either online or offline. 
*/
if (!(p->flags & PF_KTHREAD))
-   return cpu_active(cpu);
+   return !arch_cpu_parked(cpu) && cpu_active(cpu);
 
/* KTHREAD_IS_PER_CPU is always allowed. */
if (kthread_is_per_cpu(p))
@@ -2453,6 +2456,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, 
int cpu)
if (cpu_dying(cpu))
return false;
 
+   /* CPU should be avoided at the moment */
+   if (arch_cpu_parked(cpu))
+   return false;
+
/* But are allowed during online. */
return cpu_online(cpu);
 }
@@ -3930,6 +3937,10 @@ static inline bool ttwu_queue_cond(struct task_struct 
*p, int cpu)
if (task_on_scx(p))
return false;
 
+   /* The task should not be queued onto a parked CPU. */
+   if (arch_cpu_parked(cpu))
+   return false;
+
/*
 * Do not complicate things with the async wake_list while the CPU is
 * in hotplug state.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1c0ef435a7aa..5eb1a3113704 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6871,6 +6871,8 @@ static int sched_idle_rq(struct rq *rq)
 #ifdef CONFIG_SMP
 static int sched_idle_cpu(int cpu)
 {
+   if (arch_cpu_parked(cpu))
+   return 0;
return sched_idle_rq(cpu_rq(cpu));
 }
 #endif
@@ -7399,6 +7401,9 @@ static int wake_affine(struct sched_domain *sd, struct 
task_struct *p,
 {
int target = nr_cpumask_bits;
 
+   if (arch_cpu_parked(target))
+   return prev_cpu;
+
if (sched_feat(WA_IDLE))
target = wake_affine_idle(this_cpu, prev_cpu, sync);
 
@@ -9182,7 +9187,12 @@ enum group_type {
 * The CPU is overloaded and can't provide expected CPU cycles to all
 * tasks.
 */
-   group_overloaded
+   group_overloaded,
+   /*
+  

[RFC PATCH v2 2/3] sched/fair: adapt scheduler group weight and capacity for parked CPUs

2025-02-17 Thread Tobias Huschle
Parked CPUs should not be considered to be available for computation.
This implies, that they should also not contribute to the overall weight
of scheduler groups, as a large group of parked CPUs should not attempt
to process any tasks, hence, a small group of non-parked CPUs should be
considered to have a larger weight.
The same consideration holds true for the CPU capacities of such groups.
A group of parked CPUs should not be considered to have any capacity.

Signed-off-by: Tobias Huschle 
---
 kernel/sched/fair.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5eb1a3113704..287c6648a41d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9913,6 +9913,8 @@ struct sg_lb_stats {
unsigned int sum_nr_running;/* Nr of all tasks running in 
the group */
unsigned int sum_h_nr_running;  /* Nr of CFS tasks running in 
the group */
unsigned int sum_nr_parked;
+   unsigned int parked_cpus;
+   unsigned int parked_capacity;
unsigned int idle_cpus; /* Nr of idle CPUs in 
the group */
unsigned int group_weight;
enum group_type group_type;
@@ -10369,6 +10371,8 @@ static inline void update_sg_lb_stats(struct lb_env 
*env,
*sg_overutilized = 1;
 
sgs->sum_nr_parked += arch_cpu_parked(i) * rq->cfs.h_nr_queued;
+   sgs->parked_capacity += arch_cpu_parked(i) * capacity_of(i);
+   sgs->parked_cpus += arch_cpu_parked(i);
 
/*
 * No need to call idle_cpu() if nr_running is not 0
@@ -10406,9 +10410,11 @@ static inline void update_sg_lb_stats(struct lb_env 
*env,
}
}
 
-   sgs->group_capacity = group->sgc->capacity;
+   sgs->group_capacity = group->sgc->capacity - sgs->parked_capacity;
+   if (!sgs->group_capacity)
+   sgs->group_capacity = 1;
 
-   sgs->group_weight = group->group_weight;
+   sgs->group_weight = group->group_weight - sgs->parked_cpus;
 
/* Check if dst CPU is idle and preferred to this group */
if (!local_group && env->idle && sgs->sum_h_nr_running &&
@@ -10692,6 +10698,8 @@ static inline void update_sg_wakeup_stats(struct 
sched_domain *sd,
sgs->sum_nr_running += nr_running;
 
sgs->sum_nr_parked += arch_cpu_parked(i) * rq->cfs.h_nr_queued;
+   sgs->parked_capacity += arch_cpu_parked(i) * capacity_of(i);
+   sgs->parked_cpus += arch_cpu_parked(i);
 
/*
 * No need to call idle_cpu_without() if nr_running is not 0
@@ -10707,9 +10715,11 @@ static inline void update_sg_wakeup_stats(struct 
sched_domain *sd,
 
}
 
-   sgs->group_capacity = group->sgc->capacity;
+   sgs->group_capacity = group->sgc->capacity - sgs->parked_capacity;
+   if (!sgs->group_capacity)
+   sgs->group_capacity = 1;
 
-   sgs->group_weight = group->group_weight;
+   sgs->group_weight = group->group_weight - sgs->parked_cpus;
 
sgs->group_type = group_classify(sd->imbalance_pct, group, sgs);
 
-- 
2.34.1




[RFC PATCH v2 0/3] sched/fair: introduce new scheduler group type group_parked

2025-02-17 Thread Tobias Huschle
Changes to v1

parked vs idle
- parked CPUs are now never considered to be idle
- a scheduler group is now considered parked iff there are parked CPUs 
  and there are no idle CPUs, i.e. all non parked CPUs are busy or there
  are only parked CPUs. A scheduler group with parked tasks can be
  considered to not be parked, if it has idle CPUs which can pick up
  the parked tasks.
- idle_cpu_without always returns that the CPU will not be idle if the 
  CPU is parked

active balance, no_hz, queuing
- should_we_balance always returns true if a scheduler groups contains 
  a parked CPU and that CPU has a running task
- stopping the tick on parked CPUs is now prevented in sched_can_stop_tick
  if a task is running
- tasks are being prevented to be queued on parked CPUs in ttwu_queue_cond

cleanup
- removed duplicate checks for parked CPUs

CPU capacity
- added a patch which removes parked cpus and their capacity from 
  scheduler statistics


Original description:

Adding a new scheduler group type which allows to remove all tasks 
from certain CPUs through load balancing can help in scenarios where
such CPUs are currently unfavorable to use, for example in a 
virtualized environment.

Functionally, this works as intended. The question would be, if this
could be considered to be added and would be worth going forward 
with. If so, which areas would need additional attention? 
Some cases are referenced below.

The underlying concept and the approach of adding a new scheduler 
group type were presented in the Sched MC of the 2024 LPC.
A short summary:

Some architectures (e.g. s390) provide virtualization on a firmware
level. This implies, that Linux kernels running on such architectures
run on virtualized CPUs.

Like in other virtualized environments, the CPUs are most likely shared
with other guests on the hardware level. This implies, that Linux
kernels running in such an environment may encounter 'steal time'. In
other words, instead of being able to use all available time on a
physical CPU, some of said available time is 'stolen' by other guests.

This can cause side effects if a guest is interrupted at an unfavorable
point in time or if the guest is waiting for one of its other virtual 
CPUs to perform certain actions while those are suspended in favour of 
another guest.

Architectures, like arch/s390, address this issue by providing an
alternative classification for the CPUs seen by the Linux kernel.

The following example is arch/s390 specific:
In the default mode (horizontal CPU polarization), all CPUs are treated
equally and can be subject to steal time equally. 
In the alternate mode (vertical CPU polarization), the underlying
firmware hypervisor assigns the CPUs, visible to the guest, different
types, depending on how many CPUs the guest is entitled to use. Said
entitlement is configured by assigning weights to all active guests.
The three CPU types are:
- vertical high   : On these CPUs, the guest has always highest
priority over other guests. This means
especially that if the guest executes tasks on
these CPUs, it will encounter no steal time.
- vertical medium : These CPUs are meant to cover fractions of
entitlement.
- vertical low: These CPUs will have no priority when being
scheduled. This implies especially, that while
all other guests are using their full
entitlement, these CPUs might not be ran for a
significant amount of time.

As a consequence, using vertical lows while the underlying hypervisor
experiences a high load, driven by all defined guests, is to be avoided.

In order to consequently move tasks off of vertical lows, introduce a
new type of scheduler groups: group_parked.
Parked implies, that processes should be evacuated as fast as possible
from these CPUs. This implies that other CPUs should start pulling tasks
immediately, while the parked CPUs should refuse to pull any tasks
themselves.
Adding a group type beyond group_overloaded achieves the expected
behavior. By making its selection architecture dependent, it has
no effect on architectures which will not make use of that group type.

This approach works very well for many kinds of workloads. Tasks are
getting migrated back and forth in line with changing the parked
state of the involved CPUs.

There are a couple of issues and corner cases which need further
considerations:
- rt & dl:  Realtime and deadline scheduling require some additional 
attention. 
- ext:  Probably affected as well. Needs some conceptional
thoughts first.
- raciness: Right now, there are no synchronization efforts. It needs
to be considered whether those might be necessary or if
it is alright that the parked-state of a CPU might change
during load-balancing.

[GIT PULL] Please pull powerpc/linux.git powerpc-6.14-3 tag

2025-02-17 Thread Madhavan Srinivasan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hi Linus,

Please pull couple of powerpc fixes for 6.14:

The following changes since commit a64dcfb451e254085a7daee5fe51bf22959d52d3:

  Linux 6.14-rc2 (2025-02-09 12:45:03 -0800)

are available in the git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
tags/powerpc-6.14-3

for you to fetch changes up to d262a192d38e527faa5984629aabda2e0d1c4f54:

  powerpc/code-patching: Fix KASAN hit by not flagging text patching area as 
VM_ALLOC (2025-02-12 14:38:13 +0530)

- --
powerpc fixes for 6.14 #3

 - Couple of patches to fix KASAN fail during boot
 - Fix to avoid warnings/errors when building with 4k page size

Thanks to: Christophe Leroy, Ritesh Harjani (IBM), Erhard Furtner

- --
Christophe Leroy (3):
  powerpc/code-patching: Disable KASAN report during patching via temporary 
mm
  powerpc/64s: Rewrite __real_pte() and __rpte_to_hidx() as static inline
  powerpc/code-patching: Fix KASAN hit by not flagging text patching area 
as VM_ALLOC


 arch/powerpc/include/asm/book3s/64/hash-4k.h | 12 ++--
 arch/powerpc/lib/code-patching.c |  4 +++-
 2 files changed, 13 insertions(+), 3 deletions(-)
-BEGIN PGP SIGNATURE-

iQIzBAEBCAAdFiEEqX2DNAOgU8sBX3pRpnEsdPSHZJQFAmezCewACgkQpnEsdPSH
ZJR/SA//eL0vRKOOGIHZ1g/uvV4D0HbtsJUObG97tpZXoNirTypba1/qMRxyqghu
d4GKKesibMexUPEICxHyy6hJb7V5cfVTWCqOy1CZg2jVs3QVclxVifHJDcW4oW2D
yCoaT23cMjht47QKXSgmQTqUHgKhLzyb575iQfx8EMDUXMT8UEsXF7GekhISGPNq
JySzN2j4/1229gYni22ta24lzxWwfSZX8xNLrDQ8JuAa0+JqCA4Yh6PM2WMohmYj
Y+5GQIMz7UpuPkdfdcsjmg/pyyGI/dC0ZAof/x3nHkn1rZfJ+T/HSeV2Zt2Aq8mq
o/qb+KtA+H+8J97158pxQQ24loJ5AYmYy5qwV20DRJCJrU5VFMxow46ZBzl1c6sw
FZ/TO/Vjc8keAnJQgRaW6cr4a7ojTHxuj25h0etH+c2BjSu0XZhiLJYlqFW5pYIA
rCPiEk7IkAEz37jQpGREaxNxFiolHcKkl4A8+Tr0YC3Nr4lrSqcqsjpWA7pGhYTR
zikZUMPkqYJYJRXPeOrQreLiuZaZwRY9EpwKVMI8TCNhhSkFRSneQNVMcdaKM95T
R4xuOwgDfhVjyv+XMwqMHxuCyk7M8fZxK2ikslB23s2LmYWoB0rePG0MRuIcUk7L
hlboeCTGwmbt1E48APU3PtAoK5NedCoUcbA3Zb/gNRCXioeUFSg=
=cSvh
-END PGP SIGNATURE-



Re: [PATCH v2 12/12] dt-bindings: mtd: raw-nand-chip: Relax node name pattern

2025-02-17 Thread J . Neuschäfer
On Mon, Feb 17, 2025 at 10:31:08AM +0100, Miquel Raynal wrote:
> Hello,
> 
> >> > In some scenarios, such as under the Freescale eLBC bus, there are raw
> >> > NAND chips with a unit address that has a comma in it (cs,offset).
> >> > Relax the $nodename pattern in raw-nand-chip.yaml to allow such unit
> >> > addresses.
> >> 
> >> This is super specific to this controller, I'd rather avoid that in the
> >> main (shared) files. I believe you can force another node name in the
> >> controller's binding instead?
> >
> > It's a bit tricky. AFAICS, when I declare a node name pattern in my
> > specific binding in addition to the generic binding, the result is that
> > both of them apply, so I can't relax stricter requirements:
> >
> > # raw-nand-chip.yaml
> > properties:
> >   $nodename:
> > pattern: "^nand@[a-f0-9]$"
> >
> > # fsl,elbc-fcm-nand.yaml
> > properties:
> >   $nodename:
> > pattern: "^nand@[a-f0-9](,[0-9a-f]*)?$"
> 
> Well, I guess this is creating a second possible node name.
> 
> > # dtc
> > /.../fsl,elbc-fcm-nand.example.dtb:
> > nand@1,0: $nodename:0: 'nand@1,0' does not match '^nand@[a-f0-9]$'
> > from schema $id:
> > http://devicetree.org/schemas/mtd/fsl,elbc-fcm-nand.yaml#
> 
> What about fixing the DT instead?

In this particular context under the Freescale eLBC ("enhanced Local Bus
Controller"), nand@1,0 makes complete sense, because it refers to chip
select 1, offset 0. The eLBC binding (which has existed without YAML
formalization for a long time) specifies that each device address
includes a chip select and a base address under that CS.

The alternative of spelling it as nand@1 makes readability
strictly worse (IMO).

Due to the conflicting requirements of keeping compatibility with
historic device trees and complying with modern DT conventions,
I'm already ignoring a validation warning from dtc, which suggests to
use nand@1 instead of nand@1,0 because the eLBC bus has
historically been specified with compatible = ..., "simple-bus",
so I guess the fsl,elbc-fcm-nand binding can't be perfect anyway.

In any case, I'll drop this patch during further development.


Thank you for your inputs,

J. Neuschäfer



Re: [PATCH v2 12/12] dt-bindings: mtd: raw-nand-chip: Relax node name pattern

2025-02-17 Thread Miquel Raynal
Hello,

>> > In some scenarios, such as under the Freescale eLBC bus, there are raw
>> > NAND chips with a unit address that has a comma in it (cs,offset).
>> > Relax the $nodename pattern in raw-nand-chip.yaml to allow such unit
>> > addresses.
>> 
>> This is super specific to this controller, I'd rather avoid that in the
>> main (shared) files. I believe you can force another node name in the
>> controller's binding instead?
>
> It's a bit tricky. AFAICS, when I declare a node name pattern in my
> specific binding in addition to the generic binding, the result is that
> both of them apply, so I can't relax stricter requirements:
>
> # raw-nand-chip.yaml
> properties:
>   $nodename:
> pattern: "^nand@[a-f0-9]$"
>
> # fsl,elbc-fcm-nand.yaml
> properties:
>   $nodename:
> pattern: "^nand@[a-f0-9](,[0-9a-f]*)?$"

Well, I guess this is creating a second possible node name.

> # dtc
> /.../fsl,elbc-fcm-nand.example.dtb:
> nand@1,0: $nodename:0: 'nand@1,0' does not match '^nand@[a-f0-9]$'
> from schema $id:
>   http://devicetree.org/schemas/mtd/fsl,elbc-fcm-nand.yaml#

What about fixing the DT instead?

> (I changed the second pattern to nand-fail@... and dtc warned about it
>  mismatching too.)
>
> Perhaps I'm missing a DT-schema trick to override a value/pattern.
>
> Alternatively (pending discussion on patch 11/12), I might end up not
> referencing raw-nand-chip.yaml.

Ok.

Thanks,
Miquèl



[PATCH v6 2/6] syscall.h: add syscall_set_arguments()

2025-02-17 Thread Dmitry V. Levin
This function is going to be needed on all HAVE_ARCH_TRACEHOOK
architectures to implement PTRACE_SET_SYSCALL_INFO API.

This partially reverts commit 7962c2eddbfe ("arch: remove unused
function syscall_set_arguments()") by reusing some of old
syscall_set_arguments() implementations.

Signed-off-by: Dmitry V. Levin 
Tested-by: Charlie Jenkins 
Reviewed-by: Charlie Jenkins 
Acked-by: Helge Deller  # parisc
---
 arch/arc/include/asm/syscall.h| 14 +++
 arch/arm/include/asm/syscall.h| 13 ++
 arch/arm64/include/asm/syscall.h  | 13 ++
 arch/csky/include/asm/syscall.h   | 13 ++
 arch/hexagon/include/asm/syscall.h|  7 ++
 arch/loongarch/include/asm/syscall.h  |  8 ++
 arch/mips/include/asm/syscall.h   | 32 
 arch/nios2/include/asm/syscall.h  | 11 
 arch/openrisc/include/asm/syscall.h   |  7 ++
 arch/parisc/include/asm/syscall.h | 12 +
 arch/powerpc/include/asm/syscall.h| 10 
 arch/riscv/include/asm/syscall.h  |  9 +++
 arch/s390/include/asm/syscall.h   |  9 +++
 arch/sh/include/asm/syscall_32.h  | 12 +
 arch/sparc/include/asm/syscall.h  | 10 
 arch/um/include/asm/syscall-generic.h | 14 +++
 arch/x86/include/asm/syscall.h| 36 +++
 arch/xtensa/include/asm/syscall.h | 11 
 include/asm-generic/syscall.h | 16 
 19 files changed, 257 insertions(+)

diff --git a/arch/arc/include/asm/syscall.h b/arch/arc/include/asm/syscall.h
index 9709256e31c8..89c1e1736356 100644
--- a/arch/arc/include/asm/syscall.h
+++ b/arch/arc/include/asm/syscall.h
@@ -67,6 +67,20 @@ syscall_get_arguments(struct task_struct *task, struct 
pt_regs *regs,
}
 }
 
+static inline void
+syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
+ unsigned long *args)
+{
+   unsigned long *inside_ptregs = ®s->r0;
+   unsigned int n = 6;
+   unsigned int i = 0;
+
+   while (n--) {
+   *inside_ptregs = args[i++];
+   inside_ptregs--;
+   }
+}
+
 static inline int
 syscall_get_arch(struct task_struct *task)
 {
diff --git a/arch/arm/include/asm/syscall.h b/arch/arm/include/asm/syscall.h
index fe4326d938c1..21927fa0ae2b 100644
--- a/arch/arm/include/asm/syscall.h
+++ b/arch/arm/include/asm/syscall.h
@@ -80,6 +80,19 @@ static inline void syscall_get_arguments(struct task_struct 
*task,
memcpy(args, ®s->ARM_r0 + 1, 5 * sizeof(args[0]));
 }
 
+static inline void syscall_set_arguments(struct task_struct *task,
+struct pt_regs *regs,
+const unsigned long *args)
+{
+   memcpy(®s->ARM_r0, args, 6 * sizeof(args[0]));
+   /*
+* Also copy the first argument into ARM_ORIG_r0
+* so that syscall_get_arguments() would return it
+* instead of the previous value.
+*/
+   regs->ARM_ORIG_r0 = regs->ARM_r0;
+}
+
 static inline int syscall_get_arch(struct task_struct *task)
 {
/* ARM tasks don't change audit architectures on the fly. */
diff --git a/arch/arm64/include/asm/syscall.h b/arch/arm64/include/asm/syscall.h
index ab8e14b96f68..76020b66286b 100644
--- a/arch/arm64/include/asm/syscall.h
+++ b/arch/arm64/include/asm/syscall.h
@@ -73,6 +73,19 @@ static inline void syscall_get_arguments(struct task_struct 
*task,
memcpy(args, ®s->regs[1], 5 * sizeof(args[0]));
 }
 
+static inline void syscall_set_arguments(struct task_struct *task,
+struct pt_regs *regs,
+const unsigned long *args)
+{
+   memcpy(®s->regs[0], args, 6 * sizeof(args[0]));
+   /*
+* Also copy the first argument into orig_x0
+* so that syscall_get_arguments() would return it
+* instead of the previous value.
+*/
+   regs->orig_x0 = regs->regs[0];
+}
+
 /*
  * We don't care about endianness (__AUDIT_ARCH_LE bit) here because
  * AArch64 has the same system calls both on little- and big- endian.
diff --git a/arch/csky/include/asm/syscall.h b/arch/csky/include/asm/syscall.h
index 0de5734950bf..30403f7a0487 100644
--- a/arch/csky/include/asm/syscall.h
+++ b/arch/csky/include/asm/syscall.h
@@ -59,6 +59,19 @@ syscall_get_arguments(struct task_struct *task, struct 
pt_regs *regs,
memcpy(args, ®s->a1, 5 * sizeof(args[0]));
 }
 
+static inline void
+syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
+ const unsigned long *args)
+{
+   memcpy(®s->a0, args, 6 * sizeof(regs->a0));
+   /*
+* Also copy the first argument into orig_x0
+* so that syscall_get_arguments() would return it
+* instead of the previous value.
+*/
+   regs->orig_a0 = regs->a0;
+}
+
 static inline int
 syscall_get_arch(struct task_struct *task)
 {
diff --git a/arch/hexagon/inc

[PATCH v6 3/6] syscall.h: introduce syscall_set_nr()

2025-02-17 Thread Dmitry V. Levin
Similar to syscall_set_arguments() that complements
syscall_get_arguments(), introduce syscall_set_nr()
that complements syscall_get_nr().

syscall_set_nr() is going to be needed along with
syscall_set_arguments() on all HAVE_ARCH_TRACEHOOK
architectures to implement PTRACE_SET_SYSCALL_INFO API.

Signed-off-by: Dmitry V. Levin 
Tested-by: Charlie Jenkins 
Reviewed-by: Charlie Jenkins 
Acked-by: Helge Deller  # parisc
---
 arch/arc/include/asm/syscall.h| 11 +++
 arch/arm/include/asm/syscall.h| 24 
 arch/arm64/include/asm/syscall.h  | 16 
 arch/hexagon/include/asm/syscall.h|  7 +++
 arch/loongarch/include/asm/syscall.h  |  7 +++
 arch/m68k/include/asm/syscall.h   |  7 +++
 arch/microblaze/include/asm/syscall.h |  7 +++
 arch/mips/include/asm/syscall.h   | 14 ++
 arch/nios2/include/asm/syscall.h  |  5 +
 arch/openrisc/include/asm/syscall.h   |  6 ++
 arch/parisc/include/asm/syscall.h |  7 +++
 arch/powerpc/include/asm/syscall.h| 10 ++
 arch/riscv/include/asm/syscall.h  |  7 +++
 arch/s390/include/asm/syscall.h   | 12 
 arch/sh/include/asm/syscall_32.h  | 12 
 arch/sparc/include/asm/syscall.h  | 12 
 arch/um/include/asm/syscall-generic.h |  5 +
 arch/x86/include/asm/syscall.h|  7 +++
 arch/xtensa/include/asm/syscall.h |  7 +++
 include/asm-generic/syscall.h | 14 ++
 20 files changed, 197 insertions(+)

diff --git a/arch/arc/include/asm/syscall.h b/arch/arc/include/asm/syscall.h
index 89c1e1736356..728d625a10f1 100644
--- a/arch/arc/include/asm/syscall.h
+++ b/arch/arc/include/asm/syscall.h
@@ -23,6 +23,17 @@ syscall_get_nr(struct task_struct *task, struct pt_regs 
*regs)
return -1;
 }
 
+static inline void
+syscall_set_nr(struct task_struct *task, struct pt_regs *regs, int nr)
+{
+   /*
+* Unlike syscall_get_nr(), syscall_set_nr() can be called only when
+* the target task is stopped for tracing on entering syscall, so
+* there is no need to have the same check syscall_get_nr() has.
+*/
+   regs->r8 = nr;
+}
+
 static inline void
 syscall_rollback(struct task_struct *task, struct pt_regs *regs)
 {
diff --git a/arch/arm/include/asm/syscall.h b/arch/arm/include/asm/syscall.h
index 21927fa0ae2b..18b102a30741 100644
--- a/arch/arm/include/asm/syscall.h
+++ b/arch/arm/include/asm/syscall.h
@@ -68,6 +68,30 @@ static inline void syscall_set_return_value(struct 
task_struct *task,
regs->ARM_r0 = (long) error ? error : val;
 }
 
+static inline void syscall_set_nr(struct task_struct *task,
+ struct pt_regs *regs,
+ int nr)
+{
+   if (nr == -1) {
+   task_thread_info(task)->abi_syscall = -1;
+   /*
+* When the syscall number is set to -1, the syscall will be
+* skipped.  In this case the syscall return value has to be
+* set explicitly, otherwise the first syscall argument is
+* returned as the syscall return value.
+*/
+   syscall_set_return_value(task, regs, -ENOSYS, 0);
+   return;
+   }
+   if ((IS_ENABLED(CONFIG_AEABI) && !IS_ENABLED(CONFIG_OABI_COMPAT))) {
+   task_thread_info(task)->abi_syscall = nr;
+   return;
+   }
+   task_thread_info(task)->abi_syscall =
+   (task_thread_info(task)->abi_syscall & ~__NR_SYSCALL_MASK) |
+   (nr & __NR_SYSCALL_MASK);
+}
+
 #define SYSCALL_MAX_ARGS 7
 
 static inline void syscall_get_arguments(struct task_struct *task,
diff --git a/arch/arm64/include/asm/syscall.h b/arch/arm64/include/asm/syscall.h
index 76020b66286b..712daa90e643 100644
--- a/arch/arm64/include/asm/syscall.h
+++ b/arch/arm64/include/asm/syscall.h
@@ -61,6 +61,22 @@ static inline void syscall_set_return_value(struct 
task_struct *task,
regs->regs[0] = val;
 }
 
+static inline void syscall_set_nr(struct task_struct *task,
+ struct pt_regs *regs,
+ int nr)
+{
+   regs->syscallno = nr;
+   if (nr == -1) {
+   /*
+* When the syscall number is set to -1, the syscall will be
+* skipped.  In this case the syscall return value has to be
+* set explicitly, otherwise the first syscall argument is
+* returned as the syscall return value.
+*/
+   syscall_set_return_value(task, regs, -ENOSYS, 0);
+   }
+}
+
 #define SYSCALL_MAX_ARGS 6
 
 static inline void syscall_get_arguments(struct task_struct *task,
diff --git a/arch/hexagon/include/asm/syscall.h 
b/arch/hexagon/include/asm/syscall.h
index 1024a6548d78..70637261817a 100644
--- a/arch/hexagon/include/asm/syscall.h
+++ b/arch/hexagon/in