date:20240610

[RFC PATCH v2 0/5] powerpc/ftrace: Move ftrace sequence out of line

2024-06-10 Thread Naveen N Rao

This is v2 of the series posted here:
http://lkml.kernel.org/r/cover.1702045299.git.nav...@kernel.org

Since v2, the primary change is that the entire ftrace sequence is moved 
out of line and this is now restricted to 64-bit powerpc by default.  
Patch 5 has the details.

I have dropped patches to enable DYNAMIC_FTRACE_WITH_CALL_OPS and ftrace 
direct support so that this approach can be finalized.

This series depends on Benjamin Gray's series adding support for 
patch_ulong():
http://lkml.kernel.org/r/20240515024445.236364-1-bg...@linux.ibm.com


Appreciate feedback on the approach.


Thanks,
Naveen



Naveen N Rao (5):
  powerpc/kprobes: Use ftrace to determine if a probe is at function
entry
  powerpc/ftrace: Remove pointer to struct module from dyn_arch_ftrace
  powerpc/ftrace: Unify 32-bit and 64-bit ftrace entry code
  kbuild: Add generic hook for architectures to use before the final
vmlinux link
  powerpc64/ftrace: Move ftrace sequence out of line

 arch/Kconfig |   3 +
 arch/powerpc/Kconfig |   4 +
 arch/powerpc/Makefile|   4 +
 arch/powerpc/include/asm/ftrace.h|  11 +-
 arch/powerpc/include/asm/module.h|   5 +
 arch/powerpc/kernel/asm-offsets.c|   4 +
 arch/powerpc/kernel/kprobes.c|  18 +--
 arch/powerpc/kernel/module_64.c  |  67 +++-
 arch/powerpc/kernel/trace/ftrace.c   | 196 ---
 arch/powerpc/kernel/trace/ftrace_64_pg.c |  73 -
 arch/powerpc/kernel/trace/ftrace_entry.S |  75 ++---
 arch/powerpc/kernel/vmlinux.lds.S|   3 +-
 arch/powerpc/tools/vmlinux_o.sh  |  47 ++
 scripts/link-vmlinux.sh  |  18 ++-
 14 files changed, 419 insertions(+), 109 deletions(-)
 create mode 100755 arch/powerpc/tools/vmlinux_o.sh


base-commit: 2c644f2847c188b4fa545e602e4a1d4db55e8c8d
prerequisite-patch-id: a1d50e589288239d5a8b1c1f354cd4737057c9d3
prerequisite-patch-id: da4142d56880861bd0ad7ad7087c9e2670a2ee54
prerequisite-patch-id: 609d292e054b2396b603890522a940fa0bdfb6d8
prerequisite-patch-id: 6f7213fb77b1260defbf43be0e47bff9c80054cc
prerequisite-patch-id: ad3b71bf071ae4ba1bee5b087e61a2055772a74f
-- 
2.45.2

[RFC PATCH v2 2/5] powerpc/ftrace: Remove pointer to struct module from dyn_arch_ftrace

2024-06-10 Thread Naveen N Rao

Pointer to struct module is only relevant for ftrace records belonging
to kernel modules. Having this field in dyn_arch_ftrace wastes memory
for all ftrace records belonging to the kernel. Remove the same in
favour of looking up the module from the ftrace record address, similar
to other architectures.

Signed-off-by: Naveen N Rao 
---
 arch/powerpc/include/asm/ftrace.h|  1 -
 arch/powerpc/kernel/trace/ftrace.c   | 47 ++-
 arch/powerpc/kernel/trace/ftrace_64_pg.c | 73 +++-
 3 files changed, 64 insertions(+), 57 deletions(-)

diff --git a/arch/powerpc/include/asm/ftrace.h 
b/arch/powerpc/include/asm/ftrace.h
index 107fc5a48456..201f9d15430a 100644
--- a/arch/powerpc/include/asm/ftrace.h
+++ b/arch/powerpc/include/asm/ftrace.h
@@ -26,7 +26,6 @@ unsigned long prepare_ftrace_return(unsigned long parent, 
unsigned long ip,
 struct module;
 struct dyn_ftrace;
 struct dyn_arch_ftrace {
-   struct module *mod;
 };
 
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_ARGS
diff --git a/arch/powerpc/kernel/trace/ftrace.c 
b/arch/powerpc/kernel/trace/ftrace.c
index d8d6b4fd9a14..041be965485e 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -106,20 +106,36 @@ static unsigned long find_ftrace_tramp(unsigned long ip)
return 0;
 }
 
+static struct module *ftrace_lookup_module(struct dyn_ftrace *rec)
+{
+   struct module *mod = NULL;
+
+#ifdef CONFIG_MODULES
+   /*
+* NOTE: __module_text_address() must be called with preemption
+* disabled, but we can rely on ftrace_lock to ensure that 'mod'
+* retains its validity throughout the remainder of this code.
+   */
+   preempt_disable();
+   mod = __module_text_address(rec->ip);
+   preempt_enable();
+
+   if (!mod)
+   pr_err("No module loaded at addr=%lx\n", rec->ip);
+#endif
+
+   return mod;
+}
+
 static int ftrace_get_call_inst(struct dyn_ftrace *rec, unsigned long addr, 
ppc_inst_t *call_inst)
 {
unsigned long ip = rec->ip;
unsigned long stub;
+   struct module *mod;
 
if (is_offset_in_branch_range(addr - ip)) {
/* Within range */
stub = addr;
-#ifdef CONFIG_MODULES
-   } else if (rec->arch.mod) {
-   /* Module code would be going to one of the module stubs */
-   stub = (addr == (unsigned long)ftrace_caller ? 
rec->arch.mod->arch.tramp :
-  
rec->arch.mod->arch.tramp_regs);
-#endif
} else if (core_kernel_text(ip)) {
/* We would be branching to one of our ftrace stubs */
stub = find_ftrace_tramp(ip);
@@ -128,7 +144,16 @@ static int ftrace_get_call_inst(struct dyn_ftrace *rec, 
unsigned long addr, ppc_
return -EINVAL;
}
} else {
-   return -EINVAL;
+   mod = ftrace_lookup_module(rec);
+   if (mod) {
+#ifdef CONFIG_MODULES
+   /* Module code would be going to one of the module 
stubs */
+   stub = (addr == (unsigned long)ftrace_caller ? 
mod->arch.tramp :
+  
mod->arch.tramp_regs);
+#endif
+   } else {
+   return -EINVAL;
+   }
}
 
*call_inst = ftrace_create_branch_inst(ip, stub, 1);
@@ -256,14 +281,6 @@ int ftrace_init_nop(struct module *mod, struct dyn_ftrace 
*rec)
if (ret)
return ret;
 
-   if (!core_kernel_text(ip)) {
-   if (!mod) {
-   pr_err("0x%lx: No module provided for non-kernel 
address\n", ip);
-   return -EFAULT;
-   }
-   rec->arch.mod = mod;
-   }
-
/* Nop-out the ftrace location */
new = ppc_inst(PPC_RAW_NOP());
addr = MCOUNT_ADDR;
diff --git a/arch/powerpc/kernel/trace/ftrace_64_pg.c 
b/arch/powerpc/kernel/trace/ftrace_64_pg.c
index 12fab1803bcf..45bd8dcf9886 100644
--- a/arch/powerpc/kernel/trace/ftrace_64_pg.c
+++ b/arch/powerpc/kernel/trace/ftrace_64_pg.c
@@ -116,6 +116,24 @@ static unsigned long find_bl_target(unsigned long ip, 
ppc_inst_t op)
 }
 
 #ifdef CONFIG_MODULES
+static struct module *ftrace_lookup_module(struct dyn_ftrace *rec)
+{
+   struct module *mod;
+   /*
+* NOTE: __module_text_address() must be called with preemption
+* disabled, but we can rely on ftrace_lock to ensure that 'mod'
+* retains its validity throughout the remainder of this code.
+   */
+   preempt_disable();
+   mod = __module_text_address(rec->ip);
+   preempt_enable();
+
+   if (!mod)
+   pr_err("No module loaded at addr=%lx\n", rec->ip);
+
+   return mod;
+}
+
 static int
 __ftrace_make_nop(struct module *mod,
  struct dyn_ftrace *rec, unsigned long addr)
@@ -124,6 +142,12 @@

[RFC PATCH v2 3/5] powerpc/ftrace: Unify 32-bit and 64-bit ftrace entry code

2024-06-10 Thread Naveen N Rao

On 32-bit powerpc, gcc generates a three instruction sequence for
function profiling:
mflrr0
stw r0, 4(r1)
bl  _mcount

On kernel boot, the call to _mcount() is nop-ed out, to be patched back
in when ftrace is actually enabled. The 'stw' instruction therefore is
not necessary unless ftrace is enabled. Nop it out during ftrace init.

When ftrace is enabled, we want the 'stw' so that stack unwinding works
properly. Perform the same within the ftrace handler, similar to 64-bit
powerpc.

For 64-bit powerpc, early versions of gcc used to emit a three
instruction sequence for function profiling (with -mprofile-kernel) with
a 'std' instruction to mimic the 'stw' above. Address that scenario also
by nop-ing out the 'std' instruction during ftrace init.

Signed-off-by: Naveen N Rao 
---
 arch/powerpc/kernel/trace/ftrace.c   | 6 --
 arch/powerpc/kernel/trace/ftrace_entry.S | 4 ++--
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/trace/ftrace.c 
b/arch/powerpc/kernel/trace/ftrace.c
index 041be965485e..2e1667a578ff 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -266,13 +266,15 @@ int ftrace_init_nop(struct module *mod, struct dyn_ftrace 
*rec)
/* Expected sequence: 'mflr r0', 'stw r0,4(r1)', 'bl _mcount' */
ret = ftrace_validate_inst(ip - 8, ppc_inst(PPC_RAW_MFLR(_R0)));
if (!ret)
-   ret = ftrace_validate_inst(ip - 4, 
ppc_inst(PPC_RAW_STW(_R0, _R1, 4)));
+   ret = ftrace_modify_code(ip - 4, 
ppc_inst(PPC_RAW_STW(_R0, _R1, 4)),
+ppc_inst(PPC_RAW_NOP()));
} else if (IS_ENABLED(CONFIG_MPROFILE_KERNEL)) {
/* Expected sequence: 'mflr r0', ['std r0,16(r1)'], 'bl 
_mcount' */
ret = ftrace_read_inst(ip - 4, &old);
if (!ret && !ppc_inst_equal(old, ppc_inst(PPC_RAW_MFLR(_R0 {
ret = ftrace_validate_inst(ip - 8, 
ppc_inst(PPC_RAW_MFLR(_R0)));
-   ret |= ftrace_validate_inst(ip - 4, 
ppc_inst(PPC_RAW_STD(_R0, _R1, 16)));
+   ret |= ftrace_modify_code(ip - 4, 
ppc_inst(PPC_RAW_STD(_R0, _R1, 16)),
+ ppc_inst(PPC_RAW_NOP()));
}
} else {
return -EINVAL;
diff --git a/arch/powerpc/kernel/trace/ftrace_entry.S 
b/arch/powerpc/kernel/trace/ftrace_entry.S
index 76dbe9fd2c0f..244a1c7bb1e8 100644
--- a/arch/powerpc/kernel/trace/ftrace_entry.S
+++ b/arch/powerpc/kernel/trace/ftrace_entry.S
@@ -33,6 +33,8 @@
  * and then arrange for the ftrace function to be called.
  */
 .macro ftrace_regs_entry allregs
+   /* Save the original return address in A's stack frame */
+   PPC_STL r0, LRSAVE(r1)
/* Create a minimal stack frame for representing B */
PPC_STLUr1, -STACK_FRAME_MIN_SIZE(r1)
 
@@ -44,8 +46,6 @@
SAVE_GPRS(3, 10, r1)
 
 #ifdef CONFIG_PPC64
-   /* Save the original return address in A's stack frame */
-   std r0, LRSAVE+SWITCH_FRAME_SIZE+STACK_FRAME_MIN_SIZE(r1)
/* Ok to continue? */
lbz r3, PACA_FTRACE_ENABLED(r13)
cmpdi   r3, 0
-- 
2.45.2

[RFC PATCH v2 4/5] kbuild: Add generic hook for architectures to use before the final vmlinux link

2024-06-10 Thread Naveen N Rao

On powerpc, we would like to be able to make a pass on vmlinux.o and
generate a new object file to be linked into vmlinux. Add a generic pass
in link-vmlinux.sh that architectures can use for this purpose.
Architectures need to select CONFIG_ARCH_WANTS_PRE_LINK_VMLINUX and must
provide arch//tools/vmlinux_o.sh, which will be invoked prior to
the final vmlinux link step.

Signed-off-by: Naveen N Rao 
---
 arch/Kconfig|  3 +++
 scripts/link-vmlinux.sh | 18 +++---
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 975dd22a2dbd..649f0903e7ef 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1643,4 +1643,7 @@ config CC_HAS_SANE_FUNCTION_ALIGNMENT
 config ARCH_NEED_CMPXCHG_1_EMU
bool
 
+config ARCH_WANTS_PRE_LINK_VMLINUX
+   def_bool n
+
 endmenu
diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh
index 46ce5d04dbeb..07f70e105d82 100755
--- a/scripts/link-vmlinux.sh
+++ b/scripts/link-vmlinux.sh
@@ -122,7 +122,7 @@ gen_btf()
return 1
fi
 
-   vmlinux_link ${1}
+   vmlinux_link ${1} ${arch_vmlinux_o}
 
info "BTF" ${2}
LLVM_OBJCOPY="${OBJCOPY}" ${PAHOLE} -J ${PAHOLE_FLAGS} ${1}
@@ -178,7 +178,7 @@ kallsyms_step()
kallsymso=${kallsyms_vmlinux}.o
kallsyms_S=${kallsyms_vmlinux}.S
 
-   vmlinux_link ${kallsyms_vmlinux} "${kallsymso_prev}" 
${btf_vmlinux_bin_o}
+   vmlinux_link ${kallsyms_vmlinux} "${kallsymso_prev}" 
${btf_vmlinux_bin_o} ${arch_vmlinux_o}
mksysmap ${kallsyms_vmlinux} ${kallsyms_vmlinux}.syms
kallsyms ${kallsyms_vmlinux}.syms ${kallsyms_S}
 
@@ -203,6 +203,7 @@ sorttable()
 
 cleanup()
 {
+   rm -f .arch.vmlinux.*
rm -f .btf.*
rm -f System.map
rm -f vmlinux
@@ -223,6 +224,17 @@ fi
 
 ${MAKE} -f "${srctree}/scripts/Makefile.build" obj=init 
init/version-timestamp.o
 
+arch_vmlinux_o=""
+if is_enabled CONFIG_ARCH_WANTS_PRE_LINK_VMLINUX; then
+   arch_vmlinux_o=.arch.vmlinux.o
+   info "ARCH" ${arch_vmlinux_o}
+   if ! ${srctree}/arch/${SRCARCH}/tools/vmlinux_o.sh ${arch_vmlinux_o} ; 
then
+   echo >&2 "Failed to generate ${arch_vmlinux_o}"
+   echo >&2 "Try to disable CONFIG_ARCH_WANTS_PRE_LINK_VMLINUX"
+   exit 1
+   fi
+fi
+
 btf_vmlinux_bin_o=""
 if is_enabled CONFIG_DEBUG_INFO_BTF; then
btf_vmlinux_bin_o=.btf.vmlinux.bin.o
@@ -273,7 +285,7 @@ if is_enabled CONFIG_KALLSYMS; then
fi
 fi
 
-vmlinux_link vmlinux "${kallsymso}" ${btf_vmlinux_bin_o}
+vmlinux_link vmlinux "${kallsymso}" ${btf_vmlinux_bin_o} ${arch_vmlinux_o}
 
 # fill in BTF IDs
 if is_enabled CONFIG_DEBUG_INFO_BTF && is_enabled CONFIG_BPF; then
-- 
2.45.2

[RFC PATCH v2 1/5] powerpc/kprobes: Use ftrace to determine if a probe is at function entry

2024-06-10 Thread Naveen N Rao

Rather than hard-coding the offset into a function to be used to
determine if a kprobe is at function entry, use ftrace_location() to
determine the ftrace location within the function and categorize all
instructions till that offset to be function entry.

For functions that cannot be traced, we fall back to using a fixed
offset of 8 (two instructions) to categorize a probe as being at
function entry for 64-bit elfv2, unless we are using pcrel.

Acked-by: Masami Hiramatsu (Google) 
Signed-off-by: Naveen N Rao 
---
 arch/powerpc/kernel/kprobes.c | 18 --
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index 14c5ddec3056..ca204f4f21c1 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -105,24 +105,22 @@ kprobe_opcode_t *kprobe_lookup_name(const char *name, 
unsigned int offset)
return addr;
 }
 
-static bool arch_kprobe_on_func_entry(unsigned long offset)
+static bool arch_kprobe_on_func_entry(unsigned long addr, unsigned long offset)
 {
-#ifdef CONFIG_PPC64_ELF_ABI_V2
-#ifdef CONFIG_KPROBES_ON_FTRACE
-   return offset <= 16;
-#else
-   return offset <= 8;
-#endif
-#else
+   unsigned long ip = ftrace_location(addr);
+
+   if (ip)
+   return offset <= (ip - addr);
+   if (IS_ENABLED(CONFIG_PPC64_ELF_ABI_V2) && 
!IS_ENABLED(CONFIG_PPC_KERNEL_PCREL))
+   return offset <= 8;
return !offset;
-#endif
 }
 
 /* XXX try and fold the magic of kprobe_lookup_name() in this */
 kprobe_opcode_t *arch_adjust_kprobe_addr(unsigned long addr, unsigned long 
offset,
 bool *on_func_entry)
 {
-   *on_func_entry = arch_kprobe_on_func_entry(offset);
+   *on_func_entry = arch_kprobe_on_func_entry(addr, offset);
return (kprobe_opcode_t *)(addr + offset);
 }
 
-- 
2.45.2

[RFC PATCH v2 5/5] powerpc64/ftrace: Move ftrace sequence out of line

2024-06-10 Thread Naveen N Rao

Function profile sequence on powerpc includes two instructions at the
beginning of each function:
mflrr0
bl  ftrace_caller

The call to ftrace_caller() gets nop'ed out during kernel boot and is
patched in when ftrace is enabled.

Given the sequence, we cannot return from ftrace_caller with 'blr' as we
need to keep LR and r0 intact. This results in link stack imbalance when
ftrace is enabled. To address that, we would like to use a three
instruction sequence:
mflrr0
bl  ftrace_caller
mtlrr0

Further more, to support DYNAMIC_FTRACE_WITH_CALL_OPS, we need to
reserve two instruction slots before the function. This results in a
total of five instruction slots to be reserved for ftrace use on each
function that is traced.

Move the function profile sequence out-of-line to minimize its impact.
To do this, we reserve a single nop at function entry using
-fpatchable-function-entry=1 and add a pass on vmlinux.o to determine
the total number of functions that can be traced. This is then used to
generate a .S file reserving the appropriate amount of space for use as
ftrace stubs, which is built and linked into vmlinux.

On bootup, the stub space is split into separate stubs per function and
populated with the proper instruction sequence. A pointer to the
associated stub is maintained in dyn_arch_ftrace.

For modules, space for ftrace stubs is reserved from the generic module
stub space.

This is restricted to and enabled by default only on 64-bit powerpc.

Signed-off-by: Naveen N Rao 
---
 arch/powerpc/Kconfig |   4 +
 arch/powerpc/Makefile|   4 +
 arch/powerpc/include/asm/ftrace.h|  10 ++
 arch/powerpc/include/asm/module.h|   5 +
 arch/powerpc/kernel/asm-offsets.c|   4 +
 arch/powerpc/kernel/module_64.c  |  67 +--
 arch/powerpc/kernel/trace/ftrace.c   | 145 +--
 arch/powerpc/kernel/trace/ftrace_entry.S |  71 ---
 arch/powerpc/kernel/vmlinux.lds.S|   3 +-
 arch/powerpc/tools/vmlinux_o.sh  |  47 
 10 files changed, 324 insertions(+), 36 deletions(-)
 create mode 100755 arch/powerpc/tools/vmlinux_o.sh

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c88c6d46a5bc..c393daeaf643 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -568,6 +568,10 @@ config ARCH_USING_PATCHABLE_FUNCTION_ENTRY
def_bool 
$(success,$(srctree)/arch/powerpc/tools/gcc-check-fpatchable-function-entry.sh 
$(CC) -mlittle-endian) if PPC64 && CPU_LITTLE_ENDIAN
def_bool 
$(success,$(srctree)/arch/powerpc/tools/gcc-check-fpatchable-function-entry.sh 
$(CC) -mbig-endian) if PPC64 && CPU_BIG_ENDIAN
 
+config FTRACE_PFE_OUT_OF_LINE
+   def_bool PPC64 && ARCH_USING_PATCHABLE_FUNCTION_ENTRY
+   select ARCH_WANTS_PRE_LINK_VMLINUX
+
 config HOTPLUG_CPU
bool "Support for enabling/disabling CPUs"
depends on SMP && (PPC_PSERIES || \
diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index a8479c881cac..bb920d48ec6e 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -155,7 +155,11 @@ CC_FLAGS_NO_FPU:= $(call 
cc-option,-msoft-float)
 ifdef CONFIG_FUNCTION_TRACER
 ifdef CONFIG_ARCH_USING_PATCHABLE_FUNCTION_ENTRY
 KBUILD_CPPFLAGS+= -DCC_USING_PATCHABLE_FUNCTION_ENTRY
+ifdef CONFIG_FTRACE_PFE_OUT_OF_LINE
+CC_FLAGS_FTRACE := -fpatchable-function-entry=1
+else
 CC_FLAGS_FTRACE := -fpatchable-function-entry=2
+endif
 else
 CC_FLAGS_FTRACE := -pg
 ifdef CONFIG_MPROFILE_KERNEL
diff --git a/arch/powerpc/include/asm/ftrace.h 
b/arch/powerpc/include/asm/ftrace.h
index 201f9d15430a..9da1da0f87b4 100644
--- a/arch/powerpc/include/asm/ftrace.h
+++ b/arch/powerpc/include/asm/ftrace.h
@@ -26,6 +26,9 @@ unsigned long prepare_ftrace_return(unsigned long parent, 
unsigned long ip,
 struct module;
 struct dyn_ftrace;
 struct dyn_arch_ftrace {
+#ifdef CONFIG_FTRACE_PFE_OUT_OF_LINE
+   unsigned long pfe_stub;
+#endif
 };
 
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_ARGS
@@ -132,6 +135,13 @@ static inline u8 this_cpu_get_ftrace_enabled(void) { 
return 1; }
 
 #ifdef CONFIG_FUNCTION_TRACER
 extern unsigned int ftrace_tramp_text[], ftrace_tramp_init[];
+#ifdef CONFIG_FTRACE_PFE_OUT_OF_LINE
+struct ftrace_pfe_stub {
+   u32 insn[4];
+};
+extern struct ftrace_pfe_stub ftrace_pfe_stub_text[], 
ftrace_pfe_stub_inittext[];
+extern unsigned long ftrace_pfe_stub_text_count, 
ftrace_pfe_stub_inittext_count;
+#endif
 void ftrace_free_init_tramp(void);
 unsigned long ftrace_call_adjust(unsigned long addr);
 #else
diff --git a/arch/powerpc/include/asm/module.h 
b/arch/powerpc/include/asm/module.h
index 300c777cc307..28dbd1ec5593 100644
--- a/arch/powerpc/include/asm/module.h
+++ b/arch/powerpc/include/asm/module.h
@@ -47,6 +47,11 @@ struct mod_arch_specific {
 #ifdef CONFIG_DYNAMIC_FTRACE
unsigned long tramp;
unsigned long tramp_regs;
+#ifdef CONFIG_FTRACE_PFE_OUT_OF_LINE
+   s

Re: [PATCH v1 1/1] treewide: Align match_string() with sysfs_match_string()

2024-06-10 Thread Linus Walleij

On Tue, Jun 4, 2024 at 9:46 AM Jani Nikula  wrote:

[Maybe slightly off-topic, ranty]

> Why do we think it's a good idea to increase and normalize the use of
> double-underscore function names across the kernel, like
> __match_string() in this case? It should mean "reserved for the
> implementation, not to be called directly".
>
> If it's to be used directly, it should be named accordingly, right?

It's a huge mess. "__" prefix is just so ambiguous I think it just
shouldn't be used or prolifierated, and it usually breaks Rusty Russells
API rules times over.

Consider __set_bit() from , used all over the place,
in contrast with set_bit() for example, what does "__" represent in
this context that makes __set_bit() different from set_bit()?

It means "non-atomic"...

How does a random contributor know this?

Yeah, you guess it. By the token of "everybody knows that".
(Grep, google, repeat for the number of contributors to the kernel.)

I was considering to send a script to Torvalds to just change all
this to set_bit_nonatomic() (etc) but was hesitating because that
makes the name unambiguous but long. I think I stayed off it
because changing stuff like that all over the place creates churn
and churn is bad.

Yours,
Linus Walleij

Re: [RFC PATCH v2 4/5] kbuild: Add generic hook for architectures to use before the final vmlinux link

2024-06-10 Thread Masahiro Yamada

On Mon, Jun 10, 2024 at 5:39 PM Naveen N Rao  wrote:
>
> On powerpc, we would like to be able to make a pass on vmlinux.o and
> generate a new object file to be linked into vmlinux. Add a generic pass
> in link-vmlinux.sh that architectures can use for this purpose.
> Architectures need to select CONFIG_ARCH_WANTS_PRE_LINK_VMLINUX and must
> provide arch//tools/vmlinux_o.sh, which will be invoked prior to
> the final vmlinux link step.
>
> Signed-off-by: Naveen N Rao 
> ---
>  arch/Kconfig|  3 +++
>  scripts/link-vmlinux.sh | 18 +++---
>  2 files changed, 18 insertions(+), 3 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 975dd22a2dbd..649f0903e7ef 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -1643,4 +1643,7 @@ config CC_HAS_SANE_FUNCTION_ALIGNMENT
>  config ARCH_NEED_CMPXCHG_1_EMU
> bool
>
> +config ARCH_WANTS_PRE_LINK_VMLINUX
> +   def_bool n
> +
>  endmenu
> diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh
> index 46ce5d04dbeb..07f70e105d82 100755
> --- a/scripts/link-vmlinux.sh
> +++ b/scripts/link-vmlinux.sh
> @@ -122,7 +122,7 @@ gen_btf()
> return 1
> fi
>
> -   vmlinux_link ${1}
> +   vmlinux_link ${1} ${arch_vmlinux_o}
>
> info "BTF" ${2}
> LLVM_OBJCOPY="${OBJCOPY}" ${PAHOLE} -J ${PAHOLE_FLAGS} ${1}
> @@ -178,7 +178,7 @@ kallsyms_step()
> kallsymso=${kallsyms_vmlinux}.o
> kallsyms_S=${kallsyms_vmlinux}.S
>
> -   vmlinux_link ${kallsyms_vmlinux} "${kallsymso_prev}" 
> ${btf_vmlinux_bin_o}
> +   vmlinux_link ${kallsyms_vmlinux} "${kallsymso_prev}" 
> ${btf_vmlinux_bin_o} ${arch_vmlinux_o}
> mksysmap ${kallsyms_vmlinux} ${kallsyms_vmlinux}.syms
> kallsyms ${kallsyms_vmlinux}.syms ${kallsyms_S}
>
> @@ -203,6 +203,7 @@ sorttable()
>
>  cleanup()
>  {
> +   rm -f .arch.vmlinux.*
> rm -f .btf.*
> rm -f System.map
> rm -f vmlinux
> @@ -223,6 +224,17 @@ fi
>
>  ${MAKE} -f "${srctree}/scripts/Makefile.build" obj=init 
> init/version-timestamp.o
>
> +arch_vmlinux_o=""
> +if is_enabled CONFIG_ARCH_WANTS_PRE_LINK_VMLINUX; then
> +   arch_vmlinux_o=.arch.vmlinux.o
> +   info "ARCH" ${arch_vmlinux_o}
> +   if ! ${srctree}/arch/${SRCARCH}/tools/vmlinux_o.sh ${arch_vmlinux_o} 
> ; then
> +   echo >&2 "Failed to generate ${arch_vmlinux_o}"
> +   echo >&2 "Try to disable CONFIG_ARCH_WANTS_PRE_LINK_VMLINUX"
> +   exit 1
> +   fi
> +fi



This is wrong because scripts/link-vmlinux.sh is not triggered
even when source files under arch/powerpc/tools/ are changed.

Presumably, scripts/Makefile.vmlinux will be the right place.





> +
>  btf_vmlinux_bin_o=""
>  if is_enabled CONFIG_DEBUG_INFO_BTF; then
> btf_vmlinux_bin_o=.btf.vmlinux.bin.o
> @@ -273,7 +285,7 @@ if is_enabled CONFIG_KALLSYMS; then
> fi
>  fi
>
> -vmlinux_link vmlinux "${kallsymso}" ${btf_vmlinux_bin_o}
> +vmlinux_link vmlinux "${kallsymso}" ${btf_vmlinux_bin_o} ${arch_vmlinux_o}
>
>  # fill in BTF IDs
>  if is_enabled CONFIG_DEBUG_INFO_BTF && is_enabled CONFIG_BPF; then
> --
> 2.45.2
>


-- 
Best Regards
Masahiro Yamada

Re: [PATCH V3 11/14] tools/perf: Add support to use libcapstone in powerpc

2024-06-10 Thread Athira Rajeev




> On 3 Jun 2024, at 10:28 PM, Adrian Hunter  wrote:
> 
> On 3/06/24 19:30, Ian Rogers wrote:
>> On Fri, May 31, 2024 at 11:10 PM Athira Rajeev
>>  wrote:
>>> 
>>> Now perf uses the capstone library to disassemble the instructions in
>>> x86. capstone is used (if available) for perf annotate to speed up.
>>> Currently it only supports x86 architecture. Patch includes changes to
>>> enable this in powerpc. For now, only for data type sort keys, this
>>> method is used and only binary code (raw instruction) is read. This is
>>> because powerpc approach to understand instructions and reg fields uses
>>> raw instruction. The "cs_disasm" is currently not enabled. While
>>> attempting to do cs_disasm, observation is that some of the instructions
>>> were not identified (ex: extswsli, maddld) and it had to fallback to use
>>> objdump. Hence enabling "cs_disasm" is added in comment section as a
>>> TODO for powerpc.
>>> 
>>> Signed-off-by: Athira Rajeev 
>>> ---
>>> tools/perf/util/disasm.c | 148 ++-
>>> 1 file changed, 146 insertions(+), 2 deletions(-)
>>> 
>>> diff --git a/tools/perf/util/disasm.c b/tools/perf/util/disasm.c
>>> index d8b357055302..915508d2e197 100644
>>> --- a/tools/perf/util/disasm.c
>>> +++ b/tools/perf/util/disasm.c
>>> @@ -1540,12 +1540,18 @@ static int open_capstone_handle(struct 
>>> annotate_args *args, bool is_64bit,
>>> {
>>>struct annotation_options *opt = args->options;
>>>cs_mode mode = is_64bit ? CS_MODE_64 : CS_MODE_32;
>>> +   int ret;
>>> 
>>>/* TODO: support more architectures */
>>> -   if (!arch__is(args->arch, "x86"))
>>> +   if ((!arch__is(args->arch, "x86")) && (!arch__is(args->arch, 
>>> "powerpc")))
>>>return -1;
>>> 
>>> -   if (cs_open(CS_ARCH_X86, mode, handle) != CS_ERR_OK)
>>> +   if (arch__is(args->arch, "x86"))
>>> +   ret = cs_open(CS_ARCH_X86, mode, handle);
>>> +   else
>>> +   ret = cs_open(CS_ARCH_PPC, mode, handle);
>>> +
>>> +   if (ret != CS_ERR_OK)
>>>return -1;
>> 
>> There looks to be a pretty/more robust capstone_init function in
>> print_insn.c, should we factor this code out and recycle:
>> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/print_insn.c?h=perf-tools-next#n40
> 
> On a slightly related note, there is a compile error
> been around for a while in util/disasm.c on Ubuntu 22.04
> 
> In file included from /usr/include/capstone/capstone.h:279,
> from util/disasm.c:1354:
> /usr/include/capstone/bpf.h:94:14: error: ‘bpf_insn’ defined as wrong
> kind of tag
>   94 | typedef enum bpf_insn {
>  |  ^~~~
> 

Hi Adrian

I tried compilation on Ubuntu 22.04, but didn’t face this issue.
The libcapstone version I have is libcapstone4 which doesn’t have the include 
for “bpf.h”
What is the version of libcapstone in the setup where you are seeing this issue 
?

Thanks
Athira
>> 
>> Thanks,
>> Ian
>> 
>>>if (!opt->disassembler_style ||
>>> @@ -1635,6 +1641,139 @@ static void print_capstone_detail(cs_insn *insn, 
>>> char *buf, size_t len,
>>>}
>>> }
>>> 
>>> +static int symbol__disassemble_capstone_powerpc(char *filename, struct 
>>> symbol *sym,
>>> +   struct annotate_args *args)
>>> +{
>>> +   struct annotation *notes = symbol__annotation(sym);
>>> +   struct map *map = args->ms.map;
>>> +   struct dso *dso = map__dso(map);
>>> +   struct nscookie nsc;
>>> +   u64 start = map__rip_2objdump(map, sym->start);
>>> +   u64 end = map__rip_2objdump(map, sym->end);
>>> +   u64 len = end - start;
>>> +   u64 offset;
>>> +   int i, fd, count;
>>> +   bool is_64bit = false;
>>> +   bool needs_cs_close = false;
>>> +   u8 *buf = NULL;
>>> +   struct find_file_offset_data data = {
>>> +   .ip = start,
>>> +   };
>>> +   csh handle;
>>> +   char disasm_buf[512];
>>> +   struct disasm_line *dl;
>>> +   u32 *line;
>>> +
>>> +   if (args->options->objdump_path)
>>> +   return -1;
>>> +
>>> +   nsinfo__mountns_enter(dso->nsinfo, &nsc);
>>> +   fd = open(filename, O_RDONLY);
>>> +   nsinfo__mountns_exit(&nsc);
>>> +   if (fd < 0)
>>> +   return -1;
>>> +
>>> +   if (file__read_maps(fd, /*exe=*/true, find_file_offset, &data,
>>> +   &is_64bit) == 0)
>>> +   goto err;
>>> +
>>> +   if (open_capstone_handle(args, is_64bit, &handle) < 0)
>>> +   goto err;
>>> +
>>> +   needs_cs_close = true;
>>> +
>>> +   buf = malloc(len);
>>> +   if (buf == NULL)
>>> +   goto err;
>>> +
>>> +   count = pread(fd, buf, len, data.offset);
>>> +   close(fd);
>>> +   fd = -1;
>>> +
>>> +   if ((u64)count != len)
>>> +   goto err;
>>> +
>>> +   line = (u32 *)buf;
>>> +
>>> +   /* add

Re: [PATCH 1/3] tools/perf: Fix the nrcpus in perf bench futex to enable the run when all CPU's are not online

2024-06-10 Thread Disha Goel


On 07/06/24 10:13 am, Athira Rajeev wrote:


Perf bench futex fails as below when attempted to run on
on a powerpc system:

  ./perf bench futex all
  Running futex/hash benchmark...
Run summary [PID 626307]: 80 threads, each operating on 1024 [private] futexes 
for 10 secs.

perf: pthread_create: No such file or directory

In the setup where this perf bench was ran, difference was that
partition had 640 CPU's, but not all CPUs were online. 80 CPUs
were online. While blocking the threads with futex_wait, code
sets the affinity using cpumask. The cpumask size used is 80
which is picked from "nrcpus = perf_cpu_map__nr(cpu)". Here the
benchmark reports fail while setting affinity for cpu number which
is greater than 80 or higher, because it attempts to set a bit
position which is not allocated on the cpumask. Fix this by changing
the size of cpumask to number of possible cpus and not the number
of online cpus.

Signed-off-by: Athira Rajeev 


Thanks for the fix patches, Athira.
I have tested all three patches on a power machine (both small and max config),
and the perf bench futex and epoll tests run fine.

For the series:
Tested-by: Disha Goel 


---
  tools/perf/bench/futex-hash.c  | 2 +-
  tools/perf/bench/futex-lock-pi.c   | 2 +-
  tools/perf/bench/futex-requeue.c   | 2 +-
  tools/perf/bench/futex-wake-parallel.c | 2 +-
  tools/perf/bench/futex-wake.c  | 2 +-
  5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/tools/perf/bench/futex-hash.c b/tools/perf/bench/futex-hash.c
index 0c69d20efa32..b472eded521b 100644
--- a/tools/perf/bench/futex-hash.c
+++ b/tools/perf/bench/futex-hash.c
@@ -174,7 +174,7 @@ int bench_futex_hash(int argc, const char **argv)
pthread_attr_init(&thread_attr);
gettimeofday(&bench__start, NULL);
  
-	nrcpus = perf_cpu_map__nr(cpu);

+   nrcpus = cpu__max_cpu().cpu;
cpuset = CPU_ALLOC(nrcpus);
BUG_ON(!cpuset);
size = CPU_ALLOC_SIZE(nrcpus);
diff --git a/tools/perf/bench/futex-lock-pi.c b/tools/perf/bench/futex-lock-pi.c
index 7a4973346180..0416120c091b 100644
--- a/tools/perf/bench/futex-lock-pi.c
+++ b/tools/perf/bench/futex-lock-pi.c
@@ -122,7 +122,7 @@ static void create_threads(struct worker *w, struct 
perf_cpu_map *cpu)
  {
cpu_set_t *cpuset;
unsigned int i;
-   int nrcpus =  perf_cpu_map__nr(cpu);
+   int nrcpus =  cpu__max_cpu().cpu;
size_t size;
  
  	threads_starting = params.nthreads;

diff --git a/tools/perf/bench/futex-requeue.c b/tools/perf/bench/futex-requeue.c
index d9ad736c1a3e..aad5bfc4fe18 100644
--- a/tools/perf/bench/futex-requeue.c
+++ b/tools/perf/bench/futex-requeue.c
@@ -125,7 +125,7 @@ static void block_threads(pthread_t *w, struct perf_cpu_map 
*cpu)
  {
cpu_set_t *cpuset;
unsigned int i;
-   int nrcpus = perf_cpu_map__nr(cpu);
+   int nrcpus = cpu__max_cpu().cpu;
size_t size;
  
  	threads_starting = params.nthreads;

diff --git a/tools/perf/bench/futex-wake-parallel.c 
b/tools/perf/bench/futex-wake-parallel.c
index b66df553e561..90a5b91bf139 100644
--- a/tools/perf/bench/futex-wake-parallel.c
+++ b/tools/perf/bench/futex-wake-parallel.c
@@ -149,7 +149,7 @@ static void block_threads(pthread_t *w, struct perf_cpu_map 
*cpu)
  {
cpu_set_t *cpuset;
unsigned int i;
-   int nrcpus = perf_cpu_map__nr(cpu);
+   int nrcpus = cpu__max_cpu().cpu;
size_t size;
  
  	threads_starting = params.nthreads;

diff --git a/tools/perf/bench/futex-wake.c b/tools/perf/bench/futex-wake.c
index 690fd6d3da13..49b3c89b0b35 100644
--- a/tools/perf/bench/futex-wake.c
+++ b/tools/perf/bench/futex-wake.c
@@ -100,7 +100,7 @@ static void block_threads(pthread_t *w, struct perf_cpu_map 
*cpu)
cpu_set_t *cpuset;
unsigned int i;
size_t size;
-   int nrcpus = perf_cpu_map__nr(cpu);
+   int nrcpus = cpu__max_cpu().cpu;
threads_starting = params.nthreads;
  
  	cpuset = CPU_ALLOC(nrcpus);

[PATCH v2] PowerPC: Replace kretprobe with rethook

2024-06-10 Thread Abhishek Dubey

This is an adaptation of commit f3a112c0c40d ("x86,rethook,kprobes:
Replace kretprobe with rethook on x86") to PowerPC.

Replaces the kretprobe code with rethook on Power. With this patch,
kretprobe on Power uses the rethook instead of kretprobe specific
trampoline code.

Reference to other archs:
commit b57c2f124098 ("riscv: add riscv rethook implementation")
commit 7b0a096436c2 ("LoongArch: Replace kretprobe with rethook")

Signed-off-by: Abhishek Dubey 
---
 arch/powerpc/Kconfig |  1 +
 arch/powerpc/kernel/Makefile |  1 +
 arch/powerpc/kernel/kprobes.c| 65 +
 arch/powerpc/kernel/optprobes.c  |  2 +-
 arch/powerpc/kernel/rethook.c| 71 
 arch/powerpc/kernel/stacktrace.c | 10 +++--
 6 files changed, 81 insertions(+), 69 deletions(-)
 create mode 100644 arch/powerpc/kernel/rethook.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c88c6d46a5bc..fa0b1ab3f935 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -270,6 +270,7 @@ config PPC
select HAVE_PERF_EVENTS_NMI if PPC64
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
+   select HAVE_RETHOOK
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE
select HAVE_RSEQ
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 8585d03c02d3..7dd1b523b17f 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -140,6 +140,7 @@ obj-$(CONFIG_KPROBES)   += kprobes.o
 obj-$(CONFIG_OPTPROBES)+= optprobes.o optprobes_head.o
 obj-$(CONFIG_KPROBES_ON_FTRACE)+= kprobes-ftrace.o
 obj-$(CONFIG_UPROBES)  += uprobes.o
+obj-$(CONFIG_RETHOOK)   += rethook.o
 obj-$(CONFIG_PPC_UDBG_16550)   += legacy_serial.o udbg_16550.o
 obj-$(CONFIG_SWIOTLB)  += dma-swiotlb.o
 obj-$(CONFIG_ARCH_HAS_DMA_SET_MASK) += dma-mask.o
diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index 14c5ddec3056..f8aa91bc3b17 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -228,16 +228,6 @@ static nokprobe_inline void set_current_kprobe(struct 
kprobe *p, struct pt_regs
kcb->kprobe_saved_msr = regs->msr;
 }
 
-void arch_prepare_kretprobe(struct kretprobe_instance *ri, struct pt_regs 
*regs)
-{
-   ri->ret_addr = (kprobe_opcode_t *)regs->link;
-   ri->fp = NULL;
-
-   /* Replace the return addr with trampoline addr */
-   regs->link = (unsigned long)__kretprobe_trampoline;
-}
-NOKPROBE_SYMBOL(arch_prepare_kretprobe);
-
 static int try_to_emulate(struct kprobe *p, struct pt_regs *regs)
 {
int ret;
@@ -394,49 +384,6 @@ int kprobe_handler(struct pt_regs *regs)
 }
 NOKPROBE_SYMBOL(kprobe_handler);
 
-/*
- * Function return probe trampoline:
- * - init_kprobes() establishes a probepoint here
- * - When the probed function returns, this probe
- * causes the handlers to fire
- */
-asm(".global __kretprobe_trampoline\n"
-   ".type __kretprobe_trampoline, @function\n"
-   "__kretprobe_trampoline:\n"
-   "nop\n"
-   "blr\n"
-   ".size __kretprobe_trampoline, .-__kretprobe_trampoline\n");
-
-/*
- * Called when the probe at kretprobe trampoline is hit
- */
-static int trampoline_probe_handler(struct kprobe *p, struct pt_regs *regs)
-{
-   unsigned long orig_ret_address;
-
-   orig_ret_address = __kretprobe_trampoline_handler(regs, NULL);
-   /*
-* We get here through one of two paths:
-* 1. by taking a trap -> kprobe_handler() -> here
-* 2. by optprobe branch -> optimized_callback() -> opt_pre_handler() 
-> here
-*
-* When going back through (1), we need regs->nip to be setup properly
-* as it is used to determine the return address from the trap.
-* For (2), since nip is not honoured with optprobes, we instead setup
-* the link register properly so that the subsequent 'blr' in
-* __kretprobe_trampoline jumps back to the right instruction.
-*
-* For nip, we should set the address to the previous instruction since
-* we end up emulating it in kprobe_handler(), which increments the nip
-* again.
-*/
-   regs_set_return_ip(regs, orig_ret_address - 4);
-   regs->link = orig_ret_address;
-
-   return 0;
-}
-NOKPROBE_SYMBOL(trampoline_probe_handler);
-
 /*
  * Called after single-stepping.  p->addr is the address of the
  * instruction whose first byte has been replaced by the "breakpoint"
@@ -539,19 +486,9 @@ int kprobe_fault_handler(struct pt_regs *regs, int trapnr)
 }
 NOKPROBE_SYMBOL(kprobe_fault_handler);
 
-static struct kprobe trampoline_p = {
-   .addr = (kprobe_opcode_t *) &__kretprobe_trampoline,
-   .pre_handler = trampoline_probe_handler
-};
-
-int __init arch_init_kprobes(void)
-{
-   return register_kprobe(&trampoline_p);
-}
-
 int arch_trampoline_kp

Re: [RFC PATCH v2 4/5] kbuild: Add generic hook for architectures to use before the final vmlinux link

2024-06-10 Thread Naveen N Rao

On Mon, Jun 10, 2024 at 06:14:51PM GMT, Masahiro Yamada wrote:
> On Mon, Jun 10, 2024 at 5:39 PM Naveen N Rao  wrote:
> >
> > On powerpc, we would like to be able to make a pass on vmlinux.o and
> > generate a new object file to be linked into vmlinux. Add a generic pass
> > in link-vmlinux.sh that architectures can use for this purpose.
> > Architectures need to select CONFIG_ARCH_WANTS_PRE_LINK_VMLINUX and must
> > provide arch//tools/vmlinux_o.sh, which will be invoked prior to
> > the final vmlinux link step.
> >
> > Signed-off-by: Naveen N Rao 
> > ---
> >  arch/Kconfig|  3 +++
> >  scripts/link-vmlinux.sh | 18 +++---
> >  2 files changed, 18 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index 975dd22a2dbd..649f0903e7ef 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -1643,4 +1643,7 @@ config CC_HAS_SANE_FUNCTION_ALIGNMENT
> >  config ARCH_NEED_CMPXCHG_1_EMU
> > bool
> >
> > +config ARCH_WANTS_PRE_LINK_VMLINUX
> > +   def_bool n
> > +
> >  endmenu
> > diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh
> > index 46ce5d04dbeb..07f70e105d82 100755
> > --- a/scripts/link-vmlinux.sh
> > +++ b/scripts/link-vmlinux.sh
...
> >
> > +arch_vmlinux_o=""
> > +if is_enabled CONFIG_ARCH_WANTS_PRE_LINK_VMLINUX; then
> > +   arch_vmlinux_o=.arch.vmlinux.o
> > +   info "ARCH" ${arch_vmlinux_o}
> > +   if ! ${srctree}/arch/${SRCARCH}/tools/vmlinux_o.sh 
> > ${arch_vmlinux_o} ; then
> > +   echo >&2 "Failed to generate ${arch_vmlinux_o}"
> > +   echo >&2 "Try to disable CONFIG_ARCH_WANTS_PRE_LINK_VMLINUX"
> > +   exit 1
> > +   fi
> > +fi
> 
> 
> 
> This is wrong because scripts/link-vmlinux.sh is not triggered
> even when source files under arch/powerpc/tools/ are changed.
> 
> Presumably, scripts/Makefile.vmlinux will be the right place.

Ah, yes. Something like this?

diff --git a/scripts/Makefile.vmlinux b/scripts/Makefile.vmlinux
index 49946cb96844..77d90b6ac53e 100644
--- a/scripts/Makefile.vmlinux
+++ b/scripts/Makefile.vmlinux
@@ -22,6 +22,10 @@ targets += .vmlinux.export.o
 vmlinux: .vmlinux.export.o
 endif

+ifdef CONFIG_ARCH_WANTS_PRE_LINK_VMLINUX
+vmlinux: $(srctree)/arch/$(SRCARCH)/tools/vmlinux_o.sh
+endif
+
 ARCH_POSTLINK := $(wildcard $(srctree)/arch/$(SRCARCH)/Makefile.postlink)

 # Final link of vmlinux with optional arch pass after final link


Thanks,
Naveen

Re: [RFC PATCH v2 2/5] powerpc/ftrace: Remove pointer to struct module from dyn_arch_ftrace

2024-06-10 Thread Steven Rostedt

On Mon, 10 Jun 2024 14:08:15 +0530
Naveen N Rao  wrote:

> Pointer to struct module is only relevant for ftrace records belonging
> to kernel modules. Having this field in dyn_arch_ftrace wastes memory
> for all ftrace records belonging to the kernel. Remove the same in
> favour of looking up the module from the ftrace record address, similar
> to other architectures.
> 
> Signed-off-by: Naveen N Rao 
> ---
>  arch/powerpc/include/asm/ftrace.h|  1 -
>  arch/powerpc/kernel/trace/ftrace.c   | 47 ++-
>  arch/powerpc/kernel/trace/ftrace_64_pg.c | 73 +++-
>  3 files changed, 64 insertions(+), 57 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/ftrace.h 
> b/arch/powerpc/include/asm/ftrace.h
> index 107fc5a48456..201f9d15430a 100644
> --- a/arch/powerpc/include/asm/ftrace.h
> +++ b/arch/powerpc/include/asm/ftrace.h
> @@ -26,7 +26,6 @@ unsigned long prepare_ftrace_return(unsigned long parent, 
> unsigned long ip,
>  struct module;
>  struct dyn_ftrace;
>  struct dyn_arch_ftrace {
> - struct module *mod;
>  };

Nice. I always hated that extra field.


>  
>  #ifdef CONFIG_DYNAMIC_FTRACE_WITH_ARGS
> diff --git a/arch/powerpc/kernel/trace/ftrace.c 
> b/arch/powerpc/kernel/trace/ftrace.c
> index d8d6b4fd9a14..041be965485e 100644
> --- a/arch/powerpc/kernel/trace/ftrace.c
> +++ b/arch/powerpc/kernel/trace/ftrace.c
> @@ -106,20 +106,36 @@ static unsigned long find_ftrace_tramp(unsigned long ip)
>   return 0;
>  }
>  
> +static struct module *ftrace_lookup_module(struct dyn_ftrace *rec)
> +{
> + struct module *mod = NULL;
> +
> +#ifdef CONFIG_MODULES
> + /*
> +  * NOTE: __module_text_address() must be called with preemption
> +  * disabled, but we can rely on ftrace_lock to ensure that 'mod'
> +  * retains its validity throughout the remainder of this code.
> + */
> + preempt_disable();
> + mod = __module_text_address(rec->ip);
> + preempt_enable();
> +
> + if (!mod)
> + pr_err("No module loaded at addr=%lx\n", rec->ip);
> +#endif
> +
> + return mod;
> +}

It may look nicer to have:

#ifdef CONFIG_MODULES
static struct module *ftrace_lookup_module(struct dyn_ftrace *rec)
{
struct module *mod = NULL;

/*
 * NOTE: __module_text_address() must be called with preemption
 * disabled, but we can rely on ftrace_lock to ensure that 'mod'
 * retains its validity throughout the remainder of this code.
*/
preempt_disable();
mod = __module_text_address(rec->ip);
preempt_enable();

if (!mod)
pr_err("No module loaded at addr=%lx\n", rec->ip);

return mod;
}
#else
static inline struct module *ftrace_lookup_module(struct dyn_ftrace *rec)
{
return NULL;
}
#endif

> +
>  static int ftrace_get_call_inst(struct dyn_ftrace *rec, unsigned long addr, 
> ppc_inst_t *call_inst)
>  {
>   unsigned long ip = rec->ip;
>   unsigned long stub;
> + struct module *mod;
>  
>   if (is_offset_in_branch_range(addr - ip)) {
>   /* Within range */
>   stub = addr;
> -#ifdef CONFIG_MODULES
> - } else if (rec->arch.mod) {
> - /* Module code would be going to one of the module stubs */
> - stub = (addr == (unsigned long)ftrace_caller ? 
> rec->arch.mod->arch.tramp :
> -
> rec->arch.mod->arch.tramp_regs);
> -#endif
>   } else if (core_kernel_text(ip)) {
>   /* We would be branching to one of our ftrace stubs */
>   stub = find_ftrace_tramp(ip);
> @@ -128,7 +144,16 @@ static int ftrace_get_call_inst(struct dyn_ftrace *rec, 
> unsigned long addr, ppc_
>   return -EINVAL;
>   }
>   } else {
> - return -EINVAL;
> + mod = ftrace_lookup_module(rec);
> + if (mod) {
> +#ifdef CONFIG_MODULES
> + /* Module code would be going to one of the module 
> stubs */
> + stub = (addr == (unsigned long)ftrace_caller ? 
> mod->arch.tramp :
> +
> mod->arch.tramp_regs);
> +#endif

You have CONFIG_MODULES here and in ftrace_lookup_module() above, which
would always return NULL. Could you combine the above to be done in
ftrace_lookup_module() so that there's no #ifdef CONFIG_MODULES here?


> + } else {
> + return -EINVAL;
> + }
>   }
>  
>   *call_inst = ftrace_create_branch_inst(ip, stub, 1);
> @@ -256,14 +281,6 @@ int ftrace_init_nop(struct module *mod, struct 
> dyn_ftrace *rec)
>   if (ret)
>   return ret;
>  
> - if (!core_kernel_text(ip)) {
> - if (!mod) {
> - pr_err("0x%lx: No module provided for non-kernel 
> address\n", ip);
> - return -EFAULT;
> - }
> - rec->arch.mod = mod;
> -

Re: [RFC PATCH v2 3/5] powerpc/ftrace: Unify 32-bit and 64-bit ftrace entry code

2024-06-10 Thread Steven Rostedt

On Mon, 10 Jun 2024 14:08:16 +0530
Naveen N Rao  wrote:

> On 32-bit powerpc, gcc generates a three instruction sequence for
> function profiling:
>   mflrr0
>   stw r0, 4(r1)
>   bl  _mcount
> 
> On kernel boot, the call to _mcount() is nop-ed out, to be patched back
> in when ftrace is actually enabled. The 'stw' instruction therefore is
> not necessary unless ftrace is enabled. Nop it out during ftrace init.
> 
> When ftrace is enabled, we want the 'stw' so that stack unwinding works
> properly. Perform the same within the ftrace handler, similar to 64-bit
> powerpc.
> 
> For 64-bit powerpc, early versions of gcc used to emit a three
> instruction sequence for function profiling (with -mprofile-kernel) with
> a 'std' instruction to mimic the 'stw' above. Address that scenario also
> by nop-ing out the 'std' instruction during ftrace init.
> 
> Signed-off-by: Naveen N Rao 

Isn't there still the race that there's a preemption between the:

stw r0, 4(r1)
and
bl  _mcount

And if this breaks stack unwinding, couldn't this cause an issue for live
kernel patching?

I know it's very unlikely, but in theory, I think the race exists.

-- Steve

Re: [RFC PATCH v2 4/5] kbuild: Add generic hook for architectures to use before the final vmlinux link

2024-06-10 Thread Masahiro Yamada

On Tue, Jun 11, 2024 at 2:20 AM Naveen N Rao  wrote:
>
> On Mon, Jun 10, 2024 at 06:14:51PM GMT, Masahiro Yamada wrote:
> > On Mon, Jun 10, 2024 at 5:39 PM Naveen N Rao  wrote:
> > >
> > > On powerpc, we would like to be able to make a pass on vmlinux.o and
> > > generate a new object file to be linked into vmlinux. Add a generic pass
> > > in link-vmlinux.sh that architectures can use for this purpose.
> > > Architectures need to select CONFIG_ARCH_WANTS_PRE_LINK_VMLINUX and must
> > > provide arch//tools/vmlinux_o.sh, which will be invoked prior to
> > > the final vmlinux link step.
> > >
> > > Signed-off-by: Naveen N Rao 
> > > ---
> > >  arch/Kconfig|  3 +++
> > >  scripts/link-vmlinux.sh | 18 +++---
> > >  2 files changed, 18 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/arch/Kconfig b/arch/Kconfig
> > > index 975dd22a2dbd..649f0903e7ef 100644
> > > --- a/arch/Kconfig
> > > +++ b/arch/Kconfig
> > > @@ -1643,4 +1643,7 @@ config CC_HAS_SANE_FUNCTION_ALIGNMENT
> > >  config ARCH_NEED_CMPXCHG_1_EMU
> > > bool
> > >
> > > +config ARCH_WANTS_PRE_LINK_VMLINUX
> > > +   def_bool n
> > > +
> > >  endmenu
> > > diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh
> > > index 46ce5d04dbeb..07f70e105d82 100755
> > > --- a/scripts/link-vmlinux.sh
> > > +++ b/scripts/link-vmlinux.sh
> ...
> > >
> > > +arch_vmlinux_o=""
> > > +if is_enabled CONFIG_ARCH_WANTS_PRE_LINK_VMLINUX; then
> > > +   arch_vmlinux_o=.arch.vmlinux.o
> > > +   info "ARCH" ${arch_vmlinux_o}
> > > +   if ! ${srctree}/arch/${SRCARCH}/tools/vmlinux_o.sh 
> > > ${arch_vmlinux_o} ; then
> > > +   echo >&2 "Failed to generate ${arch_vmlinux_o}"
> > > +   echo >&2 "Try to disable 
> > > CONFIG_ARCH_WANTS_PRE_LINK_VMLINUX"
> > > +   exit 1
> > > +   fi
> > > +fi
> >
> >
> >
> > This is wrong because scripts/link-vmlinux.sh is not triggered
> > even when source files under arch/powerpc/tools/ are changed.
> >
> > Presumably, scripts/Makefile.vmlinux will be the right place.
>
> Ah, yes. Something like this?
>
> diff --git a/scripts/Makefile.vmlinux b/scripts/Makefile.vmlinux
> index 49946cb96844..77d90b6ac53e 100644
> --- a/scripts/Makefile.vmlinux
> +++ b/scripts/Makefile.vmlinux
> @@ -22,6 +22,10 @@ targets += .vmlinux.export.o
>  vmlinux: .vmlinux.export.o
>  endif
>
> +ifdef CONFIG_ARCH_WANTS_PRE_LINK_VMLINUX
> +vmlinux: $(srctree)/arch/$(SRCARCH)/tools/vmlinux_o.sh
> +endif
> +
>  ARCH_POSTLINK := $(wildcard $(srctree)/arch/$(SRCARCH)/Makefile.postlink)
>
>  # Final link of vmlinux with optional arch pass after final link
>
>
> Thanks,
> Naveen
>



No.

Something like below.

Then, you can do everything in Makefile, not a shell script.



ifdef CONFIG_ARCH_WANTS_PRE_LINK_VMLINUX
vmlinux: .arch.vmlinux.o

.arch.vmlinux.o: FORCE
$(Q)$(MAKE) $(build)=arch/$(SRCARCH)/tools .arch.vmlinux.o

endif



I did not test it, though.








-- 
Best Regards
Masahiro Yamada

Re: [PATCH] powerpc/eeh: avoid possible crash when edev->pdev changes

2024-06-10 Thread Michael Ellerman

Hi Ganesh,

Ganesh Goudar  writes:
> If a PCI device is removed during eeh_pe_report_edev(), edev->pdev
> will change and can cause a crash, hold the PCI rescan/remove lock
> while taking a copy of edev->pdev.
>
> Signed-off-by: Ganesh Goudar 
> ---
>  arch/powerpc/kernel/eeh_pe.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
> index d1030bc52564..49f968733912 100644
> --- a/arch/powerpc/kernel/eeh_pe.c
> +++ b/arch/powerpc/kernel/eeh_pe.c
> @@ -859,7 +859,9 @@ struct pci_bus *eeh_pe_bus_get(struct eeh_pe *pe)
>  
>   /* Retrieve the parent PCI bus of first (top) PCI device */
>   edev = list_first_entry_or_null(&pe->edevs, struct eeh_dev, entry);
> + pci_lock_rescan_remove();
>   pdev = eeh_dev_to_pci_dev(edev);
> + pci_unlock_rescan_remove();
>   if (pdev)
>   return pdev->bus;

What prevents pdev being freed/reused immediately after you drop the
rescan/remove lock?

AFAICS eeh_dev_to_pci_dev() doesn't take an additional reference to the
pdev or anything.

cheers

[PATCH 1/2] function_graph: Fix up ftrace_graph_ret_addr()

2024-06-10 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Yang Li sent a patch to fix the kerneldoc of ftrace_graph_ret_addr().
While reviewing it, I realized that the comments in the entire function
header needed a rewrite. When doing that, I realized that @idx parameter
was being ignored. Every time this was called by the unwinder, it would
start the loop at the top of the shadow stack and look for the matching
stack pointer. When it found it, it would return it. When the unwinder
asked for the next function, it would search from the beginning again.

In reality, it should start from where it left off. That was the reason
for the @idx parameter in the first place. The first time the unwinder
calls this function, the @idx pointer would contain zero. That would mean
to start from the top of the stack. The function was supposed to update
the @idx with the index where it found the return address, so that the
next time the unwinder calls this function it doesn't have to search
through the previous addresses it found (making it O(n^2)!).

This speeds up the unwinder's use of ftrace_graph_ret_addr() by an order
of magnitude.

Link: 
https://lore.kernel.org/linux-trace-kernel/20240610181746.656e3...@gandalf.local.home/

Reported-by: Yang Li 
Fixes: 7aa1eaef9f428 ("function_graph: Allow multiple users to attach to 
function graph")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/fgraph.c | 28 
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 63d0c2f84ce1..91f1eef256af 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -870,18 +870,24 @@ ftrace_graph_get_ret_stack(struct task_struct *task, int 
idx)
 }
 
 /**
- * ftrace_graph_ret_addr - convert a potentially modified stack return address
- *to its original value
+ * ftrace_graph_ret_addr - return the original value of the return address
+ * @task: The task the unwinder is being executed on
+ * @idx: An initialized pointer to the next stack index to use
+ * @ret: The current return address (likely pointing to return_handler)
+ * @retp: The address on the stack of the current return location
  *
  * This function can be called by stack unwinding code to convert a found stack
- * return address ('ret') to its original value, in case the function graph
+ * return address (@ret) to its original value, in case the function graph
  * tracer has modified it to be 'return_to_handler'.  If the address hasn't
- * been modified, the unchanged value of 'ret' is returned.
+ * been modified, the unchanged value of @ret is returned.
  *
- * 'idx' is a state variable which should be initialized by the caller to zero
- * before the first call.
+ * @idx holds the last index used to know where to start from. It should be
+ * initialized to zero for the first iteration as that will mean to start
+ * at the top of the shadow stack. If the location is found, this pointer
+ * will be assigned that location so that if called again, it will continue
+ * where it left off.
  *
- * 'retp' is a pointer to the return address on the stack.  It's ignored if
+ * @retp is a pointer to the return address on the stack.  It's ignored if
  * the arch doesn't have HAVE_FUNCTION_GRAPH_RET_ADDR_PTR defined.
  */
 #ifdef HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
@@ -895,6 +901,10 @@ unsigned long ftrace_graph_ret_addr(struct task_struct 
*task, int *idx,
if (ret != return_handler)
return ret;
 
+   if (!idx)
+   return ret;
+
+   i = *idx ? : task->curr_ret_stack;
while (i > 0) {
ret_stack = get_ret_stack(current, i, &i);
if (!ret_stack)
@@ -908,8 +918,10 @@ unsigned long ftrace_graph_ret_addr(struct task_struct 
*task, int *idx,
 * Thus we will continue to find real return address.
 */
if (ret_stack->retp == retp &&
-   ret_stack->ret != return_handler)
+   ret_stack->ret != return_handler) {
+   *idx = i;
return ret_stack->ret;
+   }
}
 
return ret;
-- 
2.43.0

[PATCH 2/2] function_graph: Everyone uses HAVE_FUNCTION_GRAPH_RET_ADDR_PTR, remove it

2024-06-10 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

All architectures that implement function graph also implements
HAVE_FUNCTION_GRAPH_RET_ADDR_PTR. Remove it, as it is no longer a
differentiator.

Signed-off-by: Steven Rostedt (Google) 
---
 Documentation/trace/ftrace-design.rst | 12 -
 arch/arm64/include/asm/ftrace.h   | 11 -
 arch/csky/include/asm/ftrace.h|  2 --
 arch/loongarch/include/asm/ftrace.h   |  1 -
 arch/powerpc/include/asm/ftrace.h |  2 --
 arch/riscv/include/asm/ftrace.h   |  1 -
 arch/s390/include/asm/ftrace.h|  1 -
 arch/x86/include/asm/ftrace.h |  2 --
 include/linux/ftrace.h|  2 --
 kernel/trace/fgraph.c | 35 +--
 10 files changed, 1 insertion(+), 68 deletions(-)

diff --git a/Documentation/trace/ftrace-design.rst 
b/Documentation/trace/ftrace-design.rst
index 6893399157f0..dc82d64b3a44 100644
--- a/Documentation/trace/ftrace-design.rst
+++ b/Documentation/trace/ftrace-design.rst
@@ -217,18 +217,6 @@ along to ftrace_push_return_trace() instead of a stub 
value of 0.
 
 Similarly, when you call ftrace_return_to_handler(), pass it the frame pointer.
 
-HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
-
-
-An arch may pass in a pointer to the return address on the stack.  This
-prevents potential stack unwinding issues where the unwinder gets out of
-sync with ret_stack and the wrong addresses are reported by
-ftrace_graph_ret_addr().
-
-Adding support for it is easy: just define the macro in asm/ftrace.h and
-pass the return address pointer as the 'retp' argument to
-ftrace_push_return_trace().
-
 HAVE_SYSCALL_TRACEPOINTS
 
 
diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
index ab158196480c..dc9cf0bd2a4c 100644
--- a/arch/arm64/include/asm/ftrace.h
+++ b/arch/arm64/include/asm/ftrace.h
@@ -12,17 +12,6 @@
 
 #define HAVE_FUNCTION_GRAPH_FP_TEST
 
-/*
- * HAVE_FUNCTION_GRAPH_RET_ADDR_PTR means that the architecture can provide a
- * "return address pointer" which can be used to uniquely identify a return
- * address which has been overwritten.
- *
- * On arm64 we use the address of the caller's frame record, which remains the
- * same for the lifetime of the instrumented function, unlike the return
- * address in the LR.
- */
-#define HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
-
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_ARGS
 #define ARCH_SUPPORTS_FTRACE_OPS 1
 #else
diff --git a/arch/csky/include/asm/ftrace.h b/arch/csky/include/asm/ftrace.h
index fd215c38ef27..00f9f7647e3f 100644
--- a/arch/csky/include/asm/ftrace.h
+++ b/arch/csky/include/asm/ftrace.h
@@ -7,8 +7,6 @@
 
 #define HAVE_FUNCTION_GRAPH_FP_TEST
 
-#define HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
-
 #define ARCH_SUPPORTS_FTRACE_OPS 1
 
 #define MCOUNT_ADDR((unsigned long)_mcount)
diff --git a/arch/loongarch/include/asm/ftrace.h 
b/arch/loongarch/include/asm/ftrace.h
index de891c2c83d4..c0a682808e07 100644
--- a/arch/loongarch/include/asm/ftrace.h
+++ b/arch/loongarch/include/asm/ftrace.h
@@ -28,7 +28,6 @@ struct dyn_ftrace;
 struct dyn_arch_ftrace { };
 
 #define ARCH_SUPPORTS_FTRACE_OPS 1
-#define HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
 
 #define ftrace_init_nop ftrace_init_nop
 int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec);
diff --git a/arch/powerpc/include/asm/ftrace.h 
b/arch/powerpc/include/asm/ftrace.h
index 107fc5a48456..559560286e6d 100644
--- a/arch/powerpc/include/asm/ftrace.h
+++ b/arch/powerpc/include/asm/ftrace.h
@@ -8,8 +8,6 @@
 #define MCOUNT_ADDR((unsigned long)(_mcount))
 #define MCOUNT_INSN_SIZE   4 /* sizeof mcount call */
 
-#define HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
-
 /* Ignore unused weak functions which will have larger offsets */
 #if defined(CONFIG_MPROFILE_KERNEL) || 
defined(CONFIG_ARCH_USING_PATCHABLE_FUNCTION_ENTRY)
 #define FTRACE_MCOUNT_MAX_OFFSET   16
diff --git a/arch/riscv/include/asm/ftrace.h b/arch/riscv/include/asm/ftrace.h
index 9eb31a7ea0aa..2cddd79ff21b 100644
--- a/arch/riscv/include/asm/ftrace.h
+++ b/arch/riscv/include/asm/ftrace.h
@@ -11,7 +11,6 @@
 #if defined(CONFIG_FUNCTION_GRAPH_TRACER) && defined(CONFIG_FRAME_POINTER)
 #define HAVE_FUNCTION_GRAPH_FP_TEST
 #endif
-#define HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
 
 #define ARCH_SUPPORTS_FTRACE_OPS 1
 #ifndef __ASSEMBLY__
diff --git a/arch/s390/include/asm/ftrace.h b/arch/s390/include/asm/ftrace.h
index 77e479d44f1e..fbadca645af7 100644
--- a/arch/s390/include/asm/ftrace.h
+++ b/arch/s390/include/asm/ftrace.h
@@ -2,7 +2,6 @@
 #ifndef _ASM_S390_FTRACE_H
 #define _ASM_S390_FTRACE_H
 
-#define HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
 #define ARCH_SUPPORTS_FTRACE_OPS 1
 #define MCOUNT_INSN_SIZE   6
 
diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
index 897cf02c20b1..0152a81d9b4a 100644
--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -20,8 +20,6 @@
 #define ARCH_SUPPORTS_FTRACE_OPS 1
 #endif
 
-#define HAVE_FUNCTION_GRAPH_RET

[PATCH 0/2] function_graph: ftrace_graph_ret_addr(); there can be only one!

2024-06-10 Thread Steven Rostedt



I noticed a slight bug in ftrace_graph_ret_addr() for when
HAVE_FUNCTION_GRAPH_RET_ADDR_PTR was defined and fixed it up.
I then noticed it was buggy when not defined. Looking for an
architecture that did not have it defined, I couldn't find any.
So I removed it.

Steven Rostedt (Google) (2):
  function_graph: Fix up ftrace_graph_ret_addr()
  function_graph: Everyone uses HAVE_FUNCTION_GRAPH_RET_ADDR_PTR, remove it


 Documentation/trace/ftrace-design.rst | 12 ---
 arch/arm64/include/asm/ftrace.h   | 11 ---
 arch/csky/include/asm/ftrace.h|  2 --
 arch/loongarch/include/asm/ftrace.h   |  1 -
 arch/powerpc/include/asm/ftrace.h |  2 --
 arch/riscv/include/asm/ftrace.h   |  1 -
 arch/s390/include/asm/ftrace.h|  1 -
 arch/x86/include/asm/ftrace.h |  2 --
 include/linux/ftrace.h|  2 --
 kernel/trace/fgraph.c | 61 ---
 10 files changed, 20 insertions(+), 75 deletions(-)

[PATCH 04/26] loop: always update discard settings in loop_reconfigure_limits

2024-06-10 Thread Christoph Hellwig

Simplify loop_reconfigure_limits by always updating the discard limits.
This adds a little more work to loop_set_block_size, but doesn't change
the outcome as the discard flag won't change.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/loop.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 93a49c40a31a71..c658282454af1b 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -975,8 +975,7 @@ loop_set_status_from_info(struct loop_device *lo,
return 0;
 }
 
-static int loop_reconfigure_limits(struct loop_device *lo, unsigned short 
bsize,
-   bool update_discard_settings)
+static int loop_reconfigure_limits(struct loop_device *lo, unsigned short 
bsize)
 {
struct queue_limits lim;
 
@@ -984,8 +983,7 @@ static int loop_reconfigure_limits(struct loop_device *lo, 
unsigned short bsize,
lim.logical_block_size = bsize;
lim.physical_block_size = bsize;
lim.io_min = bsize;
-   if (update_discard_settings)
-   loop_config_discard(lo, &lim);
+   loop_config_discard(lo, &lim);
return queue_limits_commit_update(lo->lo_queue, &lim);
 }
 
@@ -1086,7 +1084,7 @@ static int loop_configure(struct loop_device *lo, 
blk_mode_t mode,
else
bsize = 512;
 
-   error = loop_reconfigure_limits(lo, bsize, true);
+   error = loop_reconfigure_limits(lo, bsize);
if (WARN_ON_ONCE(error))
goto out_unlock;
 
@@ -1496,7 +1494,7 @@ static int loop_set_block_size(struct loop_device *lo, 
unsigned long arg)
invalidate_bdev(lo->lo_device);
 
blk_mq_freeze_queue(lo->lo_queue);
-   err = loop_reconfigure_limits(lo, arg, false);
+   err = loop_reconfigure_limits(lo, arg);
loop_update_dio(lo);
blk_mq_unfreeze_queue(lo->lo_queue);
 
-- 
2.43.0

[PATCH 02/26] sd: move zone limits setup out of sd_read_block_characteristics

2024-06-10 Thread Christoph Hellwig

Move a bit of code that sets up the zone flag and the write granularity
into sd_zbc_read_zones to be with the rest of the zoned limits.

Signed-off-by: Christoph Hellwig 
---
 drivers/scsi/sd.c | 21 +
 drivers/scsi/sd_zbc.c | 13 -
 2 files changed, 13 insertions(+), 21 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 85b45345a27739..5bfed61c70db8f 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3308,29 +3308,10 @@ static void sd_read_block_characteristics(struct 
scsi_disk *sdkp,
blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q);
}
 
-
-#ifdef CONFIG_BLK_DEV_ZONED /* sd_probe rejects ZBD devices early otherwise */
-   if (sdkp->device->type == TYPE_ZBC) {
-   lim->zoned = true;
-
-   /*
-* Per ZBC and ZAC specifications, writes in sequential write
-* required zones of host-managed devices must be aligned to
-* the device physical block size.
-*/
-   lim->zone_write_granularity = sdkp->physical_block_size;
-   } else {
-   /*
-* Host-aware devices are treated as conventional.
-*/
-   lim->zoned = false;
-   }
-#endif /* CONFIG_BLK_DEV_ZONED */
-
if (!sdkp->first_scan)
return;
 
-   if (lim->zoned)
+   if (sdkp->device->type == TYPE_ZBC)
sd_printk(KERN_NOTICE, sdkp, "Host-managed zoned block 
device\n");
else if (sdkp->zoned == 1)
sd_printk(KERN_NOTICE, sdkp, "Host-aware SMR disk used as 
regular disk\n");
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 422eaed8457227..e9501db0450be3 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -598,8 +598,19 @@ int sd_zbc_read_zones(struct scsi_disk *sdkp, struct 
queue_limits *lim,
u32 zone_blocks = 0;
int ret;
 
-   if (!sd_is_zoned(sdkp))
+   if (!sd_is_zoned(sdkp)) {
+   lim->zoned = false;
return 0;
+   }
+
+   lim->zoned = true;
+
+   /*
+* Per ZBC and ZAC specifications, writes in sequential write required
+* zones of host-managed devices must be aligned to the device physical
+* block size.
+*/
+   lim->zone_write_granularity = sdkp->physical_block_size;
 
/* READ16/WRITE16/SYNC16 is mandatory for ZBC devices */
sdkp->device->use_16_for_rw = 1;
-- 
2.43.0

[PATCH 01/26] sd: fix sd_is_zoned

2024-06-10 Thread Christoph Hellwig

Since commit 7437bb73f087 ("block: remove support for the host aware zone
model"), only ZBC devices expose a zoned access model.  sd_is_zoned is
used to check for that and thus return false for host aware devices.

Fixes: 7437bb73f087 ("block: remove support for the host aware zone model")
Signed-off-by: Christoph Hellwig 
---
 drivers/scsi/sd.h | 7 ++-
 drivers/scsi/sd_zbc.c | 7 +--
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 726f1613f6cb56..65dff3c2108926 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -222,9 +222,14 @@ static inline sector_t sectors_to_logical(struct 
scsi_device *sdev, sector_t sec
 
 void sd_dif_config_host(struct scsi_disk *sdkp, struct queue_limits *lim);
 
+/*
+ * Check if we support a zoned model for this device.
+ *
+ * Note that host aware devices are treated as conventional by Linux.
+ */
 static inline int sd_is_zoned(struct scsi_disk *sdkp)
 {
-   return sdkp->zoned == 1 || sdkp->device->type == TYPE_ZBC;
+   return sdkp->device->type == TYPE_ZBC;
 }
 
 #ifdef CONFIG_BLK_DEV_ZONED
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index f685838d9ed214..422eaed8457227 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -598,13 +598,8 @@ int sd_zbc_read_zones(struct scsi_disk *sdkp, struct 
queue_limits *lim,
u32 zone_blocks = 0;
int ret;
 
-   if (!sd_is_zoned(sdkp)) {
-   /*
-* Device managed or normal SCSI disk, no special handling
-* required.
-*/
+   if (!sd_is_zoned(sdkp))
return 0;
-   }
 
/* READ16/WRITE16/SYNC16 is mandatory for ZBC devices */
sdkp->device->use_16_for_rw = 1;
-- 
2.43.0

[PATCH 03/26] loop: stop using loop_reconfigure_limits in __loop_clr_fd

2024-06-10 Thread Christoph Hellwig

__loop_clr_fd wants to clear all settings on the device.  Prepare for
moving more settings into the block limits by open coding
loop_reconfigure_limits.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/loop.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 93780f41646b75..93a49c40a31a71 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1133,6 +1133,7 @@ static int loop_configure(struct loop_device *lo, 
blk_mode_t mode,
 
 static void __loop_clr_fd(struct loop_device *lo, bool release)
 {
+   struct queue_limits lim;
struct file *filp;
gfp_t gfp = lo->old_gfp_mask;
 
@@ -1156,7 +1157,14 @@ static void __loop_clr_fd(struct loop_device *lo, bool 
release)
lo->lo_offset = 0;
lo->lo_sizelimit = 0;
memset(lo->lo_file_name, 0, LO_NAME_SIZE);
-   loop_reconfigure_limits(lo, 512, false);
+
+   /* reset the block size to the default */
+   lim = queue_limits_start_update(lo->lo_queue);
+   lim.logical_block_size = 512;
+   lim.physical_block_size = 512;
+   lim.io_min = 512;
+   queue_limits_commit_update(lo->lo_queue, &lim);
+
invalidate_disk(lo->lo_disk);
loop_sysfs_exit(lo);
/* let user-space know about this change */
-- 
2.43.0

[PATCH 05/26] loop: regularize upgrading the lock size for direct I/O

2024-06-10 Thread Christoph Hellwig

The LOOP_CONFIGURE path automatically upgrades the block size to that
of the underlying file for O_DIRECT file descriptors, but the
LOOP_SET_BLOCK_SIZE path does not.  Fix this by lifting the code to
pick the block size into common code.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/loop.c | 25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c658282454af1b..4f6d8514d19bd6 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -975,10 +975,24 @@ loop_set_status_from_info(struct loop_device *lo,
return 0;
 }
 
+static unsigned short loop_default_blocksize(struct loop_device *lo,
+   struct block_device *backing_bdev)
+{
+   /* In case of direct I/O, match underlying block size */
+   if ((lo->lo_backing_file->f_flags & O_DIRECT) && backing_bdev)
+   return bdev_logical_block_size(backing_bdev);
+   return 512;
+}
+
 static int loop_reconfigure_limits(struct loop_device *lo, unsigned short 
bsize)
 {
+   struct file *file = lo->lo_backing_file;
+   struct inode *inode = file->f_mapping->host;
struct queue_limits lim;
 
+   if (!bsize)
+   bsize = loop_default_blocksize(lo, inode->i_sb->s_bdev);
+
lim = queue_limits_start_update(lo->lo_queue);
lim.logical_block_size = bsize;
lim.physical_block_size = bsize;
@@ -997,7 +1011,6 @@ static int loop_configure(struct loop_device *lo, 
blk_mode_t mode,
int error;
loff_t size;
bool partscan;
-   unsigned short bsize;
bool is_loop;
 
if (!file)
@@ -1076,15 +1089,7 @@ static int loop_configure(struct loop_device *lo, 
blk_mode_t mode,
if (!(lo->lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
blk_queue_write_cache(lo->lo_queue, true, false);
 
-   if (config->block_size)
-   bsize = config->block_size;
-   else if ((lo->lo_backing_file->f_flags & O_DIRECT) && 
inode->i_sb->s_bdev)
-   /* In case of direct I/O, match underlying block size */
-   bsize = bdev_logical_block_size(inode->i_sb->s_bdev);
-   else
-   bsize = 512;
-
-   error = loop_reconfigure_limits(lo, bsize);
+   error = loop_reconfigure_limits(lo, config->block_size);
if (WARN_ON_ONCE(error))
goto out_unlock;
 
-- 
2.43.0

[PATCH 06/26] loop: also use the default block size from an underlying block device

2024-06-10 Thread Christoph Hellwig

Fix the code in loop_reconfigure_limits to pick a default block size for
O_DIRECT file descriptors to also work when the loop device sits on top
of a block device and not just on a regular file on a block device based
file system.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/loop.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 4f6d8514d19bd6..d7cf6bbbfb1b86 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -988,10 +988,16 @@ static int loop_reconfigure_limits(struct loop_device 
*lo, unsigned short bsize)
 {
struct file *file = lo->lo_backing_file;
struct inode *inode = file->f_mapping->host;
+   struct block_device *backing_bdev = NULL;
struct queue_limits lim;
 
+   if (S_ISBLK(inode->i_mode))
+   backing_bdev = I_BDEV(inode);
+   else if (inode->i_sb->s_bdev)
+   backing_bdev = inode->i_sb->s_bdev;
+
if (!bsize)
-   bsize = loop_default_blocksize(lo, inode->i_sb->s_bdev);
+   bsize = loop_default_blocksize(lo, backing_bdev);
 
lim = queue_limits_start_update(lo->lo_queue);
lim.logical_block_size = bsize;
-- 
2.43.0

[PATCH 07/26] loop: fold loop_update_rotational into loop_reconfigure_limits

2024-06-10 Thread Christoph Hellwig

This prepares for moving the rotational flag into the queue_limits and
also fixes it for the case where the loop device is backed by a block
device.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/loop.c | 23 ---
 1 file changed, 4 insertions(+), 19 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index d7cf6bbbfb1b86..2c4a5eb3a6a7f9 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -916,24 +916,6 @@ static void loop_free_idle_workers_timer(struct timer_list 
*timer)
return loop_free_idle_workers(lo, false);
 }
 
-static void loop_update_rotational(struct loop_device *lo)
-{
-   struct file *file = lo->lo_backing_file;
-   struct inode *file_inode = file->f_mapping->host;
-   struct block_device *file_bdev = file_inode->i_sb->s_bdev;
-   struct request_queue *q = lo->lo_queue;
-   bool nonrot = true;
-
-   /* not all filesystems (e.g. tmpfs) have a sb->s_bdev */
-   if (file_bdev)
-   nonrot = bdev_nonrot(file_bdev);
-
-   if (nonrot)
-   blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
-   else
-   blk_queue_flag_clear(QUEUE_FLAG_NONROT, q);
-}
-
 /**
  * loop_set_status_from_info - configure device from loop_info
  * @lo: struct loop_device to configure
@@ -1003,6 +985,10 @@ static int loop_reconfigure_limits(struct loop_device 
*lo, unsigned short bsize)
lim.logical_block_size = bsize;
lim.physical_block_size = bsize;
lim.io_min = bsize;
+   if (!backing_bdev || bdev_nonrot(backing_bdev))
+   blk_queue_flag_set(QUEUE_FLAG_NONROT, lo->lo_queue);
+   else
+   blk_queue_flag_clear(QUEUE_FLAG_NONROT, lo->lo_queue);
loop_config_discard(lo, &lim);
return queue_limits_commit_update(lo->lo_queue, &lim);
 }
@@ -1099,7 +1085,6 @@ static int loop_configure(struct loop_device *lo, 
blk_mode_t mode,
if (WARN_ON_ONCE(error))
goto out_unlock;
 
-   loop_update_rotational(lo);
loop_update_dio(lo);
loop_sysfs_init(lo);
 
-- 
2.43.0

[PATCH 08/26] virtio_blk: remove virtblk_update_cache_mode

2024-06-10 Thread Christoph Hellwig

virtblk_update_cache_mode boils down to a single call to
blk_queue_write_cache.  Remove it in preparation for moving the cache
control flags into the queue_limits.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/virtio_blk.c | 13 +++--
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 2351f411fa4680..378b241911ca87 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -1089,14 +1089,6 @@ static int virtblk_get_cache_mode(struct virtio_device 
*vdev)
return writeback;
 }
 
-static void virtblk_update_cache_mode(struct virtio_device *vdev)
-{
-   u8 writeback = virtblk_get_cache_mode(vdev);
-   struct virtio_blk *vblk = vdev->priv;
-
-   blk_queue_write_cache(vblk->disk->queue, writeback, false);
-}
-
 static const char *const virtblk_cache_types[] = {
"write through", "write back"
 };
@@ -1116,7 +1108,7 @@ cache_type_store(struct device *dev, struct 
device_attribute *attr,
return i;
 
virtio_cwrite8(vdev, offsetof(struct virtio_blk_config, wce), i);
-   virtblk_update_cache_mode(vdev);
+   blk_queue_write_cache(disk->queue, virtblk_get_cache_mode(vdev), false);
return count;
 }
 
@@ -1528,7 +1520,8 @@ static int virtblk_probe(struct virtio_device *vdev)
vblk->index = index;
 
/* configure queue flush support */
-   virtblk_update_cache_mode(vdev);
+   blk_queue_write_cache(vblk->disk->queue, virtblk_get_cache_mode(vdev),
+   false);
 
/* If disk is read-only in the host, the guest should obey */
if (virtio_has_feature(vdev, VIRTIO_BLK_F_RO))
-- 
2.43.0

[PATCH 09/26] nbd: move setting the cache control flags to __nbd_set_size

2024-06-10 Thread Christoph Hellwig

Move setting the cache control flags in nbd in preparation for moving
these flags into the queue_limits structure.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/nbd.c | 17 +++--
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index ad887d614d5b3f..44b8c671921e5c 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -342,6 +342,12 @@ static int __nbd_set_size(struct nbd_device *nbd, loff_t 
bytesize,
lim.max_hw_discard_sectors = UINT_MAX;
else
lim.max_hw_discard_sectors = 0;
+   if (!(nbd->config->flags & NBD_FLAG_SEND_FLUSH))
+   blk_queue_write_cache(nbd->disk->queue, false, false);
+   else if (nbd->config->flags & NBD_FLAG_SEND_FUA)
+   blk_queue_write_cache(nbd->disk->queue, true, true);
+   else
+   blk_queue_write_cache(nbd->disk->queue, true, false);
lim.logical_block_size = blksize;
lim.physical_block_size = blksize;
error = queue_limits_commit_update(nbd->disk->queue, &lim);
@@ -1286,19 +1292,10 @@ static void nbd_bdev_reset(struct nbd_device *nbd)
 
 static void nbd_parse_flags(struct nbd_device *nbd)
 {
-   struct nbd_config *config = nbd->config;
-   if (config->flags & NBD_FLAG_READ_ONLY)
+   if (nbd->config->flags & NBD_FLAG_READ_ONLY)
set_disk_ro(nbd->disk, true);
else
set_disk_ro(nbd->disk, false);
-   if (config->flags & NBD_FLAG_SEND_FLUSH) {
-   if (config->flags & NBD_FLAG_SEND_FUA)
-   blk_queue_write_cache(nbd->disk->queue, true, true);
-   else
-   blk_queue_write_cache(nbd->disk->queue, true, false);
-   }
-   else
-   blk_queue_write_cache(nbd->disk->queue, false, false);
 }
 
 static void send_disconnects(struct nbd_device *nbd)
-- 
2.43.0

[PATCH 10/26] xen-blkfront: don't disable cache flushes when they fail

2024-06-10 Thread Christoph Hellwig

blkfront always had a robust negotiation protocol for detecting a write
cache.  Stop simply disabling cache flushes when they fail as that is
a grave error.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/xen-blkfront.c | 29 +
 1 file changed, 9 insertions(+), 20 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 9b4ec3e4908cce..9794ac2d3299d1 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -982,18 +982,6 @@ static const char *flush_info(struct blkfront_info *info)
return "barrier or flush: disabled;";
 }
 
-static void xlvbd_flush(struct blkfront_info *info)
-{
-   blk_queue_write_cache(info->rq, info->feature_flush ? true : false,
- info->feature_fua ? true : false);
-   pr_info("blkfront: %s: %s %s %s %s %s %s %s\n",
-   info->gd->disk_name, flush_info(info),
-   "persistent grants:", info->feature_persistent ?
-   "enabled;" : "disabled;", "indirect descriptors:",
-   info->max_indirect_segments ? "enabled;" : "disabled;",
-   "bounce buffer:", info->bounce ? "enabled" : "disabled;");
-}
-
 static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
 {
int major;
@@ -1162,7 +1150,15 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
info->sector_size = sector_size;
info->physical_sector_size = physical_sector_size;
 
-   xlvbd_flush(info);
+   blk_queue_write_cache(info->rq, info->feature_flush ? true : false,
+ info->feature_fua ? true : false);
+
+   pr_info("blkfront: %s: %s %s %s %s %s %s %s\n",
+   info->gd->disk_name, flush_info(info),
+   "persistent grants:", info->feature_persistent ?
+   "enabled;" : "disabled;", "indirect descriptors:",
+   info->max_indirect_segments ? "enabled;" : "disabled;",
+   "bounce buffer:", info->bounce ? "enabled" : "disabled;");
 
if (info->vdisk_info & VDISK_READONLY)
set_disk_ro(gd, 1);
@@ -1622,13 +1618,6 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
   info->gd->disk_name, 
op_name(bret.operation));
blkif_req(req)->error = BLK_STS_NOTSUPP;
}
-   if (unlikely(blkif_req(req)->error)) {
-   if (blkif_req(req)->error == BLK_STS_NOTSUPP)
-   blkif_req(req)->error = BLK_STS_OK;
-   info->feature_fua = 0;
-   info->feature_flush = 0;
-   xlvbd_flush(info);
-   }
fallthrough;
case BLKIF_OP_READ:
case BLKIF_OP_WRITE:
-- 
2.43.0

[PATCH 11/26] block: freeze the queue in queue_attr_store

2024-06-10 Thread Christoph Hellwig

queue_attr_store updates attributes used to control generating I/O, and
can cause malformed bios if changed with I/O in flight.  Freeze the queue
in common code instead of adding it to almost every attribute.

Signed-off-by: Christoph Hellwig 
---
 block/blk-mq.c| 5 +++--
 block/blk-sysfs.c | 9 ++---
 2 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0d4cd39c3d25da..58b0d6c7cc34d6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4631,13 +4631,15 @@ int blk_mq_update_nr_requests(struct request_queue *q, 
unsigned int nr)
int ret;
unsigned long i;
 
+   if (WARN_ON_ONCE(!q->mq_freeze_depth))
+   return -EINVAL;
+
if (!set)
return -EINVAL;
 
if (q->nr_requests == nr)
return 0;
 
-   blk_mq_freeze_queue(q);
blk_mq_quiesce_queue(q);
 
ret = 0;
@@ -4671,7 +4673,6 @@ int blk_mq_update_nr_requests(struct request_queue *q, 
unsigned int nr)
}
 
blk_mq_unquiesce_queue(q);
-   blk_mq_unfreeze_queue(q);
 
return ret;
 }
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index f0f9314ab65c61..5c787965b7d09e 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -189,12 +189,9 @@ static ssize_t queue_discard_max_store(struct 
request_queue *q,
if ((max_discard_bytes >> SECTOR_SHIFT) > UINT_MAX)
return -EINVAL;
 
-   blk_mq_freeze_queue(q);
lim = queue_limits_start_update(q);
lim.max_user_discard_sectors = max_discard_bytes >> SECTOR_SHIFT;
err = queue_limits_commit_update(q, &lim);
-   blk_mq_unfreeze_queue(q);
-
if (err)
return err;
return ret;
@@ -241,11 +238,9 @@ queue_max_sectors_store(struct request_queue *q, const 
char *page, size_t count)
if (ret < 0)
return ret;
 
-   blk_mq_freeze_queue(q);
lim = queue_limits_start_update(q);
lim.max_user_sectors = max_sectors_kb << 1;
err = queue_limits_commit_update(q, &lim);
-   blk_mq_unfreeze_queue(q);
if (err)
return err;
return ret;
@@ -585,13 +580,11 @@ static ssize_t queue_wb_lat_store(struct request_queue 
*q, const char *page,
 * ends up either enabling or disabling wbt completely. We can't
 * have IO inflight if that happens.
 */
-   blk_mq_freeze_queue(q);
blk_mq_quiesce_queue(q);
 
wbt_set_min_lat(q, val);
 
blk_mq_unquiesce_queue(q);
-   blk_mq_unfreeze_queue(q);
 
return count;
 }
@@ -722,9 +715,11 @@ queue_attr_store(struct kobject *kobj, struct attribute 
*attr,
if (!entry->store)
return -EIO;
 
+   blk_mq_freeze_queue(q);
mutex_lock(&q->sysfs_lock);
res = entry->store(q, page, length);
mutex_unlock(&q->sysfs_lock);
+   blk_mq_unfreeze_queue(q);
return res;
 }
 
-- 
2.43.0

[PATCH 12/26] block: remove blk_flush_policy

2024-06-10 Thread Christoph Hellwig

Fold blk_flush_policy into the only caller to prepare for pending changes
to it.

Signed-off-by: Christoph Hellwig 
---
 block/blk-flush.c | 33 +++--
 1 file changed, 15 insertions(+), 18 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index c17cf8ed8113db..2234f8b3fc05f2 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -100,23 +100,6 @@ blk_get_flush_queue(struct request_queue *q, struct 
blk_mq_ctx *ctx)
return blk_mq_map_queue(q, REQ_OP_FLUSH, ctx)->fq;
 }
 
-static unsigned int blk_flush_policy(unsigned long fflags, struct request *rq)
-{
-   unsigned int policy = 0;
-
-   if (blk_rq_sectors(rq))
-   policy |= REQ_FSEQ_DATA;
-
-   if (fflags & (1UL << QUEUE_FLAG_WC)) {
-   if (rq->cmd_flags & REQ_PREFLUSH)
-   policy |= REQ_FSEQ_PREFLUSH;
-   if (!(fflags & (1UL << QUEUE_FLAG_FUA)) &&
-   (rq->cmd_flags & REQ_FUA))
-   policy |= REQ_FSEQ_POSTFLUSH;
-   }
-   return policy;
-}
-
 static unsigned int blk_flush_cur_seq(struct request *rq)
 {
return 1 << ffz(rq->flush.seq);
@@ -399,12 +382,26 @@ bool blk_insert_flush(struct request *rq)
 {
struct request_queue *q = rq->q;
unsigned long fflags = q->queue_flags;  /* may change, cache */
-   unsigned int policy = blk_flush_policy(fflags, rq);
struct blk_flush_queue *fq = blk_get_flush_queue(q, rq->mq_ctx);
+   unsigned int policy = 0;
 
/* FLUSH/FUA request must never be merged */
WARN_ON_ONCE(rq->bio != rq->biotail);
 
+   if (blk_rq_sectors(rq))
+   policy |= REQ_FSEQ_DATA;
+
+   /*
+* Check which flushes we need to sequence for this operation.
+*/
+   if (fflags & (1UL << QUEUE_FLAG_WC)) {
+   if (rq->cmd_flags & REQ_PREFLUSH)
+   policy |= REQ_FSEQ_PREFLUSH;
+   if (!(fflags & (1UL << QUEUE_FLAG_FUA)) &&
+   (rq->cmd_flags & REQ_FUA))
+   policy |= REQ_FSEQ_POSTFLUSH;
+   }
+
/*
 * @policy now records what operations need to be done.  Adjust
 * REQ_PREFLUSH and FUA for the driver.
-- 
2.43.0

[PATCH 13/26] block: move cache control settings out of queue->flags

2024-06-10 Thread Christoph Hellwig

Move the cache control settings into the queue_limits so that they
can be set atomically and all I/O is frozen when changing the
flags.

Add new features and flags field for the driver set flags, and internal
(usually sysfs-controlled) flags in the block layer.  Note that we'll
eventually remove enough field from queue_limits to bring it back to the
previous size.

The disable flag is inverted compared to the previous meaning, which
means it now survives a rescan, similar to the max_sectors and
max_discard_sectors user limits.

The FLUSH and FUA flags are now inherited by blk_stack_limits, which
simplified the code in dm a lot, but also causes a slight behavior
change in that dm-switch and dm-unstripe now advertise a write cache
despite setting num_flush_bios to 0.  The I/O path will handle this
gracefully, but as far as I can tell the lack of num_flush_bios
and thus flush support is a pre-existing data integrity bug in those
targets that really needs fixing, after which a non-zero num_flush_bios
should be required in dm for targets that map to underlying devices.

Signed-off-by: Christoph Hellwig 
---
 .../block/writeback_cache_control.rst | 67 +++
 arch/um/drivers/ubd_kern.c|  2 +-
 block/blk-core.c  |  2 +-
 block/blk-flush.c |  9 ++-
 block/blk-mq-debugfs.c|  2 -
 block/blk-settings.c  | 29 ++--
 block/blk-sysfs.c | 29 +---
 block/blk-wbt.c   |  4 +-
 drivers/block/drbd/drbd_main.c|  2 +-
 drivers/block/loop.c  |  9 +--
 drivers/block/nbd.c   | 14 ++--
 drivers/block/null_blk/main.c | 12 ++--
 drivers/block/ps3disk.c   |  7 +-
 drivers/block/rnbd/rnbd-clt.c | 10 +--
 drivers/block/ublk_drv.c  |  8 ++-
 drivers/block/virtio_blk.c| 20 --
 drivers/block/xen-blkfront.c  |  9 ++-
 drivers/md/bcache/super.c |  7 +-
 drivers/md/dm-table.c | 39 +++
 drivers/md/md.c   |  8 ++-
 drivers/mmc/core/block.c  | 42 ++--
 drivers/mmc/core/queue.c  | 12 ++--
 drivers/mmc/core/queue.h  |  3 +-
 drivers/mtd/mtd_blkdevs.c |  5 +-
 drivers/nvdimm/pmem.c |  4 +-
 drivers/nvme/host/core.c  |  7 +-
 drivers/nvme/host/multipath.c |  6 --
 drivers/scsi/sd.c | 28 +---
 include/linux/blkdev.h| 38 +--
 29 files changed, 227 insertions(+), 207 deletions(-)

diff --git a/Documentation/block/writeback_cache_control.rst 
b/Documentation/block/writeback_cache_control.rst
index b208488d0aae85..9cfe27f90253c7 100644
--- a/Documentation/block/writeback_cache_control.rst
+++ b/Documentation/block/writeback_cache_control.rst
@@ -46,41 +46,50 @@ worry if the underlying devices need any explicit cache 
flushing and how
 the Forced Unit Access is implemented.  The REQ_PREFLUSH and REQ_FUA flags
 may both be set on a single bio.
 
+Feature settings for block drivers
+--
 
-Implementation details for bio based block drivers
---
+For devices that do not support volatile write caches there is no driver
+support required, the block layer completes empty REQ_PREFLUSH requests before
+entering the driver and strips off the REQ_PREFLUSH and REQ_FUA bits from
+requests that have a payload.
 
-These drivers will always see the REQ_PREFLUSH and REQ_FUA bits as they sit
-directly below the submit_bio interface.  For remapping drivers the REQ_FUA
-bits need to be propagated to underlying devices, and a global flush needs
-to be implemented for bios with the REQ_PREFLUSH bit set.  For real device
-drivers that do not have a volatile cache the REQ_PREFLUSH and REQ_FUA bits
-on non-empty bios can simply be ignored, and REQ_PREFLUSH requests without
-data can be completed successfully without doing any work.  Drivers for
-devices with volatile caches need to implement the support for these
-flags themselves without any help from the block layer.
+For devices with volatile write caches the driver needs to tell the block layer
+that it supports flushing caches by setting the
 
+   BLK_FEAT_WRITE_CACHE
 
-Implementation details for request_fn based block drivers
--
+flag in the queue_limits feature field.  For devices that also support the FUA
+bit the block layer needs to be told to pass on the REQ_FUA bit by also setting
+the
 
-For devices that do not support volatile write caches there is no driver
-support required, the block layer completes empty

[PATCH 14/26] block: move the nonrot flag to queue_limits

2024-06-10 Thread Christoph Hellwig

Move the norot flag into the queue_limits feature field so that it can be
set atomically and all I/O is frozen when changing the flag.

Use the chance to switch to defaulting to non-rotational and require
the driver to opt into rotational, which matches the polarity of the
sysfs interface.

For the z2ram, ps3vram, 2x memstick, ubiblock and dcssblk the new
rotational flag is not set as they clearly are not rotational despite
this being a behavior change.  There are some other drivers that
unconditionally set the rotational flag to keep the existing behavior
as they arguably can be used on rotational devices even if that is
probably not their main use today (e.g. virtio_blk and drbd).

The flag is automatically inherited in blk_stack_limits matching the
existing behavior in dm and md.

Signed-off-by: Christoph Hellwig 
---
 arch/m68k/emu/nfblock.c |  1 +
 arch/um/drivers/ubd_kern.c  |  1 -
 arch/xtensa/platforms/iss/simdisk.c |  5 +++-
 block/blk-mq-debugfs.c  |  1 -
 block/blk-sysfs.c   | 39 ++---
 drivers/block/amiflop.c |  5 +++-
 drivers/block/aoe/aoeblk.c  |  1 +
 drivers/block/ataflop.c |  5 +++-
 drivers/block/brd.c |  2 --
 drivers/block/drbd/drbd_main.c  |  3 ++-
 drivers/block/floppy.c  |  3 ++-
 drivers/block/loop.c|  8 +++---
 drivers/block/mtip32xx/mtip32xx.c   |  1 -
 drivers/block/n64cart.c |  2 --
 drivers/block/nbd.c |  5 
 drivers/block/null_blk/main.c   |  1 -
 drivers/block/pktcdvd.c |  1 +
 drivers/block/ps3disk.c |  3 ++-
 drivers/block/rbd.c |  3 ---
 drivers/block/rnbd/rnbd-clt.c   |  4 ---
 drivers/block/sunvdc.c  |  1 +
 drivers/block/swim.c|  5 +++-
 drivers/block/swim3.c   |  5 +++-
 drivers/block/ublk_drv.c|  9 +++
 drivers/block/virtio_blk.c  |  4 ++-
 drivers/block/xen-blkfront.c|  1 -
 drivers/block/zram/zram_drv.c   |  2 --
 drivers/cdrom/gdrom.c   |  1 +
 drivers/md/bcache/super.c   |  2 --
 drivers/md/dm-table.c   | 12 -
 drivers/md/md.c | 13 --
 drivers/mmc/core/queue.c|  1 -
 drivers/mtd/mtd_blkdevs.c   |  1 -
 drivers/nvdimm/btt.c|  1 -
 drivers/nvdimm/pmem.c   |  1 -
 drivers/nvme/host/core.c|  1 -
 drivers/nvme/host/multipath.c   |  1 -
 drivers/s390/block/dasd_genhd.c |  1 -
 drivers/s390/block/scm_blk.c|  1 -
 drivers/scsi/sd.c   |  4 +--
 include/linux/blkdev.h  | 10 
 41 files changed, 83 insertions(+), 88 deletions(-)

diff --git a/arch/m68k/emu/nfblock.c b/arch/m68k/emu/nfblock.c
index 642fb80c5c4e31..8eea7ef9115146 100644
--- a/arch/m68k/emu/nfblock.c
+++ b/arch/m68k/emu/nfblock.c
@@ -98,6 +98,7 @@ static int __init nfhd_init_one(int id, u32 blocks, u32 bsize)
 {
struct queue_limits lim = {
.logical_block_size = bsize,
+   .features   = BLK_FEAT_ROTATIONAL,
};
struct nfhd_device *dev;
int dev_id = id - NFHD_DEV_OFFSET;
diff --git a/arch/um/drivers/ubd_kern.c b/arch/um/drivers/ubd_kern.c
index 19e01691ea0ea7..9f1e76ddda5a26 100644
--- a/arch/um/drivers/ubd_kern.c
+++ b/arch/um/drivers/ubd_kern.c
@@ -882,7 +882,6 @@ static int ubd_add(int n, char **error_out)
goto out_cleanup_tags;
}
 
-   blk_queue_flag_set(QUEUE_FLAG_NONROT, disk->queue);
disk->major = UBD_MAJOR;
disk->first_minor = n << UBD_SHIFT;
disk->minors = 1 << UBD_SHIFT;
diff --git a/arch/xtensa/platforms/iss/simdisk.c 
b/arch/xtensa/platforms/iss/simdisk.c
index defc67909a9c74..d6d2b533a5744d 100644
--- a/arch/xtensa/platforms/iss/simdisk.c
+++ b/arch/xtensa/platforms/iss/simdisk.c
@@ -263,6 +263,9 @@ static const struct proc_ops simdisk_proc_ops = {
 static int __init simdisk_setup(struct simdisk *dev, int which,
struct proc_dir_entry *procdir)
 {
+   struct queue_limits lim = {
+   .features   = BLK_FEAT_ROTATIONAL,
+   };
char tmp[2] = { '0' + which, 0 };
int err;
 
@@ -271,7 +274,7 @@ static int __init simdisk_setup(struct simdisk *dev, int 
which,
spin_lock_init(&dev->lock);
dev->users = 0;
 
-   dev->gd = blk_alloc_disk(NULL, NUMA_NO_NODE);
+   dev->gd = blk_alloc_disk(&lim, NUMA_NO_NODE);
if (IS_ERR(dev->gd)) {
err = PTR_ERR(dev->gd);
goto out;
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index e8b9db7c30c455..4d0e62ec88f033 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -84,7 +84,6 @@ static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(NOMERGES),
QUEUE_FLAG_NAME(SAME_COMP),
QUEUE_FLAG_NAME(F

[PATCH 15/26] block: move the add_random flag to queue_limits

2024-06-10 Thread Christoph Hellwig

Move the add_random flag into the queue_limits feature field so that it
can be set atomically and all I/O is frozen when changing the flag.

Note that this also removes code from dm to clear the flag based on
the underlying devices, which can't be reached as dm devices will
always start out without the flag set.

Signed-off-by: Christoph Hellwig 
---
 block/blk-mq-debugfs.c|  1 -
 block/blk-sysfs.c |  6 +++---
 drivers/block/mtip32xx/mtip32xx.c |  1 -
 drivers/md/dm-table.c | 18 --
 drivers/mmc/core/queue.c  |  2 --
 drivers/mtd/mtd_blkdevs.c |  3 ---
 drivers/s390/block/scm_blk.c  |  4 
 drivers/scsi/scsi_lib.c   |  3 +--
 drivers/scsi/sd.c | 11 +++
 include/linux/blkdev.h|  5 +++--
 10 files changed, 10 insertions(+), 44 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 4d0e62ec88f033..6b7edb50bfd3fa 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -86,7 +86,6 @@ static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(FAIL_IO),
QUEUE_FLAG_NAME(IO_STAT),
QUEUE_FLAG_NAME(NOXMERGES),
-   QUEUE_FLAG_NAME(ADD_RANDOM),
QUEUE_FLAG_NAME(SYNCHRONOUS),
QUEUE_FLAG_NAME(SAME_FORCE),
QUEUE_FLAG_NAME(INIT_DONE),
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 637ed3bbbfb46f..9174aca3b85526 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -323,7 +323,7 @@ queue_##name##_store(struct request_queue *q, const char 
*page, size_t count) \
 }
 
 QUEUE_SYSFS_FEATURE(rotational, BLK_FEAT_ROTATIONAL)
-QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0);
+QUEUE_SYSFS_FEATURE(add_random, BLK_FEAT_ADD_RANDOM)
 QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0);
 QUEUE_SYSFS_BIT_FNS(stable_writes, STABLE_WRITES, 0);
 #undef QUEUE_SYSFS_BIT_FNS
@@ -561,7 +561,7 @@ static struct queue_sysfs_entry queue_hw_sector_size_entry 
= {
 
 QUEUE_RW_ENTRY(queue_rotational, "rotational");
 QUEUE_RW_ENTRY(queue_iostats, "iostats");
-QUEUE_RW_ENTRY(queue_random, "add_random");
+QUEUE_RW_ENTRY(queue_add_random, "add_random");
 QUEUE_RW_ENTRY(queue_stable_writes, "stable_writes");
 
 #ifdef CONFIG_BLK_WBT
@@ -665,7 +665,7 @@ static struct attribute *queue_attrs[] = {
&queue_nomerges_entry.attr,
&queue_iostats_entry.attr,
&queue_stable_writes_entry.attr,
-   &queue_random_entry.attr,
+   &queue_add_random_entry.attr,
&queue_poll_entry.attr,
&queue_wc_entry.attr,
&queue_fua_entry.attr,
diff --git a/drivers/block/mtip32xx/mtip32xx.c 
b/drivers/block/mtip32xx/mtip32xx.c
index 1dbbf72659d549..c6ef0546ffc9d2 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -3485,7 +3485,6 @@ static int mtip_block_initialize(struct driver_data *dd)
goto start_service_thread;
 
/* Set device limits. */
-   blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, dd->queue);
dma_set_max_seg_size(&dd->pdev->dev, 0x40);
 
/* Set the capacity of the device in 512 byte sectors. */
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 3514a57c2df5d2..7654babc2775c1 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1716,14 +1716,6 @@ static int device_dax_write_cache_enabled(struct 
dm_target *ti,
return false;
 }
 
-static int device_is_not_random(struct dm_target *ti, struct dm_dev *dev,
-sector_t start, sector_t len, void *data)
-{
-   struct request_queue *q = bdev_get_queue(dev->bdev);
-
-   return !blk_queue_add_random(q);
-}
-
 static int device_not_write_zeroes_capable(struct dm_target *ti, struct dm_dev 
*dev,
   sector_t start, sector_t len, void 
*data)
 {
@@ -1876,16 +1868,6 @@ int dm_table_set_restrictions(struct dm_table *t, struct 
request_queue *q,
else
blk_queue_flag_clear(QUEUE_FLAG_STABLE_WRITES, q);
 
-   /*
-* Determine whether or not this queue's I/O timings contribute
-* to the entropy pool, Only request-based targets use this.
-* Clear QUEUE_FLAG_ADD_RANDOM if any underlying device does not
-* have it set.
-*/
-   if (blk_queue_add_random(q) &&
-   dm_table_any_dev_attr(t, device_is_not_random, NULL))
-   blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q);
-
/*
 * For a zoned target, setup the zones related queue attributes
 * and resources necessary for zone append emulation if necessary.
diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index b4f62fa845864c..da00904d4a3c7e 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -387,8 +387,6 @@ static struct gendisk *mmc_alloc_disk(struct mmc_queue *mq,
blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, mq->queue);
blk_queue_rq_timeout(mq->queue, 60 * HZ);
 
-   blk_queu

[PATCH 16/26] block: move the io_stat flag setting to queue_limits

2024-06-10 Thread Christoph Hellwig

Move the io_stat flag into the queue_limits feature field so that it
can be set atomically and all I/O is frozen when changing the flag.

Simplify md and dm to set the flag unconditionally instead of avoiding
setting a simple flag for cases where it already is set by other means,
which is a bit pointless.

Signed-off-by: Christoph Hellwig 
---
 block/blk-mq-debugfs.c|  1 -
 block/blk-mq.c|  6 +-
 block/blk-sysfs.c |  2 +-
 drivers/md/dm-table.c | 12 +---
 drivers/md/dm.c   | 13 +++--
 drivers/md/md.c   |  5 ++---
 drivers/nvme/host/multipath.c |  2 +-
 include/linux/blkdev.h|  9 +
 8 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 6b7edb50bfd3fa..cbe99444ed1a54 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -84,7 +84,6 @@ static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(NOMERGES),
QUEUE_FLAG_NAME(SAME_COMP),
QUEUE_FLAG_NAME(FAIL_IO),
-   QUEUE_FLAG_NAME(IO_STAT),
QUEUE_FLAG_NAME(NOXMERGES),
QUEUE_FLAG_NAME(SYNCHRONOUS),
QUEUE_FLAG_NAME(SAME_FORCE),
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 58b0d6c7cc34d6..cf67dc13f7dd4c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4116,7 +4116,11 @@ struct request_queue *blk_mq_alloc_queue(struct 
blk_mq_tag_set *set,
struct request_queue *q;
int ret;
 
-   q = blk_alloc_queue(lim ? lim : &default_lim, set->numa_node);
+   if (!lim)
+   lim = &default_lim;
+   lim->features |= BLK_FEAT_IO_STAT;
+
+   q = blk_alloc_queue(lim, set->numa_node);
if (IS_ERR(q))
return q;
q->queuedata = queuedata;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 9174aca3b85526..6f58530fb3c08e 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -324,7 +324,7 @@ queue_##name##_store(struct request_queue *q, const char 
*page, size_t count) \
 
 QUEUE_SYSFS_FEATURE(rotational, BLK_FEAT_ROTATIONAL)
 QUEUE_SYSFS_FEATURE(add_random, BLK_FEAT_ADD_RANDOM)
-QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0);
+QUEUE_SYSFS_FEATURE(iostats, BLK_FEAT_IO_STAT)
 QUEUE_SYSFS_BIT_FNS(stable_writes, STABLE_WRITES, 0);
 #undef QUEUE_SYSFS_BIT_FNS
 
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 7654babc2775c1..3e3b713502f61e 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -579,6 +579,12 @@ int dm_split_args(int *argc, char ***argvp, char *input)
return 0;
 }
 
+static void dm_set_stacking_limits(struct queue_limits *limits)
+{
+   blk_set_stacking_limits(limits);
+   limits->features |= BLK_FEAT_IO_STAT;
+}
+
 /*
  * Impose necessary and sufficient conditions on a devices's table such
  * that any incoming bio which respects its logical_block_size can be
@@ -617,7 +623,7 @@ static int validate_hardware_logical_block_alignment(struct 
dm_table *t,
for (i = 0; i < t->num_targets; i++) {
ti = dm_table_get_target(t, i);
 
-   blk_set_stacking_limits(&ti_limits);
+   dm_set_stacking_limits(&ti_limits);
 
/* combine all target devices' limits */
if (ti->type->iterate_devices)
@@ -1591,7 +1597,7 @@ int dm_calculate_queue_limits(struct dm_table *t,
unsigned int zone_sectors = 0;
bool zoned = false;
 
-   blk_set_stacking_limits(limits);
+   dm_set_stacking_limits(limits);
 
t->integrity_supported = true;
for (unsigned int i = 0; i < t->num_targets; i++) {
@@ -1604,7 +1610,7 @@ int dm_calculate_queue_limits(struct dm_table *t,
for (unsigned int i = 0; i < t->num_targets; i++) {
struct dm_target *ti = dm_table_get_target(t, i);
 
-   blk_set_stacking_limits(&ti_limits);
+   dm_set_stacking_limits(&ti_limits);
 
if (!ti->type->iterate_devices) {
/* Set I/O hints portion of queue limits */
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 13037d6a6f62a2..8a976cee448bed 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2386,22 +2386,15 @@ int dm_setup_md_queue(struct mapped_device *md, struct 
dm_table *t)
struct table_device *td;
int r;
 
-   switch (type) {
-   case DM_TYPE_REQUEST_BASED:
+   WARN_ON_ONCE(type == DM_TYPE_NONE);
+
+   if (type == DM_TYPE_REQUEST_BASED) {
md->disk->fops = &dm_rq_blk_dops;
r = dm_mq_init_request_queue(md, t);
if (r) {
DMERR("Cannot initialize queue for request-based dm 
mapped device");
return r;
}
-   break;
-   case DM_TYPE_BIO_BASED:
-   case DM_TYPE_DAX_BIO_BASED:
-   blk_queue_flag_set(QUEUE_FLAG_IO_STAT, md->queue);
-   break;
-   case DM_TYPE_NONE:
-   WARN_ON_ONCE(true)

[PATCH 17/26] block: move the stable_write flag to queue_limits

2024-06-10 Thread Christoph Hellwig

Move the io_stat flag into the queue_limits feature field so that it can
be set atomically and all I/O is frozen when changing the flag.

The flag is now inherited by blk_stack_limits, which greatly simplifies
the code in dm, and fixed md which previously did not pass on the flag
set on lower devices.

Signed-off-by: Christoph Hellwig 
---
 block/blk-mq-debugfs.c |  1 -
 block/blk-sysfs.c  | 29 +
 drivers/block/drbd/drbd_main.c |  5 ++---
 drivers/block/rbd.c|  9 +++--
 drivers/block/zram/zram_drv.c  |  2 +-
 drivers/md/dm-table.c  | 19 ---
 drivers/md/raid5.c |  6 --
 drivers/mmc/core/queue.c   |  5 +++--
 drivers/nvme/host/core.c   |  9 +
 drivers/nvme/host/multipath.c  |  4 
 drivers/scsi/iscsi_tcp.c   |  8 
 include/linux/blkdev.h |  9 ++---
 12 files changed, 29 insertions(+), 77 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index cbe99444ed1a54..eb73f1d348e5a9 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -88,7 +88,6 @@ static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(SYNCHRONOUS),
QUEUE_FLAG_NAME(SAME_FORCE),
QUEUE_FLAG_NAME(INIT_DONE),
-   QUEUE_FLAG_NAME(STABLE_WRITES),
QUEUE_FLAG_NAME(POLL),
QUEUE_FLAG_NAME(DAX),
QUEUE_FLAG_NAME(STATS),
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 6f58530fb3c08e..cde525724831ef 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -296,37 +296,10 @@ static ssize_t queue_##_name##_store(struct request_queue 
*q,  \
return queue_feature_store(q, page, count, _feature);\
 }
 
-#define QUEUE_SYSFS_BIT_FNS(name, flag, neg)   \
-static ssize_t \
-queue_##name##_show(struct request_queue *q, char *page)   \
-{  \
-   int bit;\
-   bit = test_bit(QUEUE_FLAG_##flag, &q->queue_flags); \
-   return queue_var_show(neg ? !bit : bit, page);  \
-}  \
-static ssize_t \
-queue_##name##_store(struct request_queue *q, const char *page, size_t count) \
-{  \
-   unsigned long val;  \
-   ssize_t ret;\
-   ret = queue_var_store(&val, page, count);   \
-   if (ret < 0)\
-return ret;\
-   if (neg)\
-   val = !val; \
-   \
-   if (val)\
-   blk_queue_flag_set(QUEUE_FLAG_##flag, q);   \
-   else\
-   blk_queue_flag_clear(QUEUE_FLAG_##flag, q); \
-   return ret; \
-}
-
 QUEUE_SYSFS_FEATURE(rotational, BLK_FEAT_ROTATIONAL)
 QUEUE_SYSFS_FEATURE(add_random, BLK_FEAT_ADD_RANDOM)
 QUEUE_SYSFS_FEATURE(iostats, BLK_FEAT_IO_STAT)
-QUEUE_SYSFS_BIT_FNS(stable_writes, STABLE_WRITES, 0);
-#undef QUEUE_SYSFS_BIT_FNS
+QUEUE_SYSFS_FEATURE(stable_writes, BLK_FEAT_STABLE_WRITES);
 
 static ssize_t queue_zoned_show(struct request_queue *q, char *page)
 {
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 2ef29a47807550..f92673f05c7abc 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2698,7 +2698,8 @@ enum drbd_ret_code drbd_create_device(struct 
drbd_config_context *adm_ctx, unsig
 */
.max_hw_sectors = DRBD_MAX_BIO_SIZE_SAFE >> 8,
.features   = BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA |
- BLK_FEAT_ROTATIONAL,
+ BLK_FEAT_ROTATIONAL |
+ BLK_FEAT_STABLE_WRITES,
};
 
device = minor_to_device(minor);
@@ -2737,8 +2738,6 @@ enum drbd_ret_code drbd_create_device(struct 
drbd_config_context *adm_ctx, unsig
sprintf(disk->disk_name, "drbd%d", minor);
disk->private_data = device;
 
-   blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, disk->queue);
-
device->md_io.page = alloc_page(GFP_KERNEL);
if (!device->md_io.page)

[PATCH 18/26] block: move the synchronous flag to queue_limits

2024-06-10 Thread Christoph Hellwig

Move the synchronous flag into the queue_limits feature field so that it
can be set atomically and all I/O is frozen when changing the flag.

Signed-off-by: Christoph Hellwig 
---
 block/blk-mq-debugfs.c| 1 -
 drivers/block/brd.c   | 2 +-
 drivers/block/zram/zram_drv.c | 4 ++--
 drivers/nvdimm/btt.c  | 3 +--
 drivers/nvdimm/pmem.c | 4 ++--
 include/linux/blkdev.h| 7 ---
 6 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index eb73f1d348e5a9..957774e40b1d0c 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -85,7 +85,6 @@ static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(SAME_COMP),
QUEUE_FLAG_NAME(FAIL_IO),
QUEUE_FLAG_NAME(NOXMERGES),
-   QUEUE_FLAG_NAME(SYNCHRONOUS),
QUEUE_FLAG_NAME(SAME_FORCE),
QUEUE_FLAG_NAME(INIT_DONE),
QUEUE_FLAG_NAME(POLL),
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index b25dc463b5e3a6..d77deb571dbd06 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -335,6 +335,7 @@ static int brd_alloc(int i)
.max_hw_discard_sectors = UINT_MAX,
.max_discard_segments   = 1,
.discard_granularity= PAGE_SIZE,
+   .features   = BLK_FEAT_SYNCHRONOUS,
};
 
list_for_each_entry(brd, &brd_devices, brd_list)
@@ -366,7 +367,6 @@ static int brd_alloc(int i)
strscpy(disk->disk_name, buf, DISK_NAME_LEN);
set_capacity(disk, rd_size * 2);

-   blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, disk->queue);
blk_queue_flag_set(QUEUE_FLAG_NOWAIT, disk->queue);
err = add_disk(disk);
if (err)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index f8f1b5b54795ac..efcb8d9d274c31 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -2208,7 +2208,8 @@ static int zram_add(void)
 #if ZRAM_LOGICAL_BLOCK_SIZE == PAGE_SIZE
.max_write_zeroes_sectors   = UINT_MAX,
 #endif
-   .features   = BLK_FEAT_STABLE_WRITES,
+   .features   = BLK_FEAT_STABLE_WRITES |
+ BLK_FEAT_SYNCHRONOUS,
};
struct zram *zram;
int ret, device_id;
@@ -2246,7 +2247,6 @@ static int zram_add(void)
 
/* Actual capacity set using sysfs (/sys/block/zram/disksize */
set_capacity(zram->disk, 0);
-   blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, zram->disk->queue);
ret = device_add_disk(NULL, zram->disk, zram_disk_groups);
if (ret)
goto out_cleanup_disk;
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index e474afa8e9f68d..e79c06d65bb77b 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1501,6 +1501,7 @@ static int btt_blk_init(struct btt *btt)
.logical_block_size = btt->sector_size,
.max_hw_sectors = UINT_MAX,
.max_integrity_segments = 1,
+   .features   = BLK_FEAT_SYNCHRONOUS,
};
int rc;
 
@@ -1518,8 +1519,6 @@ static int btt_blk_init(struct btt *btt)
btt->btt_disk->fops = &btt_fops;
btt->btt_disk->private_data = btt;
 
-   blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, btt->btt_disk->queue);
-
set_capacity(btt->btt_disk, btt->nlba * btt->sector_size >> 9);
rc = device_add_disk(&btt->nd_btt->dev, btt->btt_disk, NULL);
if (rc)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 501cf226df0187..b821dcf018f6ae 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -455,7 +455,8 @@ static int pmem_attach_disk(struct device *dev,
.logical_block_size = pmem_sector_size(ndns),
.physical_block_size= PAGE_SIZE,
.max_hw_sectors = UINT_MAX,
-   .features   = BLK_FEAT_WRITE_CACHE,
+   .features   = BLK_FEAT_WRITE_CACHE |
+ BLK_FEAT_SYNCHRONOUS,
};
int nid = dev_to_node(dev), fua;
struct resource *res = &nsio->res;
@@ -546,7 +547,6 @@ static int pmem_attach_disk(struct device *dev,
}
pmem->virt_addr = addr;
 
-   blk_queue_flag_set(QUEUE_FLAG_SYNCHRONOUS, q);
if (pmem->pfn_flags & PFN_MAP)
blk_queue_flag_set(QUEUE_FLAG_DAX, q);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index db14c61791e022..4d908e29c760da 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -301,6 +301,9 @@ enum {
 
/* don't modify data until writeback is done */
BLK_FEAT_STABLE_WRITES  = (1u << 5),
+
+   /* always completes in submit context */
+   BLK_FEAT_SYNCHRONOUS= (1u << 6),
 };
 
 /*
@@ -566,7 +569,6 @@ st

[PATCH 19/26] block: move the nowait flag to queue_limits

2024-06-10 Thread Christoph Hellwig

Move the nowait flag into the queue_limits feature field so that it
can be set atomically and all I/O is frozen when changing the flag.

Stacking drivers are simplified in that they now can simply set the
flag, and blk_stack_limits will clear it when the features is not
supported by any of the underlying devices.

Signed-off-by: Christoph Hellwig 
---
 block/blk-mq-debugfs.c|  1 -
 block/blk-mq.c|  2 +-
 block/blk-settings.c  |  9 +
 drivers/block/brd.c   |  4 ++--
 drivers/md/dm-table.c | 16 ++--
 drivers/md/md.c   | 18 +-
 drivers/nvme/host/multipath.c |  3 +--
 include/linux/blkdev.h|  9 +
 8 files changed, 21 insertions(+), 41 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 957774e40b1d0c..62b132e9a9ce3b 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -96,7 +96,6 @@ static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(ZONE_RESETALL),
QUEUE_FLAG_NAME(RQ_ALLOC_TIME),
QUEUE_FLAG_NAME(HCTX_ACTIVE),
-   QUEUE_FLAG_NAME(NOWAIT),
QUEUE_FLAG_NAME(SQ_SCHED),
QUEUE_FLAG_NAME(SKIP_TAGSET_QUIESCE),
 };
diff --git a/block/blk-mq.c b/block/blk-mq.c
index cf67dc13f7dd4c..43235acc87505f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4118,7 +4118,7 @@ struct request_queue *blk_mq_alloc_queue(struct 
blk_mq_tag_set *set,
 
if (!lim)
lim = &default_lim;
-   lim->features |= BLK_FEAT_IO_STAT;
+   lim->features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT;
 
q = blk_alloc_queue(lim, set->numa_node);
if (IS_ERR(q))
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 536ee202fcdccb..bf4622c19b5c09 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -459,6 +459,15 @@ int blk_stack_limits(struct queue_limits *t, struct 
queue_limits *b,
 
t->features |= (b->features & BLK_FEAT_INHERIT_MASK);
 
+   /*
+* BLK_FEAT_NOWAIT needs to be supported both by the stacking driver
+* and all underlying devices.  The stacking driver sets the flag
+* before stacking the limits, and this will clear the flag if any
+* of the underlying devices does not support it.
+*/
+   if (!(b->features & BLK_FEAT_NOWAIT))
+   t->features &= ~BLK_FEAT_NOWAIT;
+
t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors);
t->max_user_sectors = min_not_zero(t->max_user_sectors,
b->max_user_sectors);
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index d77deb571dbd06..a300645cd9d4a5 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -335,7 +335,8 @@ static int brd_alloc(int i)
.max_hw_discard_sectors = UINT_MAX,
.max_discard_segments   = 1,
.discard_granularity= PAGE_SIZE,
-   .features   = BLK_FEAT_SYNCHRONOUS,
+   .features   = BLK_FEAT_SYNCHRONOUS |
+ BLK_FEAT_NOWAIT,
};
 
list_for_each_entry(brd, &brd_devices, brd_list)
@@ -367,7 +368,6 @@ static int brd_alloc(int i)
strscpy(disk->disk_name, buf, DISK_NAME_LEN);
set_capacity(disk, rd_size * 2);

-   blk_queue_flag_set(QUEUE_FLAG_NOWAIT, disk->queue);
err = add_disk(disk);
if (err)
goto out_cleanup_disk;
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index f4e1b50ffdcda5..eee43d27733f9a 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -582,7 +582,7 @@ int dm_split_args(int *argc, char ***argvp, char *input)
 static void dm_set_stacking_limits(struct queue_limits *limits)
 {
blk_set_stacking_limits(limits);
-   limits->features |= BLK_FEAT_IO_STAT;
+   limits->features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT;
 }
 
 /*
@@ -1746,12 +1746,6 @@ static bool dm_table_supports_write_zeroes(struct 
dm_table *t)
return true;
 }
 
-static int device_not_nowait_capable(struct dm_target *ti, struct dm_dev *dev,
-sector_t start, sector_t len, void *data)
-{
-   return !bdev_nowait(dev->bdev);
-}
-
 static bool dm_table_supports_nowait(struct dm_table *t)
 {
for (unsigned int i = 0; i < t->num_targets; i++) {
@@ -1759,10 +1753,6 @@ static bool dm_table_supports_nowait(struct dm_table *t)
 
if (!dm_target_supports_nowait(ti->type))
return false;
-
-   if (!ti->type->iterate_devices ||
-   ti->type->iterate_devices(ti, device_not_nowait_capable, 
NULL))
-   return false;
}
 
return true;
@@ -1825,9 +1815,7 @@ int dm_table_set_restrictions(struct dm_table *t, struct 
request_queue *q,
int r;
 
if (dm_table_supports_nowait(t))
-   blk_queue_flag_set(QUEUE_FLAG_NOWAIT, q);
-

[PATCH 21/26] block: move the poll flag to queue_limits

2024-06-10 Thread Christoph Hellwig

Move the poll flag into the queue_limits feature field so that it
can be set atomically and all I/O is frozen when changing the flag.

Stacking drivers are simplified in that they now can simply set the
flag, and blk_stack_limits will clear it when the features is not
supported by any of the underlying devices.

Signed-off-by: Christoph Hellwig 
---
 block/blk-core.c  |  5 ++--
 block/blk-mq-debugfs.c|  1 -
 block/blk-mq.c| 31 +++-
 block/blk-settings.c  | 10 ---
 block/blk-sysfs.c |  4 +--
 drivers/md/dm-table.c | 54 +--
 drivers/nvme/host/multipath.c | 12 +---
 include/linux/blkdev.h|  4 ++-
 8 files changed, 45 insertions(+), 76 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2b45a4df9a1aa1..8d9fbd353fc7fc 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -791,7 +791,7 @@ void submit_bio_noacct(struct bio *bio)
}
}
 
-   if (!test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
+   if (!(q->limits.features & BLK_FEAT_POLL))
bio_clear_polled(bio);
 
switch (bio_op(bio)) {
@@ -915,8 +915,7 @@ int bio_poll(struct bio *bio, struct io_comp_batch *iob, 
unsigned int flags)
return 0;
 
q = bdev_get_queue(bdev);
-   if (cookie == BLK_QC_T_NONE ||
-   !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
+   if (cookie == BLK_QC_T_NONE || !(q->limits.features & BLK_FEAT_POLL))
return 0;
 
blk_flush_plug(current->plug, false);
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index f4fa820251ce83..3a21527913840d 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -87,7 +87,6 @@ static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(NOXMERGES),
QUEUE_FLAG_NAME(SAME_FORCE),
QUEUE_FLAG_NAME(INIT_DONE),
-   QUEUE_FLAG_NAME(POLL),
QUEUE_FLAG_NAME(STATS),
QUEUE_FLAG_NAME(REGISTERED),
QUEUE_FLAG_NAME(QUIESCED),
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 43235acc87505f..e2b9710ddc5ad1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -4109,6 +4109,12 @@ void blk_mq_release(struct request_queue *q)
blk_mq_sysfs_deinit(q);
 }
 
+static bool blk_mq_can_poll(struct blk_mq_tag_set *set)
+{
+   return set->nr_maps > HCTX_TYPE_POLL &&
+   set->map[HCTX_TYPE_POLL].nr_queues;
+}
+
 struct request_queue *blk_mq_alloc_queue(struct blk_mq_tag_set *set,
struct queue_limits *lim, void *queuedata)
 {
@@ -4119,6 +4125,8 @@ struct request_queue *blk_mq_alloc_queue(struct 
blk_mq_tag_set *set,
if (!lim)
lim = &default_lim;
lim->features |= BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT;
+   if (blk_mq_can_poll(set))
+   lim->features |= BLK_FEAT_POLL;
 
q = blk_alloc_queue(lim, set->numa_node);
if (IS_ERR(q))
@@ -4273,17 +4281,6 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set 
*set,
mutex_unlock(&q->sysfs_lock);
 }
 
-static void blk_mq_update_poll_flag(struct request_queue *q)
-{
-   struct blk_mq_tag_set *set = q->tag_set;
-
-   if (set->nr_maps > HCTX_TYPE_POLL &&
-   set->map[HCTX_TYPE_POLL].nr_queues)
-   blk_queue_flag_set(QUEUE_FLAG_POLL, q);
-   else
-   blk_queue_flag_clear(QUEUE_FLAG_POLL, q);
-}
-
 int blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
struct request_queue *q)
 {
@@ -4311,7 +4308,6 @@ int blk_mq_init_allocated_queue(struct blk_mq_tag_set 
*set,
q->tag_set = set;
 
q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT;
-   blk_mq_update_poll_flag(q);
 
INIT_DELAYED_WORK(&q->requeue_work, blk_mq_requeue_work);
INIT_LIST_HEAD(&q->flush_list);
@@ -4798,8 +4794,10 @@ static void __blk_mq_update_nr_hw_queues(struct 
blk_mq_tag_set *set,
 fallback:
blk_mq_update_queue_map(set);
list_for_each_entry(q, &set->tag_list, tag_set_list) {
+   struct queue_limits lim;
+
blk_mq_realloc_hw_ctxs(set, q);
-   blk_mq_update_poll_flag(q);
+
if (q->nr_hw_queues != set->nr_hw_queues) {
int i = prev_nr_hw_queues;
 
@@ -4811,6 +4809,13 @@ static void __blk_mq_update_nr_hw_queues(struct 
blk_mq_tag_set *set,
set->nr_hw_queues = prev_nr_hw_queues;
goto fallback;
}
+   lim = queue_limits_start_update(q);
+   if (blk_mq_can_poll(set))
+   lim.features |= BLK_FEAT_POLL;
+   else
+   lim.features &= ~BLK_FEAT_POLL;
+   if (queue_limits_commit_update(q, &lim) < 0)
+   pr_warn("updating the poll flag failed\n");
blk_mq_map_swqueue(q);
}
 
diff --git a/block/blk-settings.c b/block/blk-settings.c
index bf4622c19b

[PATCH 20/26] block: move the dax flag to queue_limits

2024-06-10 Thread Christoph Hellwig

Move the dax flag into the queue_limits feature field so that it
can be set atomically and all I/O is frozen when changing the flag.

Signed-off-by: Christoph Hellwig 
---
 block/blk-mq-debugfs.c   | 1 -
 drivers/md/dm-table.c| 4 ++--
 drivers/nvdimm/pmem.c| 7 ++-
 drivers/s390/block/dcssblk.c | 2 +-
 include/linux/blkdev.h   | 6 --
 5 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 62b132e9a9ce3b..f4fa820251ce83 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -88,7 +88,6 @@ static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(SAME_FORCE),
QUEUE_FLAG_NAME(INIT_DONE),
QUEUE_FLAG_NAME(POLL),
-   QUEUE_FLAG_NAME(DAX),
QUEUE_FLAG_NAME(STATS),
QUEUE_FLAG_NAME(REGISTERED),
QUEUE_FLAG_NAME(QUIESCED),
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index eee43d27733f9a..d3a960aee03c6a 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1834,11 +1834,11 @@ int dm_table_set_restrictions(struct dm_table *t, 
struct request_queue *q,
limits->features |= BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA;
 
if (dm_table_supports_dax(t, device_not_dax_capable)) {
-   blk_queue_flag_set(QUEUE_FLAG_DAX, q);
+   limits->features |= BLK_FEAT_DAX;
if (dm_table_supports_dax(t, 
device_not_dax_synchronous_capable))
set_dax_synchronous(t->md->dax_dev);
} else
-   blk_queue_flag_clear(QUEUE_FLAG_DAX, q);
+   limits->features &= ~BLK_FEAT_DAX;
 
if (dm_table_any_dev_attr(t, device_dax_write_cache_enabled, NULL))
dax_write_cache(t->md->dax_dev, true);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index b821dcf018f6ae..1dd74c969d5a09 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -465,7 +465,6 @@ static int pmem_attach_disk(struct device *dev,
struct dax_device *dax_dev;
struct nd_pfn_sb *pfn_sb;
struct pmem_device *pmem;
-   struct request_queue *q;
struct gendisk *disk;
void *addr;
int rc;
@@ -499,6 +498,8 @@ static int pmem_attach_disk(struct device *dev,
}
if (fua)
lim.features |= BLK_FEAT_FUA;
+   if (is_nd_pfn(dev))
+   lim.features |= BLK_FEAT_DAX;
 
if (!devm_request_mem_region(dev, res->start, resource_size(res),
dev_name(&ndns->dev))) {
@@ -509,7 +510,6 @@ static int pmem_attach_disk(struct device *dev,
disk = blk_alloc_disk(&lim, nid);
if (IS_ERR(disk))
return PTR_ERR(disk);
-   q = disk->queue;
 
pmem->disk = disk;
pmem->pgmap.owner = pmem;
@@ -547,9 +547,6 @@ static int pmem_attach_disk(struct device *dev,
}
pmem->virt_addr = addr;
 
-   if (pmem->pfn_flags & PFN_MAP)
-   blk_queue_flag_set(QUEUE_FLAG_DAX, q);
-
disk->fops  = &pmem_fops;
disk->private_data  = pmem;
nvdimm_namespace_disk_name(ndns, disk->disk_name);
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 6d1689a2717e5f..d5a5d11ae0dcdf 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -548,6 +548,7 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char
 {
struct queue_limits lim = {
.logical_block_size = 4096,
+   .features   = BLK_FEAT_DAX,
};
int rc, i, j, num_of_segments;
struct dcssblk_dev_info *dev_info;
@@ -643,7 +644,6 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char
dev_info->gd->fops = &dcssblk_devops;
dev_info->gd->private_data = dev_info;
dev_info->gd->flags |= GENHD_FL_NO_PART;
-   blk_queue_flag_set(QUEUE_FLAG_DAX, dev_info->gd->queue);
 
seg_byte_size = (dev_info->end - dev_info->start + 1);
set_capacity(dev_info->gd, seg_byte_size >> 9); // size in sectors
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 59c2327692589b..c2545580c5b134 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -307,6 +307,9 @@ enum {
 
/* supports REQ_NOWAIT */
BLK_FEAT_NOWAIT = (1u << 7),
+
+   /* supports DAX */
+   BLK_FEAT_DAX= (1u << 8),
 };
 
 /*
@@ -575,7 +578,6 @@ struct request_queue {
 #define QUEUE_FLAG_SAME_FORCE  12  /* force complete on same CPU */
 #define QUEUE_FLAG_INIT_DONE   14  /* queue is initialized */
 #define QUEUE_FLAG_POLL16  /* IO polling enabled if set */
-#define QUEUE_FLAG_DAX 19  /* device supports DAX */
 #define QUEUE_FLAG_STATS   20  /* track IO start and completion times 
*/
 #define QUEUE_FLAG_REGISTERED  22  /*

[PATCH 22/26] block: move the zoned flag into the feature field

2024-06-10 Thread Christoph Hellwig

Move the boolean zoned field into the flags field to reclaim a little
bit of space.

Signed-off-by: Christoph Hellwig 
---
 block/blk-settings.c   |  5 ++---
 drivers/block/null_blk/zoned.c |  2 +-
 drivers/block/ublk_drv.c   |  2 +-
 drivers/block/virtio_blk.c |  5 +++--
 drivers/md/dm-table.c  | 11 ++-
 drivers/md/dm-zone.c   |  2 +-
 drivers/md/dm-zoned-target.c   |  2 +-
 drivers/nvme/host/zns.c|  2 +-
 drivers/scsi/sd_zbc.c  |  4 ++--
 include/linux/blkdev.h |  9 ++---
 10 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 026ba68d829856..96e07f24bd9aa1 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -68,7 +68,7 @@ static void blk_apply_bdi_limits(struct backing_dev_info *bdi,
 
 static int blk_validate_zoned_limits(struct queue_limits *lim)
 {
-   if (!lim->zoned) {
+   if (!(lim->features & BLK_FEAT_ZONED)) {
if (WARN_ON_ONCE(lim->max_open_zones) ||
WARN_ON_ONCE(lim->max_active_zones) ||
WARN_ON_ONCE(lim->zone_write_granularity) ||
@@ -602,8 +602,7 @@ int blk_stack_limits(struct queue_limits *t, struct 
queue_limits *b,
   b->max_secure_erase_sectors);
t->zone_write_granularity = max(t->zone_write_granularity,
b->zone_write_granularity);
-   t->zoned = max(t->zoned, b->zoned);
-   if (!t->zoned) {
+   if (!(t->features & BLK_FEAT_ZONED)) {
t->zone_write_granularity = 0;
t->max_zone_append_sectors = 0;
}
diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index f118d304f31080..ca8e739e76b981 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -158,7 +158,7 @@ int null_init_zoned_dev(struct nullb_device *dev,
sector += dev->zone_size_sects;
}
 
-   lim->zoned = true;
+   lim->features |= BLK_FEAT_ZONED;
lim->chunk_sectors = dev->zone_size_sects;
lim->max_zone_append_sectors = dev->zone_append_max_sectors;
lim->max_open_zones = dev->zone_max_open;
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 4fcde099935868..69c16018cbb19a 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -2196,7 +2196,7 @@ static int ublk_ctrl_start_dev(struct ublk_device *ub, 
struct io_uring_cmd *cmd)
if (!IS_ENABLED(CONFIG_BLK_DEV_ZONED))
return -EOPNOTSUPP;
 
-   lim.zoned = true;
+   lim.features |= BLK_FEAT_ZONED;
lim.max_active_zones = p->max_active_zones;
lim.max_open_zones =  p->max_open_zones;
lim.max_zone_append_sectors = p->max_zone_append_sectors;
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 13a2f24f176628..cea45b296f8bec 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -728,7 +728,7 @@ static int virtblk_read_zoned_limits(struct virtio_blk 
*vblk,
 
dev_dbg(&vdev->dev, "probing host-managed zoned device\n");
 
-   lim->zoned = true;
+   lim->features |= BLK_FEAT_ZONED;
 
virtio_cread(vdev, struct virtio_blk_config,
 zoned.max_open_zones, &v);
@@ -1546,7 +1546,8 @@ static int virtblk_probe(struct virtio_device *vdev)
 * All steps that follow use the VQs therefore they need to be
 * placed after the virtio_device_ready() call above.
 */
-   if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) && lim.zoned) {
+   if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
+   (lim.features & BLK_FEAT_ZONED)) {
blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, vblk->disk->queue);
err = blk_revalidate_disk_zones(vblk->disk);
if (err)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 653c253b6f7f32..48ccd9a396d8e6 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1605,12 +1605,12 @@ int dm_calculate_queue_limits(struct dm_table *t,
ti->type->iterate_devices(ti, dm_set_device_limits,
  &ti_limits);
 
-   if (!zoned && ti_limits.zoned) {
+   if (!zoned && (ti_limits.features & BLK_FEAT_ZONED)) {
/*
 * After stacking all limits, validate all devices
 * in table support this zoned model and zone sectors.
 */
-   zoned = ti_limits.zoned;
+   zoned = (ti_limits.features & BLK_FEAT_ZONED);
zone_sectors = ti_limits.chunk_sectors;
}
 
@@ -1658,12 +1658,12 @@ int dm_calculate_queue_limits(struct dm_table *t,
 *   zoned model on host-managed zoned block devices.
 * BUT...

[PATCH 23/26] block: move the zone_resetall flag to queue_limits

2024-06-10 Thread Christoph Hellwig

Move the zone_resetall flag into the queue_limits feature field so that
it can be set atomically and all I/O is frozen when changing the flag.

Signed-off-by: Christoph Hellwig 
---
 block/blk-mq-debugfs.c | 1 -
 drivers/block/null_blk/zoned.c | 3 +--
 drivers/block/ublk_drv.c   | 4 +---
 drivers/block/virtio_blk.c | 3 +--
 drivers/nvme/host/zns.c| 3 +--
 drivers/scsi/sd_zbc.c  | 5 +
 include/linux/blkdev.h | 6 --
 7 files changed, 9 insertions(+), 16 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 3a21527913840d..f2fd72f4414ae8 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -91,7 +91,6 @@ static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(REGISTERED),
QUEUE_FLAG_NAME(QUIESCED),
QUEUE_FLAG_NAME(PCI_P2PDMA),
-   QUEUE_FLAG_NAME(ZONE_RESETALL),
QUEUE_FLAG_NAME(RQ_ALLOC_TIME),
QUEUE_FLAG_NAME(HCTX_ACTIVE),
QUEUE_FLAG_NAME(SQ_SCHED),
diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index ca8e739e76b981..b42c00f1313254 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -158,7 +158,7 @@ int null_init_zoned_dev(struct nullb_device *dev,
sector += dev->zone_size_sects;
}
 
-   lim->features |= BLK_FEAT_ZONED;
+   lim->features |= BLK_FEAT_ZONED | BLK_FEAT_ZONE_RESETALL;
lim->chunk_sectors = dev->zone_size_sects;
lim->max_zone_append_sectors = dev->zone_append_max_sectors;
lim->max_open_zones = dev->zone_max_open;
@@ -171,7 +171,6 @@ int null_register_zoned_dev(struct nullb *nullb)
struct request_queue *q = nullb->q;
struct gendisk *disk = nullb->disk;
 
-   blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
disk->nr_zones = bdev_nr_zones(disk->part0);
 
pr_info("%s: using %s zone append\n",
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 69c16018cbb19a..4fdff13fc23b8a 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -248,8 +248,6 @@ static int ublk_dev_param_zoned_validate(const struct 
ublk_device *ub)
 
 static void ublk_dev_param_zoned_apply(struct ublk_device *ub)
 {
-   blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, ub->ub_disk->queue);
-
ub->ub_disk->nr_zones = ublk_get_nr_zones(ub);
 }
 
@@ -2196,7 +2194,7 @@ static int ublk_ctrl_start_dev(struct ublk_device *ub, 
struct io_uring_cmd *cmd)
if (!IS_ENABLED(CONFIG_BLK_DEV_ZONED))
return -EOPNOTSUPP;
 
-   lim.features |= BLK_FEAT_ZONED;
+   lim.features |= BLK_FEAT_ZONED | BLK_FEAT_ZONE_RESETALL;
lim.max_active_zones = p->max_active_zones;
lim.max_open_zones =  p->max_open_zones;
lim.max_zone_append_sectors = p->max_zone_append_sectors;
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index cea45b296f8bec..6c64a67ab9c901 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -728,7 +728,7 @@ static int virtblk_read_zoned_limits(struct virtio_blk 
*vblk,
 
dev_dbg(&vdev->dev, "probing host-managed zoned device\n");
 
-   lim->features |= BLK_FEAT_ZONED;
+   lim->features |= BLK_FEAT_ZONED | BLK_FEAT_ZONE_RESETALL;
 
virtio_cread(vdev, struct virtio_blk_config,
 zoned.max_open_zones, &v);
@@ -1548,7 +1548,6 @@ static int virtblk_probe(struct virtio_device *vdev)
 */
if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
(lim.features & BLK_FEAT_ZONED)) {
-   blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, vblk->disk->queue);
err = blk_revalidate_disk_zones(vblk->disk);
if (err)
goto out_cleanup_disk;
diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
index 06f2417aa50de7..99bb89c2495ae3 100644
--- a/drivers/nvme/host/zns.c
+++ b/drivers/nvme/host/zns.c
@@ -108,13 +108,12 @@ int nvme_query_zone_info(struct nvme_ns *ns, unsigned 
lbaf,
 void nvme_update_zone_info(struct nvme_ns *ns, struct queue_limits *lim,
struct nvme_zone_info *zi)
 {
-   lim->features |= BLK_FEAT_ZONED;
+   lim->features |= BLK_FEAT_ZONED | BLK_FEAT_ZONE_RESETALL;
lim->max_open_zones = zi->max_open_zones;
lim->max_active_zones = zi->max_active_zones;
lim->max_zone_append_sectors = ns->ctrl->max_zone_append;
lim->chunk_sectors = ns->head->zsze =
nvme_lba_to_sect(ns->head, zi->zone_size);
-   blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, ns->queue);
 }
 
 static void *nvme_zns_alloc_report_buffer(struct nvme_ns *ns,
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 26b6e92350cda9..8c79f588f80d8b 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -592,8 +592,6 @@ int sd_zbc_revalidate_zones(struct scsi_disk *sdkp)
 int sd_zbc_read_zones(stru

[PATCH 25/26] block: move the skip_tagset_quiesce flag to queue_limits

2024-06-10 Thread Christoph Hellwig

Move the skip_tagset_quiesce flag into the queue_limits feature field so
that it can be set atomically and all I/O is frozen when changing the
flag.

Signed-off-by: Christoph Hellwig 
---
 block/blk-mq-debugfs.c   | 1 -
 drivers/nvme/host/core.c | 8 +---
 include/linux/blkdev.h   | 6 --
 3 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 8b5a68861c119b..344f9e503bdb32 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -93,7 +93,6 @@ static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(RQ_ALLOC_TIME),
QUEUE_FLAG_NAME(HCTX_ACTIVE),
QUEUE_FLAG_NAME(SQ_SCHED),
-   QUEUE_FLAG_NAME(SKIP_TAGSET_QUIESCE),
 };
 #undef QUEUE_FLAG_NAME
 
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 31e752e8d632cd..bf410d10b12006 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4489,13 +4489,15 @@ int nvme_alloc_io_tag_set(struct nvme_ctrl *ctrl, 
struct blk_mq_tag_set *set,
return ret;
 
if (ctrl->ops->flags & NVME_F_FABRICS) {
-   ctrl->connect_q = blk_mq_alloc_queue(set, NULL, NULL);
+   struct queue_limits lim = {
+   .features   = BLK_FEAT_SKIP_TAGSET_QUIESCE,
+   };
+
+   ctrl->connect_q = blk_mq_alloc_queue(set, &lim, NULL);
if (IS_ERR(ctrl->connect_q)) {
ret = PTR_ERR(ctrl->connect_q);
goto out_free_tag_set;
}
-   blk_queue_flag_set(QUEUE_FLAG_SKIP_TAGSET_QUIESCE,
-  ctrl->connect_q);
}
 
ctrl->tagset = set;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index cc4f6e64e8e3f5..d7ad25def6e50b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -322,6 +322,9 @@ enum {
 
/* supports PCI(e) p2p requests */
BLK_FEAT_PCI_P2PDMA = (1u << 12),
+
+   /* skip this queue in blk_mq_(un)quiesce_tagset */
+   BLK_FEAT_SKIP_TAGSET_QUIESCE= (1u << 13),
 };
 
 /*
@@ -594,7 +597,6 @@ struct request_queue {
 #define QUEUE_FLAG_RQ_ALLOC_TIME 27/* record rq->alloc_time_ns */
 #define QUEUE_FLAG_HCTX_ACTIVE 28  /* at least one blk-mq hctx is active */
 #define QUEUE_FLAG_SQ_SCHED 30 /* single queue style io dispatch */
-#define QUEUE_FLAG_SKIP_TAGSET_QUIESCE 31 /* quiesce_tagset skip the queue*/
 
 #define QUEUE_FLAG_MQ_DEFAULT  (1UL << QUEUE_FLAG_SAME_COMP)
 
@@ -629,7 +631,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct 
request_queue *q);
 #define blk_queue_registered(q)test_bit(QUEUE_FLAG_REGISTERED, 
&(q)->queue_flags)
 #define blk_queue_sq_sched(q)  test_bit(QUEUE_FLAG_SQ_SCHED, &(q)->queue_flags)
 #define blk_queue_skip_tagset_quiesce(q) \
-   test_bit(QUEUE_FLAG_SKIP_TAGSET_QUIESCE, &(q)->queue_flags)
+   ((q)->limits.features & BLK_FEAT_SKIP_TAGSET_QUIESCE)
 
 extern void blk_set_pm_only(struct request_queue *q);
 extern void blk_clear_pm_only(struct request_queue *q);
-- 
2.43.0

[PATCH 24/26] block: move the pci_p2pdma flag to queue_limits

2024-06-10 Thread Christoph Hellwig

Move the pci_p2pdma flag into the queue_limits feature field so that it
can be set atomically and all I/O is frozen when changing the flag.

Signed-off-by: Christoph Hellwig 
---
 block/blk-mq-debugfs.c   | 1 -
 drivers/nvme/host/core.c | 8 +++-
 include/linux/blkdev.h   | 7 ---
 3 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index f2fd72f4414ae8..8b5a68861c119b 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -90,7 +90,6 @@ static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(STATS),
QUEUE_FLAG_NAME(REGISTERED),
QUEUE_FLAG_NAME(QUIESCED),
-   QUEUE_FLAG_NAME(PCI_P2PDMA),
QUEUE_FLAG_NAME(RQ_ALLOC_TIME),
QUEUE_FLAG_NAME(HCTX_ACTIVE),
QUEUE_FLAG_NAME(SQ_SCHED),
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 5ecf762d7c8837..31e752e8d632cd 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3735,6 +3735,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct 
nvme_ns_info *info)
 
if (ctrl->opts && ctrl->opts->data_digest)
lim.features |= BLK_FEAT_STABLE_WRITES;
+   if (ctrl->ops->supports_pci_p2pdma &&
+   ctrl->ops->supports_pci_p2pdma(ctrl))
+   lim.features |= BLK_FEAT_PCI_P2PDMA;
 
disk = blk_mq_alloc_disk(ctrl->tagset, &lim, ns);
if (IS_ERR(disk))
@@ -3744,11 +3747,6 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct 
nvme_ns_info *info)
 
ns->disk = disk;
ns->queue = disk->queue;
-
-   if (ctrl->ops->supports_pci_p2pdma &&
-   ctrl->ops->supports_pci_p2pdma(ctrl))
-   blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
-
ns->ctrl = ctrl;
kref_init(&ns->kref);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ffb7a42871b4ed..cc4f6e64e8e3f5 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -319,6 +319,9 @@ enum {
 
/* supports Zone Reset All */
BLK_FEAT_ZONE_RESETALL  = (1u << 11),
+
+   /* supports PCI(e) p2p requests */
+   BLK_FEAT_PCI_P2PDMA = (1u << 12),
 };
 
 /*
@@ -588,7 +591,6 @@ struct request_queue {
 #define QUEUE_FLAG_STATS   20  /* track IO start and completion times 
*/
 #define QUEUE_FLAG_REGISTERED  22  /* queue has been registered to a disk 
*/
 #define QUEUE_FLAG_QUIESCED24  /* queue has been quiesced */
-#define QUEUE_FLAG_PCI_P2PDMA  25  /* device supports PCI p2p requests */
 #define QUEUE_FLAG_RQ_ALLOC_TIME 27/* record rq->alloc_time_ns */
 #define QUEUE_FLAG_HCTX_ACTIVE 28  /* at least one blk-mq hctx is active */
 #define QUEUE_FLAG_SQ_SCHED 30 /* single queue style io dispatch */
@@ -611,8 +613,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct 
request_queue *q);
 #define blk_queue_zone_resetall(q) \
((q)->limits.features & BLK_FEAT_ZONE_RESETALL)
 #define blk_queue_dax(q)   ((q)->limits.features & BLK_FEAT_DAX)
-#define blk_queue_pci_p2pdma(q)\
-   test_bit(QUEUE_FLAG_PCI_P2PDMA, &(q)->queue_flags)
+#define blk_queue_pci_p2pdma(q)((q)->limits.features & 
BLK_FEAT_PCI_P2PDMA)
 #ifdef CONFIG_BLK_RQ_ALLOC_TIME
 #define blk_queue_rq_alloc_time(q) \
test_bit(QUEUE_FLAG_RQ_ALLOC_TIME, &(q)->queue_flags)
-- 
2.43.0

[PATCH 26/26] block: move the bounce flag into the feature field

2024-06-10 Thread Christoph Hellwig

Move the bounce field into the flags field to reclaim a little bit of
space.

Signed-off-by: Christoph Hellwig 
---
 block/blk-settings.c| 1 -
 block/blk.h | 2 +-
 drivers/scsi/scsi_lib.c | 2 +-
 include/linux/blkdev.h  | 6 --
 4 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 96e07f24bd9aa1..d0e9096f93ca8a 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -479,7 +479,6 @@ int blk_stack_limits(struct queue_limits *t, struct 
queue_limits *b,
b->max_write_zeroes_sectors);
t->max_zone_append_sectors = 
min(queue_limits_max_zone_append_sectors(t),
 
queue_limits_max_zone_append_sectors(b));
-   t->bounce = max(t->bounce, b->bounce);
 
t->seg_boundary_mask = min_not_zero(t->seg_boundary_mask,
b->seg_boundary_mask);
diff --git a/block/blk.h b/block/blk.h
index 79e8d5d4fe0caf..fa32f7fad5d7e6 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -394,7 +394,7 @@ struct bio *__blk_queue_bounce(struct bio *bio, struct 
request_queue *q);
 static inline bool blk_queue_may_bounce(struct request_queue *q)
 {
return IS_ENABLED(CONFIG_BOUNCE) &&
-   q->limits.bounce == BLK_BOUNCE_HIGH &&
+   (q->limits.features & BLK_FEAT_BOUNCE_HIGH) &&
max_low_pfn >= max_pfn;
 }
 
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 54f771ec8cfb5e..e2f7bfb2b9e450 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1986,7 +1986,7 @@ void scsi_init_limits(struct Scsi_Host *shost, struct 
queue_limits *lim)
shost->dma_alignment, dma_get_cache_alignment() - 1);
 
if (shost->no_highmem)
-   lim->bounce = BLK_BOUNCE_HIGH;
+   lim->features |= BLK_FEAT_BOUNCE_HIGH;
 
dma_set_seg_boundary(dev, shost->dma_boundary);
dma_set_max_seg_size(dev, shost->max_segment_size);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d7ad25def6e50b..d1d9787e76ce73 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -325,6 +325,9 @@ enum {
 
/* skip this queue in blk_mq_(un)quiesce_tagset */
BLK_FEAT_SKIP_TAGSET_QUIESCE= (1u << 13),
+
+   /* bounce all highmem pages */
+   BLK_FEAT_BOUNCE_HIGH= (1u << 14),
 };
 
 /*
@@ -332,7 +335,7 @@ enum {
  */
 #define BLK_FEAT_INHERIT_MASK \
(BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA | BLK_FEAT_ROTATIONAL | \
-BLK_FEAT_STABLE_WRITES | BLK_FEAT_ZONED)
+BLK_FEAT_STABLE_WRITES | BLK_FEAT_ZONED | BLK_FEAT_BOUNCE_HIGH)
 
 /* internal flags in queue_limits.flags */
 enum {
@@ -352,7 +355,6 @@ enum blk_bounce {
 struct queue_limits {
unsigned intfeatures;
unsigned intflags;
-   enum blk_bounce bounce;
unsigned long   seg_boundary_mask;
unsigned long   virt_boundary_mask;
 
-- 
2.43.0

Re: [PATCH 01/26] sd: fix sd_is_zoned

2024-06-10 Thread Damien Le Moal

On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> Since commit 7437bb73f087 ("block: remove support for the host aware zone
> model"), only ZBC devices expose a zoned access model.  sd_is_zoned is
> used to check for that and thus return false for host aware devices.
> 
> Fixes: 7437bb73f087 ("block: remove support for the host aware zone model")
> Signed-off-by: Christoph Hellwig 

Looks good.

Reviewed-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH 02/26] sd: move zone limits setup out of sd_read_block_characteristics

2024-06-10 Thread Damien Le Moal

On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> Move a bit of code that sets up the zone flag and the write granularity
> into sd_zbc_read_zones to be with the rest of the zoned limits.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/scsi/sd.c | 21 +
>  drivers/scsi/sd_zbc.c | 13 -
>  2 files changed, 13 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 85b45345a27739..5bfed61c70db8f 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -3308,29 +3308,10 @@ static void sd_read_block_characteristics(struct 
> scsi_disk *sdkp,
>   blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q);
>   }
>  
> -
> -#ifdef CONFIG_BLK_DEV_ZONED /* sd_probe rejects ZBD devices early otherwise 
> */
> - if (sdkp->device->type == TYPE_ZBC) {
> - lim->zoned = true;
> -
> - /*
> -  * Per ZBC and ZAC specifications, writes in sequential write
> -  * required zones of host-managed devices must be aligned to
> -  * the device physical block size.
> -  */
> - lim->zone_write_granularity = sdkp->physical_block_size;
> - } else {
> - /*
> -  * Host-aware devices are treated as conventional.
> -  */
> - lim->zoned = false;
> - }
> -#endif /* CONFIG_BLK_DEV_ZONED */
> -
>   if (!sdkp->first_scan)
>   return;
>  
> - if (lim->zoned)
> + if (sdkp->device->type == TYPE_ZBC)

Nit: use sd_is_zoned() here ?

>   sd_printk(KERN_NOTICE, sdkp, "Host-managed zoned block 
> device\n");
>   else if (sdkp->zoned == 1)
>   sd_printk(KERN_NOTICE, sdkp, "Host-aware SMR disk used as 
> regular disk\n");
> diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
> index 422eaed8457227..e9501db0450be3 100644
> --- a/drivers/scsi/sd_zbc.c
> +++ b/drivers/scsi/sd_zbc.c
> @@ -598,8 +598,19 @@ int sd_zbc_read_zones(struct scsi_disk *sdkp, struct 
> queue_limits *lim,
>   u32 zone_blocks = 0;
>   int ret;
>  
> - if (!sd_is_zoned(sdkp))
> + if (!sd_is_zoned(sdkp)) {
> + lim->zoned = false;

Maybe we should clear the other zone related limits here ? If the drive is
reformatted/converted from SMR to CMR (FORMAT WITH PRESET), the other zone
limits may be set already, no ?

>   return 0;
> + }
> +
> + lim->zoned = true;
> +
> + /*
> +  * Per ZBC and ZAC specifications, writes in sequential write required
> +  * zones of host-managed devices must be aligned to the device physical
> +  * block size.
> +  */
> + lim->zone_write_granularity = sdkp->physical_block_size;
>  
>   /* READ16/WRITE16/SYNC16 is mandatory for ZBC devices */
>   sdkp->device->use_16_for_rw = 1;

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH 02/26] sd: move zone limits setup out of sd_read_block_characteristics

2024-06-10 Thread Christoph Hellwig

On Tue, Jun 11, 2024 at 02:51:24PM +0900, Damien Le Moal wrote:
> > -   if (lim->zoned)
> > +   if (sdkp->device->type == TYPE_ZBC)
> 
> Nit: use sd_is_zoned() here ?

Yes.

> > -   if (!sd_is_zoned(sdkp))
> > +   if (!sd_is_zoned(sdkp)) {
> > +   lim->zoned = false;
> 
> Maybe we should clear the other zone related limits here ? If the drive is
> reformatted/converted from SMR to CMR (FORMAT WITH PRESET), the other zone
> limits may be set already, no ?

blk_validate_zoned_limits already takes care of that.

Re: [PATCH 03/26] loop: stop using loop_reconfigure_limits in __loop_clr_fd

2024-06-10 Thread Damien Le Moal

On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> __loop_clr_fd wants to clear all settings on the device.  Prepare for
> moving more settings into the block limits by open coding
> loop_reconfigure_limits.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/block/loop.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 93780f41646b75..93a49c40a31a71 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1133,6 +1133,7 @@ static int loop_configure(struct loop_device *lo, 
> blk_mode_t mode,
>  
>  static void __loop_clr_fd(struct loop_device *lo, bool release)
>  {
> + struct queue_limits lim;
>   struct file *filp;
>   gfp_t gfp = lo->old_gfp_mask;
>  
> @@ -1156,7 +1157,14 @@ static void __loop_clr_fd(struct loop_device *lo, bool 
> release)
>   lo->lo_offset = 0;
>   lo->lo_sizelimit = 0;
>   memset(lo->lo_file_name, 0, LO_NAME_SIZE);
> - loop_reconfigure_limits(lo, 512, false);
> +
> + /* reset the block size to the default */
> + lim = queue_limits_start_update(lo->lo_queue);
> + lim.logical_block_size = 512;

Nit: SECTOR_SIZE ? maybe ?

> + lim.physical_block_size = 512;
> + lim.io_min = 512;
> + queue_limits_commit_update(lo->lo_queue, &lim);
> +
>   invalidate_disk(lo->lo_disk);
>   loop_sysfs_exit(lo);
>   /* let user-space know about this change */

Otherwise, looks OK to me.

Reviewed-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH 02/26] sd: move zone limits setup out of sd_read_block_characteristics

2024-06-10 Thread Christoph Hellwig

On Tue, Jun 11, 2024 at 07:52:39AM +0200, Christoph Hellwig wrote:
> > Maybe we should clear the other zone related limits here ? If the drive is
> > reformatted/converted from SMR to CMR (FORMAT WITH PRESET), the other zone
> > limits may be set already, no ?
> 
> blk_validate_zoned_limits already takes care of that.

Sorry, brainfart.  The integrity code does that, but not the zoned
code.  I suspect the core code might be a better place for it,
though.

Re: [PATCH 04/26] loop: always update discard settings in loop_reconfigure_limits

2024-06-10 Thread Damien Le Moal

On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> Simplify loop_reconfigure_limits by always updating the discard limits.
> This adds a little more work to loop_set_block_size, but doesn't change
> the outcome as the discard flag won't change.
> 
> Signed-off-by: Christoph Hellwig 

Looks OK to me.

Reviewed-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH 03/26] loop: stop using loop_reconfigure_limits in __loop_clr_fd

2024-06-10 Thread Christoph Hellwig

On Tue, Jun 11, 2024 at 02:53:19PM +0900, Damien Le Moal wrote:
> > +   /* reset the block size to the default */
> > +   lim = queue_limits_start_update(lo->lo_queue);
> > +   lim.logical_block_size = 512;
> 
> Nit: SECTOR_SIZE ? maybe ?

Yes.  I was following the existing code, but SECTOR_SIZE is probably
a better choice here.

Re: [PATCH 05/26] loop: regularize upgrading the lock size for direct I/O

2024-06-10 Thread Damien Le Moal

On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> The LOOP_CONFIGURE path automatically upgrades the block size to that
> of the underlying file for O_DIRECT file descriptors, but the
> LOOP_SET_BLOCK_SIZE path does not.  Fix this by lifting the code to
> pick the block size into common code.

s/lock/block in the commit title.

> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/block/loop.c | 25 +++--
>  1 file changed, 15 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index c658282454af1b..4f6d8514d19bd6 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -975,10 +975,24 @@ loop_set_status_from_info(struct loop_device *lo,
>   return 0;
>  }
>  
> +static unsigned short loop_default_blocksize(struct loop_device *lo,
> + struct block_device *backing_bdev)
> +{
> + /* In case of direct I/O, match underlying block size */
> + if ((lo->lo_backing_file->f_flags & O_DIRECT) && backing_bdev)
> + return bdev_logical_block_size(backing_bdev);
> + return 512;

Nit: SECTOR_SIZE ?

> +}
> +
>  static int loop_reconfigure_limits(struct loop_device *lo, unsigned short 
> bsize)
>  {
> + struct file *file = lo->lo_backing_file;
> + struct inode *inode = file->f_mapping->host;
>   struct queue_limits lim;
>  
> + if (!bsize)
> + bsize = loop_default_blocksize(lo, inode->i_sb->s_bdev);

If bsize is specified and there is a backing dev used with direct IO, should it
be checked that bsize is a multiple of bdev_logical_block_size(backing_bdev) ?

> +
>   lim = queue_limits_start_update(lo->lo_queue);
>   lim.logical_block_size = bsize;
>   lim.physical_block_size = bsize;
> @@ -997,7 +1011,6 @@ static int loop_configure(struct loop_device *lo, 
> blk_mode_t mode,
>   int error;
>   loff_t size;
>   bool partscan;
> - unsigned short bsize;
>   bool is_loop;
>  
>   if (!file)
> @@ -1076,15 +1089,7 @@ static int loop_configure(struct loop_device *lo, 
> blk_mode_t mode,
>   if (!(lo->lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
>   blk_queue_write_cache(lo->lo_queue, true, false);
>  
> - if (config->block_size)
> - bsize = config->block_size;
> - else if ((lo->lo_backing_file->f_flags & O_DIRECT) && 
> inode->i_sb->s_bdev)
> - /* In case of direct I/O, match underlying block size */
> - bsize = bdev_logical_block_size(inode->i_sb->s_bdev);
> - else
> - bsize = 512;
> -
> - error = loop_reconfigure_limits(lo, bsize);
> + error = loop_reconfigure_limits(lo, config->block_size);
>   if (WARN_ON_ONCE(error))
>   goto out_unlock;
>  

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH 06/26] loop: also use the default block size from an underlying block device

2024-06-10 Thread Damien Le Moal

On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> Fix the code in loop_reconfigure_limits to pick a default block size for
> O_DIRECT file descriptors to also work when the loop device sits on top
> of a block device and not just on a regular file on a block device based
> file system.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/block/loop.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 4f6d8514d19bd6..d7cf6bbbfb1b86 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -988,10 +988,16 @@ static int loop_reconfigure_limits(struct loop_device 
> *lo, unsigned short bsize)
>  {
>   struct file *file = lo->lo_backing_file;
>   struct inode *inode = file->f_mapping->host;
> + struct block_device *backing_bdev = NULL;
>   struct queue_limits lim;
>  
> + if (S_ISBLK(inode->i_mode))
> + backing_bdev = I_BDEV(inode);
> + else if (inode->i_sb->s_bdev)
> + backing_bdev = inode->i_sb->s_bdev;
> +

Why not move this hunk inside the below "if" ? (backing_dev declaration can go
there too).

>   if (!bsize)
> - bsize = loop_default_blocksize(lo, inode->i_sb->s_bdev);
> + bsize = loop_default_blocksize(lo, backing_bdev);
>  
>   lim = queue_limits_start_update(lo->lo_queue);
>   lim.logical_block_size = bsize;

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH 05/26] loop: regularize upgrading the lock size for direct I/O

2024-06-10 Thread Christoph Hellwig

On Tue, Jun 11, 2024 at 02:56:59PM +0900, Damien Le Moal wrote:
> > +   if (!bsize)
> > +   bsize = loop_default_blocksize(lo, inode->i_sb->s_bdev);
> 
> If bsize is specified and there is a backing dev used with direct IO, should 
> it
> be checked that bsize is a multiple of bdev_logical_block_size(backing_bdev) ?

For direct I/O that check would be useful.  For buffered I/O we can do
read-modify-write cycles.  However this series is already huge and not
primarily about improving the loop driver parameter validation, so
I'll defer this for now.

Re: [PATCH 06/26] loop: also use the default block size from an underlying block device

2024-06-10 Thread Christoph Hellwig

On Tue, Jun 11, 2024 at 02:58:56PM +0900, Damien Le Moal wrote:
> > +   if (S_ISBLK(inode->i_mode))
> > +   backing_bdev = I_BDEV(inode);
> > +   else if (inode->i_sb->s_bdev)
> > +   backing_bdev = inode->i_sb->s_bdev;
> > +
> 
> Why not move this hunk inside the below "if" ? (backing_dev declaration can go
> there too).

Because another use will pop up a bit later :)

Re: [PATCH 07/26] loop: fold loop_update_rotational into loop_reconfigure_limits

2024-06-10 Thread Damien Le Moal

On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> This prepares for moving the rotational flag into the queue_limits and
> also fixes it for the case where the loop device is backed by a block
> device.
> 
> Signed-off-by: Christoph Hellwig 

Looks good to me.

Reviewed-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research

57 matches

Mail list logo