date:20240130

Re: [PATCH v9] bus: mhi: host: Add tracing support

2024-01-30 Thread Manivannan Sadhasivam

On Fri, Jan 05, 2024 at 05:53:03PM +0530, Krishna chaitanya chundru wrote:
> This change adds ftrace support for following functions which
> helps in debugging the issues when there is Channel state & MHI
> state change and also when we receive data and control events:
> 1. mhi_intvec_mhi_states
> 2. mhi_process_data_event_ring
> 3. mhi_process_ctrl_ev_ring
> 4. mhi_gen_tre
> 5. mhi_update_channel_state
> 6. mhi_tryset_pm_state
> 7. mhi_pm_st_worker
> 
> Change the implementation of the arrays which has enum to strings mapping
> to make it consistent in both trace header file and other files.
> 
> Where ever the trace events are added, debug messages are removed.
> 
> Signed-off-by: Krishna chaitanya chundru 

Few nitpicks below.

> Reviewed-by: "Steven Rostedt (Google)" 
> ---
> Changes in v9:
> - Change the implementations of some array so that the strings to enum mapping
> - is same in both trace header and other files as suggested by steve.
> - Link to v8: 
> https://lore.kernel.org/r/20231207-ftrace_support-v8-1-7f62d4558...@quicinc.com
> 
> Changes in v8:
> - Pass the structure and derefernce the variables in TP_fast_assign as 
> suggested by steve
> - Link to v7: 
> https://lore.kernel.org/r/20231206-ftrace_support-v7-1-aca49a042...@quicinc.com
> 
> Changes in v7:
> - change log format as pointed by mani.
> - Link to v6: 
> https://lore.kernel.org/r/20231204-ftrace_support-v6-1-9b206546d...@quicinc.com
> 
> Changes in v6:
> - use 'rp' directly as suggested by jeffrey.
> - Link to v5: 
> https://lore.kernel.org/r/20231127-ftrace_support-v5-1-eb67daead...@quicinc.com
> 
> Changes in v5:
> - Use DECLARE_EVENT_CLASS for multiple events as suggested by steve.
> - Instead of converting to u64 to print address, use %px to print the address 
> to avoid
> - warnings in some platforms.
> - Link to v4: 
> https://lore.kernel.org/r/2023-ftrace_support-v4-1-c83602399...@quicinc.com
> 
> Changes in v4:
> - Fix compilation issues in previous patch which happended due to rebasing.
> - In the defconfig FTRACE config is not enabled due to that the compilation 
> issue is not
> - seen in my workspace.
> - Link to v3: 
> https://lore.kernel.org/r/2023-ftrace_support-v3-1-f358d2911...@quicinc.com
> 
> Changes in v3:
> - move trace header file from include/trace/events to drivers/bus/mhi/host/ 
> so that
> - we can include driver header files.
> - Use macros directly in the trace events as suggested Jeffrey Hugo.
> - Reorder the structure in the events as suggested by steve to avoid holes in 
> the buffer.
> - removed the mhi_to_physical function as this can give security issues.
> - removed macros to define strings as we can get those from driver headers.
> - Link to v2: 
> https://lore.kernel.org/r/20231013-ftrace_support-v2-1-6e893ce01...@quicinc.com
> 
> Changes in v2:
> - Passing the raw state into the trace event and using  __print_symbolic() as 
> suggested by bjorn.
> - Change mhi_pm_st_worker to mhi_pm_st_transition as suggested by bjorn.
> - Fixed the kernel test rebot issues.
> - Link to v1: 
> https://lore.kernel.org/r/20231005-ftrace_support-v1-1-23a2f394f...@quicinc.com
> ---
>  drivers/bus/mhi/common.h|  38 +++---
>  drivers/bus/mhi/host/init.c |  63 +
>  drivers/bus/mhi/host/internal.h |  40 ++
>  drivers/bus/mhi/host/main.c |  19 ++-
>  drivers/bus/mhi/host/pm.c   |   7 +-
>  drivers/bus/mhi/host/trace.h| 275 
> 
>  6 files changed, 378 insertions(+), 64 deletions(-)
> 

[...]

> +TRACE_EVENT(mhi_gen_tre,
> +
> + TP_PROTO(struct mhi_controller *mhi_cntrl, struct mhi_chan *mhi_chan,
> +  struct mhi_ring_element *mhi_tre),
> +
> + TP_ARGS(mhi_cntrl, mhi_chan, mhi_tre),
> +
> + TP_STRUCT__entry(
> + __string(name, mhi_cntrl->mhi_dev->name)
> + __field(int, ch_num)
> + __field(void *, wp)
> + __field(__le64, tre_ptr)
> + __field(__le32, dword0)
> + __field(__le32, dword1)
> + ),
> +
> + TP_fast_assign(
> + __assign_str(name, mhi_cntrl->mhi_dev->name);
> + __entry->ch_num = mhi_chan->chan;
> + __entry->wp = mhi_tre;
> + __entry->tre_ptr = mhi_tre->ptr;
> + __entry->dword0 = mhi_tre->dword[0];
> + __entry->dword1 = mhi_tre->dword[1];
> + ),
> +
> + TP_printk("%s: Chan: %d Tre: 0x%p Tre buf: 0x%llx dword0: 0x%08x 
> dword1: 0x%08x\n",

Use caps for printing the acronyms everywhere. Like TRE, DWORD etc...

> +   __get_str(name), __entry->ch_num, __entry->wp, 
> __entry->tre_ptr,
> +   __entry->dword0, __entry->dword1)
> +);
> +
> +TRACE_EVENT(mhi_intvec_states,
> +
> + TP_PROTO(struct mhi_controller *mhi_cntrl, int dev_ee, int dev_state),
> +
> + TP_ARGS(mhi_cntrl, dev_ee, dev_state),
> +
> + TP_STRUCT__entry(
> + __string(name, mhi_cntrl->mhi_dev->name)
> + __field(int, local_ee)
> +

Re: [PATCH v2 1/4] remoteproc: Add TEE support

2024-01-30 Thread Arnaud POULIQUEN




On 1/29/24 19:55, Mathieu Poirier wrote:
> On Thu, Jan 18, 2024 at 11:04:30AM +0100, Arnaud Pouliquen wrote:
>> From: Arnaud Pouliquen 
>>
>> Add a remoteproc TEE (Trusted Execution Environment) device
>> that will be probed by the TEE bus. If the associated Trusted
>> application is supported on secure part this device offers a client
>> interface to load a firmware in the secure part.
>> This firmware could be authenticated and decrypted by the secure
>> trusted application.
>>
>> Signed-off-by: Arnaud Pouliquen 
>> ---
>>  drivers/remoteproc/Kconfig  |   9 +
>>  drivers/remoteproc/Makefile |   1 +
>>  drivers/remoteproc/tee_remoteproc.c | 393 
>>  include/linux/tee_remoteproc.h  |  99 +++
>>  4 files changed, 502 insertions(+)
>>  create mode 100644 drivers/remoteproc/tee_remoteproc.c
>>  create mode 100644 include/linux/tee_remoteproc.h
>>
>> diff --git a/drivers/remoteproc/Kconfig b/drivers/remoteproc/Kconfig
>> index 48845dc8fa85..85299606806c 100644
>> --- a/drivers/remoteproc/Kconfig
>> +++ b/drivers/remoteproc/Kconfig
>> @@ -365,6 +365,15 @@ config XLNX_R5_REMOTEPROC
>>  
>>It's safe to say N if not interested in using RPU r5f cores.
>>  
>> +
>> +config TEE_REMOTEPROC
>> +tristate "trusted firmware support by a TEE application"
>> +depends on OPTEE
>> +help
>> +  Support for trusted remote processors firmware. The firmware
>> +  authentication and/or decryption are managed by a trusted application.
>> +  This can be either built-in or a loadable module.
>> +
>>  endif # REMOTEPROC
>>  
>>  endmenu
>> diff --git a/drivers/remoteproc/Makefile b/drivers/remoteproc/Makefile
>> index 91314a9b43ce..fa8daebce277 100644
>> --- a/drivers/remoteproc/Makefile
>> +++ b/drivers/remoteproc/Makefile
>> @@ -36,6 +36,7 @@ obj-$(CONFIG_RCAR_REMOTEPROC)  += rcar_rproc.o
>>  obj-$(CONFIG_ST_REMOTEPROC) += st_remoteproc.o
>>  obj-$(CONFIG_ST_SLIM_REMOTEPROC)+= st_slim_rproc.o
>>  obj-$(CONFIG_STM32_RPROC)   += stm32_rproc.o
>> +obj-$(CONFIG_TEE_REMOTEPROC)+= tee_remoteproc.o
>>  obj-$(CONFIG_TI_K3_DSP_REMOTEPROC)  += ti_k3_dsp_remoteproc.o
>>  obj-$(CONFIG_TI_K3_R5_REMOTEPROC)   += ti_k3_r5_remoteproc.o
>>  obj-$(CONFIG_XLNX_R5_REMOTEPROC)+= xlnx_r5_remoteproc.o
>> diff --git a/drivers/remoteproc/tee_remoteproc.c 
>> b/drivers/remoteproc/tee_remoteproc.c
>> new file mode 100644
>> index ..49e1e0caf889
>> --- /dev/null
>> +++ b/drivers/remoteproc/tee_remoteproc.c
>> @@ -0,0 +1,393 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +/*
>> + * Copyright (C) STMicroelectronics 2023 - All Rights Reserved
>> + * Author: Arnaud Pouliquen 
>> + */
>> +
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +
>> +#include "remoteproc_internal.h"
>> +
>> +#define MAX_TEE_PARAM_ARRY_MEMBER   4
>> +
>> +/*
>> + * Authentication of the firmware and load in the remote processor memory
>> + *
>> + * [in]  params[0].value.a: unique 32bit identifier of the remote processor
>> + * [in]  params[1].memref:  buffer containing the image of the 
>> buffer
>> + */
>> +#define TA_RPROC_FW_CMD_LOAD_FW 1
>> +
>> +/*
>> + * Start the remote processor
>> + *
>> + * [in]  params[0].value.a: unique 32bit identifier of the remote processor
>> + */
>> +#define TA_RPROC_FW_CMD_START_FW2
>> +
>> +/*
>> + * Stop the remote processor
>> + *
>> + * [in]  params[0].value.a: unique 32bit identifier of the remote processor
>> + */
>> +#define TA_RPROC_FW_CMD_STOP_FW 3
>> +
>> +/*
>> + * Return the address of the resource table, or 0 if not found
>> + * No check is done to verify that the address returned is accessible by
>> + * the non secure context. If the resource table is loaded in a protected
>> + * memory the access by the non secure context will lead to a data abort.
>> + *
>> + * [in]  params[0].value.a: unique 32bit identifier of the remote processor
>> + * [out]  params[1].value.a:32bit LSB resource table memory address
>> + * [out]  params[1].value.b:32bit MSB resource table memory address
>> + * [out]  params[2].value.a:32bit LSB resource table memory size
>> + * [out]  params[2].value.b:32bit MSB resource table memory size
>> + */
>> +#define TA_RPROC_FW_CMD_GET_RSC_TABLE   4
>> +
>> +/*
>> + * Return the address of the core dump
>> + *
>> + * [in]  params[0].value.a: unique 32bit identifier of the remote processor
>> + * [out] params[1].memref:  address of the core dump image if exist,
>> + *  else return Null
>> + */
>> +#define TA_RPROC_FW_CMD_GET_COREDUMP5
>> +
>> +struct tee_rproc_mem {
>> +char name[20];
>> +void __iomem *cpu_addr;
>> +phys_addr_t bus_addr;
>> +u32 dev_addr;
>> +size_t size;
>> +};
>> +
>> +struct tee_rproc_context {
>> +struct list_head sessio

Re: [PATCH 1/3] dt-bindings: remoteproc: qcom,sm8550-pas: document the X1E80100 aDSP & cDSP

2024-01-30 Thread Krzysztof Kozlowski

On 29/01/2024 14:34, Abel Vesa wrote:
> Document the aDSP and cDSP Peripheral Authentication Service on the
> X1E80100 Platform.
> 
> Signed-off-by: Abel Vesa 
> ---
>  Documentation/devicetree/bindings/remoteproc/qcom,sm8550-pas.yaml | 6 ++
>  1 file changed, 6 insertions(+)
> 

Reviewed-by: Krzysztof Kozlowski 

Best regards,
Krzysztof

Re: [PATCH] media: dt-bindings: qcom,sc7280-venus: Allow one IOMMU entry

2024-01-30 Thread Vikash Garodia

Hello,

On 1/30/2024 1:17 PM, Luca Weiss wrote:
> On Mon Jan 29, 2024 at 6:37 PM CET, Conor Dooley wrote:
>> On Mon, Jan 29, 2024 at 08:48:54AM +0100, Luca Weiss wrote:
>>> Some SC7280-based boards crash when providing the "secure_non_pixel"
>>> context bank, so allow only one iommu in the bindings also.
>>>
>>> Signed-off-by: Luca Weiss 
>>
>> Do we have any idea why this happens? How is someone supposed to know
>> whether or not their system requires you to only provide one iommu?
>> Yes, a crash might be the obvious answer, but is there a way of knowing
>> without the crashes?
> 
> +CC Vikash Garodia
> 
> Unfortunately I don't really have much more information than this
> message here:
> https://lore.kernel.org/linux-arm-msm/ff021f49-f81b-0fd1-bd2c-895dbbb03...@quicinc.com/
> 
> And see also the following replies for a bit more context, like this
> one:
> https://lore.kernel.org/linux-arm-msm/a4e8b531-49f9-f4a1-51cb-e422c5628...@quicinc.com/
> 
> Maybe Vikash can add some more info regarding this.

0x2184 is a secure SID i.e any transactions with that ID would be access
controlled by trustzone (TZ). SC7280 (chromebook) was designed without TZ, while
some other DT deriving from SC7280 (like qcm6490) might have TZ. Hence it is
good to move the iommu entry from SC7280 to chrome-common.

Regards,
Vikash

Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-30 Thread Linus Torvalds

On Mon, 29 Jan 2024 at 19:56, Linus Torvalds
 wrote:
>
> But I've been staring at this code for too long, so I'm posting this
> just as a "it's broken, but _something_ like this", because I'm taking
> a break from looking at this.

Bah. I literally woke up and realized what the problem is.

We're caching negative dentries, but eventfs_create_dir() has no way
to invalidate the old negative dentry when it adds a new entry.

The old "try to reuse dentry" ended up hiding that issue, because it
would look up the negative dentry from the 'ei' and turn it into a
positive one.

And I was stupidly staring at the wrong code - all these functions
with almost the same names.

eventfs_create_events_dir() is fine, because it gets the parent as a
dentry, so looking up the new dentry is trivial.

But eventfs_create_dir() doesn't get a dentry at all. It gets the parent as a

  struct eventfs_inode *parent

which is no good.

So that explains why creating an event after deleting an old one - or
after just looking it up before it was created - ends up with the
nonsensical "ls" output - it gets listed by readdir() because the
entry is there in the eventfs data structures, but then doing a
"stat()" on it will just hit the negative dentry. So it won't actually
look up the dentry.

The simple solution is probably just to not cache negative dentries
for eventfs.  So instead of doing the "d_add(dentry, NULL);" we should
just return "ERR_PTR(ENOENT)".

Which is actually what /proc/ lookups do, for the same reason.

I'll go back to bed, but I think the fix is something trivial like this:

  --- a/fs/tracefs/event_inode.c
  +++ b/fs/tracefs/event_inode.c
  @@ -517,9 +517,8 @@ static struct dentry *eventfs_root_lookup(
}

enoent:
  - /* Nothing found? */
  - d_add(dentry, NULL);
  - result = NULL;
  + /* Don't cache negative lookups, just return an error */
  + result = ERR_PTR(ENOENT);

out:
mutex_unlock(&eventfs_mutex);

I just wanted to write it down when the likely solution struck me.

   Linus

Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-30 Thread Linus Torvalds

On Tue, 30 Jan 2024 at 00:43, Linus Torvalds
 wrote:
>
> I'll go back to bed, but I think the fix is something trivial like this:

Almost.

>   + result = ERR_PTR(ENOENT);

That needs a '-' in front of the ENOENT, otherwise you have a positive
error number and things go wrong very quickly.

And that does indeed fix the lookup problem, but you end up with the
same problem later when you do the eventfs_remove_dir(). Again the
eventfs data structure changes, but we don't have a reliable dentry
that we can invalidate.

The dentry cache is just very good at caching those old dentries, and
the interface for eventfs_create_dir() and eventfs_remove_dir() is
just not great.

If those did an actual path lookup (like eventfs_create_events_dir()
does), we'd have the dentry, and it's trivial to get from dentry to
eventfs_inode.

But going the other way is the broken thing because of how the
dentries are just temporary caches.

I suspect the solution is to make eventfs_create_dir() do the same as
the events directory case does, and actually pin the directory dentry
and save it off.

Oh well. At least I understand what the problem is. Now I'm going to
try to go back to sleep.

  Linus

Re: [PATCH v2 4/4] remoteproc: stm32: Add support of an OP-TEE TA to load the firmware

2024-01-30 Thread Arnaud POULIQUEN




On 1/26/24 18:11, Mathieu Poirier wrote:
> On Thu, Jan 18, 2024 at 11:04:33AM +0100, Arnaud Pouliquen wrote:
>> The new TEE remoteproc device is used to manage remote firmware in a
>> secure, trusted context. The 'st,stm32mp1-m4-tee' compatibility is
>> introduced to delegate the loading of the firmware to the trusted
>> execution context. In such cases, the firmware should be signed and
>> adhere to the image format defined by the TEE.
>>
>> Signed-off-by: Arnaud Pouliquen 
>> ---
>> V1 to V2 update:
>> - remove the select "TEE_REMOTEPROC" in STM32_RPROC config as detected by
>>   the kernel test robot:
>>  WARNING: unmet direct dependencies detected for TEE_REMOTEPROC
>>  Depends on [n]: REMOTEPROC [=y] && OPTEE [=n]
>>  Selected by [y]:
>>  - STM32_RPROC [=y] && (ARCH_STM32 || COMPILE_TEST [=y]) && REMOTEPROC 
>> [=y]
>> - Fix initialized trproc variable in  stm32_rproc_probe
>> ---
>>  drivers/remoteproc/stm32_rproc.c | 149 +--
>>  1 file changed, 144 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/remoteproc/stm32_rproc.c 
>> b/drivers/remoteproc/stm32_rproc.c
>> index fcc0001e2657..cf6a21bac945 100644
>> --- a/drivers/remoteproc/stm32_rproc.c
>> +++ b/drivers/remoteproc/stm32_rproc.c
>> @@ -20,6 +20,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  
>>  #include "remoteproc_internal.h"
>> @@ -49,6 +50,9 @@
>>  #define M4_STATE_STANDBY4
>>  #define M4_STATE_CRASH  5
>>  
>> +/* Remote processor unique identifier aligned with the Trusted Execution 
>> Environment definitions */
>> +#define STM32_MP1_M4_PROC_ID0
>> +
>>  struct stm32_syscon {
>>  struct regmap *map;
>>  u32 reg;
>> @@ -90,6 +94,8 @@ struct stm32_rproc {
>>  struct stm32_mbox mb[MBOX_NB_MBX];
>>  struct workqueue_struct *workqueue;
>>  bool hold_boot_smc;
>> +bool fw_loaded;
>> +struct tee_rproc *trproc;
>>  void __iomem *rsc_va;
>>  };
>>  
>> @@ -257,6 +263,91 @@ static int stm32_rproc_release(struct rproc *rproc)
>>  return err;
>>  }
>>  
>> +static int stm32_rproc_tee_elf_sanity_check(struct rproc *rproc,
>> +const struct firmware *fw)
>> +{
>> +struct stm32_rproc *ddata = rproc->priv;
>> +unsigned int ret = 0;
>> +
>> +if (rproc->state == RPROC_DETACHED)
>> +return 0;
>> +
>> +ret = tee_rproc_load_fw(ddata->trproc, fw);
>> +if (!ret)
>> +ddata->fw_loaded = true;
>> +
>> +return ret;
>> +}
>> +
>> +static int stm32_rproc_tee_elf_load(struct rproc *rproc,
>> +const struct firmware *fw)
>> +{
>> +struct stm32_rproc *ddata = rproc->priv;
>> +unsigned int ret;
>> +
>> +/*
>> + * This function can be called by remote proc for recovery
>> + * without the sanity check. In this case we need to load the firmware
>> + * else nothing done here as the firmware has been preloaded for the
>> + * sanity check to be able to parse it for the resource table.
>> + */
> 
> This comment is very confusing - please consider refactoring.  
> 
>> +if (ddata->fw_loaded)
>> +return 0;
>> +
> 
> I'm not sure about keeping a flag to indicate the status of the loaded 
> firmware.
> It is not done for the non-secure method, I don't see why it would be needed 
> for
> the secure one.
> 

The difference is on the sanity check.
- in rproc_elf_sanity_check we  parse the elf file to verify that it is
valid.
- in stm32_rproc_tee_elf_sanity_check we have to do the same, that means to
authenticate it. the authentication is done during the load.

So this flag is used to avoid to reload it twice time.
refactoring the comment should help to understand this flag


An alternative would be to bypass the sanity check. But this lead to same
limitation.
Before loading the firmware in remoteproc_core, we call rproc_parse_fw() that is
used to get the resource table address. To get it from tee we need to
authenticate the firmware so load it...


>> +ret = tee_rproc_load_fw(ddata->trproc, fw);
>> +if (ret)
>> +return ret;
>> +ddata->fw_loaded = true;
>> +
>> +/* Update the resource table parameters. */
>> +if (rproc_tee_get_rsc_table(ddata->trproc)) {
>> +/* No resource table: reset the related fields. */
>> +rproc->cached_table = NULL;
>> +rproc->table_ptr = NULL;
>> +rproc->table_sz = 0;
>> +}
>> +
>> +return 0;
>> +}
>> +
>> +static struct resource_table *
>> +stm32_rproc_tee_elf_find_loaded_rsc_table(struct rproc *rproc,
>> +  const struct firmware *fw)
>> +{
>> +struct stm32_rproc *ddata = rproc->priv;
>> +
>> +return tee_rproc_get_loaded_rsc_table(ddata->trproc);
>> +}
>> +
>> +static int stm32_rproc_tee_start(struct rproc *rproc)
>> +{
>> +struct stm32_rproc *ddata = rproc->priv;
>> +
>> +return tee_rproc_start(ddata->trproc);
>> +}
>>

Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2024-01-30 Thread Chen-Yu Tsai

Hi,

On Mon, Jan 29, 2024 at 12:09:50PM -0800, Luis Chamberlain wrote:
> On Thu, Dec 21, 2023 at 10:02:46AM +0100, Christophe Leroy wrote:
> > Declaring rodata_enabled and mark_rodata_ro() at all time
> > helps removing related #ifdefery in C files.
> > 
> > Signed-off-by: Christophe Leroy 
> 
> Very nice cleanup, thanks!, applied and pushed
> 
>   Luis

On next-20240130, which has your modules-next branch, and thus this
series and the other "module: Use set_memory_rox()" series applied,
my kernel crashes in some very weird way. Reverting your branch
makes the crash go away.

I thought I'd report it right away. Maybe you folks would know what's
happening here? This is on arm64.

[   10.481015] Unable to handle kernel paging request at virtual address 
ffde85245d30
[   10.490369] KASAN: maybe wild-memory-access in range 
[0x00f42922e980-0x00f42922e987]
[   10.503744] Mem abort info:
[   10.509383]   ESR = 0x9647
[   10.514400]   EC = 0x25: DABT (current EL), IL = 32 bits
[   10.522366]   SET = 0, FnV = 0
[   10.526343]   EA = 0, S1PTW = 0
[   10.530695]   FSC = 0x07: level 3 translation fault
[   10.537081] Data abort info:
[   10.540839]   ISV = 0, ISS = 0x0047, ISS2 = 0x
[   10.546456]   CM = 0, WnR = 1, TnD = 0, TagAccess = 0
[   10.551726]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[   10.557612] swapper pgtable: 4k pages, 39-bit VAs, pgdp=41f98000
[   10.565214] [ffde85245d30] pgd=10023003, p4d=10023003, 
pud=10023003, pmd=1001121eb003, pte=
[   10.578887] Internal error: Oops: 9647 [#1] PREEMPT SMP
[   10.585815] Modules linked in:
[   10.590235] CPU: 6 PID: 195 Comm: (udev-worker) Tainted: GB  
6.8.0-rc2-next-20240130-02908-ge8ad01d60927-dirty #163 
3f2318148ecc5fa70d1092c2b874f9b59bdb7d60
[   10.607021] Hardware name: Google Tentacruel board (DT)
[   10.613607] pstate: a049 (NzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   10.621954] pc : module_bug_finalize+0x118/0x148
[   10.626823] lr : module_bug_finalize+0x118/0x148
[   10.631463] sp : ffc0820478d0
[   10.631466] x29: ffc0820478d0 x28: ffc082047ca0 x27: ffde8d7d31a0
[   10.631477] x26: ffde85223780 x25:  x24: ffde8c413cc0
[   10.631486] x23: ffde8dfcec80 x22: ffde8dfce000 x21: ffde85223ba8
[   10.631495] x20: ffde85223780 x19: ffde85245d28 x18: 
[   10.631504] x17: ffde8aa15938 x16: ffde8aabdd90 x15: ffde8aab8124
[   10.631513] x14: ffde8acdd380 x13: 41b58ab3 x12: ffbbd1bf9d91
[   10.631522] x11: 1ffbd1bf9d90 x10: ffbbd1bf9d90 x9 : dfc0
[   10.631531] x8 : 00442e406270 x7 : ffde8dfcec87 x6 : 0001
[   10.631539] x5 : ffde8dfcec80 x4 :  x3 : ffde8bbadf08
[   10.631548] x2 : 0001 x1 : ffde8eaff080 x0 : 
[   10.631556] Call trace:
[   10.631559]  module_bug_finalize+0x118/0x148
[   10.631565]  load_module+0x25ec/0x2a78
[   10.631572]  __do_sys_init_module+0x234/0x418
[   10.631578]  __arm64_sys_init_module+0x4c/0x68
[   10.631584]  invoke_syscall+0x68/0x198
[   10.631589]  el0_svc_common.constprop.0+0x11c/0x150
[   10.631594]  do_el0_svc+0x38/0x50
[   10.631598]  el0_svc+0x50/0xa0
[   10.631604]  el0t_64_sync_handler+0x120/0x130
[   10.631609]  el0t_64_sync+0x1a8/0x1b0
[   10.631619] Code: 97c5418e c89ffef5 91002260 97c53ca7 (f9000675)
[   10.631624] ---[ end trace  ]---
[   10.642965] Kernel panic - not syncing: Oops: Fatal exception
[   10.642975] SMP: stopping secondary CPUs
[   10.648339] Kernel Offset: 0x1e0a80 from 0xffc08000
[   10.648343] PHYS_OFFSET: 0x4000
[   10.648345] CPU features: 0x0,c061,7002814a,2100720b
[   10.648350] Memory Limit: none

Re: [PATCH RFC v3 10/35] mm: cma: Fast track allocating memory when the pages are free

2024-01-30 Thread Anshuman Khandual




On 1/25/24 22:12, Alexandru Elisei wrote:
> If the pages to be allocated are free, take them directly off the buddy
> allocator, instead of going through alloc_contig_range() and avoiding
> costly calls to lru_cache_disable().
> 
> Only allocations of the same size as the CMA region order are considered,
> to avoid taking the zone spinlock for too long.
> 
> Signed-off-by: Alexandru Elisei 

This patch seems to be improving standard cma_alloc() as well as
the previously added new allocator i.e cma_alloc_range() - via a
new helper cma_alloc_pages_fastpath().

Should not any standard cma_alloc() improvement be discussed as
an independent patch separately irrespective of this series. OR
it is some how related to this series which I might be missing ?

> ---
> 
> Changes since rfc v2:
> 
> * New patch. Reworked from the rfc v2 patch #26 ("arm64: mte: Fast track
> reserving tag storage when the block is free") (David Hildenbrand).
> 
>  include/linux/page-flags.h | 15 --
>  mm/Kconfig |  5 +
>  mm/cma.c   | 42 ++
>  mm/memory-failure.c|  8 
>  mm/page_alloc.c| 23 -
>  5 files changed, 73 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 735cddc13d20..b7237bce7446 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -575,11 +575,22 @@ TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
>  #define MAGIC_HWPOISON   0x48575053U /* HWPS */
>  extern void SetPageHWPoisonTakenOff(struct page *page);
>  extern void ClearPageHWPoisonTakenOff(struct page *page);
> -extern bool take_page_off_buddy(struct page *page);
> -extern bool put_page_back_buddy(struct page *page);
> +extern bool PageHWPoisonTakenOff(struct page *page);
>  #else
>  PAGEFLAG_FALSE(HWPoison, hwpoison)
> +TESTSCFLAG_FALSE(HWPoison, hwpoison)
>  #define __PG_HWPOISON 0
> +static inline void SetPageHWPoisonTakenOff(struct page *page) { }
> +static inline void ClearPageHWPoisonTakenOff(struct page *page) { }
> +static inline bool PageHWPoisonTakenOff(struct page *page)
> +{
> +  return false;
> +}
> +#endif
> +
> +#ifdef CONFIG_WANTS_TAKE_PAGE_OFF_BUDDY
> +extern bool take_page_off_buddy(struct page *page, bool poison);
> +extern bool put_page_back_buddy(struct page *page, bool unpoison);
>  #endif
>  
>  #if defined(CONFIG_PAGE_IDLE_FLAG) && defined(CONFIG_64BIT)
> diff --git a/mm/Kconfig b/mm/Kconfig
> index ffc3a2ba3a8c..341cf53898db 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -745,12 +745,16 @@ config DEFAULT_MMAP_MIN_ADDR
>  config ARCH_SUPPORTS_MEMORY_FAILURE
>   bool
>  
> +config WANTS_TAKE_PAGE_OFF_BUDDY
> + bool> +
>  config MEMORY_FAILURE
>   depends on MMU
>   depends on ARCH_SUPPORTS_MEMORY_FAILURE
>   bool "Enable recovery from hardware memory errors"
>   select MEMORY_ISOLATION
>   select RAS
> + select WANTS_TAKE_PAGE_OFF_BUDDY
>   help
> Enables code to recover from some memory failures on systems
> with MCA recovery. This allows a system to continue running
> @@ -891,6 +895,7 @@ config CMA
>   depends on MMU
>   select MIGRATION
>   select MEMORY_ISOLATION
> + select WANTS_TAKE_PAGE_OFF_BUDDY
>   help
> This enables the Contiguous Memory Allocator which allows other
> subsystems to allocate big physically-contiguous blocks of memory.
> diff --git a/mm/cma.c b/mm/cma.c
> index 2881bab12b01..15663f95d77b 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -444,6 +444,34 @@ static void cma_debug_show_areas(struct cma *cma)
>  static inline void cma_debug_show_areas(struct cma *cma) { }
>  #endif
>  
> +/* Called with the cma mutex held. */
> +static int cma_alloc_pages_fastpath(struct cma *cma, unsigned long start,
> + unsigned long end)
> +{
> + bool success = false;
> + unsigned long i, j;
> +
> + /* Avoid contention on the zone lock. */
> + if (start - end != 1 << cma->order_per_bit)
> + return -EINVAL;
> +
> + for (i = start; i < end; i++) {
> + if (!is_free_buddy_page(pfn_to_page(i)))
> + break;
> + success = take_page_off_buddy(pfn_to_page(i), false);
> + if (!success)
> + break;
> + }
> +
> + if (success)
> + return 0;
> +
> + for (j = start; j < i; j++)
> + put_page_back_buddy(pfn_to_page(j), false);
> +
> + return -EBUSY;
> +}
> +
>  /**
>   * cma_alloc_range() - allocate pages in a specific range
>   * @cma:   Contiguous memory region for which the allocation is performed.
> @@ -493,7 +521,11 @@ int cma_alloc_range(struct cma *cma, unsigned long 
> start, unsigned long count,
>  
>   for (i = 0; i < tries; i++) {
>   mutex_lock(&cma_mutex);
> - err = alloc_contig_range(start, start + count, MIGRATE_CMA, 
> gfp);
> +

Re: [PATCH v13 3/6] tracing: Add snapshot refcount

2024-01-30 Thread kernel test robot

Hi Vincent,

kernel test robot noticed the following build errors:

[auto build test ERROR on 29142dc92c37d3259a33aef15b03e6ee25b0d188]

url:
https://github.com/intel-lab-lkp/linux/commits/Vincent-Donnefort/ring-buffer-Zero-ring-buffer-sub-buffers/20240129-223025
base:   29142dc92c37d3259a33aef15b03e6ee25b0d188
patch link:
https://lore.kernel.org/r/20240129142802.2145305-4-vdonnefort%40google.com
patch subject: [PATCH v13 3/6] tracing: Add snapshot refcount
config: arc-randconfig-002-20240130 
(https://download.01.org/0day-ci/archive/20240130/202401301740.qzzlpcyv-...@intel.com/config)
compiler: arceb-elf-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): 
(https://download.01.org/0day-ci/archive/20240130/202401301740.qzzlpcyv-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202401301740.qzzlpcyv-...@intel.com/

All errors (new ones prefixed by >>):

   kernel/trace/trace.c: In function 'tracing_set_tracer':
   kernel/trace/trace.c:6644:17: error: implicit declaration of function 
'tracing_disarm_snapshot_locked'; did you mean 'tracing_disarm_snapshot'? 
[-Werror=implicit-function-declaration]
6644 | tracing_disarm_snapshot_locked(tr);
 | ^~
 | tracing_disarm_snapshot
>> kernel/trace/trace.c:6648:23: error: implicit declaration of function 
>> 'tracing_arm_snapshot_locked'; did you mean 'tracing_arm_snapshot'? 
>> [-Werror=implicit-function-declaration]
6648 | ret = tracing_arm_snapshot_locked(tr);
 |   ^~~
 |   tracing_arm_snapshot
   cc1: some warnings being treated as errors


vim +6648 kernel/trace/trace.c

  6560  
  6561  int tracing_set_tracer(struct trace_array *tr, const char *buf)
  6562  {
  6563  struct tracer *t;
  6564  #ifdef CONFIG_TRACER_MAX_TRACE
  6565  bool had_max_tr;
  6566  #endif
  6567  int ret = 0;
  6568  
  6569  mutex_lock(&trace_types_lock);
  6570  
  6571  if (!tr->ring_buffer_expanded) {
  6572  ret = __tracing_resize_ring_buffer(tr, trace_buf_size,
  6573  RING_BUFFER_ALL_CPUS);
  6574  if (ret < 0)
  6575  goto out;
  6576  ret = 0;
  6577  }
  6578  
  6579  for (t = trace_types; t; t = t->next) {
  6580  if (strcmp(t->name, buf) == 0)
  6581  break;
  6582  }
  6583  if (!t) {
  6584  ret = -EINVAL;
  6585  goto out;
  6586  }
  6587  if (t == tr->current_trace)
  6588  goto out;
  6589  
  6590  #ifdef CONFIG_TRACER_SNAPSHOT
  6591  if (t->use_max_tr) {
  6592  local_irq_disable();
  6593  arch_spin_lock(&tr->max_lock);
  6594  if (tr->cond_snapshot)
  6595  ret = -EBUSY;
  6596  arch_spin_unlock(&tr->max_lock);
  6597  local_irq_enable();
  6598  if (ret)
  6599  goto out;
  6600  }
  6601  #endif
  6602  /* Some tracers won't work on kernel command line */
  6603  if (system_state < SYSTEM_RUNNING && t->noboot) {
  6604  pr_warn("Tracer '%s' is not allowed on command line, 
ignored\n",
  6605  t->name);
  6606  goto out;
  6607  }
  6608  
  6609  /* Some tracers are only allowed for the top level buffer */
  6610  if (!trace_ok_for_array(t, tr)) {
  6611  ret = -EINVAL;
  6612  goto out;
  6613  }
  6614  
  6615  /* If trace pipe files are being read, we can't change the 
tracer */
  6616  if (tr->trace_ref) {
  6617  ret = -EBUSY;
  6618  goto out;
  6619  }
  6620  
  6621  trace_branch_disable();
  6622  
  6623  tr->current_trace->enabled--;
  6624  
  6625  if (tr->current_trace->reset)
  6626  tr->current_trace->reset(tr);
  6627  
  6628  #ifdef CONFIG_TRACER_MAX_TRACE
  6629  had_max_tr = tr->current_trace->use_max_tr;
  6630  
  6631  /* Current trace needs to be nop_trace before synchronize_rcu */
  6632  tr->current_trace = &nop_trace;
  6633  
  6634  if (had_max_tr && !t->use_max_tr) {
  6635  /*
  6636   * W

Re: [PATCH RFC v3 11/35] mm: Allow an arch to hook into folio allocation when VMA is known

2024-01-30 Thread Anshuman Khandual




On 1/25/24 22:12, Alexandru Elisei wrote:
> arm64 uses VM_HIGH_ARCH_0 and VM_HIGH_ARCH_1 for enabling MTE for a VMA.
> When VM_HIGH_ARCH_0, which arm64 renames to VM_MTE, is set for a VMA, and
> the gfp flag __GFP_ZERO is present, the __GFP_ZEROTAGS gfp flag also gets
> set in vma_alloc_zeroed_movable_folio().
> 
> Expand this to be more generic by adding an arch hook that modifes the gfp
> flags for an allocation when the VMA is known.
> 
> Note that __GFP_ZEROTAGS is ignored by the page allocator unless __GFP_ZERO
> is also set; from that point of view, the current behaviour is unchanged,
> even though the arm64 flag is set in more places.  When arm64 will have
> support to reuse the tag storage for data allocation, the uses of the
> __GFP_ZEROTAGS flag will be expanded to instruct the page allocator to try
> to reserve the corresponding tag storage for the pages being allocated.

Right but how will pushing __GFP_ZEROTAGS addition into gfp_t flags further
down via a new arch call back i.e arch_calc_vma_gfp() while still maintaining
(vma->vm_flags & VM_MTE) conditionality improve the current scenario. Because
the page allocator could have still analyzed alloc flags for __GFP_ZEROTAGS
for any additional stuff.

OR this just adds some new core MM paths to get __GFP_ZEROTAGS which was not
the case earlier via this call back.

> 
> The flags returned by arch_calc_vma_gfp() are or'ed with the flags set by
> the caller; this has been done to keep an architecture from modifying the
> flags already set by the core memory management code; this is similar to
> how do_mmap() -> calc_vm_flag_bits() -> arch_calc_vm_flag_bits() has been
> implemented. This can be revisited in the future if there's a need to do
> so.
> 
> Signed-off-by: Alexandru Elisei 
> ---
>  arch/arm64/include/asm/page.h|  5 ++---
>  arch/arm64/include/asm/pgtable.h |  3 +++
>  arch/arm64/mm/fault.c| 19 ++-
>  include/linux/pgtable.h  |  7 +++
>  mm/mempolicy.c   |  1 +
>  mm/shmem.c   |  5 -
>  6 files changed, 23 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
> index 2312e6ee595f..88bab032a493 100644
> --- a/arch/arm64/include/asm/page.h
> +++ b/arch/arm64/include/asm/page.h
> @@ -29,9 +29,8 @@ void copy_user_highpage(struct page *to, struct page *from,
>  void copy_highpage(struct page *to, struct page *from);
>  #define __HAVE_ARCH_COPY_HIGHPAGE
>  
> -struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
> - unsigned long vaddr);
> -#define vma_alloc_zeroed_movable_folio vma_alloc_zeroed_movable_folio
> +#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
> + vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
>  
>  void tag_clear_highpage(struct page *to);
>  #define __HAVE_ARCH_TAG_CLEAR_HIGHPAGE
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 79ce70fbb751..08f0904dbfc2 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1071,6 +1071,9 @@ static inline void arch_swap_restore(swp_entry_t entry, 
> struct folio *folio)
>  
>  #endif /* CONFIG_ARM64_MTE */
>  
> +#define __HAVE_ARCH_CALC_VMA_GFP
> +gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp);
> +
>  /*
>   * On AArch64, the cache coherency is handled via the set_pte_at() function.
>   */
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 55f6455a8284..4d3f0a870ad8 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -937,22 +937,15 @@ void do_debug_exception(unsigned long 
> addr_if_watchpoint, unsigned long esr,
>  NOKPROBE_SYMBOL(do_debug_exception);
>  
>  /*
> - * Used during anonymous page fault handling.
> + * If this is called during anonymous page fault handling, and the page is
> + * mapped with PROT_MTE, initialise the tags at the point of tag zeroing as 
> this
> + * is usually faster than separate DC ZVA and STGM.
>   */
> -struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
> - unsigned long vaddr)
> +gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp)
>  {
> - gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO;
> -
> - /*
> -  * If the page is mapped with PROT_MTE, initialise the tags at the
> -  * point of allocation and page zeroing as this is usually faster than
> -  * separate DC ZVA and STGM.
> -  */
>   if (vma->vm_flags & VM_MTE)
> - flags |= __GFP_ZEROTAGS;
> -
> - return vma_alloc_folio(flags, 0, vma, vaddr, false);
> + return __GFP_ZEROTAGS;
> + return 0;
>  }
>  
>  void tag_clear_highpage(struct page *page)
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index c5ddec6b5305..98f81ca08cbe 100644
> --- a/include/linux/pgtable.h
> +++ b/include/

Re: [PATCH v13 3/6] tracing: Add snapshot refcount

2024-01-30 Thread Vincent Donnefort

On Tue, Jan 30, 2024 at 05:30:38PM +0800, kernel test robot wrote:
> Hi Vincent,
> 
> kernel test robot noticed the following build errors:
> 
> [auto build test ERROR on 29142dc92c37d3259a33aef15b03e6ee25b0d188]
> 
> url:
> https://github.com/intel-lab-lkp/linux/commits/Vincent-Donnefort/ring-buffer-Zero-ring-buffer-sub-buffers/20240129-223025
> base:   29142dc92c37d3259a33aef15b03e6ee25b0d188
> patch link:
> https://lore.kernel.org/r/20240129142802.2145305-4-vdonnefort%40google.com
> patch subject: [PATCH v13 3/6] tracing: Add snapshot refcount
> config: arc-randconfig-002-20240130 
> (https://download.01.org/0day-ci/archive/20240130/202401301740.qzzlpcyv-...@intel.com/config)
> compiler: arceb-elf-gcc (GCC) 13.2.0
> reproduce (this is a W=1 build): 
> (https://download.01.org/0day-ci/archive/20240130/202401301740.qzzlpcyv-...@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version 
> of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot 
> | Closes: 
> https://lore.kernel.org/oe-kbuild-all/202401301740.qzzlpcyv-...@intel.com/
> 
> All errors (new ones prefixed by >>):
> 
>kernel/trace/trace.c: In function 'tracing_set_tracer':
>kernel/trace/trace.c:6644:17: error: implicit declaration of function 
> 'tracing_disarm_snapshot_locked'; did you mean 'tracing_disarm_snapshot'? 
> [-Werror=implicit-function-declaration]
> 6644 | tracing_disarm_snapshot_locked(tr);
>  | ^~
>  | tracing_disarm_snapshot
> >> kernel/trace/trace.c:6648:23: error: implicit declaration of function 
> >> 'tracing_arm_snapshot_locked'; did you mean 'tracing_arm_snapshot'? 
> >> [-Werror=implicit-function-declaration]
> 6648 | ret = tracing_arm_snapshot_locked(tr);
>  |   ^~~
>  |   tracing_arm_snapshot
>cc1: some warnings being treated as errors

Right, two tracers (hwlat and osnoise) select _only_ MAX_TRACE and not
TRACER_SNAPSHOT.

However, AFAICT, they will not call any of the swapping functions (they don't
set use_max_tr). So I suppose arm/disarm can be ommited in that case.

> 
> 
> vim +6648 kernel/trace/trace.c
> 
>   6560
>   6561int tracing_set_tracer(struct trace_array *tr, const char *buf)
>   6562{
>   6563struct tracer *t;
>   6564#ifdef CONFIG_TRACER_MAX_TRACE
>   6565bool had_max_tr;
>   6566#endif
>   6567int ret = 0;
>   6568
>   6569mutex_lock(&trace_types_lock);
>   6570
>   6571if (!tr->ring_buffer_expanded) {
>   6572ret = __tracing_resize_ring_buffer(tr, 
> trace_buf_size,
>   6573
> RING_BUFFER_ALL_CPUS);
>   6574if (ret < 0)
>   6575goto out;
>   6576ret = 0;
>   6577}
>   6578
>   6579for (t = trace_types; t; t = t->next) {
>   6580if (strcmp(t->name, buf) == 0)
>   6581break;
>   6582}
>   6583if (!t) {
>   6584ret = -EINVAL;
>   6585goto out;
>   6586}
>   6587if (t == tr->current_trace)
>   6588goto out;
>   6589
>   6590#ifdef CONFIG_TRACER_SNAPSHOT
>   6591if (t->use_max_tr) {
>   6592local_irq_disable();
>   6593arch_spin_lock(&tr->max_lock);
>   6594if (tr->cond_snapshot)
>   6595ret = -EBUSY;
>   6596arch_spin_unlock(&tr->max_lock);
>   6597local_irq_enable();
>   6598if (ret)
>   6599goto out;
>   6600}
>   6601#endif
>   6602/* Some tracers won't work on kernel command line */
>   6603if (system_state < SYSTEM_RUNNING && t->noboot) {
>   6604pr_warn("Tracer '%s' is not allowed on command 
> line, ignored\n",
>   6605t->name);
>   6606goto out;
>   6607

Re: [PATCH] media: dt-bindings: qcom,sc7280-venus: Allow one IOMMU entry

2024-01-30 Thread Krzysztof Kozlowski

On 29/01/2024 08:48, Luca Weiss wrote:
> Some SC7280-based boards crash when providing the "secure_non_pixel"
> context bank, so allow only one iommu in the bindings also.
> 
> Signed-off-by: Luca Weiss 
> ---


Acked-by: Krzysztof Kozlowski 

Best regards,
Krzysztof

Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2024-01-30 Thread Christophe Leroy



Le 30/01/2024 à 10:16, Chen-Yu Tsai a écrit :
> [Vous ne recevez pas souvent de courriers de we...@chromium.org. D?couvrez 
> pourquoi ceci est important ? https://aka.ms/LearnAboutSenderIdentification ]
> 
> Hi,
> 
> On Mon, Jan 29, 2024 at 12:09:50PM -0800, Luis Chamberlain wrote:
>> On Thu, Dec 21, 2023 at 10:02:46AM +0100, Christophe Leroy wrote:
>>> Declaring rodata_enabled and mark_rodata_ro() at all time
>>> helps removing related #ifdefery in C files.
>>>
>>> Signed-off-by: Christophe Leroy 
>>
>> Very nice cleanup, thanks!, applied and pushed
>>
>>Luis
> 
> On next-20240130, which has your modules-next branch, and thus this
> series and the other "module: Use set_memory_rox()" series applied,
> my kernel crashes in some very weird way. Reverting your branch
> makes the crash go away.
> 
> I thought I'd report it right away. Maybe you folks would know what's
> happening here? This is on arm64.

That's strange, it seems to bug in module_bug_finalize() which is 
_before_ calls to module_enable_ro() and such.

Can you try to revert the 6 patches one by one to see which one 
introduces the problem ?

In reality, only patch 677bfb9db8a3 really change things. Other ones are 
more on less only cleanup.

Thanks
Christophe


> 
> [   10.481015] Unable to handle kernel paging request at virtual address 
> ffde85245d30
> [   10.490369] KASAN: maybe wild-memory-access in range 
> [0x00f42922e980-0x00f42922e987]
> [   10.503744] Mem abort info:
> [   10.509383]   ESR = 0x9647
> [   10.514400]   EC = 0x25: DABT (current EL), IL = 32 bits
> [   10.522366]   SET = 0, FnV = 0
> [   10.526343]   EA = 0, S1PTW = 0
> [   10.530695]   FSC = 0x07: level 3 translation fault
> [   10.537081] Data abort info:
> [   10.540839]   ISV = 0, ISS = 0x0047, ISS2 = 0x
> [   10.546456]   CM = 0, WnR = 1, TnD = 0, TagAccess = 0
> [   10.551726]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> [   10.557612] swapper pgtable: 4k pages, 39-bit VAs, pgdp=41f98000
> [   10.565214] [ffde85245d30] pgd=10023003, p4d=10023003, 
> pud=10023003, pmd=1001121eb003, pte=
> [   10.578887] Internal error: Oops: 9647 [#1] PREEMPT SMP
> [   10.585815] Modules linked in:
> [   10.590235] CPU: 6 PID: 195 Comm: (udev-worker) Tainted: GB
>   6.8.0-rc2-next-20240130-02908-ge8ad01d60927-dirty #163 
> 3f2318148ecc5fa70d1092c2b874f9b59bdb7d60
> [   10.607021] Hardware name: Google Tentacruel board (DT)
> [   10.613607] pstate: a049 (NzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [   10.621954] pc : module_bug_finalize+0x118/0x148
> [   10.626823] lr : module_bug_finalize+0x118/0x148
> [   10.631463] sp : ffc0820478d0
> [   10.631466] x29: ffc0820478d0 x28: ffc082047ca0 x27: 
> ffde8d7d31a0
> [   10.631477] x26: ffde85223780 x25:  x24: 
> ffde8c413cc0
> [   10.631486] x23: ffde8dfcec80 x22: ffde8dfce000 x21: 
> ffde85223ba8
> [   10.631495] x20: ffde85223780 x19: ffde85245d28 x18: 
> 
> [   10.631504] x17: ffde8aa15938 x16: ffde8aabdd90 x15: 
> ffde8aab8124
> [   10.631513] x14: ffde8acdd380 x13: 41b58ab3 x12: 
> ffbbd1bf9d91
> [   10.631522] x11: 1ffbd1bf9d90 x10: ffbbd1bf9d90 x9 : 
> dfc0
> [   10.631531] x8 : 00442e406270 x7 : ffde8dfcec87 x6 : 
> 0001
> [   10.631539] x5 : ffde8dfcec80 x4 :  x3 : 
> ffde8bbadf08
> [   10.631548] x2 : 0001 x1 : ffde8eaff080 x0 : 
> 
> [   10.631556] Call trace:
> [   10.631559]  module_bug_finalize+0x118/0x148
> [   10.631565]  load_module+0x25ec/0x2a78
> [   10.631572]  __do_sys_init_module+0x234/0x418
> [   10.631578]  __arm64_sys_init_module+0x4c/0x68
> [   10.631584]  invoke_syscall+0x68/0x198
> [   10.631589]  el0_svc_common.constprop.0+0x11c/0x150
> [   10.631594]  do_el0_svc+0x38/0x50
> [   10.631598]  el0_svc+0x50/0xa0
> [   10.631604]  el0t_64_sync_handler+0x120/0x130
> [   10.631609]  el0t_64_sync+0x1a8/0x1b0
> [   10.631619] Code: 97c5418e c89ffef5 91002260 97c53ca7 (f9000675)
> [   10.631624] ---[ end trace  ]---
> [   10.642965] Kernel panic - not syncing: Oops: Fatal exception
> [   10.642975] SMP: stopping secondary CPUs
> [   10.648339] Kernel Offset: 0x1e0a80 from 0xffc08000
> [   10.648343] PHYS_OFFSET: 0x4000
> [   10.648345] CPU features: 0x0,c061,7002814a,2100720b
> [   10.648350] Memory Limit: none
>

Re: [RFC PATCH 1/2] x86/kprobes: Prohibit kprobing on INT and UD

2024-01-30 Thread Google

On Mon, 29 Jan 2024 20:50:39 -0600
Jinghao Jia  wrote:

> On 1/29/24 19:44, Masami Hiramatsu (Google) wrote:
> > On Sun, 28 Jan 2024 15:25:59 -0600
> > Jinghao Jia  wrote:
> > 
>   /* Check if paddr is at an instruction boundary */
>   static int can_probe(unsigned long paddr)
>   {
>  @@ -294,6 +310,16 @@ static int can_probe(unsigned long paddr)
>   #endif
>   addr += insn.length;
>   }
>  +__addr = recover_probed_instruction(buf, addr);
>  +if (!__addr)
>  +return 0;
>  +
>  +if (insn_decode_kernel(&insn, (void *)__addr) < 0)
>  +return 0;
>  +
>  +if (is_exception_insn(&insn))
>  +return 0;
>  +
> >>>
> >>> Please don't put this outside of decoding loop. You should put these in
> >>> the loop which decodes the instruction from the beginning of the function.
> >>> Since the x86 instrcution is variable length, can_probe() needs to check
> >>> whether that the address is instruction boundary and decodable.
> >>>
> >>> Thank you,
> >>
> >> If my understanding is correct then this is trying to decode the kprobe
> >> target instruction, given that it is after the main decoding loop.  Here I
> >> hoisted the decoding logic out of the if(IS_ENABLED(CONFIG_CFI_CLANG))
> >> block so that we do not need to decode the same instruction twice.  I left
> >> the main decoding loop unchanged so it is still decoding the function from
> >> the start and should handle instruction boundaries. Are there any caveats
> >> that I missed?
> > 
> > Ah, sorry I misread the patch. You're correct!
> > This is a good place to do that.
> > 
> > But hmm, I think we should add another patch to check the addr == paddr
> > soon after the loop so that we will avoid decoding.
> > 
> > Thank you,
> > 
> 
> Yes, that makes sense to me. At the same time, I'm also thinking about
> changing the return type of can_probe() to bool, since we are just using
> int as bool in this context.

Yes, that is also a good change :)

Thank you,

> 
> --Jinghao
> 
> >>
> >> --Jinghao
> >>
> >>>
>   if (IS_ENABLED(CONFIG_CFI_CLANG)) {
>   /*
>    * The compiler generates the following instruction 
>  sequence
>  @@ -308,13 +334,6 @@ static int can_probe(unsigned long paddr)
>    * Also, these movl and addl are used for showing 
>  expected
>    * type. So those must not be touched.
>    */
>  -__addr = recover_probed_instruction(buf, addr);
>  -if (!__addr)
>  -return 0;
>  -
>  -if (insn_decode_kernel(&insn, (void *)__addr) < 0)
>  -return 0;
>  -
>   if (insn.opcode.value == 0xBA)
>   offset = 12;
>   else if (insn.opcode.value == 0x3)
>  -- 
>  2.43.0
> 
> >>>
> >>>
> > 
> > 


-- 
Masami Hiramatsu (Google)

Re: [PATCH RFC v3 09/35] mm: cma: Introduce cma_remove_mem()

2024-01-30 Thread Alexandru Elisei

Hi,

I really appreciate the feedback you have given me so far. I believe the
commit message isn't clear enough and there has been a confusion.

A CMA user adds a CMA area to the cma_areas array with
cma_declare_contiguous_nid() or cma_init_reserved_mem().
init_cma_reserved_pageblock() then iterates over the array and activates
all cma areas.

The function cma_remove_mem() is intended to be used to remove a cma area
from the cma_areas array **before** the area has been activated.

Usecase: a driver (in this case, the arm64 dynamic tag storage code)
manages several cma areas. The driver successfully adds the first area to
the cma_areas array. When the driver tries to adds the second area, the
function fails. Without cma_remove_mem(), the driver has no way to prevent
the first area from being freed to the page allocator. cma_remove_mem() is
about providing a means to do cleanup in case of error.

Does that make more sense now?

Ok Tue, Jan 30, 2024 at 11:20:56AM +0530, Anshuman Khandual wrote:
> 
> 
> On 1/25/24 22:12, Alexandru Elisei wrote:
> > Memory is added to CMA with cma_declare_contiguous_nid() and
> > cma_init_reserved_mem(). This memory is then put on the MIGRATE_CMA list in
> > cma_init_reserved_areas(), where the page allocator can make use of it.
> 
> cma_declare_contiguous_nid() reserves memory in memblock and marks the

You forgot about about cma_init_reserved_mem() which does the same thing,
but yes, you are right.

> for subsequent CMA usage, where as cma_init_reserved_areas() activates
> these memory areas through init_cma_reserved_pageblock(). Standard page
> allocator only receives these memory via free_reserved_page() - only if

I don't think that's correct. init_cma_reserved_pageblock() clears the
PG_reserved page flag, sets the migratetype to MIGRATE_CMA and then frees
the page. After that, the page is available to the standard page allocator
to use for allocation. Otherwise, what would be the point of the
MIGRATE_CMA migratetype?

> the page block activation fails.

For the sake of having a complete picture, I'll add that that only happens
if cma->reserve_pages_on_error is false. If the CMA user sets the field to
'true' (with cma_reserve_pages_on_error()), then the pages in the CMA
region are kept PG_reserved if activation fails.

> 
> > 
> > If a device manages multiple CMA areas, and there's an error when one of
> > the areas is added to CMA, there is no mechanism for the device to prevent
> 
> What kind of error ? init_cma_reserved_pageblock() fails ? But that will
> not happen until cma_init_reserved_areas().

I think I haven't been clear enough. When I say that "an area is added
to CMA", I mean that the memory region is added to cma_areas array, via
cma_declare_contiguous_nid() or cma_init_reserved_mem(). There are several
ways in which either function can fail.

> 
> > the rest of the areas, which were added before the error occured, from
> > being later added to the MIGRATE_CMA list.
> 
> Why is this mechanism required ? cma_init_reserved_areas() scans over all
> CMA areas and try and activate each of them sequentially. Why is not this
> sufficient ?

This patch is about removing a struct cma from the cma_areas array after it
has been added to the array, with cma_declare_contiguous_nid() or
cma_init_reserved_mem(), to prevent the area from being activated in
cma_init_reserved_areas(). Sorry for the confusion.

I'll add a check in cma_remove_mem() to fail if the cma area has been
activated, and a comment to the function to explain its usage.

> 
> > 
> > Add cma_remove_mem() which allows a previously reserved CMA area to be
> > removed and thus it cannot be used by the page allocator.
> 
> Successfully activated CMA areas do not get used by the buddy allocator.

I don't believe that is correct, see above.

> 
> > 
> > Signed-off-by: Alexandru Elisei 
> > ---
> > 
> > Changes since rfc v2:
> > 
> > * New patch.
> > 
> >  include/linux/cma.h |  1 +
> >  mm/cma.c| 30 +-
> >  2 files changed, 30 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/cma.h b/include/linux/cma.h
> > index e32559da6942..787cbec1702e 100644
> > --- a/include/linux/cma.h
> > +++ b/include/linux/cma.h
> > @@ -48,6 +48,7 @@ extern int cma_init_reserved_mem(phys_addr_t base, 
> > phys_addr_t size,
> > unsigned int order_per_bit,
> > const char *name,
> > struct cma **res_cma);
> > +extern void cma_remove_mem(struct cma **res_cma);
> >  extern struct page *cma_alloc(struct cma *cma, unsigned long count, 
> > unsigned int align,
> >   bool no_warn);
> >  extern int cma_alloc_range(struct cma *cma, unsigned long start, unsigned 
> > long count,
> > diff --git a/mm/cma.c b/mm/cma.c
> > index 4a0f68b9443b..2881bab12b01 100644
> > --- a/mm/cma.c
> > +++ b/mm/cma.c
> > @@ -147,8 +147,12 @@ static int __init cma_init_reserved_areas(void

Re: [RFC PATCH 4/7] ext2: Use dax_is_supported()

2024-01-30 Thread Jan Kara

On Mon 29-01-24 16:06:28, Mathieu Desnoyers wrote:
> Use dax_is_supported() to validate whether the architecture has
> virtually aliased caches at mount time.
> 
> This is relevant for architectures which require a dynamic check
> to validate whether they have virtually aliased data caches
> (ARCH_HAS_CACHE_ALIASING_DYNAMIC=y).
> 
> Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
> caches")
> Signed-off-by: Mathieu Desnoyers 
> Cc: Jan Kara 
> Cc: linux-e...@vger.kernel.org
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Cc: linux...@kvack.org
> Cc: linux-a...@vger.kernel.org
> Cc: Dan Williams 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Matthew Wilcox 
> Cc: nvd...@lists.linux.dev
> Cc: linux-...@vger.kernel.org

Looks good to me (although I share Dave's opinion it would be nice to CC
the whole series to fsdevel - luckily we have lore these days so it is not
that tedious to find the whole series :)). Feel free to add:

Acked-by: Jan Kara 

Honza

> ---
>  fs/ext2/super.c | 14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index 01f9addc8b1f..0398e7a90eb6 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -585,13 +585,13 @@ static int parse_options(char *options, struct 
> super_block *sb,
>   set_opt(opts->s_mount_opt, XIP);
>   fallthrough;
>   case Opt_dax:
> -#ifdef CONFIG_FS_DAX
> - ext2_msg(sb, KERN_WARNING,
> - "DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
> - set_opt(opts->s_mount_opt, DAX);
> -#else
> - ext2_msg(sb, KERN_INFO, "dax option not supported");
> -#endif
> + if (dax_is_supported()) {
> + ext2_msg(sb, KERN_WARNING,
> +  "DAX enabled. Warning: EXPERIMENTAL, 
> use at your own risk");
> + set_opt(opts->s_mount_opt, DAX);
> + } else {
> + ext2_msg(sb, KERN_INFO, "dax option not 
> supported");
> + }
>   break;
>  
>  #if defined(CONFIG_QUOTA)
> -- 
> 2.39.2
> 
-- 
Jan Kara 
SUSE Labs, CR

Re: [PATCH RFC v3 10/35] mm: cma: Fast track allocating memory when the pages are free

2024-01-30 Thread Alexandru Elisei

Hi,

On Tue, Jan 30, 2024 at 02:48:53PM +0530, Anshuman Khandual wrote:
> 
> 
> On 1/25/24 22:12, Alexandru Elisei wrote:
> > If the pages to be allocated are free, take them directly off the buddy
> > allocator, instead of going through alloc_contig_range() and avoiding
> > costly calls to lru_cache_disable().
> > 
> > Only allocations of the same size as the CMA region order are considered,
> > to avoid taking the zone spinlock for too long.
> > 
> > Signed-off-by: Alexandru Elisei 
> 
> This patch seems to be improving standard cma_alloc() as well as
> the previously added new allocator i.e cma_alloc_range() - via a
> new helper cma_alloc_pages_fastpath().

Yes, that's correct.

> 
> Should not any standard cma_alloc() improvement be discussed as
> an independent patch separately irrespective of this series. OR
> it is some how related to this series which I might be missing ?

Yes, it's related to this series. I wrote this patch because it fixes a
performance regression with Chrome when dynamic tag storage management is
enabled [1]. I will bring back the commit message explaining that.

[1] 
https://lore.kernel.org/linux-fsdevel/20231119165721.9849-27-alexandru.eli...@arm.com/

Thanks,
Alex

> 
> > ---
> > 
> > Changes since rfc v2:
> > 
> > * New patch. Reworked from the rfc v2 patch #26 ("arm64: mte: Fast track
> > reserving tag storage when the block is free") (David Hildenbrand).
> > 
> >  include/linux/page-flags.h | 15 --
> >  mm/Kconfig |  5 +
> >  mm/cma.c   | 42 ++
> >  mm/memory-failure.c|  8 
> >  mm/page_alloc.c| 23 -
> >  5 files changed, 73 insertions(+), 20 deletions(-)
> > 
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 735cddc13d20..b7237bce7446 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -575,11 +575,22 @@ TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
> >  #define MAGIC_HWPOISON 0x48575053U /* HWPS */
> >  extern void SetPageHWPoisonTakenOff(struct page *page);
> >  extern void ClearPageHWPoisonTakenOff(struct page *page);
> > -extern bool take_page_off_buddy(struct page *page);
> > -extern bool put_page_back_buddy(struct page *page);
> > +extern bool PageHWPoisonTakenOff(struct page *page);
> >  #else
> >  PAGEFLAG_FALSE(HWPoison, hwpoison)
> > +TESTSCFLAG_FALSE(HWPoison, hwpoison)
> >  #define __PG_HWPOISON 0
> > +static inline void SetPageHWPoisonTakenOff(struct page *page) { }
> > +static inline void ClearPageHWPoisonTakenOff(struct page *page) { }
> > +static inline bool PageHWPoisonTakenOff(struct page *page)
> > +{
> > +  return false;
> > +}
> > +#endif
> > +
> > +#ifdef CONFIG_WANTS_TAKE_PAGE_OFF_BUDDY
> > +extern bool take_page_off_buddy(struct page *page, bool poison);
> > +extern bool put_page_back_buddy(struct page *page, bool unpoison);
> >  #endif
> >  
> >  #if defined(CONFIG_PAGE_IDLE_FLAG) && defined(CONFIG_64BIT)
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index ffc3a2ba3a8c..341cf53898db 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -745,12 +745,16 @@ config DEFAULT_MMAP_MIN_ADDR
> >  config ARCH_SUPPORTS_MEMORY_FAILURE
> > bool
> >  
> > +config WANTS_TAKE_PAGE_OFF_BUDDY
> > +   bool> +
> >  config MEMORY_FAILURE
> > depends on MMU
> > depends on ARCH_SUPPORTS_MEMORY_FAILURE
> > bool "Enable recovery from hardware memory errors"
> > select MEMORY_ISOLATION
> > select RAS
> > +   select WANTS_TAKE_PAGE_OFF_BUDDY
> > help
> >   Enables code to recover from some memory failures on systems
> >   with MCA recovery. This allows a system to continue running
> > @@ -891,6 +895,7 @@ config CMA
> > depends on MMU
> > select MIGRATION
> > select MEMORY_ISOLATION
> > +   select WANTS_TAKE_PAGE_OFF_BUDDY
> > help
> >   This enables the Contiguous Memory Allocator which allows other
> >   subsystems to allocate big physically-contiguous blocks of memory.
> > diff --git a/mm/cma.c b/mm/cma.c
> > index 2881bab12b01..15663f95d77b 100644
> > --- a/mm/cma.c
> > +++ b/mm/cma.c
> > @@ -444,6 +444,34 @@ static void cma_debug_show_areas(struct cma *cma)
> >  static inline void cma_debug_show_areas(struct cma *cma) { }
> >  #endif
> >  
> > +/* Called with the cma mutex held. */
> > +static int cma_alloc_pages_fastpath(struct cma *cma, unsigned long start,
> > +   unsigned long end)
> > +{
> > +   bool success = false;
> > +   unsigned long i, j;
> > +
> > +   /* Avoid contention on the zone lock. */
> > +   if (start - end != 1 << cma->order_per_bit)
> > +   return -EINVAL;
> > +
> > +   for (i = start; i < end; i++) {
> > +   if (!is_free_buddy_page(pfn_to_page(i)))
> > +   break;
> > +   success = take_page_off_buddy(pfn_to_page(i), false);
> > +   if (!success)
> > +   break;
> > +   }
> > +
> > +   if (success

Re: [PATCH RFC v3 11/35] mm: Allow an arch to hook into folio allocation when VMA is known

2024-01-30 Thread Alexandru Elisei

Hi,

On Tue, Jan 30, 2024 at 03:25:20PM +0530, Anshuman Khandual wrote:
> 
> 
> On 1/25/24 22:12, Alexandru Elisei wrote:
> > arm64 uses VM_HIGH_ARCH_0 and VM_HIGH_ARCH_1 for enabling MTE for a VMA.
> > When VM_HIGH_ARCH_0, which arm64 renames to VM_MTE, is set for a VMA, and
> > the gfp flag __GFP_ZERO is present, the __GFP_ZEROTAGS gfp flag also gets
> > set in vma_alloc_zeroed_movable_folio().
> > 
> > Expand this to be more generic by adding an arch hook that modifes the gfp
> > flags for an allocation when the VMA is known.
> > 
> > Note that __GFP_ZEROTAGS is ignored by the page allocator unless __GFP_ZERO
> > is also set; from that point of view, the current behaviour is unchanged,
> > even though the arm64 flag is set in more places.  When arm64 will have
> > support to reuse the tag storage for data allocation, the uses of the
> > __GFP_ZEROTAGS flag will be expanded to instruct the page allocator to try
> > to reserve the corresponding tag storage for the pages being allocated.
> 
> Right but how will pushing __GFP_ZEROTAGS addition into gfp_t flags further
> down via a new arch call back i.e arch_calc_vma_gfp() while still maintaining
> (vma->vm_flags & VM_MTE) conditionality improve the current scenario. Because

I'm afraid I don't follow you.

> the page allocator could have still analyzed alloc flags for __GFP_ZEROTAGS
> for any additional stuff.
> 
> OR this just adds some new core MM paths to get __GFP_ZEROTAGS which was not
> the case earlier via this call back.

Before this patch: vma_alloc_zeroed_movable_folio() sets __GFP_ZEROTAGS.
After this patch: vma_alloc_folio() sets __GFP_ZEROTAGS.

This patch is about adding __GFP_ZEROTAGS for more callers.

Thanks,
Alex

> 
> > 
> > The flags returned by arch_calc_vma_gfp() are or'ed with the flags set by
> > the caller; this has been done to keep an architecture from modifying the
> > flags already set by the core memory management code; this is similar to
> > how do_mmap() -> calc_vm_flag_bits() -> arch_calc_vm_flag_bits() has been
> > implemented. This can be revisited in the future if there's a need to do
> > so.
> > 
> > Signed-off-by: Alexandru Elisei 
> > ---
> >  arch/arm64/include/asm/page.h|  5 ++---
> >  arch/arm64/include/asm/pgtable.h |  3 +++
> >  arch/arm64/mm/fault.c| 19 ++-
> >  include/linux/pgtable.h  |  7 +++
> >  mm/mempolicy.c   |  1 +
> >  mm/shmem.c   |  5 -
> >  6 files changed, 23 insertions(+), 17 deletions(-)
> > 
> > diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
> > index 2312e6ee595f..88bab032a493 100644
> > --- a/arch/arm64/include/asm/page.h
> > +++ b/arch/arm64/include/asm/page.h
> > @@ -29,9 +29,8 @@ void copy_user_highpage(struct page *to, struct page 
> > *from,
> >  void copy_highpage(struct page *to, struct page *from);
> >  #define __HAVE_ARCH_COPY_HIGHPAGE
> >  
> > -struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
> > -   unsigned long vaddr);
> > -#define vma_alloc_zeroed_movable_folio vma_alloc_zeroed_movable_folio
> > +#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
> > +   vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
> >  
> >  void tag_clear_highpage(struct page *to);
> >  #define __HAVE_ARCH_TAG_CLEAR_HIGHPAGE
> > diff --git a/arch/arm64/include/asm/pgtable.h 
> > b/arch/arm64/include/asm/pgtable.h
> > index 79ce70fbb751..08f0904dbfc2 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -1071,6 +1071,9 @@ static inline void arch_swap_restore(swp_entry_t 
> > entry, struct folio *folio)
> >  
> >  #endif /* CONFIG_ARM64_MTE */
> >  
> > +#define __HAVE_ARCH_CALC_VMA_GFP
> > +gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp);
> > +
> >  /*
> >   * On AArch64, the cache coherency is handled via the set_pte_at() 
> > function.
> >   */
> > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> > index 55f6455a8284..4d3f0a870ad8 100644
> > --- a/arch/arm64/mm/fault.c
> > +++ b/arch/arm64/mm/fault.c
> > @@ -937,22 +937,15 @@ void do_debug_exception(unsigned long 
> > addr_if_watchpoint, unsigned long esr,
> >  NOKPROBE_SYMBOL(do_debug_exception);
> >  
> >  /*
> > - * Used during anonymous page fault handling.
> > + * If this is called during anonymous page fault handling, and the page is
> > + * mapped with PROT_MTE, initialise the tags at the point of tag zeroing 
> > as this
> > + * is usually faster than separate DC ZVA and STGM.
> >   */
> > -struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
> > -   unsigned long vaddr)
> > +gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp)
> >  {
> > -   gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO;
> > -
> > -   /*
> > -* If the page is mapped with PROT_MTE, initialise the tags at the
> > -* point of allocati

Re: [PATCH RFC v3 08/35] mm: cma: Introduce cma_alloc_range()

2024-01-30 Thread Alexandru Elisei

Hi,

On Tue, Jan 30, 2024 at 10:50:00AM +0530, Anshuman Khandual wrote:
> 
> 
> On 1/25/24 22:12, Alexandru Elisei wrote:
> > Today, cma_alloc() is used to allocate a contiguous memory region. The
> > function allows the caller to specify the number of pages to allocate, but
> > not the starting address. cma_alloc() will walk over the entire CMA region
> > trying to allocate the first available range of the specified size.
> > 
> > Introduce cma_alloc_range(), which makes CMA more versatile by allowing the
> > caller to specify a particular range in the CMA region, defined by the
> > start pfn and the size.
> > 
> > arm64 will make use of this function when tag storage management will be
> > implemented: cma_alloc_range() will be used to reserve the tag storage
> > associated with a tagged page.
> 
> Basically, you would like to pass on a preferred start address and the
> allocation could just fail if a contig range is not available from such
> a starting address ?
> 
> Then why not just change cma_alloc() to take a new argument 'start_pfn'.
> Why create a new but almost similar allocator ?

I tried doing that, and I gave up because:

- It made cma_alloc() even more complex and hard to follow.

- What value should 'start_pfn' be to tell cma_alloc() that it should be
  ignored? Or, to put it another way, what pfn number is invalid on **all**
  platforms that Linux supports?

I can give it another go if we can come up with an invalid value for
'start_pfn'.

> 
> But then I am wondering why this could not be done in the arm64 platform
> code itself operating on a CMA area reserved just for tag storage. Unless
> this new allocator has other usage beyond MTE, this could be implemented
> in the platform itself.

I had the same idea in the previous iteration, David Hildenbrand suggested
this approach [1].

[1] 
https://lore.kernel.org/linux-fsdevel/2aafd53f-af1f-45f3-a08c-d11962254...@redhat.com/

Thanks,
Alex

> 
> > 
> > Signed-off-by: Alexandru Elisei 
> > ---
> > 
> > Changes since rfc v2:
> > 
> > * New patch.
> > 
> >  include/linux/cma.h|  2 +
> >  include/trace/events/cma.h | 59 ++
> >  mm/cma.c   | 86 ++
> >  3 files changed, 147 insertions(+)
> > 
> > diff --git a/include/linux/cma.h b/include/linux/cma.h
> > index 63873b93deaa..e32559da6942 100644
> > --- a/include/linux/cma.h
> > +++ b/include/linux/cma.h
> > @@ -50,6 +50,8 @@ extern int cma_init_reserved_mem(phys_addr_t base, 
> > phys_addr_t size,
> > struct cma **res_cma);
> >  extern struct page *cma_alloc(struct cma *cma, unsigned long count, 
> > unsigned int align,
> >   bool no_warn);
> > +extern int cma_alloc_range(struct cma *cma, unsigned long start, unsigned 
> > long count,
> > +  unsigned tries, gfp_t gfp);
> >  extern bool cma_pages_valid(struct cma *cma, const struct page *pages, 
> > unsigned long count);
> >  extern bool cma_release(struct cma *cma, const struct page *pages, 
> > unsigned long count);
> >  
> > diff --git a/include/trace/events/cma.h b/include/trace/events/cma.h
> > index 25103e67737c..a89af313a572 100644
> > --- a/include/trace/events/cma.h
> > +++ b/include/trace/events/cma.h
> > @@ -36,6 +36,65 @@ TRACE_EVENT(cma_release,
> >   __entry->count)
> >  );
> >  
> > +TRACE_EVENT(cma_alloc_range_start,
> > +
> > +   TP_PROTO(const char *name, unsigned long start, unsigned long count,
> > +unsigned tries),
> > +
> > +   TP_ARGS(name, start, count, tries),
> > +
> > +   TP_STRUCT__entry(
> > +   __string(name, name)
> > +   __field(unsigned long, start)
> > +   __field(unsigned long, count)
> > +   __field(unsigned, tries)
> > +   ),
> > +
> > +   TP_fast_assign(
> > +   __assign_str(name, name);
> > +   __entry->start = start;
> > +   __entry->count = count;
> > +   __entry->tries = tries;
> > +   ),
> > +
> > +   TP_printk("name=%s start=%lx count=%lu tries=%u",
> > + __get_str(name),
> > + __entry->start,
> > + __entry->count,
> > + __entry->tries)
> > +);
> > +
> > +TRACE_EVENT(cma_alloc_range_finish,
> > +
> > +   TP_PROTO(const char *name, unsigned long start, unsigned long count,
> > +unsigned attempts, int err),
> > +
> > +   TP_ARGS(name, start, count, attempts, err),
> > +
> > +   TP_STRUCT__entry(
> > +   __string(name, name)
> > +   __field(unsigned long, start)
> > +   __field(unsigned long, count)
> > +   __field(unsigned, attempts)
> > +   __field(int, err)
> > +   ),
> > +
> > +   TP_fast_assign(
> > +   __assign_str(name, name);
> > +   __entry->start = start;
> > +   __entry->count = count;
> > +   __entry->attempts = attempts;
> > +   __entry->err = err;
> > +   ),
> > +
> > +   TP_printk("name=%s start=%lx count=%lu a

[PATCH net-next v4 2/5] page_frag: unify gfp bits for order 3 page allocation

2024-01-30 Thread Yunsheng Lin

Currently there seems to be three page frag implementions
which all try to allocate order 3 page, if that fails, it
then fail back to allocate order 0 page, and each of them
all allow order 3 page allocation to fail under certain
condition by using specific gfp bits.

The gfp bits for order 3 page allocation are different
between different implementation, __GFP_NOMEMALLOC is
or'd to forbid access to emergency reserves memory for
__page_frag_cache_refill(), but it is not or'd in other
implementions, __GFP_DIRECT_RECLAIM is masked off to avoid
direct reclaim in skb_page_frag_refill(), but it is not
masked off in __page_frag_cache_refill().

This patch unifies the gfp bits used between different
implementions by or'ing __GFP_NOMEMALLOC and masking off
__GFP_DIRECT_RECLAIM for order 3 page allocation to avoid
possible pressure for mm.

Signed-off-by: Yunsheng Lin 
Reviewed-by: Alexander Duyck 
CC: Alexander Duyck 
---
 drivers/vhost/net.c | 2 +-
 mm/page_alloc.c | 4 ++--
 net/core/sock.c | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f2ed7167c848..e574e21cc0ca 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -670,7 +670,7 @@ static bool vhost_net_page_frag_refill(struct vhost_net 
*net, unsigned int sz,
/* Avoid direct reclaim but allow kswapd to wake */
pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
  __GFP_COMP | __GFP_NOWARN |
- __GFP_NORETRY,
+ __GFP_NORETRY | __GFP_NOMEMALLOC,
  SKB_FRAG_PAGE_ORDER);
if (likely(pfrag->page)) {
pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c0f7e67c4250..636145c29f70 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4685,8 +4685,8 @@ static struct page *__page_frag_cache_refill(struct 
page_frag_cache *nc,
gfp_t gfp = gfp_mask;
 
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-   gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
-   __GFP_NOMEMALLOC;
+   gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
+  __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
PAGE_FRAG_CACHE_MAX_ORDER);
nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
diff --git a/net/core/sock.c b/net/core/sock.c
index 88bf810394a5..8289a3d8c375 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2919,7 +2919,7 @@ bool skb_page_frag_refill(unsigned int sz, struct 
page_frag *pfrag, gfp_t gfp)
/* Avoid direct reclaim but allow kswapd to wake */
pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
  __GFP_COMP | __GFP_NOWARN |
- __GFP_NORETRY,
+ __GFP_NORETRY | __GFP_NOMEMALLOC,
  SKB_FRAG_PAGE_ORDER);
if (likely(pfrag->page)) {
pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
-- 
2.33.0

[PATCH net-next v4 4/5] vhost/net: remove vhost_net_page_frag_refill()

2024-01-30 Thread Yunsheng Lin

The page frag in vhost_net_page_frag_refill() uses the
'struct page_frag' from skb_page_frag_refill(), but it's
implementation is similar to page_frag_alloc_align() now.

This patch removes vhost_net_page_frag_refill() by using
'struct page_frag_cache' instead of 'struct page_frag',
and allocating frag using page_frag_alloc_align().

The added benefit is that not only unifying the page frag
implementation a little, but also having about 0.5% performance
boost testing by using the vhost_net_test introduced in the
last patch.

Signed-off-by: Yunsheng Lin 
Acked-by: Jason Wang 
---
 drivers/vhost/net.c | 91 ++---
 1 file changed, 27 insertions(+), 64 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index e574e21cc0ca..4b2fcb228a0a 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -141,10 +141,8 @@ struct vhost_net {
unsigned tx_zcopy_err;
/* Flush in progress. Protected by tx vq lock. */
bool tx_flush;
-   /* Private page frag */
-   struct page_frag page_frag;
-   /* Refcount bias of page frag */
-   int refcnt_bias;
+   /* Private page frag cache */
+   struct page_frag_cache pf_cache;
 };
 
 static unsigned vhost_net_zcopy_mask __read_mostly;
@@ -655,41 +653,6 @@ static bool tx_can_batch(struct vhost_virtqueue *vq, 
size_t total_len)
   !vhost_vq_avail_empty(vq->dev, vq);
 }
 
-static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz,
-  struct page_frag *pfrag, gfp_t gfp)
-{
-   if (pfrag->page) {
-   if (pfrag->offset + sz <= pfrag->size)
-   return true;
-   __page_frag_cache_drain(pfrag->page, net->refcnt_bias);
-   }
-
-   pfrag->offset = 0;
-   net->refcnt_bias = 0;
-   if (SKB_FRAG_PAGE_ORDER) {
-   /* Avoid direct reclaim but allow kswapd to wake */
-   pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
- __GFP_COMP | __GFP_NOWARN |
- __GFP_NORETRY | __GFP_NOMEMALLOC,
- SKB_FRAG_PAGE_ORDER);
-   if (likely(pfrag->page)) {
-   pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
-   goto done;
-   }
-   }
-   pfrag->page = alloc_page(gfp);
-   if (likely(pfrag->page)) {
-   pfrag->size = PAGE_SIZE;
-   goto done;
-   }
-   return false;
-
-done:
-   net->refcnt_bias = USHRT_MAX;
-   page_ref_add(pfrag->page, USHRT_MAX - 1);
-   return true;
-}
-
 #define VHOST_NET_RX_PAD (NET_IP_ALIGN + NET_SKB_PAD)
 
 static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
@@ -699,7 +662,6 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue 
*nvq,
struct vhost_net *net = container_of(vq->dev, struct vhost_net,
 dev);
struct socket *sock = vhost_vq_get_backend(vq);
-   struct page_frag *alloc_frag = &net->page_frag;
struct virtio_net_hdr *gso;
struct xdp_buff *xdp = &nvq->xdp[nvq->batched_xdp];
struct tun_xdp_hdr *hdr;
@@ -710,6 +672,7 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue 
*nvq,
int sock_hlen = nvq->sock_hlen;
void *buf;
int copied;
+   int ret;
 
if (unlikely(len < nvq->sock_hlen))
return -EFAULT;
@@ -719,18 +682,17 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue 
*nvq,
return -ENOSPC;
 
buflen += SKB_DATA_ALIGN(len + pad);
-   alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES);
-   if (unlikely(!vhost_net_page_frag_refill(net, buflen,
-alloc_frag, GFP_KERNEL)))
+   buf = page_frag_alloc_align(&net->pf_cache, buflen, GFP_KERNEL,
+   SMP_CACHE_BYTES);
+   if (unlikely(!buf))
return -ENOMEM;
 
-   buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
-   copied = copy_page_from_iter(alloc_frag->page,
-alloc_frag->offset +
-offsetof(struct tun_xdp_hdr, gso),
-sock_hlen, from);
-   if (copied != sock_hlen)
-   return -EFAULT;
+   copied = copy_from_iter(buf + offsetof(struct tun_xdp_hdr, gso),
+   sock_hlen, from);
+   if (copied != sock_hlen) {
+   ret = -EFAULT;
+   goto err;
+   }
 
hdr = buf;
gso = &hdr->gso;
@@ -743,27 +705,30 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue 
*nvq,
   vhost16_to_cpu(vq, gso->csum_start) +
   vhost16_to_cpu(vq, gso->csum_offset) + 2);
 
-

[PATCH net-next v4 5/5] tools: virtio: introduce vhost_net_test

2024-01-30 Thread Yunsheng Lin

introduce vhost_net_test basing on virtio_test to test
vhost_net changing in the kernel.

Signed-off-by: Yunsheng Lin 
---
 tools/virtio/.gitignore   |   1 +
 tools/virtio/Makefile |   8 +-
 tools/virtio/vhost_net_test.c | 576 ++
 3 files changed, 582 insertions(+), 3 deletions(-)
 create mode 100644 tools/virtio/vhost_net_test.c

diff --git a/tools/virtio/.gitignore b/tools/virtio/.gitignore
index 9934d48d9a55..7e47b281c442 100644
--- a/tools/virtio/.gitignore
+++ b/tools/virtio/.gitignore
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 *.d
 virtio_test
+vhost_net_test
 vringh_test
 virtio-trace/trace-agent
diff --git a/tools/virtio/Makefile b/tools/virtio/Makefile
index d128925980e0..e25e99c1c3b7 100644
--- a/tools/virtio/Makefile
+++ b/tools/virtio/Makefile
@@ -1,8 +1,9 @@
 # SPDX-License-Identifier: GPL-2.0
 all: test mod
-test: virtio_test vringh_test
+test: virtio_test vringh_test vhost_net_test
 virtio_test: virtio_ring.o virtio_test.o
 vringh_test: vringh_test.o vringh.o virtio_ring.o
+vhost_net_test: virtio_ring.o vhost_net_test.o
 
 try-run = $(shell set -e;  \
if ($(1)) >/dev/null 2>&1;  \
@@ -49,6 +50,7 @@ oot-clean: OOT_BUILD+=clean
 
 .PHONY: all test mod clean vhost oot oot-clean oot-build
 clean:
-   ${RM} *.o vringh_test virtio_test vhost_test/*.o vhost_test/.*.cmd \
-  vhost_test/Module.symvers vhost_test/modules.order *.d
+   ${RM} *.o vringh_test virtio_test vhost_net_test vhost_test/*.o \
+  vhost_test/.*.cmd vhost_test/Module.symvers \
+  vhost_test/modules.order *.d
 -include *.d
diff --git a/tools/virtio/vhost_net_test.c b/tools/virtio/vhost_net_test.c
new file mode 100644
index ..e336792a0d77
--- /dev/null
+++ b/tools/virtio/vhost_net_test.c
@@ -0,0 +1,576 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define RANDOM_BATCH   -1
+#define HDR_LEN12
+#define TEST_BUF_LEN   256
+#define TEST_PTYPE ETH_P_LOOPBACK
+
+/* Used by implementation of kmalloc() in tools/virtio/linux/kernel.h */
+void *__kmalloc_fake, *__kfree_ignore_start, *__kfree_ignore_end;
+
+struct vq_info {
+   int kick;
+   int call;
+   int idx;
+   long started;
+   long completed;
+   struct pollfd fds;
+   void *ring;
+   /* copy used for control */
+   struct vring vring;
+   struct virtqueue *vq;
+};
+
+struct vdev_info {
+   struct virtio_device vdev;
+   int control;
+   struct vq_info vqs[2];
+   int nvqs;
+   void *buf;
+   size_t buf_size;
+   char *test_buf;
+   char *res_buf;
+   struct vhost_memory *mem;
+   int sock;
+   int ifindex;
+   unsigned char mac[ETHER_ADDR_LEN];
+};
+
+static int tun_alloc(struct vdev_info *dev)
+{
+   struct ifreq ifr;
+   int len = HDR_LEN;
+   int fd, e;
+
+   fd = open("/dev/net/tun", O_RDWR);
+   if (fd < 0) {
+   perror("Cannot open /dev/net/tun");
+   return fd;
+   }
+
+   memset(&ifr, 0, sizeof(ifr));
+
+   ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR;
+   snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid());
+
+   e = ioctl(fd, TUNSETIFF, &ifr);
+   if (e < 0) {
+   perror("ioctl[TUNSETIFF]");
+   close(fd);
+   return e;
+   }
+
+   e = ioctl(fd, TUNSETVNETHDRSZ, &len);
+   if (e < 0) {
+   perror("ioctl[TUNSETVNETHDRSZ]");
+   close(fd);
+   return e;
+   }
+
+   e = ioctl(fd, SIOCGIFHWADDR, &ifr);
+   if (e < 0) {
+   perror("ioctl[SIOCGIFHWADDR]");
+   close(fd);
+   return e;
+   }
+
+   memcpy(dev->mac, &ifr.ifr_hwaddr.sa_data, ETHER_ADDR_LEN);
+   return fd;
+}
+
+static void vdev_create_socket(struct vdev_info *dev)
+{
+   struct ifreq ifr;
+
+   dev->sock = socket(AF_PACKET, SOCK_RAW, htons(TEST_PTYPE));
+   assert(dev->sock != -1);
+
+   snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid());
+   assert(ioctl(dev->sock, SIOCGIFINDEX, &ifr) >= 0);
+
+   dev->ifindex = ifr.ifr_ifindex;
+
+   /* Set the flags that bring the device up */
+   assert(ioctl(dev->sock, SIOCGIFFLAGS, &ifr) >= 0);
+   ifr.ifr_flags |= (IFF_UP | IFF_RUNNING);
+   assert(ioctl(dev->sock, SIOCSIFFLAGS, &ifr) >= 0);
+}
+
+static void vdev_send_packet(struct vdev_info *dev)
+{
+   char *sendbuf = dev->test_buf + HDR_LEN;
+   struct sockaddr_ll saddrll = {0};
+   int sockfd = dev->sock;
+   int ret;
+
+   saddrll.sll_family = PF_PACKET;
+   saddrll.sll_ifindex = dev->ifindex;
+   saddrll.sll_halen = ETH_ALEN;
+   sadd

Re: [PATCH RFC v3 23/35] arm64: mte: Try to reserve tag storage in arch_alloc_page()

2024-01-30 Thread Alexandru Elisei

Hi Peter,

On Mon, Jan 29, 2024 at 04:04:18PM -0800, Peter Collingbourne wrote:
> On Thu, Jan 25, 2024 at 8:45 AM Alexandru Elisei
>  wrote:
> >
> > Reserve tag storage for a page that is being allocated as tagged. This
> > is a best effort approach, and failing to reserve tag storage is
> > allowed.
> >
> > When all the associated tagged pages have been freed, return the tag
> > storage pages back to the page allocator, where they can be used again for
> > data allocations.
> >
> > Signed-off-by: Alexandru Elisei 
> > ---
> >
> > Changes since rfc v2:
> >
> > * Based on rfc v2 patch #16 ("arm64: mte: Manage tag storage on page
> > allocation").
> > * Fixed calculation of the number of associated tag storage blocks (Hyesoo
> > Yu).
> > * Tag storage is reserved in arch_alloc_page() instead of
> > arch_prep_new_page().
> >
> >  arch/arm64/include/asm/mte.h |  16 +-
> >  arch/arm64/include/asm/mte_tag_storage.h |  31 +++
> >  arch/arm64/include/asm/page.h|   5 +
> >  arch/arm64/include/asm/pgtable.h |  19 ++
> >  arch/arm64/kernel/mte_tag_storage.c  | 234 +++
> >  arch/arm64/mm/fault.c|   7 +
> >  fs/proc/page.c   |   1 +
> >  include/linux/kernel-page-flags.h|   1 +
> >  include/linux/page-flags.h   |   1 +
> >  include/trace/events/mmflags.h   |   3 +-
> >  mm/huge_memory.c |   1 +
> >  11 files changed, 316 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
> > index 8034695b3dd7..6457b7899207 100644
> > --- a/arch/arm64/include/asm/mte.h
> > +++ b/arch/arm64/include/asm/mte.h
> > @@ -40,12 +40,24 @@ void mte_free_tag_buf(void *buf);
> >  #ifdef CONFIG_ARM64_MTE
> >
> >  /* track which pages have valid allocation tags */
> > -#define PG_mte_tagged  PG_arch_2
> > +#define PG_mte_tagged  PG_arch_2
> >  /* simple lock to avoid multiple threads tagging the same page */
> > -#define PG_mte_lockPG_arch_3
> > +#define PG_mte_lockPG_arch_3
> > +/* Track if a tagged page has tag storage reserved */
> > +#define PG_tag_storage_reservedPG_arch_4
> > +
> > +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> > +DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key);
> > +extern bool page_tag_storage_reserved(struct page *page);
> > +#endif
> >
> >  static inline void set_page_mte_tagged(struct page *page)
> >  {
> > +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> > +   /* Open code mte_tag_storage_enabled() */
> > +   WARN_ON_ONCE(static_branch_likely(&tag_storage_enabled_key) &&
> > +!page_tag_storage_reserved(page));
> > +#endif
> > /*
> >  * Ensure that the tags written prior to this function are visible
> >  * before the page flags update.
> > diff --git a/arch/arm64/include/asm/mte_tag_storage.h 
> > b/arch/arm64/include/asm/mte_tag_storage.h
> > index 7b3f6bff8e6f..09f1318d924e 100644
> > --- a/arch/arm64/include/asm/mte_tag_storage.h
> > +++ b/arch/arm64/include/asm/mte_tag_storage.h
> > @@ -5,6 +5,12 @@
> >  #ifndef __ASM_MTE_TAG_STORAGE_H
> >  #define __ASM_MTE_TAG_STORAGE_H
> >
> > +#ifndef __ASSEMBLY__
> > +
> > +#include 
> > +
> > +#include 
> > +
> >  #ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> >
> >  DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key);
> > @@ -15,6 +21,15 @@ static inline bool tag_storage_enabled(void)
> >  }
> >
> >  void mte_init_tag_storage(void);
> > +
> > +static inline bool alloc_requires_tag_storage(gfp_t gfp)
> > +{
> > +   return gfp & __GFP_TAGGED;
> > +}
> > +int reserve_tag_storage(struct page *page, int order, gfp_t gfp);
> > +void free_tag_storage(struct page *page, int order);
> > +
> > +bool page_tag_storage_reserved(struct page *page);
> >  #else
> >  static inline bool tag_storage_enabled(void)
> >  {
> > @@ -23,6 +38,22 @@ static inline bool tag_storage_enabled(void)
> >  static inline void mte_init_tag_storage(void)
> >  {
> >  }
> > +static inline bool alloc_requires_tag_storage(struct page *page)
> 
> This function should take a gfp_t to match the
> CONFIG_ARM64_MTE_TAG_STORAGE case.

Ah, yes, it should, nice catch, the compiler didn't throw an error. Will
fix, thanks!

Alex

Re: [PATCH v3] virtio_net: Support RX hash XDP hint

2024-01-30 Thread Xuan Zhuo

On Thu, 25 Jan 2024 18:19:12 +0800, Liang Chen  
wrote:
> The RSS hash report is a feature that's part of the virtio specification.
> Currently, virtio backends like qemu, vdpa (mlx5), and potentially vhost
> (still a work in progress as per [1]) support this feature. While the
> capability to obtain the RSS hash has been enabled in the normal path,
> it's currently missing in the XDP path. Therefore, we are introducing
> XDP hints through kfuncs to allow XDP programs to access the RSS hash.
>
> 1.
> https://lore.kernel.org/all/20231015141644.260646-1-akihiko.od...@daynix.com/#r
>
> Signed-off-by: Liang Chen 
> ---
>  drivers/net/virtio_net.c | 98 +++-
>  1 file changed, 86 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index d7ce4a1011ea..0c845f2223da 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -349,6 +349,12 @@ struct virtio_net_common_hdr {
>   };
>  };
>
> +struct virtnet_xdp_buff {
> + struct xdp_buff xdp;
> + u32 hash_value;
> + u16 hash_report;
> +};
> +
>  static void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf);
>
>  static bool is_xdp_frame(void *ptr)
> @@ -1033,6 +1039,16 @@ static void put_xdp_frags(struct xdp_buff *xdp)
>   }
>  }
>
> +static void virtnet_xdp_save_rx_hash(struct virtnet_xdp_buff *virtnet_xdp,
> +  struct net_device *dev,
> +  struct virtio_net_hdr_v1_hash *hdr_hash)
> +{
> + if (dev->features & NETIF_F_RXHASH) {
> + virtnet_xdp->hash_value = __le32_to_cpu(hdr_hash->hash_value);
> + virtnet_xdp->hash_report = __le16_to_cpu(hdr_hash->hash_report);

Could we put the __leXX_to_cpu to virtnet_xdp_rx_hash?

Other looks good to me.

Reviewed-by: Xuan Zhuo 

Thanks.


> + }
> +}
> +
>  static int virtnet_xdp_handler(struct bpf_prog *xdp_prog, struct xdp_buff 
> *xdp,
>  struct net_device *dev,
>  unsigned int *xdp_xmit,
> @@ -1199,9 +1215,10 @@ static struct sk_buff *receive_small_xdp(struct 
> net_device *dev,
>   unsigned int headroom = vi->hdr_len + header_offset;
>   struct virtio_net_hdr_mrg_rxbuf *hdr = buf + header_offset;
>   struct page *page = virt_to_head_page(buf);
> + struct virtnet_xdp_buff virtnet_xdp;
>   struct page *xdp_page;
> + struct xdp_buff *xdp;
>   unsigned int buflen;
> - struct xdp_buff xdp;
>   struct sk_buff *skb;
>   unsigned int metasize = 0;
>   u32 act;
> @@ -1233,17 +1250,20 @@ static struct sk_buff *receive_small_xdp(struct 
> net_device *dev,
>   page = xdp_page;
>   }
>
> - xdp_init_buff(&xdp, buflen, &rq->xdp_rxq);
> - xdp_prepare_buff(&xdp, buf + VIRTNET_RX_PAD + vi->hdr_len,
> + xdp = &virtnet_xdp.xdp;
> + xdp_init_buff(xdp, buflen, &rq->xdp_rxq);
> + xdp_prepare_buff(xdp, buf + VIRTNET_RX_PAD + vi->hdr_len,
>xdp_headroom, len, true);
>
> - act = virtnet_xdp_handler(xdp_prog, &xdp, dev, xdp_xmit, stats);
> + virtnet_xdp_save_rx_hash(&virtnet_xdp, dev, (void *)hdr);
> +
> + act = virtnet_xdp_handler(xdp_prog, xdp, dev, xdp_xmit, stats);
>
>   switch (act) {
>   case XDP_PASS:
>   /* Recalculate length in case bpf program changed it */
> - len = xdp.data_end - xdp.data;
> - metasize = xdp.data - xdp.data_meta;
> + len = xdp->data_end - xdp->data;
> + metasize = xdp->data - xdp->data_meta;
>   break;
>
>   case XDP_TX:
> @@ -1254,7 +1274,7 @@ static struct sk_buff *receive_small_xdp(struct 
> net_device *dev,
>   goto err_xdp;
>   }
>
> - skb = virtnet_build_skb(buf, buflen, xdp.data - buf, len);
> + skb = virtnet_build_skb(buf, buflen, xdp->data - buf, len);
>   if (unlikely(!skb))
>   goto err;
>
> @@ -1591,10 +1611,11 @@ static struct sk_buff *receive_mergeable_xdp(struct 
> net_device *dev,
>   int num_buf = virtio16_to_cpu(vi->vdev, hdr->num_buffers);
>   struct page *page = virt_to_head_page(buf);
>   int offset = buf - page_address(page);
> + struct virtnet_xdp_buff virtnet_xdp;
>   unsigned int xdp_frags_truesz = 0;
>   struct sk_buff *head_skb;
>   unsigned int frame_sz;
> - struct xdp_buff xdp;
> + struct xdp_buff *xdp;
>   void *data;
>   u32 act;
>   int err;
> @@ -1604,16 +1625,19 @@ static struct sk_buff *receive_mergeable_xdp(struct 
> net_device *dev,
>   if (unlikely(!data))
>   goto err_xdp;
>
> - err = virtnet_build_xdp_buff_mrg(dev, vi, rq, &xdp, data, len, frame_sz,
> + xdp = &virtnet_xdp.xdp;
> + err = virtnet_build_xdp_buff_mrg(dev, vi, rq, xdp, data, len, frame_sz,
>&num_buf, &xdp_frags_truesz, stats);
>   if (unlikely(err))
>   goto err_xdp;
>
> - a

Re: [PATCH RFC v3 01/35] mm: page_alloc: Add gfp_flags parameter to arch_alloc_page()

2024-01-30 Thread Alexandru Elisei

Hi,

On Tue, Jan 30, 2024 at 09:56:10AM +0530, Anshuman Khandual wrote:
> 
> 
> On 1/29/24 17:11, Alexandru Elisei wrote:
> > Hi,
> > 
> > On Mon, Jan 29, 2024 at 11:18:59AM +0530, Anshuman Khandual wrote:
> >> On 1/25/24 22:12, Alexandru Elisei wrote:
> >>> Extend the usefulness of arch_alloc_page() by adding the gfp_flags
> >>> parameter.
> >> Although the change here is harmless in itself, it will definitely benefit
> >> from some additional context explaining the rationale, taking into account
> >> why-how arch_alloc_page() got added particularly for s390 platform and how
> >> it's going to be used in the present proposal.
> > arm64 will use it to reserve tag storage if the caller requested a tagged
> > page. Right now that means that __GFP_ZEROTAGS is set in the gfp mask, but
> > I'll rename it to __GFP_TAGGED in patch #18 ("arm64: mte: Rename
> > __GFP_ZEROTAGS to __GFP_TAGGED") [1].
> > 
> > [1] 
> > https://lore.kernel.org/lkml/20240125164256.4147-19-alexandru.eli...@arm.com/
> 
> Makes sense, but please do update the commit message explaining how
> new gfp mask argument will be used to detect tagged page allocation
> requests, further requiring tag storage allocation.

Will do, thanks!

Alex

Re: [PATCH RFC v3 04/35] mm: page_alloc: Partially revert "mm: page_alloc: remove stale CMA guard code"

2024-01-30 Thread Alexandru Elisei

Hi,

On Tue, Jan 30, 2024 at 10:04:02AM +0530, Anshuman Khandual wrote:
> 
> 
> On 1/29/24 17:16, Alexandru Elisei wrote:
> > Hi,
> > 
> > On Mon, Jan 29, 2024 at 02:31:23PM +0530, Anshuman Khandual wrote:
> >>
> >>
> >> On 1/25/24 22:12, Alexandru Elisei wrote:
> >>> The patch f945116e4e19 ("mm: page_alloc: remove stale CMA guard code")
> >>> removed the CMA filter when allocating from the MIGRATE_MOVABLE pcp list
> >>> because CMA is always allowed when __GFP_MOVABLE is set.
> >>>
> >>> With the introduction of the arch_alloc_cma() function, the above is not
> >>> true anymore, so bring back the filter.
> >>
> >> This makes sense as arch_alloc_cma() now might prevent ALLOC_CMA being
> >> assigned to alloc_flags in gfp_to_alloc_flags_cma().
> > 
> > Can I add your Reviewed-by tag then?
> 
> I think all these changes need to be reviewed in their entirety
> even though some patches do look good on their own. For example
> this patch depends on whether [PATCH 03/35] is acceptable or not.
> 
> I would suggest separating out CMA patches which could be debated
> and merged regardless of this series.

Ah, I see, makes sense. Since basically all the core mm changes are there
to enable dynamic tag storage for arm64, I'll hold on until the series
stabilises before separating the core mm from the arm64 patches.

Thanks,
Alex

Re: [PATCH RFC v3 06/35] mm: cma: Make CMA_ALLOC_SUCCESS/FAIL count the number of pages

2024-01-30 Thread Alexandru Elisei

Hi,

On Tue, Jan 30, 2024 at 10:22:11AM +0530, Anshuman Khandual wrote:
> 
> 
> On 1/29/24 17:21, Alexandru Elisei wrote:
> > Hi,
> > 
> > On Mon, Jan 29, 2024 at 02:54:20PM +0530, Anshuman Khandual wrote:
> >>
> >>
> >> On 1/25/24 22:12, Alexandru Elisei wrote:
> >>> The CMA_ALLOC_SUCCESS, respectively CMA_ALLOC_FAIL, are increased by one
> >>> after each cma_alloc() function call. This is done even though cma_alloc()
> >>> can allocate an arbitrary number of CMA pages. When looking at
> >>> /proc/vmstat, the number of successful (or failed) cma_alloc() calls
> >>> doesn't tell much with regards to how many CMA pages were allocated via
> >>> cma_alloc() versus via the page allocator (regular allocation request or
> >>> PCP lists refill).
> >>>
> >>> This can also be rather confusing to a user who isn't familiar with the
> >>> code, since the unit of measurement for nr_free_cma is the number of 
> >>> pages,
> >>> but cma_alloc_success and cma_alloc_fail count the number of cma_alloc()
> >>> function calls.
> >>>
> >>> Let's make this consistent, and arguably more useful, by having
> >>> CMA_ALLOC_SUCCESS count the number of successfully allocated CMA pages, 
> >>> and
> >>> CMA_ALLOC_FAIL count the number of pages the cma_alloc() failed to
> >>> allocate.
> >>>
> >>> For users that wish to track the number of cma_alloc() calls, there are
> >>> tracepoints for that already implemented.
> >>>
> >>> Signed-off-by: Alexandru Elisei 
> >>> ---
> >>>  mm/cma.c | 4 ++--
> >>>  1 file changed, 2 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/mm/cma.c b/mm/cma.c
> >>> index f49c95f8ee37..dbf7fe8cb1bd 100644
> >>> --- a/mm/cma.c
> >>> +++ b/mm/cma.c
> >>> @@ -517,10 +517,10 @@ struct page *cma_alloc(struct cma *cma, unsigned 
> >>> long count,
> >>>   pr_debug("%s(): returned %p\n", __func__, page);
> >>>  out:
> >>>   if (page) {
> >>> - count_vm_event(CMA_ALLOC_SUCCESS);
> >>> + count_vm_events(CMA_ALLOC_SUCCESS, count);
> >>>   cma_sysfs_account_success_pages(cma, count);
> >>>   } else {
> >>> - count_vm_event(CMA_ALLOC_FAIL);
> >>> + count_vm_events(CMA_ALLOC_FAIL, count);
> >>>   if (cma)
> >>>   cma_sysfs_account_fail_pages(cma, count);
> >>>   }
> >>
> >> Without getting into the merits of this patch - which is actually trying 
> >> to do
> >> semantics change to /proc/vmstat, wondering how is this even related to 
> >> this
> >> particular series ? If required this could be debated on it's on 
> >> separately.
> > 
> > Having the number of CMA pages allocated and the number of CMA pages freed
> > allows someone to infer how many tagged pages are in use at a given time:
> 
> That should not be done in CMA which is a generic multi purpose allocator.

Ah, ok. Let me rephrase that: Having the number of CMA pages allocated, the
number of failed CMA page allocations and the number of freed CMA pages
allows someone to infer how many CMA pages are in use at a given time.
That's valuable information for software designers and system
administrators, as it allows them to tune the number of CMA pages available
in a system.

Or put another way: what would you consider to be more useful?  Knowing the
number of cma_alloc()/cma_release() calls, or knowing the number of pages
that cma_alloc()/cma_release() allocated or freed?

> 
> > (allocated CMA pages - CMA pages allocated by drivers* - CMA pages
> > released) * 32. That is valuable information for software and hardware
> > designers.
> > 
> > Besides that, for every iteration of the series, this has proven invaluable
> > for discovering bugs with freeing and/or reserving tag storage pages.
> 
> I am afraid that might not be enough justification for getting something
> merged mainline.
> 
> > 
> > *that would require userspace reading cma_alloc_success and
> > cma_release_success before any tagged allocations are performed.
> 
> While assuming that no other non-memory-tagged CMA based allocation amd free
> call happens in the meantime ? That would be on real thin ice.
> 
> I suppose arm64 tagged memory specific allocation or free related counters
> need to be created on the caller side, including arch_free_pages_prepare().

I'll think about this. At the very least, I can add tracepoints.

Thanks,
Alex

[PATCH v3 0/5] PM: domains: Add helpers for multi PM domains to avoid open-coding

2024-01-30 Thread Ulf Hansson

Rafael, my plan is queue up this series via my pmdomain tree. Please let me know
if you see any issues with that, especially around patch1. 

Updates in v3:
- Added tested-by, reviewed-by and suggested-by tags. No other changes
have been made.

Updates in v2:
- Ccing Daniel Baluta and Iuliana Prodan the NXP remoteproc patches to
requests help with testing.
- Fixed NULL pointer bug in patch1, pointed out by Nikunj.
- Added some tested/reviewed-by tags.

Attaching/detaching of a device to multiple PM domains has started to become a
common operation for many drivers, typically during ->probe() and ->remove().
In most cases, this has lead to lots of boilerplate code in the drivers.

This series adds a pair of helper functions to manage the attach/detach of a
device to its multiple PM domains. Moreover, a couple of drivers have been
converted to use the new helpers as a proof of concept.

Note 1)
The changes in the drivers have only been compile tested, while the helpers
have been tested along with a couple of local dummy drivers that I have hacked
up to model both genpd providers and genpd consumers.

Note 2)
I was struggling to make up mind if we should have a separate helper to attach
all available power-domains described in DT, rather than providing "NULL" to the
dev_pm_domain_attach_list(). I decided not to, but please let me know if you
prefer the other option.

Note 3)
For OPP integration, as a follow up I am striving to make the
dev_pm_opp_attach_genpd() redundant. Instead I think we should move towards
using dev_pm_opp_set_config()->_opp_set_required_devs(), which would allow us to
use the helpers that $subject series is adding.

Kind regards
Ulf Hansson

Ulf Hansson (5):
  PM: domains: Add helper functions to attach/detach multiple PM domains
  remoteproc: imx_dsp_rproc: Convert to
dev_pm_domain_attach|detach_list()
  remoteproc: imx_rproc: Convert to dev_pm_domain_attach|detach_list()
  remoteproc: qcom_q6v5_adsp: Convert to
dev_pm_domain_attach|detach_list()
  media: venus: Convert to dev_pm_domain_attach|detach_list() for vcodec

 drivers/base/power/common.c   | 134 +++
 drivers/media/platform/qcom/venus/core.c  |  12 +-
 drivers/media/platform/qcom/venus/core.h  |   7 +-
 .../media/platform/qcom/venus/pm_helpers.c|  48 ++
 drivers/remoteproc/imx_dsp_rproc.c|  82 +
 drivers/remoteproc/imx_rproc.c|  73 +---
 drivers/remoteproc/qcom_q6v5_adsp.c   | 160 --
 include/linux/pm_domain.h |  38 +
 8 files changed, 289 insertions(+), 265 deletions(-)

-- 
2.34.1

[PATCH v3 1/5] PM: domains: Add helper functions to attach/detach multiple PM domains

2024-01-30 Thread Ulf Hansson

Attaching/detaching of a device to multiple PM domains has started to
become a common operation for many drivers, typically during ->probe() and
->remove(). In most cases, this has lead to lots of boilerplate code in the
drivers.

To fixup up the situation, let's introduce a pair of helper functions,
dev_pm_domain_attach|detach_list(), that driver can use instead of the
open-coding. Note that, it seems reasonable to limit the support for these
helpers to DT based platforms, at it's the only valid use case for now.

Suggested-by: Daniel Baluta 
Tested-by: Bryan O'Donoghue 
Tested-by: Iuliana Prodan 
Signed-off-by: Ulf Hansson 
---

Changes in v3:
- Added suggested-by and tested-by tags.

Changes in v2:
- Fix NULL pointer bug pointed out by Nikunj.

---
 drivers/base/power/common.c | 134 
 include/linux/pm_domain.h   |  38 ++
 2 files changed, 172 insertions(+)

diff --git a/drivers/base/power/common.c b/drivers/base/power/common.c
index 44ec20918a4d..327d168dd37a 100644
--- a/drivers/base/power/common.c
+++ b/drivers/base/power/common.c
@@ -167,6 +167,115 @@ struct device *dev_pm_domain_attach_by_name(struct device 
*dev,
 }
 EXPORT_SYMBOL_GPL(dev_pm_domain_attach_by_name);
 
+/**
+ * dev_pm_domain_attach_list - Associate a device with its PM domains.
+ * @dev: The device used to lookup the PM domains for.
+ * @data: The data used for attaching to the PM domains.
+ * @list: An out-parameter with an allocated list of attached PM domains.
+ *
+ * This function helps to attach a device to its multiple PM domains. The
+ * caller, which is typically a driver's probe function, may provide a list of
+ * names for the PM domains that we should try to attach the device to, but it
+ * may also provide an empty list, in case the attach should be done for all of
+ * the available PM domains.
+ *
+ * Callers must ensure proper synchronization of this function with power
+ * management callbacks.
+ *
+ * Returns the number of attached PM domains or a negative error code in case 
of
+ * a failure. Note that, to detach the list of PM domains, the driver shall 
call
+ * dev_pm_domain_detach_list(), typically during the remove phase.
+ */
+int dev_pm_domain_attach_list(struct device *dev,
+ const struct dev_pm_domain_attach_data *data,
+ struct dev_pm_domain_list **list)
+{
+   struct device_node *np = dev->of_node;
+   struct dev_pm_domain_list *pds;
+   struct device *pd_dev = NULL;
+   int ret, i, num_pds = 0;
+   bool by_id = true;
+   u32 pd_flags = data ? data->pd_flags : 0;
+   u32 link_flags = pd_flags & PD_FLAG_NO_DEV_LINK ? 0 :
+   DL_FLAG_STATELESS | DL_FLAG_PM_RUNTIME;
+
+   if (dev->pm_domain)
+   return -EEXIST;
+
+   /* For now this is limited to OF based platforms. */
+   if (!np)
+   return 0;
+
+   if (data && data->pd_names) {
+   num_pds = data->num_pd_names;
+   by_id = false;
+   } else {
+   num_pds = of_count_phandle_with_args(np, "power-domains",
+"#power-domain-cells");
+   }
+
+   if (num_pds <= 0)
+   return 0;
+
+   pds = devm_kzalloc(dev, sizeof(*pds), GFP_KERNEL);
+   if (!pds)
+   return -ENOMEM;
+
+   pds->pd_devs = devm_kcalloc(dev, num_pds, sizeof(*pds->pd_devs),
+   GFP_KERNEL);
+   if (!pds->pd_devs)
+   return -ENOMEM;
+
+   pds->pd_links = devm_kcalloc(dev, num_pds, sizeof(*pds->pd_links),
+GFP_KERNEL);
+   if (!pds->pd_links)
+   return -ENOMEM;
+
+   if (link_flags && pd_flags & PD_FLAG_DEV_LINK_ON)
+   link_flags |= DL_FLAG_RPM_ACTIVE;
+
+   for (i = 0; i < num_pds; i++) {
+   if (by_id)
+   pd_dev = dev_pm_domain_attach_by_id(dev, i);
+   else
+   pd_dev = dev_pm_domain_attach_by_name(dev,
+   data->pd_names[i]);
+   if (IS_ERR_OR_NULL(pd_dev)) {
+   ret = pd_dev ? PTR_ERR(pd_dev) : -ENODEV;
+   goto err_attach;
+   }
+
+   if (link_flags) {
+   struct device_link *link;
+
+   link = device_link_add(dev, pd_dev, link_flags);
+   if (!link) {
+   ret = -ENODEV;
+   goto err_link;
+   }
+
+   pds->pd_links[i] = link;
+   }
+
+   pds->pd_devs[i] = pd_dev;
+   }
+
+   pds->num_pds = num_pds;
+   *list = pds;
+   return num_pds;
+
+err_link:
+   dev_pm_domain_detach(pd_dev, true);
+err_attach:
+   while (--i >= 0) {
+   if (pds->pd_

[PATCH v3 2/5] remoteproc: imx_dsp_rproc: Convert to dev_pm_domain_attach|detach_list()

2024-01-30 Thread Ulf Hansson

Let's avoid the boilerplate code to manage the multiple PM domain case, by
converting into using dev_pm_domain_attach|detach_list().

Cc: Bjorn Andersson 
Cc: Shawn Guo 
Cc: Sascha Hauer 
Cc: Daniel Baluta 
Cc: 
Tested-by: Iuliana Prodan 
Reviewed-by: Iuliana Prodan 
Reviewed-by: Mathieu Poirier 
Signed-off-by: Ulf Hansson 
---

Changes in v3:
- Added reviewed-by and tested-by tags.

Changes in v2:
- None.

---
 drivers/remoteproc/imx_dsp_rproc.c | 82 --
 1 file changed, 9 insertions(+), 73 deletions(-)

diff --git a/drivers/remoteproc/imx_dsp_rproc.c 
b/drivers/remoteproc/imx_dsp_rproc.c
index a1c62d15f16c..d73727a5828a 100644
--- a/drivers/remoteproc/imx_dsp_rproc.c
+++ b/drivers/remoteproc/imx_dsp_rproc.c
@@ -103,12 +103,10 @@ enum imx_dsp_rp_mbox_messages {
  * @tx_ch: mailbox tx channel handle
  * @rx_ch: mailbox rx channel handle
  * @rxdb_ch: mailbox rx doorbell channel handle
- * @pd_dev: power domain device
- * @pd_dev_link: power domain device link
+ * @pd_list: power domain list
  * @ipc_handle: System Control Unit ipc handle
  * @rproc_work: work for processing virtio interrupts
  * @pm_comp: completion primitive to sync for suspend response
- * @num_domains: power domain number
  * @flags: control flags
  */
 struct imx_dsp_rproc {
@@ -121,12 +119,10 @@ struct imx_dsp_rproc {
struct mbox_chan*tx_ch;
struct mbox_chan*rx_ch;
struct mbox_chan*rxdb_ch;
-   struct device   **pd_dev;
-   struct device_link  **pd_dev_link;
+   struct dev_pm_domain_list   *pd_list;
struct imx_sc_ipc   *ipc_handle;
struct work_struct  rproc_work;
struct completion   pm_comp;
-   int num_domains;
u32 flags;
 };
 
@@ -955,74 +951,14 @@ static const struct rproc_ops imx_dsp_rproc_ops = {
 static int imx_dsp_attach_pm_domains(struct imx_dsp_rproc *priv)
 {
struct device *dev = priv->rproc->dev.parent;
-   int ret, i;
-
-   priv->num_domains = of_count_phandle_with_args(dev->of_node,
-  "power-domains",
-  "#power-domain-cells");
-
-   /* If only one domain, then no need to link the device */
-   if (priv->num_domains <= 1)
-   return 0;
-
-   priv->pd_dev = devm_kmalloc_array(dev, priv->num_domains,
- sizeof(*priv->pd_dev),
- GFP_KERNEL);
-   if (!priv->pd_dev)
-   return -ENOMEM;
-
-   priv->pd_dev_link = devm_kmalloc_array(dev, priv->num_domains,
-  sizeof(*priv->pd_dev_link),
-  GFP_KERNEL);
-   if (!priv->pd_dev_link)
-   return -ENOMEM;
-
-   for (i = 0; i < priv->num_domains; i++) {
-   priv->pd_dev[i] = dev_pm_domain_attach_by_id(dev, i);
-   if (IS_ERR(priv->pd_dev[i])) {
-   ret = PTR_ERR(priv->pd_dev[i]);
-   goto detach_pm;
-   }
-
-   /*
-* device_link_add will check priv->pd_dev[i], if it is
-* NULL, then will break.
-*/
-   priv->pd_dev_link[i] = device_link_add(dev,
-  priv->pd_dev[i],
-  DL_FLAG_STATELESS |
-  DL_FLAG_PM_RUNTIME);
-   if (!priv->pd_dev_link[i]) {
-   dev_pm_domain_detach(priv->pd_dev[i], false);
-   ret = -EINVAL;
-   goto detach_pm;
-   }
-   }
-
-   return 0;
-
-detach_pm:
-   while (--i >= 0) {
-   device_link_del(priv->pd_dev_link[i]);
-   dev_pm_domain_detach(priv->pd_dev[i], false);
-   }
-
-   return ret;
-}
-
-static int imx_dsp_detach_pm_domains(struct imx_dsp_rproc *priv)
-{
-   int i;
+   int ret;
 
-   if (priv->num_domains <= 1)
+   /* A single PM domain is already attached. */
+   if (dev->pm_domain)
return 0;
 
-   for (i = 0; i < priv->num_domains; i++) {
-   device_link_del(priv->pd_dev_link[i]);
-   dev_pm_domain_detach(priv->pd_dev[i], false);
-   }
-
-   return 0;
+   ret = dev_pm_domain_attach_list(dev, NULL, &priv->pd_list);
+   return ret < 0 ? ret : 0;
 }
 
 /**
@@ -1154,7 +1090,7 @@ static int imx_dsp_rproc_probe(struct platform_device 
*pdev)
return 0;
 
 err_detach_domains:
-   imx_dsp_detach_pm_domains(priv);
+   dev_pm_domain_deta

[PATCH v3 3/5] remoteproc: imx_rproc: Convert to dev_pm_domain_attach|detach_list()

2024-01-30 Thread Ulf Hansson

Let's avoid the boilerplate code to manage the multiple PM domain case, by
converting into using dev_pm_domain_attach|detach_list().

Cc: Bjorn Andersson 
Cc: Shawn Guo 
Cc: Sascha Hauer 
Cc: Daniel Baluta 
Cc: 
Tested-by: Iuliana Prodan 
Reviewed-by: Iuliana Prodan 
Reviewed-by: Mathieu Poirier 
Signed-off-by: Ulf Hansson 
---

Changes in v3:
- Added reviewed-by and tested-by tags.

Changes in v2:
- None.

---
 drivers/remoteproc/imx_rproc.c | 73 +-
 1 file changed, 9 insertions(+), 64 deletions(-)

diff --git a/drivers/remoteproc/imx_rproc.c b/drivers/remoteproc/imx_rproc.c
index 8bb293b9f327..3161f14442bc 100644
--- a/drivers/remoteproc/imx_rproc.c
+++ b/drivers/remoteproc/imx_rproc.c
@@ -92,7 +92,6 @@ struct imx_rproc_mem {
 
 static int imx_rproc_xtr_mbox_init(struct rproc *rproc);
 static void imx_rproc_free_mbox(struct rproc *rproc);
-static int imx_rproc_detach_pd(struct rproc *rproc);
 
 struct imx_rproc {
struct device   *dev;
@@ -113,10 +112,8 @@ struct imx_rproc {
u32 rproc_pt;   /* partition id */
u32 rsrc_id;/* resource id */
u32 entry;  /* cpu start address */
-   int num_pd;
u32 core_index;
-   struct device   **pd_dev;
-   struct device_link  **pd_dev_link;
+   struct dev_pm_domain_list   *pd_list;
 };
 
 static const struct imx_rproc_att imx_rproc_att_imx93[] = {
@@ -853,7 +850,7 @@ static void imx_rproc_put_scu(struct rproc *rproc)
return;
 
if (imx_sc_rm_is_resource_owned(priv->ipc_handle, priv->rsrc_id)) {
-   imx_rproc_detach_pd(rproc);
+   dev_pm_domain_detach_list(priv->pd_list);
return;
}
 
@@ -880,72 +877,20 @@ static int imx_rproc_partition_notify(struct 
notifier_block *nb,
 static int imx_rproc_attach_pd(struct imx_rproc *priv)
 {
struct device *dev = priv->dev;
-   int ret, i;
-
-   /*
-* If there is only one power-domain entry, the platform driver 
framework
-* will handle it, no need handle it in this driver.
-*/
-   priv->num_pd = of_count_phandle_with_args(dev->of_node, "power-domains",
- "#power-domain-cells");
-   if (priv->num_pd <= 1)
-   return 0;
-
-   priv->pd_dev = devm_kmalloc_array(dev, priv->num_pd, 
sizeof(*priv->pd_dev), GFP_KERNEL);
-   if (!priv->pd_dev)
-   return -ENOMEM;
-
-   priv->pd_dev_link = devm_kmalloc_array(dev, priv->num_pd, 
sizeof(*priv->pd_dev_link),
-  GFP_KERNEL);
-
-   if (!priv->pd_dev_link)
-   return -ENOMEM;
-
-   for (i = 0; i < priv->num_pd; i++) {
-   priv->pd_dev[i] = dev_pm_domain_attach_by_id(dev, i);
-   if (IS_ERR(priv->pd_dev[i])) {
-   ret = PTR_ERR(priv->pd_dev[i]);
-   goto detach_pd;
-   }
-
-   priv->pd_dev_link[i] = device_link_add(dev, priv->pd_dev[i], 
DL_FLAG_STATELESS |
-  DL_FLAG_PM_RUNTIME | 
DL_FLAG_RPM_ACTIVE);
-   if (!priv->pd_dev_link[i]) {
-   dev_pm_domain_detach(priv->pd_dev[i], false);
-   ret = -EINVAL;
-   goto detach_pd;
-   }
-   }
-
-   return 0;
-
-detach_pd:
-   while (--i >= 0) {
-   device_link_del(priv->pd_dev_link[i]);
-   dev_pm_domain_detach(priv->pd_dev[i], false);
-   }
-
-   return ret;
-}
-
-static int imx_rproc_detach_pd(struct rproc *rproc)
-{
-   struct imx_rproc *priv = rproc->priv;
-   int i;
+   int ret;
+   struct dev_pm_domain_attach_data pd_data = {
+   .pd_flags = PD_FLAG_DEV_LINK_ON,
+   };
 
/*
 * If there is only one power-domain entry, the platform driver 
framework
 * will handle it, no need handle it in this driver.
 */
-   if (priv->num_pd <= 1)
+   if (dev->pm_domain)
return 0;
 
-   for (i = 0; i < priv->num_pd; i++) {
-   device_link_del(priv->pd_dev_link[i]);
-   dev_pm_domain_detach(priv->pd_dev[i], false);
-   }
-
-   return 0;
+   ret = dev_pm_domain_attach_list(dev, &pd_data, &priv->pd_list);
+   return ret < 0 ? ret : 0;
 }
 
 static int imx_rproc_detect_mode(struct imx_rproc *priv)
-- 
2.34.1

[PATCH v3 4/5] remoteproc: qcom_q6v5_adsp: Convert to dev_pm_domain_attach|detach_list()

2024-01-30 Thread Ulf Hansson

Let's avoid some of the boilerplate code to manage the various PM domain
cases, by converting into using dev_pm_domain_attach|detach_list().

As a part of the conversion, we are moving over to use device_links, which
simplifies the runtime PM support too. Moreover, while attaching let's
trust that an already attached single PM domain is the correct one.

Cc: Mathieu Poirier 
Cc: Bjorn Andersson 
Cc: Konrad Dybcio 
Cc: 
Signed-off-by: Ulf Hansson 
---

Changes in v3:
- None.

Changes in v2:
- None.

---
 drivers/remoteproc/qcom_q6v5_adsp.c | 160 +---
 1 file changed, 73 insertions(+), 87 deletions(-)

diff --git a/drivers/remoteproc/qcom_q6v5_adsp.c 
b/drivers/remoteproc/qcom_q6v5_adsp.c
index 6c67514cc493..93f9a1537ec6 100644
--- a/drivers/remoteproc/qcom_q6v5_adsp.c
+++ b/drivers/remoteproc/qcom_q6v5_adsp.c
@@ -55,8 +55,6 @@
 #define QDSP6SS_CORE_CBCR  0x20
 #define QDSP6SS_SLEEP_CBCR 0x3c
 
-#define QCOM_Q6V5_RPROC_PROXY_PD_MAX   3
-
 #define LPASS_BOOT_CORE_START  BIT(0)
 #define LPASS_BOOT_CMD_START   BIT(0)
 #define LPASS_EFUSE_Q6SS_EVB_SEL 0x0
@@ -74,7 +72,8 @@ struct adsp_pil_data {
 
const char **clk_ids;
int num_clks;
-   const char **proxy_pd_names;
+   const char **pd_names;
+   unsigned int num_pds;
const char *load_state;
 };
 
@@ -110,8 +109,7 @@ struct qcom_adsp {
size_t mem_size;
bool has_iommu;
 
-   struct device *proxy_pds[QCOM_Q6V5_RPROC_PROXY_PD_MAX];
-   size_t proxy_pd_count;
+   struct dev_pm_domain_list *pd_list;
 
struct qcom_rproc_glink glink_subdev;
struct qcom_rproc_ssr ssr_subdev;
@@ -120,98 +118,92 @@ struct qcom_adsp {
int (*shutdown)(struct qcom_adsp *adsp);
 };
 
-static int qcom_rproc_pds_attach(struct device *dev, struct qcom_adsp *adsp,
-const char **pd_names)
+static int qcom_rproc_pds_attach(struct qcom_adsp *adsp, const char **pd_names,
+unsigned int num_pds)
 {
-   struct device **devs = adsp->proxy_pds;
-   size_t num_pds = 0;
+   struct device *dev = adsp->dev;
+   struct dev_pm_domain_attach_data pd_data = {
+   .pd_names = pd_names,
+   .num_pd_names = num_pds,
+   };
int ret;
-   int i;
-
-   if (!pd_names)
-   return 0;
 
/* Handle single power domain */
-   if (dev->pm_domain) {
-   devs[0] = dev;
-   pm_runtime_enable(dev);
-   return 1;
-   }
+   if (dev->pm_domain)
+   goto out;
 
-   while (pd_names[num_pds])
-   num_pds++;
+   if (!pd_names)
+   return 0;
 
-   if (num_pds > ARRAY_SIZE(adsp->proxy_pds))
-   return -E2BIG;
+   ret = dev_pm_domain_attach_list(dev, &pd_data, &adsp->pd_list);
+   if (ret < 0)
+   return ret;
 
-   for (i = 0; i < num_pds; i++) {
-   devs[i] = dev_pm_domain_attach_by_name(dev, pd_names[i]);
-   if (IS_ERR_OR_NULL(devs[i])) {
-   ret = PTR_ERR(devs[i]) ? : -ENODATA;
-   goto unroll_attach;
-   }
-   }
+out:
+   pm_runtime_enable(dev);
+   return 0;
+}
 
-   return num_pds;
+static void qcom_rproc_pds_detach(struct qcom_adsp *adsp)
+{
+   struct device *dev = adsp->dev;
+   struct dev_pm_domain_list *pds = adsp->pd_list;
 
-unroll_attach:
-   for (i--; i >= 0; i--)
-   dev_pm_domain_detach(devs[i], false);
+   dev_pm_domain_detach_list(pds);
 
-   return ret;
+   if (dev->pm_domain || pds)
+   pm_runtime_disable(adsp->dev);
 }
 
-static void qcom_rproc_pds_detach(struct qcom_adsp *adsp, struct device **pds,
- size_t pd_count)
+static int qcom_rproc_pds_enable(struct qcom_adsp *adsp)
 {
struct device *dev = adsp->dev;
-   int i;
+   struct dev_pm_domain_list *pds = adsp->pd_list;
+   int ret, i = 0;
 
-   /* Handle single power domain */
-   if (dev->pm_domain && pd_count) {
-   pm_runtime_disable(dev);
-   return;
-   }
+   if (!dev->pm_domain && !pds)
+   return 0;
 
-   for (i = 0; i < pd_count; i++)
-   dev_pm_domain_detach(pds[i], false);
-}
+   if (dev->pm_domain)
+   dev_pm_genpd_set_performance_state(dev, INT_MAX);
 
-static int qcom_rproc_pds_enable(struct qcom_adsp *adsp, struct device **pds,
-size_t pd_count)
-{
-   int ret;
-   int i;
-
-   for (i = 0; i < pd_count; i++) {
-   dev_pm_genpd_set_performance_state(pds[i], INT_MAX);
-   ret = pm_runtime_resume_and_get(pds[i]);
-   if (ret < 0) {
-   dev_pm_genpd_set_performance_state(pds[i], 0);
-   goto unroll_pd_votes;
-   }
+   while (pds && i < pds->num_pds)

[PATCH v3 5/5] media: venus: Convert to dev_pm_domain_attach|detach_list() for vcodec

2024-01-30 Thread Ulf Hansson

Let's avoid some of the boilerplate code to manage the vcodec PM domains,
by converting into using dev_pm_domain_attach|detach_list().

Cc: Mauro Carvalho Chehab 
Cc: Stanimir Varbanov 
Cc: Vikash Garodia 
Cc: Bjorn Andersson 
Cc: Konrad Dybcio 
Cc: 
Tested-by: Bryan O'Donoghue 
Reviewed-by: Bryan O'Donoghue 
Signed-off-by: Ulf Hansson 
---

Changes in v3:
- None.

Changes in v2:
- Added reviewed-by and tested-by tags.

---
 drivers/media/platform/qcom/venus/core.c  | 12 +++--
 drivers/media/platform/qcom/venus/core.h  |  7 ++-
 .../media/platform/qcom/venus/pm_helpers.c| 48 +++
 3 files changed, 26 insertions(+), 41 deletions(-)

diff --git a/drivers/media/platform/qcom/venus/core.c 
b/drivers/media/platform/qcom/venus/core.c
index a712dd4f02a5..ce206b709754 100644
--- a/drivers/media/platform/qcom/venus/core.c
+++ b/drivers/media/platform/qcom/venus/core.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -114,7 +115,8 @@ static void venus_sys_error_handler(struct work_struct 
*work)
pm_runtime_put_sync(core->dev);
 
for (i = 0; i < max_attempts; i++) {
-   if (!core->pmdomains[0] || 
!pm_runtime_active(core->pmdomains[0]))
+   if (!core->pmdomains ||
+   !pm_runtime_active(core->pmdomains->pd_devs[0]))
break;
usleep_range(1000, 1500);
}
@@ -705,7 +707,7 @@ static const struct venus_resources sdm845_res_v2 = {
.vcodec0_clks = { "vcodec0_core", "vcodec0_bus" },
.vcodec1_clks = { "vcodec1_core", "vcodec1_bus" },
.vcodec_clks_num = 2,
-   .vcodec_pmdomains = { "venus", "vcodec0", "vcodec1" },
+   .vcodec_pmdomains = (const char *[]) { "venus", "vcodec0", "vcodec1" },
.vcodec_pmdomains_num = 3,
.opp_pmdomain = (const char *[]) { "cx", NULL },
.vcodec_num = 2,
@@ -754,7 +756,7 @@ static const struct venus_resources sc7180_res = {
.clks_num = 3,
.vcodec0_clks = { "vcodec0_core", "vcodec0_bus" },
.vcodec_clks_num = 2,
-   .vcodec_pmdomains = { "venus", "vcodec0" },
+   .vcodec_pmdomains = (const char *[]) { "venus", "vcodec0" },
.vcodec_pmdomains_num = 2,
.opp_pmdomain = (const char *[]) { "cx", NULL },
.vcodec_num = 1,
@@ -811,7 +813,7 @@ static const struct venus_resources sm8250_res = {
.resets_num = 2,
.vcodec0_clks = { "vcodec0_core" },
.vcodec_clks_num = 1,
-   .vcodec_pmdomains = { "venus", "vcodec0" },
+   .vcodec_pmdomains = (const char *[]) { "venus", "vcodec0" },
.vcodec_pmdomains_num = 2,
.opp_pmdomain = (const char *[]) { "mx", NULL },
.vcodec_num = 1,
@@ -870,7 +872,7 @@ static const struct venus_resources sc7280_res = {
.clks_num = 3,
.vcodec0_clks = {"vcodec_core", "vcodec_bus"},
.vcodec_clks_num = 2,
-   .vcodec_pmdomains = { "venus", "vcodec0" },
+   .vcodec_pmdomains = (const char *[]) { "venus", "vcodec0" },
.vcodec_pmdomains_num = 2,
.opp_pmdomain = (const char *[]) { "cx", NULL },
.vcodec_num = 1,
diff --git a/drivers/media/platform/qcom/venus/core.h 
b/drivers/media/platform/qcom/venus/core.h
index 4a633261ece4..7ef341bf21cc 100644
--- a/drivers/media/platform/qcom/venus/core.h
+++ b/drivers/media/platform/qcom/venus/core.h
@@ -25,7 +25,6 @@
 
 #define VIDC_CLKS_NUM_MAX  4
 #define VIDC_VCODEC_CLKS_NUM_MAX   2
-#define VIDC_PMDOMAINS_NUM_MAX 3
 #define VIDC_RESETS_NUM_MAX2
 
 extern int venus_fw_debug;
@@ -72,7 +71,7 @@ struct venus_resources {
const char * const vcodec0_clks[VIDC_VCODEC_CLKS_NUM_MAX];
const char * const vcodec1_clks[VIDC_VCODEC_CLKS_NUM_MAX];
unsigned int vcodec_clks_num;
-   const char * const vcodec_pmdomains[VIDC_PMDOMAINS_NUM_MAX];
+   const char **vcodec_pmdomains;
unsigned int vcodec_pmdomains_num;
const char **opp_pmdomain;
unsigned int vcodec_num;
@@ -134,7 +133,7 @@ struct venus_format {
  * @video_path: an interconnect handle to video to/from memory path
  * @cpucfg_path: an interconnect handle to cpu configuration path
  * @has_opp_table: does OPP table exist
- * @pmdomains: an array of pmdomains struct device pointers
+ * @pmdomains: a pointer to a list of pmdomains
  * @opp_dl_venus: an device-link for device OPP
  * @opp_pmdomain: an OPP power-domain
  * @resets: an array of reset signals
@@ -187,7 +186,7 @@ struct venus_core {
struct icc_path *video_path;
struct icc_path *cpucfg_path;
bool has_opp_table;
-   struct device *pmdomains[VIDC_PMDOMAINS_NUM_MAX];
+   struct dev_pm_domain_list *pmdomains;
struct device_link *opp_dl_venus;
struct device *opp_pmdomain;
struct reset_control *resets[VIDC_RESETS_NUM_MAX];
diff --git a/drivers/media/platform/qcom/venus/pm_helpers.c 
b/drivers/media/platform/qcom

Re: [PATCH 2/4] tracing/user_events: Introduce multi-format events

2024-01-30 Thread Google

On Mon, 29 Jan 2024 09:29:07 -0800
Beau Belgrave  wrote:

> On Fri, Jan 26, 2024 at 03:04:45PM -0500, Steven Rostedt wrote:
> > On Fri, 26 Jan 2024 11:10:07 -0800
> > Beau Belgrave  wrote:
> > 
> > > > OK, so the each different event has suffixed name. But this will
> > > > introduce non C-variable name.
> > > > 
> > > > Steve, do you think your library can handle these symbols? It will
> > > > be something like "event:[1]" as the event name.
> > > > Personally I like "event.1" style. (of course we need to ensure the
> > > > user given event name is NOT including such suffix numbers)
> > > >   
> > > 
> > > Just to clarify around events including a suffix number. This is why
> > > multi-events use "user_events_multi" system name and the single-events
> > > using just "user_events".
> > > 
> > > Even if a user program did include a suffix, the suffix would still get
> > > appended. An example is "test" vs "test:[0]" using multi-format would
> > > result in two tracepoints ("test:[0]" and "test:[0]:[1]" respectively
> > > (assuming these are the first multi-events on the system).
> > > 
> > > I'm with you, we really don't want any spoofing or squatting possible.
> > > By using different system names and always appending the suffix I
> > > believe covers this.
> > > 
> > > Looking forward to hearing Steven's thoughts on this as well.
> > 
> > I'm leaning towards Masami's suggestion to use dots, as that won't conflict
> > with special characters from bash, as '[' and ']' do.
> > 
> 
> Thanks, yeah ideally we wouldn't use special characters.
> 
> I'm not picky about this. However, I did want something that clearly
> allowed a glob pattern to find all versions of a given register name of
> user_events by user programs that record. The dot notation will pull in
> more than expected if dotted namespace style names are used.
> 
> An example is "Asserts" and "Asserts.Verbose" from different programs.
> If we tried to find all versions of "Asserts" via glob of "Asserts.*" it
> will pull in "Asserts.Verbose.1" in addition to "Asserts.0".

If we use dot for the suffix number, we can prohibit user to use it
for their name. They still can use '_' (or change the group name?)
I just concerned that the name can be parsed by existing tools. Since
':' is used as a separator for group and event name in some case (e.g.
tracefs "set_event" is using, so trace-cmd and perf is using it.)

> While a glob of "Asserts.[0-9]" works when the unique ID is 0-9, it
> doesn't work if the number is higher, like 128. If we ever decide to
> change the ID from an integer to say hex to save space, these globs
> would break.

Hmm, why can't we use regexp?
And if we limits the number of events up to 1000 for each same-name event
we can use fixed numbers, like Assets.[0-9][0-9][0-9]

Thank you,

> 
> Is there some scheme that fits the C-variable name that addresses the
> above scenarios? Brackets gave me a simple glob that seemed to prevent a
> lot of this ("Asserts.\[*\]" in this case).
> 
> Are we confident that we always want to represent the ID as a base-10
> integer vs a base-16 integer? The suffix will be ABI to ensure recording
> programs can find their events easily.
> 
> Thanks,
> -Beau
> 
> > -- Steve


-- 
Masami Hiramatsu (Google)

Re: [RESEND PATCH v2] modules: wait do_free_init correctly

2024-01-30 Thread Luis Chamberlain

On Tue, Jan 30, 2024 at 09:40:38AM +0800, Changbin Du wrote:
> On Mon, Jan 29, 2024 at 09:53:58AM -0800, Luis Chamberlain wrote:
> > On Mon, Jan 29, 2024 at 10:03:04AM +0800, Changbin Du wrote:
> > > The commit 1a7b7d922081 ("modules: Use vmalloc special flag") moves
> > > do_free_init() into a global workqueue instead of call_rcu(). So now
> > > rcu_barrier() can not ensure that do_free_init has completed. We should
> > > wait it via flush_work().
> > > 
> > > Without this fix, we still could encounter false positive reports in
> > > W+X checking, and rcu synchronization is unnecessary.
> > 
> > You didn't answer my question, which should be documented in the commit log.
> > 
> > Does this mean we never freed modules init because of this? If so then
> > your commit log should clearly explain that. It should also explain that
> > if true (you have to verify) then it means we were no longer saving
> > the memory we wished to save, and that is important for distributions
> > which do want to save anything on memory. You may want to do a general
> > estimate on how much that means these days on any desktop / server.
>
> Actually, I have explained it in commit msg. It's not about saving memory. The
> synchronization here is just to ensure the module init's been freed before
> doing W+X checking. The problem is that the current implementation is wrong,
> rcu_barrier() cannot guarantee that. So we can encounter false positive 
> reports.
> But anyway, the module init will be freed, and it's just a timing related 
> issue.

Your desciption here is better than the commit log.

  Luis

Re: [PATCH v9] bus: mhi: host: Add tracing support

2024-01-30 Thread Steven Rostedt

On Tue, 30 Jan 2024 13:41:52 +0530
Manivannan Sadhasivam  wrote:

> So same trace will get printed for both mhi_channel_command_start() and
> mhi_channel_command_end()?

The trace output will also include the tracepoint name. That is, it will
have the same content but will be preceded with:

  mhi_channel_command_start: ...
  mhi_channel_command_end: ...

-- Steve

Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-30 Thread Steven Rostedt

On Tue, 30 Jan 2024 01:12:05 -0800
Linus Torvalds  wrote:

> On Tue, 30 Jan 2024 at 00:43, Linus Torvalds
>  wrote:
> >
> > I'll go back to bed, but I think the fix is something trivial like this:  
> 
> Almost.
> 
> >   + result = ERR_PTR(ENOENT);  
> 
> That needs a '-' in front of the ENOENT, otherwise you have a positive
> error number and things go wrong very quickly.
> 
> And that does indeed fix the lookup problem, but you end up with the
> same problem later when you do the eventfs_remove_dir(). Again the
> eventfs data structure changes, but we don't have a reliable dentry
> that we can invalidate.
> 
> The dentry cache is just very good at caching those old dentries, and
> the interface for eventfs_create_dir() and eventfs_remove_dir() is
> just not great.
> 
> If those did an actual path lookup (like eventfs_create_events_dir()
> does), we'd have the dentry, and it's trivial to get from dentry to
> eventfs_inode.
> 
> But going the other way is the broken thing because of how the
> dentries are just temporary caches.
> 
> I suspect the solution is to make eventfs_create_dir() do the same as
> the events directory case does, and actually pin the directory dentry
> and save it off.

I rather not have the create do that because that happens for every event
directory. On my machine that's:

  # find events -type d | wc -l
  2102

And that's regardless if tracefs is mounted or not. And that's how many are
also created with every instance creation. And doesn't pinning the dentry
also require it to be positive? That is, have a full inode allocated with
it?

I may try something that will still let me get rid of the ei->dentry.

> 
> Oh well. At least I understand what the problem is.

 Yep!

> Now I'm going to try to go back to sleep.

Hope you sleep better than I did. We just bought a new mattress, which felt
great in the store, but now after 4 or 5 hours sleeping in it, I wake up
with a sore back and have to sleep on the couch. And we bought the most
expensive "best" mattress in the store :-p

Oh, and NY State law has it that you can't return in once it is delivered.

-- Steve

[kees:devel/overflow/sanitizers] [overflow] 660787b56e: UBSAN:signed-integer-overflow_in_lib/test_memcat_p.c

2024-01-30 Thread kernel test robot




Hello,

kernel test robot noticed 
"UBSAN:signed-integer-overflow_in_lib/test_memcat_p.c" on:

commit: 660787b56e6e97ddc34c7882cbe1228f4040ef74 ("overflow: Reintroduce signed 
and unsigned overflow sanitizers")
https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git 
devel/overflow/sanitizers

in testcase: boot

compiler: gcc-11
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


we noticed this commit is reintroducing "signed and unsigned overflow
sanitizers", there is below config diff between parent and this commit in our
buildings:

--- ea804316c9db5148d2bb0c1f40f70d7a83404638/.config2024-01-26 
22:09:35.046768122 +0800
+++ 660787b56e6e97ddc34c7882cbe1228f4040ef74/.config2024-01-26 
19:53:20.693434428 +0800
@@ -6706,6 +6706,7 @@ CONFIG_UBSAN_BOUNDS_STRICT=y
 CONFIG_UBSAN_SHIFT=y
 # CONFIG_UBSAN_DIV_ZERO is not set
 CONFIG_UBSAN_UNREACHABLE=y
+CONFIG_UBSAN_SIGNED_WRAP=y
 # CONFIG_UBSAN_BOOL is not set
 # CONFIG_UBSAN_ENUM is not set
 # CONFIG_UBSAN_ALIGNMENT is not set

while testing, we observed below different (and same part) between parent and
this commit:

ea804316c9db5148 660787b56e6e97ddc34c7882cbe
 ---
   fail:runs  %reproductionfail:runs
   | | |
  6:60%   6:6 
dmesg.UBSAN:shift-out-of-bounds_in_arch/x86/kernel/cpu/intel.c
  6:60%   6:6 
dmesg.UBSAN:shift-out-of-bounds_in_arch/x86/kernel/cpu/topology.c
  6:60%   6:6 
dmesg.UBSAN:shift-out-of-bounds_in_fs/namespace.c
  6:60%   6:6 
dmesg.UBSAN:shift-out-of-bounds_in_fs/read_write.c
  6:60%   6:6 
dmesg.UBSAN:shift-out-of-bounds_in_include/linux/rhashtable.h
  6:60%   6:6 
dmesg.UBSAN:shift-out-of-bounds_in_include/net/tcp.h
   :6  100%   6:6 
dmesg.UBSAN:signed-integer-overflow_in_lib/test_memcat_p.c


this looks like the commit uncovered issue. but since it's hard for us to back
port this commit to each commit while bisecting, we cannot capture the real
first bad commit. not sure if this report could help somebody to investigate
the real issue?



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-lkp/202401302219.db90a6d5-oliver.s...@intel.com


[   42.894536][T1] [ cut here ]
[   42.895474][T1] UBSAN: signed-integer-overflow in 
lib/test_memcat_p.c:47:10
[   42.897128][T1] 6570 * 725861 cannot be represented in type 'int'
[   42.898391][T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
6.8.0-rc1-7-g660787b56e6e #1
[   42.899962][T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[   42.901661][T1] Call Trace:
[ 42.902009][ T1] dump_stack_lvl (??:?) 
[ 42.902009][ T1] dump_stack (??:?) 
[ 42.902009][ T1] handle_overflow (ubsan.c:?) 
[ 42.902009][ T1] ? kmemleak_alloc (??:?) 
[ 42.902009][ T1] ? kmalloc_trace (??:?) 
[ 42.902009][ T1] ? test_memcat_p_init (test_memcat_p.c:?) 
[ 42.902009][ T1] __ubsan_handle_mul_overflow (??:?) 
[ 42.902009][ T1] test_memcat_p_init (test_memcat_p.c:?) 
[ 42.902009][ T1] ? trace_hardirqs_on (??:?) 
[ 42.902009][ T1] ? _raw_spin_unlock_irqrestore (??:?) 
[ 42.902009][ T1] ? test_string_helpers_init (test_memcat_p.c:?) 
[ 42.902009][ T1] do_one_initcall (??:?) 
[ 42.902009][ T1] ? parameq (??:?) 
[ 42.902009][ T1] ? parse_args (??:?) 
[ 42.902009][ T1] do_initcalls (main.c:?) 
[ 42.902009][ T1] ? rdinit_setup (main.c:?) 
[ 42.902009][ T1] kernel_init_freeable (main.c:?) 
[ 42.902009][ T1] ? rest_init (main.c:?) 
[ 42.902009][ T1] kernel_init (main.c:?) 
[ 42.902009][ T1] ? schedule_tail (??:?) 
[ 42.902009][ T1] ret_from_fork (??:?) 
[ 42.902009][ T1] ? rest_init (main.c:?) 
[ 42.902009][ T1] ret_from_fork_asm (??:?) 
[ 42.902009][ T1] entry_INT80_32 (??:?) 
[   42.924183][T1] ---[ end trace ]---
[   42.925743][T1] test_memcat_p: test passed


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240130/202401302219.db90a6d5-oliver.s...@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCH v13 2/6] ring-buffer: Introducing ring-buffer mapping functions

2024-01-30 Thread Google

Hi Vincent,

Thanks for update the code.

On Mon, 29 Jan 2024 14:27:58 +
Vincent Donnefort  wrote:

> In preparation for allowing the user-space to map a ring-buffer, add
> a set of mapping functions:
> 
>   ring_buffer_{map,unmap}()
>   ring_buffer_map_fault()
> 
> And controls on the ring-buffer:
> 
>   ring_buffer_map_get_reader()  /* swap reader and head */
> 
> Mapping the ring-buffer also involves:
> 
>   A unique ID for each subbuf of the ring-buffer, currently they are
>   only identified through their in-kernel VA.
> 
>   A meta-page, where are stored ring-buffer statistics and a
>   description for the current reader
> 
> The linear mapping exposes the meta-page, and each subbuf of the
> ring-buffer, ordered following their unique ID, assigned during the
> first mapping.
> 
> Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
> size will remain unmodified and the splice enabling functions will in
> reality simply memcpy the data instead of swapping subbufs.
> 
> Signed-off-by: Vincent Donnefort 
> 
> diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
> index fa802db216f9..0841ba8bab14 100644
> --- a/include/linux/ring_buffer.h
> +++ b/include/linux/ring_buffer.h
> @@ -6,6 +6,8 @@
>  #include 
>  #include 
>  
> +#include 
> +
>  struct trace_buffer;
>  struct ring_buffer_iter;
>  
> @@ -221,4 +223,9 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
> hlist_node *node);
>  #define trace_rb_cpu_prepare NULL
>  #endif
>  
> +int ring_buffer_map(struct trace_buffer *buffer, int cpu);
> +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
> +struct page *ring_buffer_map_fault(struct trace_buffer *buffer, int cpu,
> +unsigned long pgoff);
> +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
>  #endif /* _LINUX_RING_BUFFER_H */
> diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
> new file mode 100644
> index ..d4bb67430719
> --- /dev/null
> +++ b/include/uapi/linux/trace_mmap.h
> @@ -0,0 +1,43 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _TRACE_MMAP_H_
> +#define _TRACE_MMAP_H_
> +
> +#include 
> +
> +/**
> + * struct trace_buffer_meta - Ring-buffer Meta-page description
> + * @meta_page_size:  Size of this meta-page.
> + * @meta_struct_len: Size of this structure.
> + * @subbuf_size: Size of each subbuf, including the header.
> + * @nr_subbufs:  Number of subbfs in the ring-buffer.
> + * @reader.lost_events:  Number of events lost at the time of the reader 
> swap.
> + * @reader.id:   subbuf ID of the current reader. From 0 to 
> @nr_subbufs - 1
> + * @reader.read: Number of bytes read on the reader subbuf.
> + * @entries: Number of entries in the ring-buffer.
> + * @overrun: Number of entries lost in the ring-buffer.
> + * @read:Number of entries that have been read.
> + */
> +struct trace_buffer_meta {
> + __u32   meta_page_size;
> + __u32   meta_struct_len;
> +
> + __u32   subbuf_size;
> + __u32   nr_subbufs;
> +
> + struct {
> + __u64   lost_events;
> + __u32   id;
> + __u32   read;
> + } reader;
> +
> + __u64   flags;
> +
> + __u64   entries;
> + __u64   overrun;
> + __u64   read;
> +
> + __u64   Reserved1;
> + __u64   Reserved2;
> +};
> +
> +#endif /* _TRACE_MMAP_H_ */
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 8179e0a8984e..081065e76d4a 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -338,6 +338,7 @@ struct buffer_page {
>   local_t  entries;   /* entries on this page */
>   unsigned longreal_end;  /* real end of data */
>   unsigned order; /* order of the page */
> + u32  id;/* ID for external mapping */
>   struct buffer_data_page *page;  /* Actual data page */
>  };
>  
> @@ -484,6 +485,12 @@ struct ring_buffer_per_cpu {
>   u64 read_stamp;
>   /* pages removed since last reset */
>   unsigned long   pages_removed;
> +
> + int mapped;
> + struct mutexmapping_lock;
> + unsigned long   *subbuf_ids;/* ID to addr */
> + struct trace_buffer_meta*meta_page;
> +
>   /* ring buffer pages to update, > 0 to add, < 0 to remove */
>   longnr_pages_to_update;
>   struct list_headnew_pages; /* new pages to add */
> @@ -1548,6 +1555,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, 
> long nr_pages, int cpu)
>   init_irq_work(&cpu_buffer->irq_work.work, rb_wake_up_waiters);
>   init_waitqueue_head(&cpu_buffer->irq_work.waiters);
>   init_waitqueue_head(&cpu_buffer->irq_work.full_wa

Re: [RFC PATCH 0/7] Introduce cache_is_aliasing() to fix DAX regression

2024-01-30 Thread Mathieu Desnoyers


On 2024-01-29 16:22, Dan Williams wrote:

Mathieu Desnoyers wrote:

This commit introduced in v5.13 prevents building FS_DAX on 32-bit ARM,
even on ARMv7 which does not have virtually aliased dcaches:

commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")

It used to work fine before: I have customers using dax over pmem on
ARMv7, but this regression will likely prevent them from upgrading their
kernel.

The root of the issue here is the fact that DAX was never designed to
handle virtually aliased dcache (VIVT and VIPT with aliased dcache). It
touches the pages through their linear mapping, which is not consistent
with the userspace mappings on virtually aliased dcaches.

This patch series introduces cache_is_aliasing() with new Kconfig
options:

   * ARCH_HAS_CACHE_ALIASING
   * ARCH_HAS_CACHE_ALIASING_DYNAMIC

and implements it for all architectures. The "DYNAMIC" implementation
implements cache_is_aliasing() as a runtime check, which is what is
needed on architectures like 32-bit ARMV6 and ARMV6K.

With this we can basically narrow down the list of architectures which
are unsupported by DAX to those which are really affected.

Feedback is welcome,


Hi Mathieu, this looks good overall, just some quibbling about the
ordering.


Thanks for having a look !



I would introduce dax_is_supported() with the current overly broad
interpretation of "!(ARM || MIPS || SPARC)" using IS_ENABLED(), then
fixup the filesystems to use the new helper, and finally go back and
convert dax_is_supported() to use cache_is_aliasing() internally.


Will do.



Separately, it is not clear to me why ARCH_HAS_CACHE_ALIASING_DYNAMIC
needs to exist. As long as all paths that care are calling
cache_is_aliasing() then whether it is dynamic or not is something only
the compiler cares about. If those dynamic archs do not want to pay the
.text size increase they can always do CONFIG_FS_DAX=n, right?


Good point. It will help reduce complexity and improve test coverage.

I also intend to rename "cache_is_aliasing()" to "dcache_is_aliasing()",
so if we introduce an "icache_is_aliasing()" in the future, it won't be
confusing. Having aliasing icache-dcache but not dcache-dcache seems to
be fairly common.

So basically:

If an arch selects ARCH_HAS_CACHE_ALIASING, it implements
dcache_is_aliasing() (for now), and eventually we can implement
icache_is_aliasing() as well.

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH 7/7] xfs: Use dax_is_supported()

2024-01-30 Thread Mathieu Desnoyers


On 2024-01-29 21:38, Dave Chinner wrote:

On Mon, Jan 29, 2024 at 04:06:31PM -0500, Mathieu Desnoyers wrote:

Use dax_is_supported() to validate whether the architecture has
virtually aliased caches at mount time.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches
(ARCH_HAS_CACHE_ALIASING_DYNAMIC=y).


Where's the rest of this patchset? I have no idea what
dax_is_supported() actually does, how it interacts with
CONFIG_FS_DAX, etc.

If you are changing anything to do with FSDAX, the cc-ing the
-entire- patchset to linux-fsdevel is absolutely necessary so the
entire patchset lands in our inboxes and not just a random patch
from the middle of a bigger change.


Sorry, I will Cc linux-fsdevel on all patches for the next round.

Meanwhile you can find the whole series on lore:

https://lore.kernel.org/lkml/20240129210631.193493-1-mathieu.desnoy...@efficios.com/

[...]



Assuming that I understand what dax_is_supported() is doing, this
change isn't right.  We're just setting the DAX configuration flags
from the mount options here, we don't validate them until
we've parsed all options and eliminated conflicts and rejected
conflicting options. We validate whether the options are
appropriate for the underlying hardware configuration later in the
mount process.

dax=always suitability is check in xfs_setup_dax_always() called
later in the mount process when we have enough context and support
to open storage devices and check them for DAX support. If the
hardware does not support DAX then we simply we turn off DAX
support, we do not reject the mount as this change does.

dax=inode and dax=never are valid options on all configurations,
even those with without FSDAX support or have hardware that is not
capable of using DAX. dax=inode only affects how an inode is
instantiated in cache - if the inode has a flag that says "use DAX"
and dax is suppoortable by the hardware, then the turn on DAX for
that inode. Otherwise we just use the normal non-dax IO paths.

Again, we don't error out the filesystem if DAX is not supported,
we just don't turn it on. This check is done in
xfs_inode_should_enable_dax() and I think all you need to do is
replace the IS_ENABLED(CONFIG_FS_DAX) with a dax_is_supported()
call...


Thanks a lot for the detailed explanation. You are right, I will
move the dax_is_supported() check to xfs_inode_should_enable_dax().

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH 4/7] ext2: Use dax_is_supported()

2024-01-30 Thread Mathieu Desnoyers


On 2024-01-30 06:33, Jan Kara wrote:

On Mon 29-01-24 16:06:28, Mathieu Desnoyers wrote:

Use dax_is_supported() to validate whether the architecture has
virtually aliased caches at mount time.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches
(ARCH_HAS_CACHE_ALIASING_DYNAMIC=y).

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Jan Kara 
Cc: linux-e...@vger.kernel.org
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org


Looks good to me (although I share Dave's opinion it would be nice to CC
the whole series to fsdevel - luckily we have lore these days so it is not
that tedious to find the whole series :)). Feel free to add:

Acked-by: Jan Kara 


Hi Jan,

Thanks for looking at it! I will do significant changes for v2, so
I will hold on before integrating your acked-by if it's OK with you.

Thanks!

Mathieu



Honza


---
  fs/ext2/super.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 01f9addc8b1f..0398e7a90eb6 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -585,13 +585,13 @@ static int parse_options(char *options, struct 
super_block *sb,
set_opt(opts->s_mount_opt, XIP);
fallthrough;
case Opt_dax:
-#ifdef CONFIG_FS_DAX
-   ext2_msg(sb, KERN_WARNING,
-   "DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
-   set_opt(opts->s_mount_opt, DAX);
-#else
-   ext2_msg(sb, KERN_INFO, "dax option not supported");
-#endif
+   if (dax_is_supported()) {
+   ext2_msg(sb, KERN_WARNING,
+"DAX enabled. Warning: EXPERIMENTAL, use at 
your own risk");
+   set_opt(opts->s_mount_opt, DAX);
+   } else {
+   ext2_msg(sb, KERN_INFO, "dax option not 
supported");
+   }
break;
  
  #if defined(CONFIG_QUOTA)

--
2.39.2



--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

RE: [PATCH v8 04/15] x86/sgx: Implement basic EPC misc cgroup functionality

2024-01-30 Thread Huang, Kai

>  struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)  {
> + struct sgx_epc_cgroup *epc_cg;
>   struct sgx_epc_page *page;
> + int ret;
> +
> + epc_cg = sgx_get_current_epc_cg();
> + ret = sgx_epc_cgroup_try_charge(epc_cg);
> + if (ret) {
> + sgx_put_epc_cg(epc_cg);
> + return ERR_PTR(ret);
> + }
> 
>   for ( ; ; ) {
>   page = __sgx_alloc_epc_page();
> @@ -567,8 +578,10 @@ struct sgx_epc_page *sgx_alloc_epc_page(void
> *owner, bool reclaim)
>   break;
>   }
> 
> - if (list_empty(&sgx_active_page_list))
> - return ERR_PTR(-ENOMEM);
> + if (list_empty(&sgx_active_page_list)) {
> + page = ERR_PTR(-ENOMEM);
> + break;
> + }

(Sorry for replying from Outlook because I am in travel for Chinese New Year.)

Perhaps I am missing something but I don't understand this change.

An empty sgx_active_page_list means you cannot reclaim any page from it, so why 
need to break?

> 
>   if (!reclaim) {
>   page = ERR_PTR(-EBUSY);
> @@ -580,10 +593,25 @@ struct sgx_epc_page *sgx_alloc_epc_page(void
> *owner, bool reclaim)
>   break;
>   }
> 
> + /*
> +  * Need to do a global reclamation if cgroup was not full but
> free
> +  * physical pages run out, causing __sgx_alloc_epc_page() to
> fail.
> +  */
>   sgx_reclaim_pages();
>   cond_resched();
>   }

And why adding this comment, especially in this patch?

I don't see it brings additional clarity because there's only global reclaim 
now, no matter whether cgroup is enabled or not.

RE: [PATCH v8 07/15] x86/sgx: Expose sgx_reclaim_pages() for cgroup

2024-01-30 Thread Huang, Kai

> + * @lru: The LRU from which pages are reclaimed.
> + * @nr_to_scan: Pointer to the target number of pages to scan, must be less
> than
> + *   SGX_NR_TO_SCAN.
> + * Return:   Number of pages reclaimed.
>   */
> -static void sgx_reclaim_pages(void)
> +unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned
> +int *nr_to_scan)

Since the function is now returning the number of reclaimed pages, why do you 
need to make the @nr_to_scan as pointer?

Cannot the caller just adjust @nr_to_scan when calling this function based on 
how many pages have reclaimed?

I am not even sure whether you need @nr_to_scan at all because as we discussed 
I think it's just extremely rare you need to pass "< SGX_NR_TO_SCAN" to this 
function.

Even if you need, you can always choose to try to reclaim SGX_NR_TO_SCAN pages.

[...]

> 
> +static void sgx_reclaim_pages_global(void) {
> + unsigned int nr_to_scan = SGX_NR_TO_SCAN;
> +
> + sgx_reclaim_pages(&sgx_global_lru, &nr_to_scan); }
> +

I think this function doesn't look sane at all when you have @nr_to_scan being 
a pointer?

I am also not sure whether this function is needed -- if we don't add 
@nr_to_scan to sgx_reclaim_pages(), then this function is basically:

sgx_reclaim_pages(&sgx_global_lru);

[PATCH v5] remoteproc: Make rproc_get_by_phandle() work for clusters

2024-01-30 Thread Tanmay Shah

From: Mathieu Poirier 

Multi-cluster remoteproc designs typically have the following DT
declaration:

remoteproc-cluster {
compatible = "soc,remoteproc-cluster";

core0: core0 {
compatible = "soc,remoteproc-core"
memory-region;
sram;
};

core1: core1 {
compatible = "soc,remoteproc-core"
memory-region;
sram;
}
};

A driver exists for the cluster rather than the individual cores
themselves so that operation mode and HW specific configurations
applicable to the cluster can be made.

Because the driver exists at the cluster level and not the individual
core level, function rproc_get_by_phandle() fails to return the
remoteproc associated with the phandled it is called for.

This patch enhances rproc_get_by_phandle() by looking for the cluster's
driver when the driver for the immediate remoteproc's parent is not
found.

Reported-by: Ben Levinsky 
Signed-off-by: Mathieu Poirier 
Co-developed-by: Tarak Reddy 
Signed-off-by: Tarak Reddy 
Co-developed-by: Tanmay Shah 
Signed-off-by: Tanmay Shah 
---
 drivers/remoteproc/remoteproc_core.c | 29 ++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/drivers/remoteproc/remoteproc_core.c 
b/drivers/remoteproc/remoteproc_core.c
index 695cce218e8c..f276956f2c5c 100644
--- a/drivers/remoteproc/remoteproc_core.c
+++ b/drivers/remoteproc/remoteproc_core.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2112,6 +2113,7 @@ EXPORT_SYMBOL(rproc_detach);
 struct rproc *rproc_get_by_phandle(phandle phandle)
 {
struct rproc *rproc = NULL, *r;
+   struct device_driver *driver;
struct device_node *np;
 
np = of_find_node_by_phandle(phandle);
@@ -2122,7 +2124,26 @@ struct rproc *rproc_get_by_phandle(phandle phandle)
list_for_each_entry_rcu(r, &rproc_list, node) {
if (r->dev.parent && device_match_of_node(r->dev.parent, np)) {
/* prevent underlying implementation from being removed 
*/
-   if (!try_module_get(r->dev.parent->driver->owner)) {
+
+   /*
+* If the remoteproc's parent has a driver, the
+* remoteproc is not part of a cluster and we can use
+* that driver.
+*/
+   driver = r->dev.parent->driver;
+
+   /*
+* If the remoteproc's parent does not have a driver,
+* look for the driver associated with the cluster.
+*/
+   if (!driver) {
+   if (r->dev.parent->parent)
+   driver = r->dev.parent->parent->driver;
+   if (!driver)
+   break;
+   }
+
+   if (!try_module_get(driver->owner)) {
dev_err(&r->dev, "can't get owner\n");
break;
}
@@ -2533,7 +2554,11 @@ EXPORT_SYMBOL(rproc_free);
  */
 void rproc_put(struct rproc *rproc)
 {
-   module_put(rproc->dev.parent->driver->owner);
+   if (rproc->dev.parent->driver)
+   module_put(rproc->dev.parent->driver->owner);
+   else
+   module_put(rproc->dev.parent->parent->driver->owner);
+
put_device(&rproc->dev);
 }
 EXPORT_SYMBOL(rproc_put);

base-commit: 99f59b148871dadb9104366e3d25b120a97f897b
-- 
2.25.1

Re: [PATCH v13 2/6] ring-buffer: Introducing ring-buffer mapping functions

2024-01-30 Thread Vincent Donnefort

On Tue, Jan 30, 2024 at 11:55:10PM +0900, Masami Hiramatsu wrote:
> Hi Vincent,
> 
> Thanks for update the code.
> 
> On Mon, 29 Jan 2024 14:27:58 +
> Vincent Donnefort  wrote:
> 
> > In preparation for allowing the user-space to map a ring-buffer, add
> > a set of mapping functions:
> > 
> >   ring_buffer_{map,unmap}()
> >   ring_buffer_map_fault()
> > 
> > And controls on the ring-buffer:
> > 
> >   ring_buffer_map_get_reader()  /* swap reader and head */
> > 
> > Mapping the ring-buffer also involves:
> > 
> >   A unique ID for each subbuf of the ring-buffer, currently they are
> >   only identified through their in-kernel VA.
> > 
> >   A meta-page, where are stored ring-buffer statistics and a
> >   description for the current reader
> > 
> > The linear mapping exposes the meta-page, and each subbuf of the
> > ring-buffer, ordered following their unique ID, assigned during the
> > first mapping.
> > 
> > Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
> > size will remain unmodified and the splice enabling functions will in
> > reality simply memcpy the data instead of swapping subbufs.
> > 
> > Signed-off-by: Vincent Donnefort 
> > 
> > diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
> > index fa802db216f9..0841ba8bab14 100644
> > --- a/include/linux/ring_buffer.h
> > +++ b/include/linux/ring_buffer.h
> > @@ -6,6 +6,8 @@
> >  #include 
> >  #include 
> >  
> > +#include 
> > +
> >  struct trace_buffer;
> >  struct ring_buffer_iter;
> >  
> > @@ -221,4 +223,9 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
> > hlist_node *node);
> >  #define trace_rb_cpu_prepare   NULL
> >  #endif
> >  
> > +int ring_buffer_map(struct trace_buffer *buffer, int cpu);
> > +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
> > +struct page *ring_buffer_map_fault(struct trace_buffer *buffer, int cpu,
> > +  unsigned long pgoff);
> > +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
> >  #endif /* _LINUX_RING_BUFFER_H */
> > diff --git a/include/uapi/linux/trace_mmap.h 
> > b/include/uapi/linux/trace_mmap.h
> > new file mode 100644
> > index ..d4bb67430719
> > --- /dev/null
> > +++ b/include/uapi/linux/trace_mmap.h
> > @@ -0,0 +1,43 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef _TRACE_MMAP_H_
> > +#define _TRACE_MMAP_H_
> > +
> > +#include 
> > +
> > +/**
> > + * struct trace_buffer_meta - Ring-buffer Meta-page description
> > + * @meta_page_size:Size of this meta-page.
> > + * @meta_struct_len:   Size of this structure.
> > + * @subbuf_size:   Size of each subbuf, including the header.
> > + * @nr_subbufs:Number of subbfs in the ring-buffer.
> > + * @reader.lost_events:Number of events lost at the time of the reader 
> > swap.
> > + * @reader.id: subbuf ID of the current reader. From 0 to 
> > @nr_subbufs - 1
> > + * @reader.read:   Number of bytes read on the reader subbuf.
> > + * @entries:   Number of entries in the ring-buffer.
> > + * @overrun:   Number of entries lost in the ring-buffer.
> > + * @read:  Number of entries that have been read.
> > + */
> > +struct trace_buffer_meta {
> > +   __u32   meta_page_size;
> > +   __u32   meta_struct_len;
> > +
> > +   __u32   subbuf_size;
> > +   __u32   nr_subbufs;
> > +
> > +   struct {
> > +   __u64   lost_events;
> > +   __u32   id;
> > +   __u32   read;
> > +   } reader;
> > +
> > +   __u64   flags;
> > +
> > +   __u64   entries;
> > +   __u64   overrun;
> > +   __u64   read;
> > +
> > +   __u64   Reserved1;
> > +   __u64   Reserved2;
> > +};
> > +
> > +#endif /* _TRACE_MMAP_H_ */
> > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> > index 8179e0a8984e..081065e76d4a 100644
> > --- a/kernel/trace/ring_buffer.c
> > +++ b/kernel/trace/ring_buffer.c
> > @@ -338,6 +338,7 @@ struct buffer_page {
> > local_t  entries;   /* entries on this page */
> > unsigned longreal_end;  /* real end of data */
> > unsigned order; /* order of the page */
> > +   u32  id;/* ID for external mapping */
> > struct buffer_data_page *page;  /* Actual data page */
> >  };
> >  
> > @@ -484,6 +485,12 @@ struct ring_buffer_per_cpu {
> > u64 read_stamp;
> > /* pages removed since last reset */
> > unsigned long   pages_removed;
> > +
> > +   int mapped;
> > +   struct mutexmapping_lock;
> > +   unsigned long   *subbuf_ids;/* ID to addr */
> > +   struct trace_buffer_meta*meta_page;
> > +
> > /* ring buffer pages to update, > 0 to add, < 0 to remove */
> > longnr_pages_to_update;
> > struct list_headnew_pages; /* new pages to a

Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-30 Thread Steven Rostedt

On Tue, 30 Jan 2024 09:39:42 -0500
Steven Rostedt  wrote:

> I may try something that will still let me get rid of the ei->dentry.

This appears to work, but like always, I may have missed something. I need
to add debugging (printks) to make sure the that I don't leave any dangling
dentries around (dget without a dput).

Here's what I did:

- Removed the ei->dentry that you loved so much

- Removed the ei->d_children that you shared as much love for.

- Removed SRCU. All accesses are under the eventfs_mutex
  I need to add comments that the callbacks need to be aware of this.
  Currently, the callbacks do not take any other locks. I may comment
  that they should never take a lock.

- Added the kref that you recommended

- Created a eventfs_root_inode that has the structure of:

  struct eventfs_root_inode {
   struct eventfs_inodeei;
   struct dentry   *root_dentry;
  };

  The "events" directory is the only directory that allocates this.
  It is required that its ei->is_events is set, and no other ei has that
  set. This will hold the only non-dynamic dentry.

- I added "parent" to the eventfs_inode that points to the parent
  eventfs_inode.

- On removal, I got rid of the SRCU callback and the work queue.
  Instead, I find the dentry of the current eventfs_inode that is being
  deleted by walking the ei->parent until I find the events inode that has
  a dentry. I then use that to do a lookup walking back down to the
  eventfs_inode I want to delete. This gives me the dentry that I can call
  d_invalidate() on.

This all works with light testing. I'm sure I did something wrong, but
hopefully this is more inline to what you are looking for.

This patch is on top of your last patch series.

-- Steve

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index ad11063bdd53..49d4630d5d70 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -24,17 +24,24 @@
 #include 
 #include "internal.h"
 
-/*
- * eventfs_mutex protects the eventfs_inode (ei) dentry. Any access
- * to the ei->dentry must be done under this mutex and after checking
- * if ei->is_freed is not set. When ei->is_freed is set, the dentry
- * is on its way to being freed after the last dput() is made on it.
- */
+/* eventfs_mutex protects ei->is_freed and the ei->ref counting. */
 static DEFINE_MUTEX(eventfs_mutex);
 
 /* Choose something "unique" ;-) */
 #define EVENTFS_FILE_INODE_INO 0x12c4e37
 
+/* Used only by the "events" directory */
+struct eventfs_root_inode {
+   struct eventfs_inodeei;
+   struct dentry   *root_dentry;
+};
+
+static struct eventfs_root_inode *get_root_inode(struct eventfs_inode *ei)
+{
+   WARN_ON_ONCE(!ei->is_events);
+   return container_of(ei, struct eventfs_inode, ei);
+}
+
 /* Just try to make something consistent and unique */
 static int eventfs_dir_ino(struct eventfs_inode *ei)
 {
@@ -44,14 +51,6 @@ static int eventfs_dir_ino(struct eventfs_inode *ei)
return ei->ino;
 }
 
-/*
- * The eventfs_inode (ei) itself is protected by SRCU. It is released from
- * its parent's list and will have is_freed set (under eventfs_mutex).
- * After the SRCU grace period is over and the last dput() is called
- * the ei is freed.
- */
-DEFINE_STATIC_SRCU(eventfs_srcu);
-
 /* Mode is unsigned short, use the upper bits for flags */
 enum {
EVENTFS_SAVE_MODE   = BIT(16),
@@ -360,7 +359,7 @@ static struct dentry *lookup_dir_entry(struct dentry 
*dentry,
ti->private = ei;
 
dentry->d_fsdata = ei;
-ei->dentry = dentry;   // Remove me!
+   kref_get(&ei->kref);
 
inc_nlink(inode);
d_add(dentry, inode);
@@ -371,10 +370,30 @@ static struct dentry *lookup_dir_entry(struct dentry 
*dentry,
 
 static void free_ei(struct eventfs_inode *ei)
 {
+   struct eventfs_root_inode *rei;
+
kfree_const(ei->name);
-   kfree(ei->d_children);
kfree(ei->entry_attrs);
-   kfree(ei);
+
+   if (ei->is_events) {
+   rei = get_root_inode(ei);
+   kfree(rei);
+   } else {
+   kfree(ei);
+   }
+}
+
+static void kref_ei_release(struct kref *kref)
+{
+   struct eventfs_inode *ei = container_of(kref, struct eventfs_inode, 
kref);
+   WARN_ON_ONCE(!ei->is_freed);
+   free_ei(ei);
+}
+
+
+static void eventfs_inode_put(struct eventfs_inode *ei)
+{
+   kref_put(&ei->kref, kref_ei_release);
 }
 
 /**
@@ -387,7 +406,6 @@ static void free_ei(struct eventfs_inode *ei)
 void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry 
*dentry)
 {
struct eventfs_inode *ei;
-   int i;
 
mutex_lock(&eventfs_mutex);
 
@@ -395,20 +413,7 @@ void eventfs_set_ei_status_free(struct tracefs_inode *ti, 
struct dentry *dentry)
if (!ei)
goto out;
 
-   /* This could belong to one of the files of the ei */
-   if (ei->dentry != dentry) {
-   for (i = 0; i < ei->nr_entries; i++) {
-

Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-30 Thread Linus Torvalds

On Tue, 30 Jan 2024 at 06:39, Steven Rostedt  wrote:
>
> On Tue, 30 Jan 2024 01:12:05 -0800
> >
> > I suspect the solution is to make eventfs_create_dir() do the same as
> > the events directory case does, and actually pin the directory dentry
> > and save it off.
>
> I rather not have the create do that because that happens for every event
> directory.

I wasn't thinking straight yesterday - the real fix is to just do a
"d_revalidate()". It's literally why that thing exists: check whether
a dentry is still valid.

In fact, that trivially gets rid of the 'events' subdirectory issues
too, so we can stop doing that horrendous simple_recursive_removal()
too.

Let me reboot into the trivial fix. I just needed to think about this
the right way.

None of this is anything that the VFS layer has any problems with.

   Linus

[RFC PATCH v2 0/8] Introduce dcache_is_aliasing() to fix DAX regression

2024-01-30 Thread Mathieu Desnoyers

This commit introduced in v5.13 prevents building FS_DAX on 32-bit ARM,
even on ARMv7 which does not have virtually aliased dcaches:

commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")

It used to work fine before: I have customers using DAX over pmem on
ARMv7, but this regression will likely prevent them from upgrading their
kernel.

The root of the issue here is the fact that DAX was never designed to
handle virtually aliased dcache (VIVT and VIPT with aliased dcache). It
touches the pages through their linear mapping, which is not consistent
with the userspace mappings on virtually aliased dcaches.

This patch series introduces dcache_is_aliasing() with the new Kconfig
option ARCH_HAS_CACHE_ALIASING and implements it for all architectures.
The implementation of dcache_is_aliasing() is either evaluated to a
constant at compile-time or a runtime check, which is what is needed on
ARM.

With this we can basically narrow down the list of architectures which
are unsupported by DAX to those which are really affected.

Note that the order of the series was completely changed based on the
feedback received on v1, cache_is_aliasing() is renamed to
dcache_is_aliasing(), ARCH_HAS_CACHE_ALIASING_DYNAMIC is gone,
dcache_is_aliasing() vs ARCH_HAS_CACHE_ALIASING relationship is
simplified, and the dax_is_supported() check was moved to its rightful
place in all filesystems.

Feedback is welcome,

Thanks,

Mathieu

Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-...@vger.kernel.org
Cc: nvd...@lists.linux.dev
Cc: linux-fsde...@vger.kernel.org

Mathieu Desnoyers (8):
  dax: Introduce dax_is_supported()
  erofs: Use dax_is_supported()
  ext2: Use dax_is_supported()
  ext4: Use dax_is_supported()
  fuse: Use dax_is_supported()
  xfs: Use dax_is_supported()
  Introduce dcache_is_aliasing() across all architectures
  dax: Fix incorrect list of dcache aliasing architectures

 arch/arc/Kconfig|  1 +
 arch/arc/include/asm/cachetype.h|  9 +
 arch/arm/Kconfig|  1 +
 arch/arm/include/asm/cachetype.h|  2 ++
 arch/csky/Kconfig   |  1 +
 arch/csky/include/asm/cachetype.h   |  9 +
 arch/m68k/Kconfig   |  1 +
 arch/m68k/include/asm/cachetype.h   |  9 +
 arch/mips/Kconfig   |  1 +
 arch/mips/include/asm/cachetype.h   |  9 +
 arch/nios2/Kconfig  |  1 +
 arch/nios2/include/asm/cachetype.h  | 10 ++
 arch/parisc/Kconfig |  1 +
 arch/parisc/include/asm/cachetype.h |  9 +
 arch/sh/Kconfig |  1 +
 arch/sh/include/asm/cachetype.h |  9 +
 arch/sparc/Kconfig  |  1 +
 arch/sparc/include/asm/cachetype.h  | 14 ++
 arch/xtensa/Kconfig |  1 +
 arch/xtensa/include/asm/cachetype.h | 10 ++
 fs/Kconfig  |  1 -
 fs/erofs/super.c|  5 -
 fs/ext2/super.c |  6 +-
 fs/ext4/super.c |  5 -
 fs/fuse/dax.c   |  7 +++
 fs/xfs/xfs_iops.c   |  2 +-
 include/linux/cacheinfo.h   |  6 ++
 include/linux/dax.h |  9 +
 mm/Kconfig  |  6 ++
 29 files changed, 142 insertions(+), 5 deletions(-)
 create mode 100644 arch/arc/include/asm/cachetype.h
 create mode 100644 arch/csky/include/asm/cachetype.h
 create mode 100644 arch/m68k/include/asm/cachetype.h
 create mode 100644 arch/mips/include/asm/cachetype.h
 create mode 100644 arch/nios2/include/asm/cachetype.h
 create mode 100644 arch/parisc/include/asm/cachetype.h
 create mode 100644 arch/sh/include/asm/cachetype.h
 create mode 100644 arch/sparc/include/asm/cachetype.h
 create mode 100644 arch/xtensa/include/asm/cachetype.h

-- 
2.39.2

[RFC PATCH v2 1/8] dax: Introduce dax_is_supported()

2024-01-30 Thread Mathieu Desnoyers

Introduce a new dax_is_supported() static inline to check whether the
architecture supports DAX.

This replaces the following fs/Kconfig:FS_DAX dependency:

  depends on !(ARM || MIPS || SPARC)

This is done in preparation for its use by each filesystem supporting
the dax mount option to validate whether dax is indeed supported.

This is done in preparation for using dcache_is_aliasing() in a
following change which will properly support architectures which detect
dcache aliasing at runtime.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
---
 fs/Kconfig  |  1 -
 include/linux/dax.h | 10 ++
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 42837617a55b..e5efdb3b276b 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -56,7 +56,6 @@ endif # BLOCK
 config FS_DAX
bool "File system based Direct Access (DAX) support"
depends on MMU
-   depends on !(ARM || MIPS || SPARC)
depends on ZONE_DEVICE || FS_DAX_LIMITED
select FS_IOMAP
select DAX
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b463502b16e1..cfc8cd4a3eae 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -78,6 +78,12 @@ static inline bool daxdev_mapping_supported(struct 
vm_area_struct *vma,
return false;
return dax_synchronous(dax_dev);
 }
+static inline bool dax_is_supported(void)
+{
+   return !IS_ENABLED(CONFIG_ARM) &&
+  !IS_ENABLED(CONFIG_MIPS) &&
+  !IS_ENABLED(CONFIG_SPARC);
+}
 #else
 static inline void *dax_holder(struct dax_device *dax_dev)
 {
@@ -122,6 +128,10 @@ static inline size_t dax_recovery_write(struct dax_device 
*dax_dev,
 {
return 0;
 }
+static inline bool dax_is_supported(void)
+{
+   return false;
+}
 #endif
 
 void set_dax_nocache(struct dax_device *dax_dev);
-- 
2.39.2

[RFC PATCH v2 3/8] ext2: Use dax_is_supported()

2024-01-30 Thread Mathieu Desnoyers

Use dax_is_supported() to validate whether the architecture has
virtually aliased data caches at mount time. Print an error and disable
DAX if dax=always is requested as a mount option on an architecture
which does not support DAX.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Jan Kara 
Cc: linux-e...@vger.kernel.org
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
---
 fs/ext2/super.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 01f9addc8b1f..30ff57d47ed4 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -955,7 +955,11 @@ static int ext2_fill_super(struct super_block *sb, void 
*data, int silent)
blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
 
if (test_opt(sb, DAX)) {
-   if (!sbi->s_daxdev) {
+   if (!dax_is_supported()) {
+   ext2_msg(sb, KERN_ERR,
+   "DAX unsupported by architecture. Turning off 
DAX.");
+   clear_opt(sbi->s_mount_opt, DAX);
+   } else if (!sbi->s_daxdev) {
ext2_msg(sb, KERN_ERR,
"DAX unsupported by block device. Turning off 
DAX.");
clear_opt(sbi->s_mount_opt, DAX);
-- 
2.39.2

[RFC PATCH v2 4/8] ext4: Use dax_is_supported()

2024-01-30 Thread Mathieu Desnoyers

Use dax_is_supported() to validate whether the architecture has
virtually aliased data caches at mount time. Mount fails if dax=always
is requested as a mount option on an architecture which does not support
DAX.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: "Theodore Ts'o" 
Cc: Andreas Dilger 
Cc: linux-e...@vger.kernel.org
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
---
 fs/ext4/super.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c5fcf377ab1f..f2c11ae3ec29 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4743,7 +4743,10 @@ static int ext4_check_feature_compatibility(struct 
super_block *sb,
}
 
if (sbi->s_mount_opt & EXT4_MOUNT_DAX_ALWAYS) {
-   if (ext4_has_feature_inline_data(sb)) {
+   if (!dax_is_supported()) {
+   ext4_msg(sb, KERN_ERR, "DAX unsupported by 
architecture.");
+   return -EINVAL;
+   } else if (ext4_has_feature_inline_data(sb)) {
ext4_msg(sb, KERN_ERR, "Cannot use DAX on a filesystem"
" that may contain inline data");
return -EINVAL;
-- 
2.39.2

[RFC PATCH v2 5/8] fuse: Use dax_is_supported()

2024-01-30 Thread Mathieu Desnoyers

Use dax_is_supported() to validate whether the architecture has
virtually aliased data caches at mount time. Silently disable
DAX if dax=always is requested as a mount option on an architecture
which does not support DAX.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Miklos Szeredi 
Cc: linux-fsde...@vger.kernel.org
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
---
 fs/fuse/dax.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index 12ef91d170bb..36e1c1abbf8e 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -1336,6 +1336,13 @@ static bool fuse_should_enable_dax(struct inode *inode, 
unsigned int flags)
if (dax_mode == FUSE_DAX_NEVER)
return false;
 
+   /*
+* Silently fallback to 'never' mode if the architecture does
+* not support DAX.
+*/
+   if (!dax_is_supported())
+   return false;
+
/*
 * fc->dax may be NULL in 'inode' mode when filesystem device doesn't
 * support DAX, in which case it will silently fallback to 'never' mode.
-- 
2.39.2

[RFC PATCH v2 6/8] xfs: Use dax_is_supported()

2024-01-30 Thread Mathieu Desnoyers

Use dax_is_supported() to validate whether the architecture has
virtually aliased data caches at mount time. Silently disable
DAX if dax=always is requested as a mount option on an architecture
which does not support DAX.

This is relevant for architectures which require a dynamic check
to validate whether they have virtually aliased data caches.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Chandan Babu R 
Cc: Darrick J. Wong 
Cc: linux-...@vger.kernel.org
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
---
 fs/xfs/xfs_iops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index a0d77f5f512e..360f640159b0 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1208,7 +1208,7 @@ static bool
 xfs_inode_should_enable_dax(
struct xfs_inode *ip)
 {
-   if (!IS_ENABLED(CONFIG_FS_DAX))
+   if (!dax_is_supported())
return false;
if (xfs_has_dax_never(ip->i_mount))
return false;
-- 
2.39.2

[RFC PATCH v2 8/8] dax: Fix incorrect list of dcache aliasing architectures

2024-01-30 Thread Mathieu Desnoyers

commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
prevents DAX from building on architectures with virtually aliased
dcache with:

  depends on !(ARM || MIPS || SPARC)

This check is too broad (e.g. recent ARMv7 don't have virtually aliased
dcaches), and also misses many other architectures with virtually
aliased dcache.

This is a regression introduced in the v5.13 Linux kernel where the
dax mount option is removed for 32-bit ARMv7 boards which have no dcache
aliasing, and therefore should work fine with FS_DAX.

This was turned into the following implementation of dax_is_supported()
by a preparatory change:

return !IS_ENABLED(CONFIG_ARM) &&
   !IS_ENABLED(CONFIG_MIPS) &&
   !IS_ENABLED(CONFIG_SPARC);

Use dcache_is_aliasing() instead to figure out whether the environment
has aliasing dcaches.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
---
 include/linux/dax.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/linux/dax.h b/include/linux/dax.h
index cfc8cd4a3eae..f59e604662e4 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 typedef unsigned long dax_entry_t;
 
@@ -80,9 +81,7 @@ static inline bool daxdev_mapping_supported(struct 
vm_area_struct *vma,
 }
 static inline bool dax_is_supported(void)
 {
-   return !IS_ENABLED(CONFIG_ARM) &&
-  !IS_ENABLED(CONFIG_MIPS) &&
-  !IS_ENABLED(CONFIG_SPARC);
+   return !dcache_is_aliasing();
 }
 #else
 static inline void *dax_holder(struct dax_device *dax_dev)
-- 
2.39.2

[RFC PATCH v2 7/8] Introduce dcache_is_aliasing() across all architectures

2024-01-30 Thread Mathieu Desnoyers

Introduce a generic way to query whether the dcache is virtually aliased
on all architectures. Its purpose is to ensure that subsystems which
are incompatible with virtually aliased data caches (e.g. FS_DAX) can
reliably query this.

For dcache aliasing, there are three scenarios dependending on the
architecture. Here is a breakdown based on my understanding:

A) The dcache is always aliasing:

* arc
* csky
* m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.)
* sh
* parisc

B) The dcache aliasing is statically known or depends on querying CPU
   state at runtime:

* arm (cache_is_vivt() || cache_is_vipt_aliasing())
* mips (cpu_has_dc_aliases)
* nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE)
* sparc32 (vac_cache_size > PAGE_SIZE)
* sparc64 (L1DCACHE_SIZE > PAGE_SIZE)
* xtensa (DCACHE_WAY_SIZE > PAGE_SIZE)

C) The dcache is never aliasing:

* alpha
* arm64 (aarch64)
* hexagon
* loongarch (but with incoherent write buffers, which are disabled since
 commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to 
PAGE_SIZE"))
* microblaze
* openrisc
* powerpc
* riscv
* s390
* um
* x86

Require architectures in A) and B) to select ARCH_HAS_CACHE_ALIASING and
implement "dcache_is_aliasing()".

Architectures in C) don't select ARCH_HAS_CACHE_ALIASING, and thus
dcache_is_aliasing() simply evaluates to "false".

Note that this leaves "icache_is_aliasing()" to be implemented as future
work. This would be useful to gate features like XIP on architectures
which have aliasing dcache-icache but not dcache-dcache.

Link: https://lore.kernel.org/lkml/20030910210416.ga24...@mail.jlokier.co.uk/
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-...@vger.kernel.org
Cc: nvd...@lists.linux.dev
Cc: linux-...@vger.kernel.org
---
 arch/arc/Kconfig|  1 +
 arch/arc/include/asm/cachetype.h|  9 +
 arch/arm/Kconfig|  1 +
 arch/arm/include/asm/cachetype.h|  2 ++
 arch/csky/Kconfig   |  1 +
 arch/csky/include/asm/cachetype.h   |  9 +
 arch/m68k/Kconfig   |  1 +
 arch/m68k/include/asm/cachetype.h   |  9 +
 arch/mips/Kconfig   |  1 +
 arch/mips/include/asm/cachetype.h   |  9 +
 arch/nios2/Kconfig  |  1 +
 arch/nios2/include/asm/cachetype.h  | 10 ++
 arch/parisc/Kconfig |  1 +
 arch/parisc/include/asm/cachetype.h |  9 +
 arch/sh/Kconfig |  1 +
 arch/sh/include/asm/cachetype.h |  9 +
 arch/sparc/Kconfig  |  1 +
 arch/sparc/include/asm/cachetype.h  | 14 ++
 arch/xtensa/Kconfig |  1 +
 arch/xtensa/include/asm/cachetype.h | 10 ++
 include/linux/cacheinfo.h   |  6 ++
 mm/Kconfig  |  6 ++
 22 files changed, 112 insertions(+)
 create mode 100644 arch/arc/include/asm/cachetype.h
 create mode 100644 arch/csky/include/asm/cachetype.h
 create mode 100644 arch/m68k/include/asm/cachetype.h
 create mode 100644 arch/mips/include/asm/cachetype.h
 create mode 100644 arch/nios2/include/asm/cachetype.h
 create mode 100644 arch/parisc/include/asm/cachetype.h
 create mode 100644 arch/sh/include/asm/cachetype.h
 create mode 100644 arch/sparc/include/asm/cachetype.h
 create mode 100644 arch/xtensa/include/asm/cachetype.h

diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index 1b0483c51cc1..969e6740bcf7 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -6,6 +6,7 @@
 config ARC
def_bool y
select ARC_TIMERS
+   select ARCH_HAS_CACHE_ALIASING
select ARCH_HAS_CACHE_LINE_SIZE
select ARCH_HAS_DEBUG_VM_PGTABLE
select ARCH_HAS_DMA_PREP_COHERENT
diff --git a/arch/arc/include/asm/cachetype.h b/arch/arc/include/asm/cachetype.h
new file mode 100644
index ..290e3cc85845
--- /dev/null
+++ b/arch/arc/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_ARC_CACHETYPE_H
+#define __ASM_ARC_CACHETYPE_H
+
+#include 
+
+#define dcache_is_aliasing()   true
+
+#endif
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index f8567e95f98b..5adeee5e421f 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -5,6 +5,7 @@ config ARM
select ARCH_32BIT_OFF_T
select ARCH_CORRECT_STACKTRACE_ON_KRETPROBE if HAVE_KRETPROBES && 
FRAME_POINTER && !ARM_UNWIND
select ARCH_HAS_BINFMT_FLAT
+   select ARCH_HAS_CACHE_ALIASING
select ARCH_HAS_CPU_FINALIZE_INIT if MMU
select ARCH_HAS_CURRENT_STACK_POINTER
select ARCH_HAS_DEBUG_VIRTUAL if MMU
diff --git a/arch/arm/include/asm/cachetype.h b/arch/arm/include/asm/cachetype.h
index e8c30430be33..18311570d4f0 10

Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-30 Thread Linus Torvalds

On Tue, 30 Jan 2024 at 08:49, Steven Rostedt  wrote:
>
> - On removal, I got rid of the SRCU callback and the work queue.
>   Instead, I find the dentry of the current eventfs_inode that is being
>   deleted by walking the ei->parent until I find the events inode that has
>   a dentry. I then use that to do a lookup walking back down to the
>   eventfs_inode I want to delete. This gives me the dentry that I can call
>   d_invalidate() on.

Yes, that works.

However, I have a patch that is *much* smaller and simpler, and
doesn't need that walk.

The VFS layer already has a good interface for "should I still use
this dentry", which is needed for various network filesystems etc that
want to time out caches (or check explicitly whether the file still
exists etc): it's the dentry d_revalidate() check.

Let me just reboot into it to test that I got all the cases.

It makes the code even more obvious, and avoids all the complexity.

   Linus

Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-30 Thread Steven Rostedt

On Tue, 30 Jan 2024 08:55:51 -0800
Linus Torvalds  wrote:

> On Tue, 30 Jan 2024 at 08:49, Steven Rostedt  wrote:
> >
> > - On removal, I got rid of the SRCU callback and the work queue.
> >   Instead, I find the dentry of the current eventfs_inode that is being
> >   deleted by walking the ei->parent until I find the events inode that has
> >   a dentry. I then use that to do a lookup walking back down to the
> >   eventfs_inode I want to delete. This gives me the dentry that I can call
> >   d_invalidate() on.  
> 
> Yes, that works.
> 
> However, I have a patch that is *much* smaller and simpler, and
> doesn't need that walk.
> 
> The VFS layer already has a good interface for "should I still use
> this dentry", which is needed for various network filesystems etc that
> want to time out caches (or check explicitly whether the file still
> exists etc): it's the dentry d_revalidate() check.
> 
> Let me just reboot into it to test that I got all the cases.
> 
> It makes the code even more obvious, and avoids all the complexity.


I actually had this before, but it wasn't working (likely to something else
that wasn't working or I did it wrong) so I reverted it.

-- Steve

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 49d4630d5d70..9867b39ae24c 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -451,6 +451,13 @@ lookup_file_dentry(struct dentry *dentry,
return dentry;
 }
 
+int eventfs_revalidate(struct dentry *dentry, unsigned int flags)
+{
+   struct eventfs_inode *ei = dentry->d_fsdata;
+
+   return ei && !ei->is_freed;
+}
+
 /**
  * eventfs_root_lookup - lookup routine to create file/dir
  * @dir: in which a lookup is being done
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index e1b172c0e091..0395459d919e 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -392,8 +392,24 @@ static void tracefs_dentry_iput(struct dentry *dentry, 
struct inode *inode)
iput(inode);
 }
 
+static int tracefs_revalidate(struct dentry *dentry, unsigned int flags)
+{
+   struct inode *inode = dentry->d_inode;
+   struct tracefs_inode *ti;
+
+   if (!dentry || !inode)
+   return 0;
+
+   ti = get_tracefs(inode);
+   if (!ti || !(ti->flags & TRACEFS_EVENT_INODE))
+   return 1;
+
+   return eventfs_revalidate(dentry, flags);
+}
+
 static const struct dentry_operations tracefs_dentry_operations = {
-   .d_iput = tracefs_dentry_iput,
+   .d_iput = tracefs_dentry_iput,
+   .d_revalidate   = tracefs_revalidate,
 };
 
 static int trace_fill_super(struct super_block *sb, void *data, int silent)
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 2af78fd95c93..a1024202c4e5 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -80,5 +80,6 @@ struct dentry *eventfs_start_creating(const char *name, 
struct dentry *parent);
 struct dentry *eventfs_failed_creating(struct dentry *dentry);
 struct dentry *eventfs_end_creating(struct dentry *dentry);
 void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry 
*dentry);
+int eventfs_revalidate(struct dentry *dentry, unsigned int flags);
 
 #endif /* _TRACEFS_INTERNAL_H */

Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-30 Thread Linus Torvalds

On Tue, 30 Jan 2024 at 08:55, Linus Torvalds
 wrote:
>
> Let me just reboot into it to test that I got all the cases.
>
> It makes the code even more obvious, and avoids all the complexity.

Yup, this works at least for the case you pointed out, and it really
feels like the RightThing(tm) from a VFS standpoint.

NOTE! This patch is on top of my previous 5-patch series.

 Linus
From 67b31bb53ef65132a65bfbe915628c481a39e73a Mon Sep 17 00:00:00 2001
From: Linus Torvalds 
Date: Tue, 30 Jan 2024 09:00:18 -0800
Subject: [PATCH] eventfs: clean up dentry ops and add revalidate function

In order for the dentries to stay up-to-date with the eventfs changes,
just add a 'd_revalidate' function that checks the 'is_freed' bit.

Also, clean up the dentry release to actually use d_release() rather
than the slightly odd d_iput() function.  We don't care about the inode,
all we want to do is to get rid of the refcount to the eventfs data
added by dentry->d_fsdata.

It would probably be cleaner to make eventfs its own filesystem, or at
least set its own dentry ops when looking up eventfs files.  But as it
is, only eventfs dentries use d_fsdata, so we don't really need to split
these things up by use.

Another thing that might be worth doing is to make all eventfs lookups
mark their dentries as not worth caching.  We could do that with
d_delete(), but the DCACHE_DONTCACHE flag would likely be even better.

As it is, the dentries are all freeable, but they only tend to get freed
at memory pressure rather than more proactively.  But that's a separate
issue.

Signed-off-by: Linus Torvalds 
---
 fs/tracefs/event_inode.c | 21 +
 fs/tracefs/inode.c   | 27 ++-
 fs/tracefs/internal.h|  3 ++-
 3 files changed, 25 insertions(+), 26 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index a37db0dac302..acdc797bd9c0 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -414,23 +414,14 @@ static inline struct eventfs_inode *alloc_ei(const char *name)
 
 
 /**
- * eventfs_set_ei_status_free - remove the dentry reference from an eventfs_inode
- * @ti: the tracefs_inode of the dentry
+ * eventfs_d_release - dentry is going away
  * @dentry: dentry which has the reference to remove.
  *
  * Remove the association between a dentry from an eventfs_inode.
  */
-void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry)
+void eventfs_d_release(struct dentry *dentry)
 {
-	struct eventfs_inode *ei;
-
-	mutex_lock(&eventfs_mutex);
-	ei = dentry->d_fsdata;
-	if (ei) {
-		dentry->d_fsdata = NULL;
-		put_ei(ei);
-	}
-	mutex_unlock(&eventfs_mutex);
+	put_ei(dentry->d_fsdata);
 }
 
 /**
@@ -517,9 +508,8 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
 	}
 
  enoent:
-	/* Nothing found? */
-	d_add(dentry, NULL);
-	result = NULL;
+	/* Don't cache negative lookups, just return an error */
+	result = ERR_PTR(-ENOENT);
 
  out:
 	mutex_unlock(&eventfs_mutex);
@@ -857,6 +847,5 @@ void eventfs_remove_events_dir(struct eventfs_inode *ei)
 	 * sticks around while the other ei->dentry are created
 	 * and destroyed dynamically.
 	 */
-	simple_recursive_removal(dentry, NULL);
 	dput(dentry);
 }
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index e1b172c0e091..64122787e5d0 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -379,21 +379,30 @@ static const struct super_operations tracefs_super_operations = {
 	.show_options	= tracefs_show_options,
 };
 
-static void tracefs_dentry_iput(struct dentry *dentry, struct inode *inode)
+/*
+ * It would be cleaner if eventfs had its own dentry ops.
+ *
+ * Note that d_revalidate is called potentially under RCU,
+ * so it can't take the eventfs mutex etc. It's fine - if
+ * we open a file just as it's marked dead, things will
+ * still work just fine, and just see the old stale case.
+ */
+static void tracefs_d_release(struct dentry *dentry)
 {
-	struct tracefs_inode *ti;
+	if (dentry->d_fsdata)
+		eventfs_d_release(dentry);
+}
 
-	if (!dentry || !inode)
-		return;
+static int tracefs_d_revalidate(struct dentry *dentry, unsigned int flags)
+{
+	struct eventfs_inode *ei = dentry->d_fsdata;
 
-	ti = get_tracefs(inode);
-	if (ti && ti->flags & TRACEFS_EVENT_INODE)
-		eventfs_set_ei_status_free(ti, dentry);
-	iput(inode);
+	return !(ei && ei->is_freed);
 }
 
 static const struct dentry_operations tracefs_dentry_operations = {
-	.d_iput = tracefs_dentry_iput,
+	.d_revalidate = tracefs_d_revalidate,
+	.d_release = tracefs_d_release,
 };
 
 static int trace_fill_super(struct super_block *sb, void *data, int silent)
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 72db3bdc4dfb..d4194466b643 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -79,6 +79,7 @@ struct inode *tracefs_get_inode(struct super_block *sb);
 struct dentry *eventfs_start_creating(const char *name, struct dentry *parent);
 struct dentry *eventfs_failed_creating(struct d

Re: [PATCH v2 2/4] dt-bindings: remoteproc: Add compatibility for TEE support

2024-01-30 Thread Rob Herring

On Fri, Jan 26, 2024 at 12:03:25PM +0100, Krzysztof Kozlowski wrote:
> On 18/01/2024 11:04, Arnaud Pouliquen wrote:
> > The "st,stm32mp1-m4-tee" compatible is utilized in a system configuration
> > where the Cortex-M4 firmware is loaded by the Trusted execution Environment
> > (TEE).
> > For instance, this compatible is used in both the Linux and OP-TEE
> > device-tree:
> > - In OP-TEE, a node is defined in the device tree with the
> >   st,stm32mp1-m4-tee to support signed remoteproc firmware.
> >   Based on DT properties, OP-TEE authenticates, loads, starts, and stops
> >   the firmware.
> > - On Linux, when the compatibility is set, the Cortex-M resets should not
> >   be declared in the device tree.
> > 
> > Signed-off-by: Arnaud Pouliquen 
> > ---
> > V1 to V2 updates
> > - update "st,stm32mp1-m4" compatible description to generalize
> > - remove the 'reset-names' requirement in one conditional branch, as the
> >   property is already part of the condition test.
> > ---
> >  .../bindings/remoteproc/st,stm32-rproc.yaml   | 52 +++
> >  1 file changed, 43 insertions(+), 9 deletions(-)
> > 
> > diff --git 
> > a/Documentation/devicetree/bindings/remoteproc/st,stm32-rproc.yaml 
> > b/Documentation/devicetree/bindings/remoteproc/st,stm32-rproc.yaml
> > index 370af61d8f28..6af821b15736 100644
> > --- a/Documentation/devicetree/bindings/remoteproc/st,stm32-rproc.yaml
> > +++ b/Documentation/devicetree/bindings/remoteproc/st,stm32-rproc.yaml
> > @@ -16,7 +16,12 @@ maintainers:
> >  
> >  properties:
> >compatible:
> > -const: st,stm32mp1-m4
> > +enum:
> > +  - st,stm32mp1-m4
> > +  - st,stm32mp1-m4-tee
> 
> The patch looks good to me, but I wonder about this choice of two
> compatibles.
> 
> Basically this is the same hardware with the same interface, but two
> compatibles to differentiate a bit different firmware setup. We have
> already such cases for Qualcomm [1] [2] and new ones will be coming. [3]
> 
> I wonder whether this should be rather the same compatible with
> additional property, e.g. "st,tee-control" or "remote-control".
> 
> [1]
> https://elixir.bootlin.com/linux/v6.7.1/source/Documentation/devicetree/bindings/dma/qcom,bam-dma.yaml#L54
> 
> [2]
> https://elixir.bootlin.com/linux/v6.7.1/source/Documentation/devicetree/bindings/net/qcom,ipa.yaml#L129
> (that's a bit different)
> 
> [3] https://lore.kernel.org/linux-devicetree/20240124103623.GJ4906@thinkpad/
> 
> @Rob,
> Any general guidance for this and Qualcomm?

I think we have cases using compatible already as well. Either way is 
fine with me.

Rob

Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2024-01-30 Thread Marek Szyprowski

Dear All,

On 30.01.2024 12:03, Christophe Leroy wrote:
> Le 30/01/2024 à 10:16, Chen-Yu Tsai a écrit :
>> [Vous ne recevez pas souvent de courriers de we...@chromium.org. D?couvrez 
>> pourquoi ceci est important ? https://aka.ms/LearnAboutSenderIdentification ]
>>
>> On Mon, Jan 29, 2024 at 12:09:50PM -0800, Luis Chamberlain wrote:
>>> On Thu, Dec 21, 2023 at 10:02:46AM +0100, Christophe Leroy wrote:
>>>> Declaring rodata_enabled and mark_rodata_ro() at all time
>>>> helps removing related #ifdefery in C files.
>>>>
>>>> Signed-off-by: Christophe Leroy 
>>> Very nice cleanup, thanks!, applied and pushed
>>>
>>> Luis
>> On next-20240130, which has your modules-next branch, and thus this
>> series and the other "module: Use set_memory_rox()" series applied,
>> my kernel crashes in some very weird way. Reverting your branch
>> makes the crash go away.
>>
>> I thought I'd report it right away. Maybe you folks would know what's
>> happening here? This is on arm64.
> That's strange, it seems to bug in module_bug_finalize() which is
> _before_ calls to module_enable_ro() and such.
>
> Can you try to revert the 6 patches one by one to see which one
> introduces the problem ?
>
> In reality, only patch 677bfb9db8a3 really change things. Other ones are
> more on less only cleanup.

I've also run into this issue with today's (20240130) linux-next on my 
test farm. The issue is not fully reproducible, so it was a bit hard to 
bisect it automatically. I've spent some time on manual testing and it 
looks that reverting the following 2 commits on top of linux-next fixes 
the problem:

65929884f868 ("modules: Remove #ifdef CONFIG_STRICT_MODULE_RWX around 
rodata_enabled")
677bfb9db8a3 ("module: Don't ignore errors from set_memory_XX()")

This in fact means that commit 677bfb9db8a3 is responsible for this 
regression, as 65929884f868 has to be reverted only because the latter 
depends on it. Let me know what I can do to help debugging this issue.

Here is the stack trace I've got on Khadas VIM3 ARM64 board:

Unable to handle kernel paging request at virtual address 80007bfeeb30
Mem abort info:
   ESR = 0x9647
   EC = 0x25: DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
   FSC = 0x07: level 3 translation fault
Data abort info:
   ISV = 0, ISS = 0x0047, ISS2 = 0x
   CM = 0, WnR = 1, TnD = 0, TagAccess = 0
   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
swapper pgtable: 4k pages, 48-bit VAs, pgdp=0a35a000
[80007bfeeb30] pgd=1000f4806003, p4d=1000f4806003, 
pud=17ed1003, pmd=17ed2003, pte=
Internal error: Oops: 9647 [#1] PREEMPT SMP
Modules linked in:
CPU: 4 PID: 182 Comm: (udev-worker) Not tainted 6.8.0-rc2-next-20240130 
#14391
Hardware name: Khadas VIM3 (DT)
pstate: 6005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : module_bug_finalize+0xb0/0xdc
lr : module_bug_finalize+0x70/0xdc
...
Call trace:
  module_bug_finalize+0xb0/0xdc
  load_module+0x182c/0x1c88
  init_module_from_file+0x84/0xc0
  idempotent_init_module+0x180/0x250
  __arm64_sys_finit_module+0x64/0xa0
  invoke_syscall+0x48/0x114
  el0_svc_common.constprop.0+0xc0/0xe0
  do_el0_svc+0x1c/0x28
  el0_svc+0x4c/0xe4
  el0t_64_sync_handler+0xc0/0xc4
  el0t_64_sync+0x190/0x194
Code: 9116e003 f942dc01 a93e8c41 c89ffc73 (f9000433)
---[ end trace  ]---



Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland

Re: [PATCH v2 2/4] dt-bindings: remoteproc: Add compatibility for TEE support

2024-01-30 Thread Rob Herring

On Thu, Jan 18, 2024 at 11:04:31AM +0100, Arnaud Pouliquen wrote:
> The "st,stm32mp1-m4-tee" compatible is utilized in a system configuration
> where the Cortex-M4 firmware is loaded by the Trusted execution Environment
> (TEE).
> For instance, this compatible is used in both the Linux and OP-TEE
> device-tree:
> - In OP-TEE, a node is defined in the device tree with the
>   st,stm32mp1-m4-tee to support signed remoteproc firmware.
>   Based on DT properties, OP-TEE authenticates, loads, starts, and stops
>   the firmware.
> - On Linux, when the compatibility is set, the Cortex-M resets should not
>   be declared in the device tree.
> 
> Signed-off-by: Arnaud Pouliquen 
> ---
> V1 to V2 updates
> - update "st,stm32mp1-m4" compatible description to generalize
> - remove the 'reset-names' requirement in one conditional branch, as the
>   property is already part of the condition test.
> ---
>  .../bindings/remoteproc/st,stm32-rproc.yaml   | 52 +++
>  1 file changed, 43 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/devicetree/bindings/remoteproc/st,stm32-rproc.yaml 
> b/Documentation/devicetree/bindings/remoteproc/st,stm32-rproc.yaml
> index 370af61d8f28..6af821b15736 100644
> --- a/Documentation/devicetree/bindings/remoteproc/st,stm32-rproc.yaml
> +++ b/Documentation/devicetree/bindings/remoteproc/st,stm32-rproc.yaml
> @@ -16,7 +16,12 @@ maintainers:
>  
>  properties:
>compatible:
> -const: st,stm32mp1-m4
> +enum:
> +  - st,stm32mp1-m4
> +  - st,stm32mp1-m4-tee
> +description:
> +  Use "st,stm32mp1-m4" for the Cortex-M4 coprocessor management by 
> non-secure context
> +  Use "st,stm32mp1-m4-tee" for the Cortex-M4 coprocessor management by 
> secure context
>  
>reg:
>  description:
> @@ -142,21 +147,40 @@ properties:
>  required:
>- compatible
>- reg
> -  - resets
>  
>  allOf:
>- if:
>properties:
> -reset-names:
> -  not:
> -contains:
> -  const: hold_boot
> +compatible:
> +  contains:
> +const: st,stm32mp1-m4
> +then:
> +  if:
> +properties:
> +  reset-names:
> +not:
> +  contains:
> +const: hold_boot

Note that this is true when 'reset-names' is not present. If that is not 
desired, then you need 'required: [reset-names]'. Not really a new issue 
though.

> +  then:
> +required:
> +  - st,syscfg-holdboot
> +  - resets
> +  else:
> +properties:
> +  st,syscfg-holdboot: false
> +required:
> +  - resets

'resets' is always required within the outer 'then' schema, so you can 
move this up a level.

> +
> +  - if:
> +  properties:
> +compatible:
> +  contains:
> +const: st,stm32mp1-m4-tee
>  then:
> -  required:
> -- st,syscfg-holdboot
> -else:
>properties:
>  st,syscfg-holdboot: false
> +reset-names: false
> +resets: false
>  
>  additionalProperties: false
>  
> @@ -188,5 +212,15 @@ examples:
>st,syscfg-rsc-tbl = <&tamp 0x144 0x>;
>st,syscfg-m4-state = <&tamp 0x148 0x>;
>  };
> +  - |
> +#include 
> +m4@1000 {
> +  compatible = "st,stm32mp1-m4-tee";
> +  reg = <0x1000 0x4>,
> +<0x3000 0x4>,
> +<0x3800 0x1>;
> +  st,syscfg-rsc-tbl = <&tamp 0x144 0x>;
> +  st,syscfg-m4-state = <&tamp 0x148 0x>;
> +};
>  
>  ...
> -- 
> 2.25.1
>

Re: [PATCH 2/4] tracing/user_events: Introduce multi-format events

2024-01-30 Thread Beau Belgrave

On Mon, Jan 29, 2024 at 09:24:07PM -0500, Steven Rostedt wrote:
> On Mon, 29 Jan 2024 09:29:07 -0800
> Beau Belgrave  wrote:
> 
> > Thanks, yeah ideally we wouldn't use special characters.
> > 
> > I'm not picky about this. However, I did want something that clearly
> > allowed a glob pattern to find all versions of a given register name of
> > user_events by user programs that record. The dot notation will pull in
> > more than expected if dotted namespace style names are used.
> > 
> > An example is "Asserts" and "Asserts.Verbose" from different programs.
> > If we tried to find all versions of "Asserts" via glob of "Asserts.*" it
> > will pull in "Asserts.Verbose.1" in addition to "Asserts.0".
> 
> Do you prevent brackets in names?
> 

No. However, since brackets have a start and end token that are distinct
finding all versions of your event is trivial compared to a single dot.

Imagine two events:
Asserts
Asserts[MyCoolIndex]

Resolves to tracepoints of:
Asserts:[0]
Asserts[MyCoolIndex]:[1]

Regardless of brackets in the names, a simple glob of Asserts:\[*\] only
finds Asserts:[0]. This is because we have that end bracket in the glob
and the full event name including the start bracket.

If I register another "version" of Asserts, thne I'll have:
Asserts:[0]
Asserts[MyCoolIndex]:[1]
Asserts:[2]

The glob of Asserts:\[*\] will return both:
Asserts:[0]
Asserts:[2]

At this point the program can either record all versions or scan further
to find which version of Asserts is wanted.

> > 
> > While a glob of "Asserts.[0-9]" works when the unique ID is 0-9, it
> > doesn't work if the number is higher, like 128. If we ever decide to
> > change the ID from an integer to say hex to save space, these globs
> > would break.
> > 
> > Is there some scheme that fits the C-variable name that addresses the
> > above scenarios? Brackets gave me a simple glob that seemed to prevent a
> > lot of this ("Asserts.\[*\]" in this case).
> 
> Prevent a lot of what? I'm not sure what your example here is.
> 

I'll try again :)

We have 2 events registered via user_events:
Asserts
Asserts.Verbose

Using dot notation these would result in tracepoints of:
user_events_multi/Asserts.0
user_events_multi/Asserts.Verbose.1

Using bracket notation these would result in tracepoints of:
user_events_multi/Asserts:[0]
user_events_multi/Asserts.Verbose:[1]

A recording program only wants to enable the Asserts tracepoint. It does
not want to record the Asserts.Verbose tracepoint.

The program must find the right tracepoint by scanning tracefs under the
user_events_multi system.

A single dot suffix does not allow a simple glob to be used. The glob
Asserts.* will return both Asserts.0 and Asserts.Verbose.1.

A simple glob of Asserts:\[*\] will only find Asserts:[0], it will not
find Asserts.Verbose:[1].

We could just use brackets and not have the colon (Asserts[0] in this
case). But brackets are still special for bash.

> > 
> > Are we confident that we always want to represent the ID as a base-10
> > integer vs a base-16 integer? The suffix will be ABI to ensure recording
> > programs can find their events easily.
> 
> Is there a difference to what we choose?
> 

If a simple glob of event_name:\[*\] cannot be used, then we must document
what the suffix format is, so an appropriate regex can be created. If we
start with base-10 then later move to base-16 we will break existing regex
patterns on the recording side.

I prefer, and have in this series, a base-16 output since it saves on
the tracepoint name size.

Either way we go, we need to define how recording programs should find
the events they care about. So we must be very clear, IMHO, about the
format of the tracepoint names in our documentation.

I personally think recording programs are likely to get this wrong
without proper guidance.

Thanks,
-Beau

> -- Steve

Re: [PATCH 2/4] tracing/user_events: Introduce multi-format events

2024-01-30 Thread Beau Belgrave

On Tue, Jan 30, 2024 at 11:12:22PM +0900, Masami Hiramatsu wrote:
> On Mon, 29 Jan 2024 09:29:07 -0800
> Beau Belgrave  wrote:
> 
> > On Fri, Jan 26, 2024 at 03:04:45PM -0500, Steven Rostedt wrote:
> > > On Fri, 26 Jan 2024 11:10:07 -0800
> > > Beau Belgrave  wrote:
> > > 
> > > > > OK, so the each different event has suffixed name. But this will
> > > > > introduce non C-variable name.
> > > > > 
> > > > > Steve, do you think your library can handle these symbols? It will
> > > > > be something like "event:[1]" as the event name.
> > > > > Personally I like "event.1" style. (of course we need to ensure the
> > > > > user given event name is NOT including such suffix numbers)
> > > > >   
> > > > 
> > > > Just to clarify around events including a suffix number. This is why
> > > > multi-events use "user_events_multi" system name and the single-events
> > > > using just "user_events".
> > > > 
> > > > Even if a user program did include a suffix, the suffix would still get
> > > > appended. An example is "test" vs "test:[0]" using multi-format would
> > > > result in two tracepoints ("test:[0]" and "test:[0]:[1]" respectively
> > > > (assuming these are the first multi-events on the system).
> > > > 
> > > > I'm with you, we really don't want any spoofing or squatting possible.
> > > > By using different system names and always appending the suffix I
> > > > believe covers this.
> > > > 
> > > > Looking forward to hearing Steven's thoughts on this as well.
> > > 
> > > I'm leaning towards Masami's suggestion to use dots, as that won't 
> > > conflict
> > > with special characters from bash, as '[' and ']' do.
> > > 
> > 
> > Thanks, yeah ideally we wouldn't use special characters.
> > 
> > I'm not picky about this. However, I did want something that clearly
> > allowed a glob pattern to find all versions of a given register name of
> > user_events by user programs that record. The dot notation will pull in
> > more than expected if dotted namespace style names are used.
> > 
> > An example is "Asserts" and "Asserts.Verbose" from different programs.
> > If we tried to find all versions of "Asserts" via glob of "Asserts.*" it
> > will pull in "Asserts.Verbose.1" in addition to "Asserts.0".
> 
> If we use dot for the suffix number, we can prohibit user to use it
> for their name. They still can use '_' (or change the group name?)

We could, however, we have user_event integration in OpenTelemetry and
I'm unsure if we should really try to restrict names. We'll also at some
point have libside integration, which might not have the same
restrictions on the user-tracer side as the kernel-tracer side.

I'm trying to restrict the user_event group name from changing outside
of an eventual tracer namespace. I'd like for each container to inherit
a tracer namespace long-term which decides what the actual group name
will be instead of users self-selecting names to prevent squatting or
spoofing of events.

> I just concerned that the name can be parsed by existing tools. Since
> ':' is used as a separator for group and event name in some case (e.g.
> tracefs "set_event" is using, so trace-cmd and perf is using it.)
> 

Good point.

What about just event_name[unique_id]? IE: Drop the colon.

Brackets are still special in bash, but it would prevent simple glob
patterns from matching to incorrect tracepoints under user_events_multi.

> > While a glob of "Asserts.[0-9]" works when the unique ID is 0-9, it
> > doesn't work if the number is higher, like 128. If we ever decide to
> > change the ID from an integer to say hex to save space, these globs
> > would break.
> 
> Hmm, why can't we use regexp?

We can use regex, but we'll need to agree the suffix format. We won't be
able to change it after that point. I'd prefer a base-16/Hex suffix
either in brackets or a simple dot.

> And if we limits the number of events up to 1000 for each same-name event
> we can use fixed numbers, like Assets.[0-9][0-9][0-9]
> 

I'm always wrong when I guess how many events programs will end up
using. Folks always surprise me. I'd rather have a solution that scales
to the max number of tracepoints allowed on the system (currently 16-bit
max value).

Thanks,
-Beau

> Thank you,
> 
> > 
> > Is there some scheme that fits the C-variable name that addresses the
> > above scenarios? Brackets gave me a simple glob that seemed to prevent a
> > lot of this ("Asserts.\[*\]" in this case).
> > 
> > Are we confident that we always want to represent the ID as a base-10
> > integer vs a base-16 integer? The suffix will be ABI to ensure recording
> > programs can find their events easily.
> > 
> > Thanks,
> > -Beau
> > 
> > > -- Steve
> 
> 
> -- 
> Masami Hiramatsu (Google)

Re: [PATCH v4 4/5] dt-bindings: input/touchscreen: imagis: add compatible for IST3032C

2024-01-30 Thread Rob Herring



On Sat, 20 Jan 2024 20:11:15 +0100, Karel Balej wrote:
> From: Karel Balej 
> 
> IST3032C is a touchscreen IC which seems mostly compatible with IST3038C
> except that it reports a different chip ID value.
> 
> Signed-off-by: Karel Balej 
> ---
> 
> Notes:
> v4:
> * Reword commit description to mention how this IC differs from the
>   already supported.
> 
>  .../devicetree/bindings/input/touchscreen/imagis,ist3038c.yaml   | 1 +
>  1 file changed, 1 insertion(+)
> 

Acked-by: Rob Herring

Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-30 Thread Steven Rostedt

On Mon, 29 Jan 2024 19:56:52 -0800
Linus Torvalds  wrote:

> [0005-eventfs-get-rid-of-dentry-pointers-without-refcounts.patch  
> text/x-patch (15652 bytes)] 

I missed this email when I wrote basically the same thing :-p

It was sent just as I went to bed and this morning I mistaken it as an
email I already read. When you mentioned that your last patch was on top of
your other 5, I said to myself, "Wait? there's five?" and went back though
this thread looking at every email with an attachment. Well, even though I
duplicated your work, I'll take that as a learning experience.

Anyway, I'm assuming that the other 4 patches are exactly the same as the
previous version, as applying patches from attachments is a manual process.
I don't even develop on the machine with my email client, so it either is
me downloading each and scp'ing them over to my development box or finding
the lore link and going through them individually one by one.

I know you don't send patches inlined anymore, which is a shame, because
patchwork takes care of all the administering when patches are inlined, and
I don't miss patches like I use to.

So, I'll down load patch 5 here and just assume that I already have your
other 4 patches from the previous email.

-- Steve

Re: [PATCH 0/4] tracing/user_events: Introduce multi-format events

2024-01-30 Thread Beau Belgrave

On Tue, Jan 30, 2024 at 11:09:33AM +0900, Masami Hiramatsu wrote:
> Hi Beau,
> 
> On Tue, 23 Jan 2024 22:08:40 +
> Beau Belgrave  wrote:
> 
> > Currently user_events supports 1 event with the same name and must have
> > the exact same format when referenced by multiple programs. This opens
> > an opportunity for malicous or poorly thought through programs to
> > create events that others use with different formats. Another scenario
> > is user programs wishing to use the same event name but add more fields
> > later when the software updates. Various versions of a program may be
> > running side-by-side, which is prevented by the current single format
> > requirement.
> > 
> > Add a new register flag (USER_EVENT_REG_MULTI_FORMAT) which indicates
> > the user program wishes to use the same user_event name, but may have
> > several different formats of the event in the future. When this flag is
> > used, create the underlying tracepoint backing the user_event with a
> > unique name per-version of the format. It's important that existing ABI
> > users do not get this logic automatically, even if one of the multi
> > format events matches the format. This ensures existing programs that
> > create events and assume the tracepoint name will match exactly continue
> > to work as expected. Add logic to only check multi-format events with
> > other multi-format events and single-format events to only check
> > single-format events during find.
> 
> Thanks for this work! This will allow many instance to use the same
> user-events at the same time.
> 
> BTW, can we force this flag set by default? My concern is if any user
> program use this user-event interface in the container (maybe it is
> possible if we bind-mount it). In this case, the user program can
> detect the other program is using the event if this flag is not set.
> Moreover, if there is a malicious program running in the container,
> it can prevent using the event name from other programs even if it
> is isolated by the name-space.
> 

The multi-format use a different system name (user_events_multi). So you
cannot use the single-format flag to detect multi-format names, etc. You
can only use it to find other single-format names like you could always do.

Likewise, you cannot use the single-event flag to block or prevent
multi-format events from being created.

Changing this behavior to default would break all of our environments.
So I'm pretty sure it would break others. The current environment
expects tracepoints to show up as their registered name under the
user_events system name. If this changed out from under us on a specific
kernel version, that would be bad.

I'd like eventually to have a tracer namespace concept for containers.
Then we would have a user_event_group per tracer namespace. Then all
user_events within the container have a unique system name which fully
isolates them. However, even with that isolation, we still need a way to
allow programs in the same container to have different versions of the
same event name.

Multi-format events fixes this problem. I think isolation should be
dealt with via true namespace isolation at the tracing level.

> Steve suggested that if a user program which is running in a namespace
> uses user-event without this flag, we can reject that by default.
> 
> What would you think about?
> 

This would break all of our environments. It would make previously
compiled programs using the existing ABI from working as expected.

I'd much prefer that level of isolation to happen at the namespace level
and why user_events as plumbing for different groups to achieve this.
It's also why the ABI does not allow programs to state a system name.
I'm trying to reserve the system name for the group/tracer/namespace
isolation work.

Thanks,
-Beau

> Thank you,
> 
> 
> > 
> > Add a register_name (reg_name) to the user_event struct which allows for
> > split naming of events. We now have the name that was used to register
> > within user_events as well as the unique name for the tracepoint. Upon
> > registering events ensure matches based on first the reg_name, followed
> > by the fields and format of the event. This allows for multiple events
> > with the same registered name to have different formats. The underlying
> > tracepoint will have a unique name in the format of {reg_name}:[unique_id].
> > The unique_id is the time, in nanoseconds, of the event creation converted
> > to hex. Since this is done under the register mutex, it is extremely
> > unlikely for these IDs to ever match. It's also very unlikely a malicious
> > program could consistently guess what the name would be and attempt to
> > squat on it via the single format ABI.
> > 
> > For example, if both "test u32 value" and "test u64 value" are used with
> > the USER_EVENT_REG_MULTI_FORMAT the system would have 2 unique
> > tracepoints. The dynamic_events file would then show the following:
> >   u:test u64 count
> >   u:test u32 count
> > 
> > The actual trac

Re: [PATCH v9] bus: mhi: host: Add tracing support

2024-01-30 Thread Manivannan Sadhasivam

On Tue, Jan 30, 2024 at 09:22:52AM -0500, Steven Rostedt wrote:
> On Tue, 30 Jan 2024 13:41:52 +0530
> Manivannan Sadhasivam  wrote:
> 
> > So same trace will get printed for both mhi_channel_command_start() and
> > mhi_channel_command_end()?
> 
> The trace output will also include the tracepoint name. That is, it will
> have the same content but will be preceded with:
> 
>   mhi_channel_command_start: ...
>   mhi_channel_command_end: ...
> 

Yes, but the message will be the same:

mhi_channel_command_start: chan%d: Updating state to: 
mhi_channel_command_end: chan%d: Updating state to:

Either only one of the trace should be present or the second one should print,
"mhi_channel_command_end: chan%d: Updated state to:"

- Mani

-- 
மணிவண்ணன் சதாசிவம்

Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-30 Thread Steven Rostedt

On Mon, 29 Jan 2024 19:56:52 -0800
Linus Torvalds  wrote:

> [0001-tracefs-avoid-using-the-ei-dentry-pointer-unnecessar.patch  
> text/x-patch (3561 bytes)] 
> 
> [0002-eventfsfs-initialize-the-tracefs-inode-properly.patch  text/x-patch 
> (2548 bytes)] 

Ah these two are new.

-- Steve

Re: [PATCH 2/4] tracing/user_events: Introduce multi-format events

2024-01-30 Thread Steven Rostedt

On Tue, 30 Jan 2024 10:05:15 -0800
Beau Belgrave  wrote:

> On Mon, Jan 29, 2024 at 09:24:07PM -0500, Steven Rostedt wrote:
> > On Mon, 29 Jan 2024 09:29:07 -0800
> > Beau Belgrave  wrote:
> >   
> > > Thanks, yeah ideally we wouldn't use special characters.
> > > 
> > > I'm not picky about this. However, I did want something that clearly
> > > allowed a glob pattern to find all versions of a given register name of
> > > user_events by user programs that record. The dot notation will pull in
> > > more than expected if dotted namespace style names are used.
> > > 
> > > An example is "Asserts" and "Asserts.Verbose" from different programs.
> > > If we tried to find all versions of "Asserts" via glob of "Asserts.*" it
> > > will pull in "Asserts.Verbose.1" in addition to "Asserts.0".  
> > 
> > Do you prevent brackets in names?
> >   
> 
> No. However, since brackets have a start and end token that are distinct
> finding all versions of your event is trivial compared to a single dot.
> 
> Imagine two events:
> Asserts
> Asserts[MyCoolIndex]
> 
> Resolves to tracepoints of:
> Asserts:[0]
> Asserts[MyCoolIndex]:[1]
> 
> Regardless of brackets in the names, a simple glob of Asserts:\[*\] only
> finds Asserts:[0]. This is because we have that end bracket in the glob
> and the full event name including the start bracket.
> 
> If I register another "version" of Asserts, thne I'll have:
> Asserts:[0]
> Asserts[MyCoolIndex]:[1]
> Asserts:[2]
> 
> The glob of Asserts:\[*\] will return both:
> Asserts:[0]
> Asserts:[2]

But what if you had registered "Asserts:[MyCoolIndex]:[1]"

Do you prevent colons?

> 
> At this point the program can either record all versions or scan further
> to find which version of Asserts is wanted.
> 
> > > 
> > > While a glob of "Asserts.[0-9]" works when the unique ID is 0-9, it
> > > doesn't work if the number is higher, like 128. If we ever decide to
> > > change the ID from an integer to say hex to save space, these globs
> > > would break.
> > > 
> > > Is there some scheme that fits the C-variable name that addresses the
> > > above scenarios? Brackets gave me a simple glob that seemed to prevent a
> > > lot of this ("Asserts.\[*\]" in this case).  
> > 
> > Prevent a lot of what? I'm not sure what your example here is.
> >   
> 
> I'll try again :)
> 
> We have 2 events registered via user_events:
> Asserts
> Asserts.Verbose
> 
> Using dot notation these would result in tracepoints of:
> user_events_multi/Asserts.0
> user_events_multi/Asserts.Verbose.1
> 
> Using bracket notation these would result in tracepoints of:
> user_events_multi/Asserts:[0]
> user_events_multi/Asserts.Verbose:[1]
> 
> A recording program only wants to enable the Asserts tracepoint. It does
> not want to record the Asserts.Verbose tracepoint.
> 
> The program must find the right tracepoint by scanning tracefs under the
> user_events_multi system.
> 
> A single dot suffix does not allow a simple glob to be used. The glob
> Asserts.* will return both Asserts.0 and Asserts.Verbose.1.
> 
> A simple glob of Asserts:\[*\] will only find Asserts:[0], it will not
> find Asserts.Verbose:[1].
> 
> We could just use brackets and not have the colon (Asserts[0] in this
> case). But brackets are still special for bash.

Are these shell scripts or programs. I use regex in programs all the time.
And if you have shell scripts, use awk or something.

Unless you prevent something from being added, I don't see the protection.

> 
> > > 
> > > Are we confident that we always want to represent the ID as a base-10
> > > integer vs a base-16 integer? The suffix will be ABI to ensure recording
> > > programs can find their events easily.  
> > 
> > Is there a difference to what we choose?
> >   
> 
> If a simple glob of event_name:\[*\] cannot be used, then we must document
> what the suffix format is, so an appropriate regex can be created. If we
> start with base-10 then later move to base-16 we will break existing regex
> patterns on the recording side.
> 
> I prefer, and have in this series, a base-16 output since it saves on
> the tracepoint name size.

I honestly don't care which base you use. So if you want to use base 16,
I'm fine with that.

> 
> Either way we go, we need to define how recording programs should find
> the events they care about. So we must be very clear, IMHO, about the
> format of the tracepoint names in our documentation.
> 
> I personally think recording programs are likely to get this wrong
> without proper guidance.
> 

Agreed.

-- Steve

[PATCH 1/6] tracefs: avoid using the ei->dentry pointer unnecessarily

2024-01-30 Thread Linus Torvalds

The eventfs_find_events() code tries to walk up the tree to find the
event directory that a dentry belongs to, in order to then find the
eventfs inode that is associated with that event directory.

However, it uses an odd combination of walking the dentry parent,
looking up the eventfs inode associated with that, and then looking up
the dentry from there.  Repeat.

But the code shouldn't have back-pointers to dentries in the first
place, and it should just walk the dentry parenthood chain directly.

Similarly, 'set_top_events_ownership()' looks up the dentry from the
eventfs inode, but the only reason it wants a dentry is to look up the
superblock in order to look up the root dentry.

But it already has the real filesystem inode, which has that same
superblock pointer.  So just pass in the superblock pointer using the
information that's already there, instead of looking up extraneous data
that is irrelevant.

Signed-off-by: Linus Torvalds 
---
 fs/tracefs/event_inode.c | 26 --
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 1c3dd0ad4660..2d128bedd654 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -156,33 +156,30 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
return ret;
 }
 
-static void update_top_events_attr(struct eventfs_inode *ei, struct dentry 
*dentry)
+static void update_top_events_attr(struct eventfs_inode *ei, struct 
super_block *sb)
 {
-   struct inode *inode;
+   struct inode *root;
 
/* Only update if the "events" was on the top level */
if (!ei || !(ei->attr.mode & EVENTFS_TOPLEVEL))
return;
 
/* Get the tracefs root inode. */
-   inode = d_inode(dentry->d_sb->s_root);
-   ei->attr.uid = inode->i_uid;
-   ei->attr.gid = inode->i_gid;
+   root = d_inode(sb->s_root);
+   ei->attr.uid = root->i_uid;
+   ei->attr.gid = root->i_gid;
 }
 
 static void set_top_events_ownership(struct inode *inode)
 {
struct tracefs_inode *ti = get_tracefs(inode);
struct eventfs_inode *ei = ti->private;
-   struct dentry *dentry;
 
/* The top events directory doesn't get automatically updated */
if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL))
return;
 
-   dentry = ei->dentry;
-
-   update_top_events_attr(ei, dentry);
+   update_top_events_attr(ei, inode->i_sb);
 
if (!(ei->attr.mode & EVENTFS_SAVE_UID))
inode->i_uid = ei->attr.uid;
@@ -235,8 +232,10 @@ static struct eventfs_inode *eventfs_find_events(struct 
dentry *dentry)
 
mutex_lock(&eventfs_mutex);
do {
-   /* The parent always has an ei, except for events itself */
-   ei = dentry->d_parent->d_fsdata;
+   // The parent is stable because we do not do renames
+   dentry = dentry->d_parent;
+   // ... and directories always have d_fsdata
+   ei = dentry->d_fsdata;
 
/*
 * If the ei is being freed, the ownership of the children
@@ -246,12 +245,11 @@ static struct eventfs_inode *eventfs_find_events(struct 
dentry *dentry)
ei = NULL;
break;
}
-
-   dentry = ei->dentry;
+   // Walk upwards until you find the events inode
} while (!ei->is_events);
mutex_unlock(&eventfs_mutex);
 
-   update_top_events_attr(ei, dentry);
+   update_top_events_attr(ei, dentry->d_sb);
 
return ei;
 }
-- 
2.43.0.5.g38fb137bdb

[PATCH 2/6] eventfsfs: initialize the tracefs inode properly

2024-01-30 Thread Linus Torvalds

The tracefs-specific fields in the inode were not initialized before the
inode was exposed to others through the dentry with 'd_instantiate()'.

And the ->flags file was initialized incorrectly with a '|=', when the
old value was stale.  It should have just been a straight assignment.

Move the field initializations up to before the d_instantiate, and fix
the use of uninitialized data.

Signed-off-by: Linus Torvalds 
---
 fs/tracefs/event_inode.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 2d128bedd654..c0d977e6c0f2 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -328,7 +328,9 @@ static struct dentry *create_file(const char *name, umode_t 
mode,
inode->i_ino = EVENTFS_FILE_INODE_INO;
 
ti = get_tracefs(inode);
-   ti->flags |= TRACEFS_EVENT_INODE;
+   ti->flags = TRACEFS_EVENT_INODE;
+   ti->private = NULL; // Directories have 'ei', files 
not
+
d_instantiate(dentry, inode);
fsnotify_create(dentry->d_parent->d_inode, dentry);
return eventfs_end_creating(dentry);
@@ -367,7 +369,8 @@ static struct dentry *create_dir(struct eventfs_inode *ei, 
struct dentry *parent
inode->i_ino = eventfs_dir_ino(ei);
 
ti = get_tracefs(inode);
-   ti->flags |= TRACEFS_EVENT_INODE;
+   ti->flags = TRACEFS_EVENT_INODE;
+   ti->private = ei;
 
inc_nlink(inode);
d_instantiate(dentry, inode);
@@ -513,7 +516,6 @@ create_file_dentry(struct eventfs_inode *ei, int idx,
 static void eventfs_post_create_dir(struct eventfs_inode *ei)
 {
struct eventfs_inode *ei_child;
-   struct tracefs_inode *ti;
 
lockdep_assert_held(&eventfs_mutex);
 
@@ -523,9 +525,6 @@ static void eventfs_post_create_dir(struct eventfs_inode 
*ei)
 srcu_read_lock_held(&eventfs_srcu)) {
ei_child->d_parent = ei->dentry;
}
-
-   ti = get_tracefs(ei->dentry->d_inode);
-   ti->private = ei;
 }
 
 /**
@@ -943,7 +942,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char 
*name, struct dentry
INIT_LIST_HEAD(&ei->list);
 
ti = get_tracefs(inode);
-   ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
+   ti->flags = TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
ti->private = ei;
 
inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;
-- 
2.43.0.5.g38fb137bdb

[PATCH 3/6] tracefs: dentry lookup crapectomy

2024-01-30 Thread Linus Torvalds

The dentry lookup for eventfs files was very broken, and had lots of
signs of the old situation where the filesystem names were all created
statically in the dentry tree, rather than being looked up dynamically
based on the eventfs data structures.

You could see it in the naming - how it claimed to "create" dentries
rather than just look up the dentries that were given it.

You could see it in various nonsensical and very incorrect operations,
like using "simple_lookup()" on the dentries that were passed in, which
only results in those dentries becoming negative dentries.  Which meant
that any other lookup would possibly return ENOENT if it saw that
negative dentry before the data rwas then later filled in.

You could see it in the immesnse amount of nonsensical code that didn't
actually just do lookups.

Signed-off-by: Linus Torvalds 
---
 fs/tracefs/event_inode.c | 275 ---
 1 file changed, 52 insertions(+), 223 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index c0d977e6c0f2..ad11063bdd53 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -230,7 +230,6 @@ static struct eventfs_inode *eventfs_find_events(struct 
dentry *dentry)
 {
struct eventfs_inode *ei;
 
-   mutex_lock(&eventfs_mutex);
do {
// The parent is stable because we do not do renames
dentry = dentry->d_parent;
@@ -247,7 +246,6 @@ static struct eventfs_inode *eventfs_find_events(struct 
dentry *dentry)
}
// Walk upwards until you find the events inode
} while (!ei->is_events);
-   mutex_unlock(&eventfs_mutex);
 
update_top_events_attr(ei, dentry->d_sb);
 
@@ -280,11 +278,10 @@ static void update_inode_attr(struct dentry *dentry, 
struct inode *inode,
 }
 
 /**
- * create_file - create a file in the tracefs filesystem
- * @name: the name of the file to create.
+ * lookup_file - look up a file in the tracefs filesystem
+ * @dentry: the dentry to look up
  * @mode: the permission that the file should have.
  * @attr: saved attributes changed by user
- * @parent: parent dentry for this file.
  * @data: something that the caller will want to get to later on.
  * @fop: struct file_operations that should be used for this file.
  *
@@ -292,13 +289,13 @@ static void update_inode_attr(struct dentry *dentry, 
struct inode *inode,
  * directory. The inode.i_private pointer will point to @data in the open()
  * call.
  */
-static struct dentry *create_file(const char *name, umode_t mode,
+static struct dentry *lookup_file(struct dentry *dentry,
+ umode_t mode,
  struct eventfs_attr *attr,
- struct dentry *parent, void *data,
+ void *data,
  const struct file_operations *fop)
 {
struct tracefs_inode *ti;
-   struct dentry *dentry;
struct inode *inode;
 
if (!(mode & S_IFMT))
@@ -307,12 +304,6 @@ static struct dentry *create_file(const char *name, 
umode_t mode,
if (WARN_ON_ONCE(!S_ISREG(mode)))
return NULL;
 
-   WARN_ON_ONCE(!parent);
-   dentry = eventfs_start_creating(name, parent);
-
-   if (IS_ERR(dentry))
-   return dentry;
-
inode = tracefs_get_inode(dentry->d_sb);
if (unlikely(!inode))
return eventfs_failed_creating(dentry);
@@ -331,29 +322,25 @@ static struct dentry *create_file(const char *name, 
umode_t mode,
ti->flags = TRACEFS_EVENT_INODE;
ti->private = NULL; // Directories have 'ei', files 
not
 
-   d_instantiate(dentry, inode);
+   d_add(dentry, inode);
fsnotify_create(dentry->d_parent->d_inode, dentry);
return eventfs_end_creating(dentry);
 };
 
 /**
- * create_dir - create a dir in the tracefs filesystem
+ * lookup_dir_entry - look up a dir in the tracefs filesystem
+ * @dentry: the directory to look up
  * @ei: the eventfs_inode that represents the directory to create
- * @parent: parent dentry for this file.
  *
- * This function will create a dentry for a directory represented by
+ * This function will look up a dentry for a directory represented by
  * a eventfs_inode.
  */
-static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry 
*parent)
+static struct dentry *lookup_dir_entry(struct dentry *dentry,
+   struct eventfs_inode *pei, struct eventfs_inode *ei)
 {
struct tracefs_inode *ti;
-   struct dentry *dentry;
struct inode *inode;
 
-   dentry = eventfs_start_creating(ei->name, parent);
-   if (IS_ERR(dentry))
-   return dentry;
-
inode = tracefs_get_inode(dentry->d_sb);
if (unlikely(!inode))
return eventfs_failed_creating(dentry);
@@ -372,8 +359,11 @@ static struct dentry *create_dir(struct eventfs_inode *ei, 
struct dentry *par

[PATCH 4/6] eventfs: remove unused 'd_parent' pointer field

2024-01-30 Thread Linus Torvalds

It's never used

Signed-off-by: Linus Torvalds 
---
 fs/tracefs/event_inode.c | 4 +---
 fs/tracefs/internal.h| 2 --
 2 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index ad11063bdd53..1d0102bfd7da 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -685,10 +685,8 @@ struct eventfs_inode *eventfs_create_dir(const char *name, 
struct eventfs_inode
INIT_LIST_HEAD(&ei->list);
 
mutex_lock(&eventfs_mutex);
-   if (!parent->is_freed) {
+   if (!parent->is_freed)
list_add_tail(&ei->list, &parent->children);
-   ei->d_parent = parent->dentry;
-   }
mutex_unlock(&eventfs_mutex);
 
/* Was the parent freed? */
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 91c2bf0b91d9..8f38740bfb5b 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -35,7 +35,6 @@ struct eventfs_attr {
  * @name:  the name of the directory to create
  * @children:  link list into the child eventfs_inode
  * @dentry: the dentry of the directory
- * @d_parent:   pointer to the parent's dentry
  * @d_children: The array of dentries to represent the files when created
  * @entry_attrs: Saved mode and ownership of the @d_children
  * @attr:  Saved mode and ownership of eventfs_inode itself
@@ -50,7 +49,6 @@ struct eventfs_inode {
const char  *name;
struct list_headchildren;
struct dentry   *dentry; /* Check is_freed to access */
-   struct dentry   *d_parent;
struct dentry   **d_children;
struct eventfs_attr *entry_attrs;
struct eventfs_attr attr;
-- 
2.43.0.5.g38fb137bdb

[PATCH 5/6] eventfs: get rid of dentry pointers without refcounts

2024-01-30 Thread Linus Torvalds

The eventfs inode had pointers to dentries (and child dentries) without
actually holding a refcount on said pointer.  That is fundamentally
broken, and while eventfs tried to then maintain coherence with dentries
going away by hooking into the '.d_iput' callback, that doesn't actually
work since it's not ordered wrt lookups.

There were two reasonms why eventfs tried to keep a pointer to a dentry:

 - the creation of a 'events' directory would actually have a stable
   dentry pointer that it created with tracefs_start_creating().

   And it needed that dentry when tearing it all down again in
   eventfs_remove_events_dir().

   This use is actually ok, because the special top-level events
   directory dentries are actually stable, not just a temporary cache of
   the eventfs data structures.

 - the 'eventfs_inode' (aka ei) needs to stay around as long as there
   are dentries that refer to it.

   It then used these dentry pointers as a replacement for doing
   reference counting: it would try to make sure that there was only
   ever one dentry associated with an event_inode, and keep a child
   dentry array around to see which dentries might still refer to the
   parent ei.

This gets rid of the invalid dentry pointer use, and renames the one
valid case to a different name to make it clear that it's not just any
random dentry.

The magic child dentry array that is kind of a "reverse reference list"
is simply replaced by having child dentries take a ref to the ei.  As
does the directory dentries.  That makes the broken use case go away.

Signed-off-by: Linus Torvalds 
---
 fs/tracefs/event_inode.c | 245 ---
 fs/tracefs/internal.h|   9 +-
 2 files changed, 80 insertions(+), 174 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 1d0102bfd7da..a37db0dac302 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -62,6 +62,34 @@ enum {
 
 #define EVENTFS_MODE_MASK  (EVENTFS_SAVE_MODE - 1)
 
+/*
+ * eventfs_inode reference count management.
+ *
+ * NOTE! We count only references from dentries, in the
+ * form 'dentry->d_fsdata'. There are also references from
+ * directory inodes ('ti->private'), but the dentry reference
+ * count is always a superset of the inode reference count.
+ */
+static void release_ei(struct kref *ref)
+{
+   struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, 
kref);
+   kfree(ei->entry_attrs);
+   kfree(ei);
+}
+
+static inline void put_ei(struct eventfs_inode *ei)
+{
+   if (ei)
+   kref_put(&ei->kref, release_ei);
+}
+
+static inline struct eventfs_inode *get_ei(struct eventfs_inode *ei)
+{
+   if (ei)
+   kref_get(&ei->kref);
+   return ei;
+}
+
 static struct dentry *eventfs_root_lookup(struct inode *dir,
  struct dentry *dentry,
  unsigned int flags);
@@ -289,7 +317,8 @@ static void update_inode_attr(struct dentry *dentry, struct 
inode *inode,
  * directory. The inode.i_private pointer will point to @data in the open()
  * call.
  */
-static struct dentry *lookup_file(struct dentry *dentry,
+static struct dentry *lookup_file(struct eventfs_inode *parent_ei,
+ struct dentry *dentry,
  umode_t mode,
  struct eventfs_attr *attr,
  void *data,
@@ -302,11 +331,11 @@ static struct dentry *lookup_file(struct dentry *dentry,
mode |= S_IFREG;
 
if (WARN_ON_ONCE(!S_ISREG(mode)))
-   return NULL;
+   return ERR_PTR(-EIO);
 
inode = tracefs_get_inode(dentry->d_sb);
if (unlikely(!inode))
-   return eventfs_failed_creating(dentry);
+   return ERR_PTR(-ENOMEM);
 
/* If the user updated the directory's attributes, use them */
update_inode_attr(dentry, inode, attr, mode);
@@ -322,9 +351,12 @@ static struct dentry *lookup_file(struct dentry *dentry,
ti->flags = TRACEFS_EVENT_INODE;
ti->private = NULL; // Directories have 'ei', files 
not
 
+   // Files have their parent's ei as their fsdata
+   dentry->d_fsdata = get_ei(parent_ei);
+
d_add(dentry, inode);
fsnotify_create(dentry->d_parent->d_inode, dentry);
-   return eventfs_end_creating(dentry);
+   return NULL;
 };
 
 /**
@@ -343,7 +375,7 @@ static struct dentry *lookup_dir_entry(struct dentry 
*dentry,
 
inode = tracefs_get_inode(dentry->d_sb);
if (unlikely(!inode))
-   return eventfs_failed_creating(dentry);
+   return ERR_PTR(-ENOMEM);
 
/* If the user updated the directory's attributes, use them */
update_inode_attr(dentry, inode, &ei->attr,
@@ -359,24 +391,28 @@ static struct dentry *lookup_dir_entry(struct dentry 
*dentry,
ti->flags = TRACEFS_EVENT_

[PATCH 6/6] eventfs: clean up dentry ops and add revalidate function

2024-01-30 Thread Linus Torvalds

In order for the dentries to stay up-to-date with the eventfs changes,
just add a 'd_revalidate' function that checks the 'is_freed' bit.

Also, clean up the dentry release to actually use d_release() rather
than the slightly odd d_iput() function.  We don't care about the inode,
all we want to do is to get rid of the refcount to the eventfs data
added by dentry->d_fsdata.

It would probably be cleaner to make eventfs its own filesystem, or at
least set its own dentry ops when looking up eventfs files.  But as it
is, only eventfs dentries use d_fsdata, so we don't really need to split
these things up by use.

Another thing that might be worth doing is to make all eventfs lookups
mark their dentries as not worth caching.  We could do that with
d_delete(), but the DCACHE_DONTCACHE flag would likely be even better.

As it is, the dentries are all freeable, but they only tend to get freed
at memory pressure rather than more proactively.  But that's a separate
issue.

Signed-off-by: Linus Torvalds 
---
 fs/tracefs/event_inode.c | 21 +
 fs/tracefs/inode.c   | 27 ++-
 fs/tracefs/internal.h|  3 ++-
 3 files changed, 25 insertions(+), 26 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index a37db0dac302..acdc797bd9c0 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -414,23 +414,14 @@ static inline struct eventfs_inode *alloc_ei(const char 
*name)
 
 
 /**
- * eventfs_set_ei_status_free - remove the dentry reference from an 
eventfs_inode
- * @ti: the tracefs_inode of the dentry
+ * eventfs_d_release - dentry is going away
  * @dentry: dentry which has the reference to remove.
  *
  * Remove the association between a dentry from an eventfs_inode.
  */
-void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry 
*dentry)
+void eventfs_d_release(struct dentry *dentry)
 {
-   struct eventfs_inode *ei;
-
-   mutex_lock(&eventfs_mutex);
-   ei = dentry->d_fsdata;
-   if (ei) {
-   dentry->d_fsdata = NULL;
-   put_ei(ei);
-   }
-   mutex_unlock(&eventfs_mutex);
+   put_ei(dentry->d_fsdata);
 }
 
 /**
@@ -517,9 +508,8 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
}
 
  enoent:
-   /* Nothing found? */
-   d_add(dentry, NULL);
-   result = NULL;
+   /* Don't cache negative lookups, just return an error */
+   result = ERR_PTR(-ENOENT);
 
  out:
mutex_unlock(&eventfs_mutex);
@@ -857,6 +847,5 @@ void eventfs_remove_events_dir(struct eventfs_inode *ei)
 * sticks around while the other ei->dentry are created
 * and destroyed dynamically.
 */
-   simple_recursive_removal(dentry, NULL);
dput(dentry);
 }
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index e1b172c0e091..64122787e5d0 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -379,21 +379,30 @@ static const struct super_operations 
tracefs_super_operations = {
.show_options   = tracefs_show_options,
 };
 
-static void tracefs_dentry_iput(struct dentry *dentry, struct inode *inode)
+/*
+ * It would be cleaner if eventfs had its own dentry ops.
+ *
+ * Note that d_revalidate is called potentially under RCU,
+ * so it can't take the eventfs mutex etc. It's fine - if
+ * we open a file just as it's marked dead, things will
+ * still work just fine, and just see the old stale case.
+ */
+static void tracefs_d_release(struct dentry *dentry)
 {
-   struct tracefs_inode *ti;
+   if (dentry->d_fsdata)
+   eventfs_d_release(dentry);
+}
 
-   if (!dentry || !inode)
-   return;
+static int tracefs_d_revalidate(struct dentry *dentry, unsigned int flags)
+{
+   struct eventfs_inode *ei = dentry->d_fsdata;
 
-   ti = get_tracefs(inode);
-   if (ti && ti->flags & TRACEFS_EVENT_INODE)
-   eventfs_set_ei_status_free(ti, dentry);
-   iput(inode);
+   return !(ei && ei->is_freed);
 }
 
 static const struct dentry_operations tracefs_dentry_operations = {
-   .d_iput = tracefs_dentry_iput,
+   .d_revalidate = tracefs_d_revalidate,
+   .d_release = tracefs_d_release,
 };
 
 static int trace_fill_super(struct super_block *sb, void *data, int silent)
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 72db3bdc4dfb..d4194466b643 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -79,6 +79,7 @@ struct inode *tracefs_get_inode(struct super_block *sb);
 struct dentry *eventfs_start_creating(const char *name, struct dentry *parent);
 struct dentry *eventfs_failed_creating(struct dentry *dentry);
 struct dentry *eventfs_end_creating(struct dentry *dentry);
-void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry 
*dentry);
+
+void eventfs_d_release(struct dentry *dentry);
 
 #endif /* _TRACEFS_INTERNAL_H */
-- 
2.43.0.5.g38fb137bdb

Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-30 Thread Linus Torvalds

On Tue, 30 Jan 2024 at 10:23, Steven Rostedt  wrote:
>
> I know you don't send patches inlined anymore, which is a shame, because
> patchwork takes care of all the administering when patches are inlined, and
> I don't miss patches like I use to.

I just sent the whole series as individual patches., and doing a

  b4 am 20240130190355.11486-1-torva...@linux-foundation.org

should get them all (or if you use patchwork, do that)

I don't normally do these inlined patches any more, because honestly,
99% of all patches I do end up being "I can't test this very well, I
think you should do something like this".

In fact, for simple ones where I might not have even compile-tested
them, much less booted into a kernel with them, I will actively
whitespace-corrupt the patch, just to make sure they aren't applied as
any kind of real workflow - they are almost always meant as a "I think
you should do this, and take credit for it all".

And so when I'm working on a series like this, I'll send attachments
just because it's easier, and because I don't want to patch-bomb
people with some kind of crazy work-in-progress thing.

But I'm reasonably comfortable with this series now, so I sent it as a
"real" patch series. I like it partly because it just removes a lot of
lines:

 3 files changed, 160 insertions(+), 433 deletions(-)

but mostly because the lines it removes are what I consider actively
broken code. So it's not just getting rid of LOC, it's getting rid of
complexity (and bugs) IMHO.

That said, I also don't think that patch series is any kind of
"final". I didn't fix up the readdir iterator locking, for example. I
don't think the SRCU parts are needed at all any more thanks to the
refcounting - the 'ei' is no longer protected by SRCU, it's protected
by virtue of us having the file open (and thus holding the dentry).

So please think of that series not as any kind of final word. More as
a "I strongly believe this is the direction eventfs should go".

I am perfectly ok with you munging those patches and taking full
credit for them, for example.

My "testing" has not involved any real tracing usage, and while I
*have* booted that thing, and have done some very basic smoke-testing
in /sys/kernel/tracing, 99% of the work was literally me just going
through the lookup code, removing everything I found objectionable,
and replacing it with what the VFS layer generally would want.

  Linus

Re: [ANNOUNCE] 5.10.204-rt100

2024-01-30 Thread Pavel Machek

Hi!

We (as in cip project), are trying to do -cip-rt releases
once a month. Are there any plans for 5.10-rt release any time soon?
That would help us ;-).

Best regards,
Pavel
-- 
DENX Software Engineering GmbH,Managing Director: Erika Unter
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany


signature.asc
Description: PGP signature

[RFC PATCH] kernel/module: add a safer implementation of try_module_get()

2024-01-30 Thread Marco Pagani

The current implementation of try_module_get() requires the module to
exist and be live as a precondition. While this may seem intuitive at
first glance, enforcing the precondition can be tricky, considering that
modules can be unloaded at any time if not previously taken. For
instance, the caller could be preempted just before calling
try_module_get(), and while preempted, the module could be unloaded and
freed. More subtly, the module could also be unloaded at any point while
executing try_module_get() before incrementing the refount with
atomic_inc_not_zero().

Neglecting the precondition that the module must exist and be live can
cause unexpected race conditions that can lead to crashes. However,
ensuring that the precondition is met may require additional locking
that increases the complexity of the code and can make it more
error-prone.

This patch adds a slower yet safer implementation of try_module_get()
that checks if the module is valid by looking into the mod_tree before
taking the module's refcount. This new function can be safely called on
stale and invalid module pointers, relieving developers from the burden
of ensuring that the module exists and is live before attempting to take
it.

The tree lookup and refcount increment are executed after taking the
module_mutex to prevent the module from being unloaded after looking up
the tree.

Signed-off-by: Marco Pagani 
---
 include/linux/module.h | 15 +++
 kernel/module/main.c   | 27 +++
 2 files changed, 42 insertions(+)

diff --git a/include/linux/module.h b/include/linux/module.h
index 08364d5cbc07..86b6ea43d204 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -695,6 +695,19 @@ extern void __module_get(struct module *module);
  */
 extern bool try_module_get(struct module *module);
 
+/**
+ * try_module_get_safe() - safely take the refcount of a module.
+ * @module: address of the module to be taken.
+ *
+ * Safer version of try_module_get(). Check first if the module exists and is 
alive,
+ * and then take its reference count.
+ *
+ * Return:
+ * * %true - module exists and its refcount has been incremented or module is 
NULL.
+ * * %false - module does not exist.
+ */
+extern bool try_module_get_safe(struct module *module);
+
 /**
  * module_put() - release a reference count to a module
  * @module: the module we should release a reference count for
@@ -815,6 +828,8 @@ static inline bool try_module_get(struct module *module)
return true;
 }
 
+#define try_module_get_safe(module) try_module_get(module)
+
 static inline void module_put(struct module *module)
 {
 }
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 98fedfdb8db5..22376b69778c 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -842,6 +842,33 @@ bool try_module_get(struct module *module)
 }
 EXPORT_SYMBOL(try_module_get);
 
+bool try_module_get_safe(struct module *module)
+{
+   struct module *mod;
+   bool ret = true;
+
+   if (!module)
+   goto out;
+
+   mutex_lock(&module_mutex);
+
+   /*
+* Check if the address points to a valid live module and take
+* the refcount only if it points to the module struct.
+*/
+   mod = __module_address((unsigned long)module);
+   if (mod && mod == module && module_is_live(mod))
+   __module_get(mod);
+   else
+   ret = false;
+
+   mutex_unlock(&module_mutex);
+
+out:
+   return ret;
+}
+EXPORT_SYMBOL(try_module_get_safe);
+
 void module_put(struct module *module)
 {
int ret;

base-commit: 4515d08a742c76612b65d2f47a87d12860519842
-- 
2.43.0

Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-30 Thread Steven Rostedt

On Tue, 30 Jan 2024 11:19:01 -0800
Linus Torvalds  wrote:

> On Tue, 30 Jan 2024 at 10:23, Steven Rostedt  wrote:
> >
> > I know you don't send patches inlined anymore, which is a shame, because
> > patchwork takes care of all the administering when patches are inlined, and
> > I don't miss patches like I use to.  
> 
> I just sent the whole series as individual patches., and doing a
> 
>   b4 am 20240130190355.11486-1-torva...@linux-foundation.org
> 
> should get them all (or if you use patchwork, do that)

Thanks. I can get the mbox with:

 wget https://patchwork.kernel.org/series/821406/mbox/

For those interested, the series is at:

  https://patchwork.kernel.org/project/linux-trace-kernel/list/?series=821406

And I got the above mbox link from the first patch and selecting the 'mbox'
link.

I download that and then run my scripts which adds the Link tags to them as
well as Cc's. I'll add the link to this thread as well. Then I simply do a
'git am -s'

Note, I mentioned the pain of pulling in your multi-attachment patches by
hand on IRC, and Konstantin replied that he'll fix b4 to do those too ;-)

But I still prefer patchwork, as that keeps history as well.

> 
> I don't normally do these inlined patches any more, because honestly,
> 99% of all patches I do end up being "I can't test this very well, I
> think you should do something like this".
> 
> In fact, for simple ones where I might not have even compile-tested
> them, much less booted into a kernel with them, I will actively
> whitespace-corrupt the patch, just to make sure they aren't applied as
> any kind of real workflow - they are almost always meant as a "I think
> you should do this, and take credit for it all".
> 
> And so when I'm working on a series like this, I'll send attachments
> just because it's easier, and because I don't want to patch-bomb
> people with some kind of crazy work-in-progress thing.
> 
> But I'm reasonably comfortable with this series now, so I sent it as a
> "real" patch series. I like it partly because it just removes a lot of
> lines:
> 
>  3 files changed, 160 insertions(+), 433 deletions(-)
> 
> but mostly because the lines it removes are what I consider actively
> broken code. So it's not just getting rid of LOC, it's getting rid of
> complexity (and bugs) IMHO.

Do you want me to put this in my urgent branch and mark them for stable,
and then send them for 6.8?

> 
> That said, I also don't think that patch series is any kind of
> "final". I didn't fix up the readdir iterator locking, for example. I
> don't think the SRCU parts are needed at all any more thanks to the
> refcounting - the 'ei' is no longer protected by SRCU, it's protected
> by virtue of us having the file open (and thus holding the dentry).
> 
> So please think of that series not as any kind of final word. More as
> a "I strongly believe this is the direction eventfs should go".
> 
> I am perfectly ok with you munging those patches and taking full
> credit for them, for example.

Before seeing your 5 patch series, I wrote that patch I showed you which
did basically the same thing. It passed all the preliminary tests so I'm
sure your version should work too. I'll still have to run it through my full
ktest suite, but that takes several hours to complete. I do that before I
send out my 'for-linus/next' or any git pulls.


> 
> My "testing" has not involved any real tracing usage, and while I
> *have* booted that thing, and have done some very basic smoke-testing
> in /sys/kernel/tracing, 99% of the work was literally me just going
> through the lookup code, removing everything I found objectionable,
> and replacing it with what the VFS layer generally would want.

I'll take your patches and start working on them.

Thanks for doing this!

-- Steve

Re: [PATCH 2/6] eventfsfs: initialize the tracefs inode properly

2024-01-30 Thread Steven Rostedt

On Tue, 30 Jan 2024 11:03:51 -0800
Linus Torvalds  wrote:

> The tracefs-specific fields in the inode were not initialized before the
> inode was exposed to others through the dentry with 'd_instantiate()'.
> 
> And the ->flags file was initialized incorrectly with a '|=', when the
> old value was stale.  It should have just been a straight assignment.

The ti is allocated from fs/tracefs/inode.c that has:

static struct inode *tracefs_alloc_inode(struct super_block *sb)
{
struct tracefs_inode *ti;

ti = kmem_cache_alloc(tracefs_inode_cachep, GFP_KERNEL);
if (!ti)
return NULL;

ti->flags = 0;

return &ti->vfs_inode;
}

Shouldn't that make it valid?

Granted, the eventfs inodes don't have any of the tracefs inode flags set.
But I purposely made he ti->flags initialized to zero.

Is this update really necessary?

Or do I need to make sure that the iput() clears it?

The flags are used by tracefs, so I want to know if there's not a bug there.

-- Steve


> 
> Move the field initializations up to before the d_instantiate, and fix
> the use of uninitialized data.
> 
> Signed-off-by: Linus Torvalds 
> ---
>  fs/tracefs/event_inode.c | 13 ++---
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
> index 2d128bedd654..c0d977e6c0f2 100644
> --- a/fs/tracefs/event_inode.c
> +++ b/fs/tracefs/event_inode.c
> @@ -328,7 +328,9 @@ static struct dentry *create_file(const char *name, 
> umode_t mode,
>   inode->i_ino = EVENTFS_FILE_INODE_INO;
>  
>   ti = get_tracefs(inode);
> - ti->flags |= TRACEFS_EVENT_INODE;
> + ti->flags = TRACEFS_EVENT_INODE;
> + ti->private = NULL; // Directories have 'ei', files 
> not
> +
>   d_instantiate(dentry, inode);
>   fsnotify_create(dentry->d_parent->d_inode, dentry);
>   return eventfs_end_creating(dentry);
> @@ -367,7 +369,8 @@ static struct dentry *create_dir(struct eventfs_inode 
> *ei, struct dentry *parent
>   inode->i_ino = eventfs_dir_ino(ei);
>  
>   ti = get_tracefs(inode);
> - ti->flags |= TRACEFS_EVENT_INODE;
> + ti->flags = TRACEFS_EVENT_INODE;
> + ti->private = ei;
>  
>   inc_nlink(inode);
>   d_instantiate(dentry, inode);
> @@ -513,7 +516,6 @@ create_file_dentry(struct eventfs_inode *ei, int idx,
>  static void eventfs_post_create_dir(struct eventfs_inode *ei)
>  {
>   struct eventfs_inode *ei_child;
> - struct tracefs_inode *ti;
>  
>   lockdep_assert_held(&eventfs_mutex);
>  
> @@ -523,9 +525,6 @@ static void eventfs_post_create_dir(struct eventfs_inode 
> *ei)
>srcu_read_lock_held(&eventfs_srcu)) {
>   ei_child->d_parent = ei->dentry;
>   }
> -
> - ti = get_tracefs(ei->dentry->d_inode);
> - ti->private = ei;
>  }
>  
>  /**
> @@ -943,7 +942,7 @@ struct eventfs_inode *eventfs_create_events_dir(const 
> char *name, struct dentry
>   INIT_LIST_HEAD(&ei->list);
>  
>   ti = get_tracefs(inode);
> - ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
> + ti->flags = TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
>   ti->private = ei;
>  
>   inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;

Re: [PATCH 2/6] eventfsfs: initialize the tracefs inode properly

2024-01-30 Thread Steven Rostedt

On Tue, 30 Jan 2024 11:03:51 -0800
Linus Torvalds  wrote:

> @@ -328,7 +328,9 @@ static struct dentry *create_file(const char *name, 
> umode_t mode,
>   inode->i_ino = EVENTFS_FILE_INODE_INO;
>  
>   ti = get_tracefs(inode);
> - ti->flags |= TRACEFS_EVENT_INODE;
> + ti->flags = TRACEFS_EVENT_INODE;
> + ti->private = NULL; // Directories have 'ei', files 
> not

Although ti->private does need to be initialized here.

-- Steve

> +
>   d_instantiate(dentry, inode);
>   fsnotify_create(dentry->d_parent->d_inode, dentry);
>   return eventfs_end_creating(dentry);

Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-30 Thread Linus Torvalds

On Tue, 30 Jan 2024 at 11:37, Steven Rostedt  wrote:
>
> Do you want me to put this in my urgent branch and mark them for stable,
> and then send them for 6.8?

Hmm. I think the only one that fixes a _reported_ bug is that [PTCH
2/6]. And it turns out that while 'ti->private' really is entirely
uninitialized (and I still think it's the cause of the kernel test
robot report that started this thread), the ti->flags field _is_
initialized to zero in tracefs_alloc_inode().

So even in that [PATCH 2/6], these parts:

  - ti->flags |= TRACEFS_EVENT_INODE;
  + ti->flags = TRACEFS_EVENT_INODE;

aren't strictly needed (but aren't wrong either).

The 'make sure to initialize ti->private before exposing the dentry'
part *definitely* needs to be part of 6.8, though. That has an
outstanding actually triggered bug report on it.

I do think that tracefs_alloc_inode() should also initialize
ti->private to NULL, but while that would fix the oops that the test
robot reported, it wouldn't fix the data-race on any ti->private
accesses.

So that "ti->private = ei" needs to be done before the d_instantiate()
(that later became a d_add()) regardless. But not having random fields
left uninitialized for future subtle bugs would be a good idea too.

Anyway.

If you do run the full tracefs tests on the whole series, and there
are no other major problems, I'll happily take it all for 6.8. And
yes, even mark it for stable. I think the other bugs are much harder
to hit, but I do think they exist. And code deletion is always good.

So give it the full test attention, and *if* it all still looks good
and there are no new subtle issues that crop up, let's just put this
saga behind us asap.

   Linus

Re: [PATCH 2/6] eventfsfs: initialize the tracefs inode properly

2024-01-30 Thread Steven Rostedt

On Tue, 30 Jan 2024 14:48:02 -0500
Steven Rostedt  wrote:
 
> The ti is allocated from fs/tracefs/inode.c that has:
> 
> static struct inode *tracefs_alloc_inode(struct super_block *sb)
> {
>   struct tracefs_inode *ti;
> 
>   ti = kmem_cache_alloc(tracefs_inode_cachep, GFP_KERNEL);

I could also just add __GFP_ZERO so that all of it is initialized to zero,
and then we don't need to assign NULL to any part of it.

-- Steve


>   if (!ti)
>   return NULL;
> 
>   ti->flags = 0;
> 
>   return &ti->vfs_inode;
> }
> 
>

Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-30 Thread Steven Rostedt

On Tue, 30 Jan 2024 11:54:56 -0800
Linus Torvalds  wrote:

> On Tue, 30 Jan 2024 at 11:37, Steven Rostedt  wrote:
> >
> > Do you want me to put this in my urgent branch and mark them for stable,
> > and then send them for 6.8?  
> 
> Hmm. I think the only one that fixes a _reported_ bug is that [PTCH
> 2/6]. And it turns out that while 'ti->private' really is entirely
> uninitialized (and I still think it's the cause of the kernel test
> robot report that started this thread), the ti->flags field _is_
> initialized to zero in tracefs_alloc_inode().
> 
> So even in that [PATCH 2/6], these parts:
> 
>   - ti->flags |= TRACEFS_EVENT_INODE;
>   + ti->flags = TRACEFS_EVENT_INODE;
> 
> aren't strictly needed (but aren't wrong either).
> 
> The 'make sure to initialize ti->private before exposing the dentry'
> part *definitely* needs to be part of 6.8, though. That has an
> outstanding actually triggered bug report on it.
> 
> I do think that tracefs_alloc_inode() should also initialize
> ti->private to NULL, but while that would fix the oops that the test
> robot reported, it wouldn't fix the data-race on any ti->private
> accesses.
> 
> So that "ti->private = ei" needs to be done before the d_instantiate()
> (that later became a d_add()) regardless. But not having random fields
> left uninitialized for future subtle bugs would be a good idea too.


I'll add a patch to add __GFP_ZERO to the tracefs inode allocation, and
modify your patch 2 to just move the ti->private = ei;


> 
> Anyway.
> 
> If you do run the full tracefs tests on the whole series, and there
> are no other major problems, I'll happily take it all for 6.8. And
> yes, even mark it for stable. I think the other bugs are much harder
> to hit, but I do think they exist. And code deletion is always good.

Sounds good.

> 
> So give it the full test attention, and *if* it all still looks good
> and there are no new subtle issues that crop up, let's just put this
> saga behind us asap.

Yes, I've been wanting to get away from eventfs for a month now.

Again, I really do appreciate the time you put in to fixing this for me.
I'm going to be backporting this to older Chromebooks as we really need to
cut down on the memory overhead of instances.

-- Steve

[PATCH] tracefs: Zero out the tracefs_inode when allocating it

2024-01-30 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

eventfs uses the tracefs_inode and assumes that it's already initialized
to zero. That is, it doesn't set fields to zero (like ti->private) after
getting its tracefs_inode. This causes bugs due to stale values.

Just initialize the entire structure to zero on allocation so there isn't
any more surprises.

This is a partial fix for accessing ti->private. The assignment still needs
to be made before the dentry is instantiated.

Cc: sta...@vger.kernel.org
Fixes: 5790b1fb3d672 ("eventfs: Remove eventfs_file and just use eventfs_inode")
Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-lkp/202401291043.e62e89dc-oliver.s...@intel.com
Suggested-by: Linus Torvalds 
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/inode.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index e1b172c0e091..f7cde61ff2fc 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -34,12 +34,10 @@ static struct inode *tracefs_alloc_inode(struct super_block 
*sb)
 {
struct tracefs_inode *ti;
 
-   ti = kmem_cache_alloc(tracefs_inode_cachep, GFP_KERNEL);
+   ti = kmem_cache_alloc(tracefs_inode_cachep, GFP_KERNEL | __GFP_ZERO);
if (!ti)
return NULL;
 
-   ti->flags = 0;
-
return &ti->vfs_inode;
 }
 
-- 
2.43.0

[PATCH v2] eventfs: Initialize the tracefs inode properly

2024-01-30 Thread Steven Rostedt

From: Linus Torvalds 

The tracefs-specific fields in the inode were not initialized before the
inode was exposed to others through the dentry with 'd_instantiate()'.

Move the field initializations up to before the d_instantiate.

Cc: sta...@vger.kernel.org
Fixes: 5790b1fb3d672 ("eventfs: Remove eventfs_file and just use eventfs_inode")
Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-lkp/202401291043.e62e89dc-oliver.s...@intel.com
Signed-off-by: Linus Torvalds 
Signed-off-by: Steven Rostedt (Google) 
---
Changes since v1: 
https://lore.kernel.org/linux-trace-kernel/20240130190355.11486-2-torva...@linux-foundation.org

- Since another patch zeroed out the entire tracefs_inode, there's no need
  to initialize any of its fields to NULL.
  (see 
https://lore.kernel.org/linux-trace-kernel/20240130151737.6e97a...@gandalf.local.home/)

 fs/tracefs/event_inode.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 1c3dd0ad4660..824b1811e342 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -370,6 +370,8 @@ static struct dentry *create_dir(struct eventfs_inode *ei, 
struct dentry *parent
 
ti = get_tracefs(inode);
ti->flags |= TRACEFS_EVENT_INODE;
+   /* Only directories have ti->private set to an ei, not files */
+   ti->private = ei;
 
inc_nlink(inode);
d_instantiate(dentry, inode);
@@ -515,7 +517,6 @@ create_file_dentry(struct eventfs_inode *ei, int idx,
 static void eventfs_post_create_dir(struct eventfs_inode *ei)
 {
struct eventfs_inode *ei_child;
-   struct tracefs_inode *ti;
 
lockdep_assert_held(&eventfs_mutex);
 
@@ -525,9 +526,6 @@ static void eventfs_post_create_dir(struct eventfs_inode 
*ei)
 srcu_read_lock_held(&eventfs_srcu)) {
ei_child->d_parent = ei->dentry;
}
-
-   ti = get_tracefs(ei->dentry->d_inode);
-   ti->private = ei;
 }
 
 /**
-- 
2.43.0

Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time

2024-01-30 Thread Luis Chamberlain

On Tue, Jan 30, 2024 at 06:48:11PM +0100, Marek Szyprowski wrote:
> Dear All,
> 
> On 30.01.2024 12:03, Christophe Leroy wrote:
> > Le 30/01/2024 à 10:16, Chen-Yu Tsai a écrit :
> >> [Vous ne recevez pas souvent de courriers de we...@chromium.org. D?couvrez 
> >> pourquoi ceci est important ? 
> >> https://aka.ms/LearnAboutSenderIdentification ]
> >>
> >> On Mon, Jan 29, 2024 at 12:09:50PM -0800, Luis Chamberlain wrote:
> >>> On Thu, Dec 21, 2023 at 10:02:46AM +0100, Christophe Leroy wrote:
> >>>> Declaring rodata_enabled and mark_rodata_ro() at all time
> >>>> helps removing related #ifdefery in C files.
> >>>>
> >>>> Signed-off-by: Christophe Leroy 
> >>> Very nice cleanup, thanks!, applied and pushed
> >>>
> >>> Luis
> >> On next-20240130, which has your modules-next branch, and thus this
> >> series and the other "module: Use set_memory_rox()" series applied,
> >> my kernel crashes in some very weird way. Reverting your branch
> >> makes the crash go away.
> >>
> >> I thought I'd report it right away. Maybe you folks would know what's
> >> happening here? This is on arm64.
> > That's strange, it seems to bug in module_bug_finalize() which is
> > _before_ calls to module_enable_ro() and such.
> >
> > Can you try to revert the 6 patches one by one to see which one
> > introduces the problem ?
> >
> > In reality, only patch 677bfb9db8a3 really change things. Other ones are
> > more on less only cleanup.
> 
> I've also run into this issue with today's (20240130) linux-next on my 
> test farm. The issue is not fully reproducible, so it was a bit hard to 
> bisect it automatically. I've spent some time on manual testing and it 
> looks that reverting the following 2 commits on top of linux-next fixes 
> the problem:
> 
> 65929884f868 ("modules: Remove #ifdef CONFIG_STRICT_MODULE_RWX around 
> rodata_enabled")
> 677bfb9db8a3 ("module: Don't ignore errors from set_memory_XX()")
> 
> This in fact means that commit 677bfb9db8a3 is responsible for this 
> regression, as 65929884f868 has to be reverted only because the latter 
> depends on it. Let me know what I can do to help debugging this issue.

Thanks for the bisect, I've reset my tree to commit
3559ad395bf02 ("module: Change module_enable_{nx/x/ro}() to more
explicit names") for now then, so to remove those commits.

  Luis

Re: [PATCH v9 3/3] remoteproc: zynqmp: parse TCM from device tree

2024-01-30 Thread Tanmay Shah



On 1/17/24 12:58 PM, Mathieu Poirier wrote:
> Alright, I spent several hours looking at this patchset and the driver as a
> whole.  I certainly salute your efforts to heed my advice and make the code 
> less
> brittle but I'm afraid we are not there.

Hi Mathieu,

I am back from vacation. Started looking into this. Thanks for spending time on 
this and helping

to make clean design of the driver. Please find my comments below.

> See below for a different way to proceed.
>
> On Wed, Jan 10, 2024 at 01:35:05PM -0800, Tanmay Shah wrote:
> > ZynqMP TCM information was fixed in driver. Now ZynqMP TCM information
> > is available in device-tree. Parse TCM information in driver
> > as per new bindings.
> > 
> > Signed-off-by: Tanmay Shah 
> > ---
> > 
> > Changes in v9:
> >   - Introduce new API to request and release core1 TCM power-domains in
> > lockstep mode. This will be used during prepare -> add_tcm_banks
> > callback to enable TCM in lockstep mode.
> >   - Parse TCM from device-tree in lockstep mode and split mode in
> > uniform way.
> >   - Fix TCM representation in device-tree in lockstep mode.
> > 
> > Changes in v8:
> >   - Remove pm_domains framework
> >   - Remove checking of pm_domain_id validation to power on/off tcm
> >   - Remove spurious change
> >   - parse power-domains property from device-tree and use EEMI calls
> > to power on/off TCM instead of using pm domains framework
> > 
> > Changes in v7:
> >   - move checking of pm_domain_id from previous patch
> >   - fix mem_bank_data memory allocation
> > 
> >  drivers/remoteproc/xlnx_r5_remoteproc.c | 245 +++-
> >  1 file changed, 239 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
> > b/drivers/remoteproc/xlnx_r5_remoteproc.c
> > index 4395edea9a64..0f87b984850b 100644
> > --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
> > +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
> > @@ -74,8 +74,8 @@ struct mbox_info {
> >  };
> >  
> >  /*
> > - * Hardcoded TCM bank values. This will be removed once TCM bindings are
> > - * accepted for system-dt specifications and upstreamed in linux kernel
> > + * Hardcoded TCM bank values. This will stay in driver to maintain backward
> > + * compatibility with device-tree that does not have TCM information.
> >   */
> >  static const struct mem_bank_data zynqmp_tcm_banks_split[] = {
> > {0xffe0UL, 0x0, 0x1UL, PD_R5_0_ATCM, "atcm0"}, /* TCM 64KB each 
> > */
> > @@ -102,6 +102,7 @@ static const struct mem_bank_data 
> > zynqmp_tcm_banks_lockstep[] = {
> >   * @rproc: rproc handle
> >   * @pm_domain_id: RPU CPU power domain id
> >   * @ipi: pointer to mailbox information
> > + * @lockstep_core1_np: second core's device_node to use in lockstep mode
> >   */
> >  struct zynqmp_r5_core {
> > struct device *dev;
> > @@ -111,6 +112,7 @@ struct zynqmp_r5_core {
> > struct rproc *rproc;
> > u32 pm_domain_id;
> > struct mbox_info *ipi;
> > +   struct device_node *lockstep_core1_np;
> >  };
> >  
> >  /**
> > @@ -539,6 +541,110 @@ static int tcm_mem_map(struct rproc *rproc,
> > return 0;
> >  }
> >  
> > +int request_core1_tcm_lockstep(struct rproc *rproc)
> > +{
> > +   struct zynqmp_r5_core *r5_core = rproc->priv;
> > +   struct of_phandle_args out_args = {0};
> > +   int ret, i, num_pd, pd_id, ret_err;
> > +   struct device_node *np;
> > +
> > +   np = r5_core->lockstep_core1_np;
> > +
> > +   /* Get number of power-domains */
> > +   num_pd = of_count_phandle_with_args(np, "power-domains",
> > +   "#power-domain-cells");
> > +   if (num_pd <= 0)
> > +   return -EINVAL;
> > +
> > +   /* Get individual power-domain id and enable TCM */
> > +   for (i = 1; i < num_pd; i++) {
> > +   ret = of_parse_phandle_with_args(np, "power-domains",
> > +"#power-domain-cells",
> > +i, &out_args);
> > +   if (ret) {
> > +   dev_warn(r5_core->dev,
> > +"failed to get tcm %d in power-domains list, 
> > ret %d\n",
> > +i, ret);
> > +   goto fail_request_core1_tcm;
> > +   }
> > +
> > +   pd_id = out_args.args[0];
> > +   of_node_put(out_args.np);
> > +
> > +   ret = zynqmp_pm_request_node(pd_id, 
> > ZYNQMP_PM_CAPABILITY_ACCESS, 0,
> > +ZYNQMP_PM_REQUEST_ACK_BLOCKING);
> > +   if (ret) {
> > +   dev_err(r5_core->dev, "failed to request TCM node 
> > 0x%x\n",
> > +   pd_id);
> > +   goto fail_request_core1_tcm;
> > +   }
> > +   }
> > +
> > +   return 0;
> > +
> > +fail_request_core1_tcm:
> > +
> > +   /* Cache actual error to return later */
> > +   ret_err = ret;
> > +
> > +   /* Release previously requested TCM in case of failure */
> > +   while (--i > 0)

Re: [RFC PATCH] kernel/module: add a safer implementation of try_module_get()

2024-01-30 Thread Luis Chamberlain

On Tue, Jan 30, 2024 at 08:36:14PM +0100, Marco Pagani wrote:
> The current implementation of try_module_get() requires the module to
> exist and be live as a precondition. While this may seem intuitive at
> first glance, enforcing the precondition can be tricky, considering that
> modules can be unloaded at any time if not previously taken. For
> instance, the caller could be preempted just before calling
> try_module_get(), and while preempted, the module could be unloaded and
> freed. More subtly, the module could also be unloaded at any point while
> executing try_module_get() before incrementing the refount with
> atomic_inc_not_zero().
> 
> Neglecting the precondition that the module must exist and be live can
> cause unexpected race conditions that can lead to crashes. However,
> ensuring that the precondition is met may require additional locking
> that increases the complexity of the code and can make it more
> error-prone.
> 
> This patch adds a slower yet safer implementation of try_module_get()
> that checks if the module is valid by looking into the mod_tree before
> taking the module's refcount. This new function can be safely called on
> stale and invalid module pointers, relieving developers from the burden
> of ensuring that the module exists and is live before attempting to take
> it.
> 
> The tree lookup and refcount increment are executed after taking the
> module_mutex to prevent the module from being unloaded after looking up
> the tree.
> 
> Signed-off-by: Marco Pagani 

It very much sounds like there is a desire to have this but without a
user, there is no justification.

> +bool try_module_get_safe(struct module *module)
> +{
> + struct module *mod;
> + bool ret = true;
> +
> + if (!module)
> + goto out;
> +
> + mutex_lock(&module_mutex);

If a user comes around then this should be mutex_lock_interruptible(),
and add might_sleep()

> +
> + /*
> +  * Check if the address points to a valid live module and take
> +  * the refcount only if it points to the module struct.
> +  */
> + mod = __module_address((unsigned long)module);
> + if (mod && mod == module && module_is_live(mod))
> + __module_get(mod);
> + else
> + ret = false;
> +
> + mutex_unlock(&module_mutex);
> +
> +out:
> + return ret;
> +}
> +EXPORT_SYMBOL(try_module_get_safe);

And EXPORT_SYMBOL_GPL() would need to be used.

I'd also expect selftests to be expanded for this case, but again,
without a user, this is just trying to resolve a problem which does not
exist.

  Luis

Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts

2024-01-30 Thread Steven Rostedt

On Tue, 30 Jan 2024 11:03:54 -0800
Linus Torvalds  wrote:

> -static void free_ei(struct eventfs_inode *ei)
> +static inline struct eventfs_inode *alloc_ei(const char *name)
>  {
> - kfree_const(ei->name);
> - kfree(ei->d_children);
> - kfree(ei->entry_attrs);
> - kfree(ei);
> + int namesize = strlen(name) + 1;
> + struct eventfs_inode *ei = kzalloc(sizeof(*ei) + namesize, GFP_KERNEL);
> +
> + if (ei) {
> + memcpy((char *)ei->name, name, namesize);
> + kref_init(&ei->kref);
> + }

I'm going to be putting back the ei->name pointer as the above actually
adds more memory usage. The reason being is that all static trace events
(not kprobes or other dynamic events) are created with the TRACE_EVENT()
macro that points to a constant name. Passing in a const to kstrdup_const()
checks if the name passed in is a constant or not. If it is, it doesn't
allocate the name and just passes back the pointer.

Removing the pointer to "name" removed 8 bytes on 64 bit machines from the
eventfs_inode structure, but added strlen(name)+1 bytes to it, where the
majority of trace event names are greater than 8 bytes, and are constant
strings.

Another reason for converting back to the way it was is I have moving the
ei allocation to a slab on my TODO list.

-- Steve



> + return ei;
>  }
>  


> diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
> index 8f38740bfb5b..72db3bdc4dfb 100644
> --- a/fs/tracefs/internal.h
> +++ b/fs/tracefs/internal.h
> @@ -34,8 +34,7 @@ struct eventfs_attr {
>   * @entries: the array of entries representing the files in the directory
>   * @name:the name of the directory to create
>   * @children:link list into the child eventfs_inode
> - * @dentry: the dentry of the directory
> - * @d_children: The array of dentries to represent the files when created
> + * @events_dir: the dentry of the events directory
>   * @entry_attrs: Saved mode and ownership of the @d_children
>   * @attr:Saved mode and ownership of eventfs_inode itself
>   * @data:The private data to pass to the callbacks
> @@ -44,12 +43,11 @@ struct eventfs_attr {
>   * @nr_entries: The number of items in @entries
>   */
>  struct eventfs_inode {
> + struct kref kref;
>   struct list_headlist;
>   const struct eventfs_entry  *entries;
> - const char  *name;
>   struct list_headchildren;
> - struct dentry   *dentry; /* Check is_freed to access */
> - struct dentry   **d_children;
> + struct dentry   *events_dir;
>   struct eventfs_attr *entry_attrs;
>   struct eventfs_attr attr;
>   void*data;
> @@ -66,6 +64,7 @@ struct eventfs_inode {
>   struct llist_node   llist;
>   struct rcu_head rcu;
>   };
> + const char name[];
>  };
>  
>  static inline struct tracefs_inode *get_tracefs(const struct inode *inode)

Re: [PATCH 6/6] eventfs: clean up dentry ops and add revalidate function

2024-01-30 Thread Steven Rostedt

On Tue, 30 Jan 2024 11:03:55 -0800
Linus Torvalds  wrote:

> Another thing that might be worth doing is to make all eventfs lookups
> mark their dentries as not worth caching.  We could do that with
> d_delete(), but the DCACHE_DONTCACHE flag would likely be even better.
> 
> As it is, the dentries are all freeable, but they only tend to get freed
> at memory pressure rather than more proactively.  But that's a separate
> issue.

I actually find the dentry's sticking around when their ref counts go to
zero a feature. Tracing applications will open and close the eventfs files
many times. If the dentry were to be deleted on every close, it would need
to be create over again in short spurts.

There's some operations that will traverse the tree many times. One
operation is on reset, as there's some dependencies in the order of
disabling triggers. If there's a dependency, and a trigger is to be
disabled out of order, the disabling will fail with -EBUSY. Thus the tools
will iterate the trigger files several times until it there's no more
changes. It's not a critical path, but I rather not add more overhead to it.

-- Steve

Re: [PATCH 6/6] eventfs: clean up dentry ops and add revalidate function

2024-01-30 Thread Linus Torvalds

On Tue, 30 Jan 2024 at 13:25, Steven Rostedt  wrote:
>
> I actually find the dentry's sticking around when their ref counts go to
> zero a feature. Tracing applications will open and close the eventfs files
> many times. If the dentry were to be deleted on every close, it would need
> to be create over again in short spurts.

Sure. I think this is literally a tuning thing, and we'll see if
anybody ever says "the dentry cache grows too much with tracefs".

   Linus

Re: [RFC PATCH v2 3/8] ext2: Use dax_is_supported()

2024-01-30 Thread Jan Kara

On Tue 30-01-24 11:52:50, Mathieu Desnoyers wrote:
> Use dax_is_supported() to validate whether the architecture has
> virtually aliased data caches at mount time. Print an error and disable
> DAX if dax=always is requested as a mount option on an architecture
> which does not support DAX.
> 
> This is relevant for architectures which require a dynamic check
> to validate whether they have virtually aliased data caches.
> 
> Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
> caches")
> Signed-off-by: Mathieu Desnoyers 
> Cc: Jan Kara 
> Cc: linux-e...@vger.kernel.org
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Cc: linux...@kvack.org
> Cc: linux-a...@vger.kernel.org
> Cc: Dan Williams 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Matthew Wilcox 
> Cc: Arnd Bergmann 
> Cc: Russell King 
> Cc: nvd...@lists.linux.dev
> Cc: linux-...@vger.kernel.org
> Cc: linux-fsde...@vger.kernel.org

OK, yeah, this is better than v1. Feel free to add:

Acked-by: Jan Kara 

Honza

> ---
>  fs/ext2/super.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index 01f9addc8b1f..30ff57d47ed4 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -955,7 +955,11 @@ static int ext2_fill_super(struct super_block *sb, void 
> *data, int silent)
>   blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
>  
>   if (test_opt(sb, DAX)) {
> - if (!sbi->s_daxdev) {
> + if (!dax_is_supported()) {
> + ext2_msg(sb, KERN_ERR,
> + "DAX unsupported by architecture. Turning off 
> DAX.");
> + clear_opt(sbi->s_mount_opt, DAX);
> + } else if (!sbi->s_daxdev) {
>   ext2_msg(sb, KERN_ERR,
>   "DAX unsupported by block device. Turning off 
> DAX.");
>   clear_opt(sbi->s_mount_opt, DAX);
> -- 
> 2.39.2
> 
-- 
Jan Kara 
SUSE Labs, CR

Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts

2024-01-30 Thread Linus Torvalds

On Tue, 30 Jan 2024 at 12:55, Steven Rostedt  wrote:
>
> I'm going to be putting back the ei->name pointer as the above actually
> adds more memory usage.

I did it mainly because I hate having multiple different allocation
sites that then have to do that kref_init() etc individually, and once
there was a single site the "name" thing really looked lik ean obvious
simplification.

That said, I think you're confused about the memory usage.

Sure, 'kstrdup_const()' optimizes away the allocation for static
constant strings, but what it does *not* do is to optimize away the
pointer.

In contrast, allocating them together gets rid of the pointer itself,
because now the name is just an offset in the structure.

And the pointer is 8 bytes right there.

So allocating the string _with_ the ei will always save at least 8 bytes.

So whenever the string is less than that in length it's *always* a win.

And even when it's not an absolute win, it will take advantage of the
kmalloc() quantized sizes too, and generally not be a loss even with
longer names.

So I'm pretty sure you are simply entirely wrong on the memory usage.
Not counting the size of the pointer is overlooking a big piece of the
puzzle.

Btw, you can look at name lengths in tracefs with something stupid like this:

find . | sed 's:^.*/::' | tr -c '\n' . | sort | uniq -c

and you will see that (at least in my case) names of lengths 1..7 are
dominating it all:

  1 .
   2189 ..
 34 ...
   2229 
207 .
   6833 ..
   2211 ...

with the rest being insignificant in comparison.

The reason? There's a *lot* of those 'filter' and 'enable' and 'id'
files. All of which are better off without a 'const char *name' taking
8 bytes.

  Linus

Re: [PATCH 2/4] tracing/user_events: Introduce multi-format events

2024-01-30 Thread Beau Belgrave

On Tue, Jan 30, 2024 at 01:52:30PM -0500, Steven Rostedt wrote:
> On Tue, 30 Jan 2024 10:05:15 -0800
> Beau Belgrave  wrote:
> 
> > On Mon, Jan 29, 2024 at 09:24:07PM -0500, Steven Rostedt wrote:
> > > On Mon, 29 Jan 2024 09:29:07 -0800
> > > Beau Belgrave  wrote:
> > >   
> > > > Thanks, yeah ideally we wouldn't use special characters.
> > > > 
> > > > I'm not picky about this. However, I did want something that clearly
> > > > allowed a glob pattern to find all versions of a given register name of
> > > > user_events by user programs that record. The dot notation will pull in
> > > > more than expected if dotted namespace style names are used.
> > > > 
> > > > An example is "Asserts" and "Asserts.Verbose" from different programs.
> > > > If we tried to find all versions of "Asserts" via glob of "Asserts.*" it
> > > > will pull in "Asserts.Verbose.1" in addition to "Asserts.0".  
> > > 
> > > Do you prevent brackets in names?
> > >   
> > 
> > No. However, since brackets have a start and end token that are distinct
> > finding all versions of your event is trivial compared to a single dot.
> > 
> > Imagine two events:
> > Asserts
> > Asserts[MyCoolIndex]
> > 
> > Resolves to tracepoints of:
> > Asserts:[0]
> > Asserts[MyCoolIndex]:[1]
> > 
> > Regardless of brackets in the names, a simple glob of Asserts:\[*\] only
> > finds Asserts:[0]. This is because we have that end bracket in the glob
> > and the full event name including the start bracket.
> > 
> > If I register another "version" of Asserts, thne I'll have:
> > Asserts:[0]
> > Asserts[MyCoolIndex]:[1]
> > Asserts:[2]
> > 
> > The glob of Asserts:\[*\] will return both:
> > Asserts:[0]
> > Asserts:[2]
> 
> But what if you had registered "Asserts:[MyCoolIndex]:[1]"
> 

Good point, the above would still require a regex type pattern to not
get pulled in.

> Do you prevent colons?
> 

No, nothing is prevented at this point.

It seems we could either prevent certain characters to make it easier or
define a good regex that we should document.

I'm leaning toward just doing a simple suffix and documenting the regex
well.

> > 
> > At this point the program can either record all versions or scan further
> > to find which version of Asserts is wanted.
> > 
> > > > 
> > > > While a glob of "Asserts.[0-9]" works when the unique ID is 0-9, it
> > > > doesn't work if the number is higher, like 128. If we ever decide to
> > > > change the ID from an integer to say hex to save space, these globs
> > > > would break.
> > > > 
> > > > Is there some scheme that fits the C-variable name that addresses the
> > > > above scenarios? Brackets gave me a simple glob that seemed to prevent a
> > > > lot of this ("Asserts.\[*\]" in this case).  
> > > 
> > > Prevent a lot of what? I'm not sure what your example here is.
> > >   
> > 
> > I'll try again :)
> > 
> > We have 2 events registered via user_events:
> > Asserts
> > Asserts.Verbose
> > 
> > Using dot notation these would result in tracepoints of:
> > user_events_multi/Asserts.0
> > user_events_multi/Asserts.Verbose.1
> > 
> > Using bracket notation these would result in tracepoints of:
> > user_events_multi/Asserts:[0]
> > user_events_multi/Asserts.Verbose:[1]
> > 
> > A recording program only wants to enable the Asserts tracepoint. It does
> > not want to record the Asserts.Verbose tracepoint.
> > 
> > The program must find the right tracepoint by scanning tracefs under the
> > user_events_multi system.
> > 
> > A single dot suffix does not allow a simple glob to be used. The glob
> > Asserts.* will return both Asserts.0 and Asserts.Verbose.1.
> > 
> > A simple glob of Asserts:\[*\] will only find Asserts:[0], it will not
> > find Asserts.Verbose:[1].
> > 
> > We could just use brackets and not have the colon (Asserts[0] in this
> > case). But brackets are still special for bash.
> 
> Are these shell scripts or programs. I use regex in programs all the time.
> And if you have shell scripts, use awk or something.
> 

They could be both. In our case, it is a program.

> Unless you prevent something from being added, I don't see the protection.
> 

Yeah, it just makes it way less likely. Given that, I'm starting to lean
toward just documenting the regex well and not trying to get fancy.

> > 
> > > > 
> > > > Are we confident that we always want to represent the ID as a base-10
> > > > integer vs a base-16 integer? The suffix will be ABI to ensure recording
> > > > programs can find their events easily.  
> > > 
> > > Is there a difference to what we choose?
> > >   
> > 
> > If a simple glob of event_name:\[*\] cannot be used, then we must document
> > what the suffix format is, so an appropriate regex can be created. If we
> > start with base-10 then later move to base-16 we will break existing regex
> > patterns on the recording side.
> > 
> > I prefer, and have in this series, a base-16 output since it saves on
> > the tracepoint name size.
> 
> I honestly don't care which base you use. So if you want to use base 16,

Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts

2024-01-30 Thread Steven Rostedt

On Tue, 30 Jan 2024 13:52:05 -0800
Linus Torvalds  wrote:

> On Tue, 30 Jan 2024 at 12:55, Steven Rostedt  wrote:
> >
> > I'm going to be putting back the ei->name pointer as the above actually
> > adds more memory usage.  
> 
> I did it mainly because I hate having multiple different allocation
> sites that then have to do that kref_init() etc individually, and once
> there was a single site the "name" thing really looked lik ean obvious
> simplification.
> 
> That said, I think you're confused about the memory usage.
> 
> Sure, 'kstrdup_const()' optimizes away the allocation for static
> constant strings, but what it does *not* do is to optimize away the
> pointer.
> 
> In contrast, allocating them together gets rid of the pointer itself,
> because now the name is just an offset in the structure.
> 
> And the pointer is 8 bytes right there.
> 
> So allocating the string _with_ the ei will always save at least 8 bytes.

Not if the string is not allocated, and ei->name is just pointing to an
existing string. Which is what kstrdup_const() does if the string is a
constant.

If the length of the string was one byte, so you allocate sizeof(*ei) + 2
(for the '\0' as well), is kzalloc() going to allocate anything less than 8
byte alignment? According to /proc/slabinfo, any allocation will be 8
bytes. Thus it is the same as having the pointer if the string is constant
but less than 8 bytes in length. But a waste if it is more.

> 
> So whenever the string is less than that in length it's *always* a win.
> 
> And even when it's not an absolute win, it will take advantage of the
> kmalloc() quantized sizes too, and generally not be a loss even with
> longer names.
> 
> So I'm pretty sure you are simply entirely wrong on the memory usage.
> Not counting the size of the pointer is overlooking a big piece of the
> puzzle.
> 
> Btw, you can look at name lengths in tracefs with something stupid like this:
> 
> find . | sed 's:^.*/::' | tr -c '\n' . | sort | uniq -c
> 
> and you will see that (at least in my case) names of lengths 1..7 are
> dominating it all:
> 
>   1 .
>2189 ..
>  34 ...
>2229 
> 207 .
>6833 ..
>2211 ...
> 
> with the rest being insignificant in comparison.
> 
> The reason? There's a *lot* of those 'filter' and 'enable' and 'id'
> files. All of which are better off without a 'const char *name' taking
> 8 bytes.

Remember, the files don't have an ei allocated. Only directories.

 # find . -type d | sed 's:^.*/::' | tr -c '\n' . | sort | uniq -c
  1 .
  2 ..
 30 ...
 14 
 21 .
 15 ..
 18 ...

Above is 8 bytes (length of 7 + '\0')

 27 
 36 .
 35 ..
 49 ...
 63 
 85 .
140 ..
180 ...
180 
155 .
171 ..
158 ...
134 
116 .
106 ..
 81 ...
 61 
 58 .
 43 ..
 41 ...
 23 
 14 .
 13 ..
  9 ...
  7 
  6 .
  5 ..
  1 ...
  2 
  1 .
  1 

The rest is all greater than the pointer size.

Also, it does prevent me from switching to a slab.

I added the below patch (after changing this back to kstrdup_const), that
determines if the string is allocated by checking if ei->name is the same
as the pointer passed in. This showed me:

 # cat /sys/kernel/tracing/dyn_ftrace_total_info
53698 pages:220 groups: 8 const cnt:2099 len:22446 dyn cnt: 3 len:56

That is, 2099 ei->name's point to a constant string without allocation.
Granted, I built my system without many modules. If you have more code
built as modules, that number may be less.

I then counted the length of the string - 7 bytes (which is the same as the
length of the string plus the '\0' character - 8 bytes) to accommodate the
pointer size, and it's a savings of 22KB per instance. And in actuality, I
should have rounded to the nearest kmalloc slab size as kzalloc() isn't going to
return 35 bytes back but 64 bytes will be allocated.

I think that's a win.

-- Steve

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index e9819d719d2a..ed9ffaac9a32 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -785,6 +785,25 @@ static int eventfs_iterate(struct file *file, struct 
dir_context *ctx)
goto out;
 }

+int sdr_ei_const;
+int sdr_ei_len;
+int sdr_ei_dyn;
+int s

Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts

2024-01-30 Thread Linus Torvalds

On Tue, 30 Jan 2024 at 13:52, Linus Torvalds
 wrote:
>
> Btw, you can look at name lengths in tracefs with something stupid like this:
>
> find . | sed 's:^.*/::' | tr -c '\n' . | sort | uniq -c

Actually, since only directories have these 'ei' fields, that find
should have "-type d", and then the stats change.

Most directory filenames there are indeed longer than 8 bytes (16
bytes seems to be the most common length).

That means that whether it wins or loses in allocation size tends to
be mostly about the kmalloc sizes and the size of that structure.

And it turns out that the real memory savings there would be this patch

  --- a/fs/tracefs/internal.h
  +++ b/fs/tracefs/internal.h
  @@ -55,15 +55,6 @@ struct eventfs_inode {
unsigned intis_events:1;
unsigned intnr_entries:30;
unsigned intino;
  - /*
  -  * Union - used for deletion
  -  * @llist:  for calling dput() if needed after RCU
  -  * @rcu:eventfs_inode to delete in RCU
  -  */
  - union {
  - struct llist_node   llist;
  - struct rcu_head rcu;
  - };
const char name[];
   };

since with all the other cleanups, neither of those fields are actually used.

With that, the base size of 'struct eventfs_inode' actually becomes 96
bytes for me.

And at least on x86-64, the kmalloc sizes are 96 and then 128 bytes.

But that means that

 - with the added name pointer, the eventfs_inode would grow to 104
bytes, and grow to that next allocation size (128)

 - with the embedded name, the size is 96+strlen+1, and will also need
at least a 128 byte allocation, but any name that is shorter than 32
bytes is free

so it ends up being a wash. You're going to allocate 128 bytes
regardless. But the deallocation is simpler.

  Linus

Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts

2024-01-30 Thread Linus Torvalds

On Tue, 30 Jan 2024 at 14:55, Steven Rostedt  wrote:
>
> Remember, the files don't have an ei allocated. Only directories.

Crossed emails.

> I then counted the length of the string - 7 bytes (which is the same as the
> length of the string plus the '\0' character - 8 bytes) to accommodate the
> pointer size, and it's a savings of 22KB per instance. And in actuality, I
> should have rounded to the nearest kmalloc slab size as kzalloc() isn't going 
> to
> return 35 bytes back but 64 bytes will be allocated.

No. See my other email. The kmalloc sizes actually means that it comes
out exactly even.

 Linus

1 2 >

1 - 100 of 147 matches

Mail list logo