Re: [PATCH v9] bus: mhi: host: Add tracing support
On Fri, Jan 05, 2024 at 05:53:03PM +0530, Krishna chaitanya chundru wrote: > This change adds ftrace support for following functions which > helps in debugging the issues when there is Channel state & MHI > state change and also when we receive data and control events: > 1. mhi_intvec_mhi_states > 2. mhi_process_data_event_ring > 3. mhi_process_ctrl_ev_ring > 4. mhi_gen_tre > 5. mhi_update_channel_state > 6. mhi_tryset_pm_state > 7. mhi_pm_st_worker > > Change the implementation of the arrays which has enum to strings mapping > to make it consistent in both trace header file and other files. > > Where ever the trace events are added, debug messages are removed. > > Signed-off-by: Krishna chaitanya chundru Few nitpicks below. > Reviewed-by: "Steven Rostedt (Google)" > --- > Changes in v9: > - Change the implementations of some array so that the strings to enum mapping > - is same in both trace header and other files as suggested by steve. > - Link to v8: > https://lore.kernel.org/r/20231207-ftrace_support-v8-1-7f62d4558...@quicinc.com > > Changes in v8: > - Pass the structure and derefernce the variables in TP_fast_assign as > suggested by steve > - Link to v7: > https://lore.kernel.org/r/20231206-ftrace_support-v7-1-aca49a042...@quicinc.com > > Changes in v7: > - change log format as pointed by mani. > - Link to v6: > https://lore.kernel.org/r/20231204-ftrace_support-v6-1-9b206546d...@quicinc.com > > Changes in v6: > - use 'rp' directly as suggested by jeffrey. > - Link to v5: > https://lore.kernel.org/r/20231127-ftrace_support-v5-1-eb67daead...@quicinc.com > > Changes in v5: > - Use DECLARE_EVENT_CLASS for multiple events as suggested by steve. > - Instead of converting to u64 to print address, use %px to print the address > to avoid > - warnings in some platforms. > - Link to v4: > https://lore.kernel.org/r/2023-ftrace_support-v4-1-c83602399...@quicinc.com > > Changes in v4: > - Fix compilation issues in previous patch which happended due to rebasing. > - In the defconfig FTRACE config is not enabled due to that the compilation > issue is not > - seen in my workspace. > - Link to v3: > https://lore.kernel.org/r/2023-ftrace_support-v3-1-f358d2911...@quicinc.com > > Changes in v3: > - move trace header file from include/trace/events to drivers/bus/mhi/host/ > so that > - we can include driver header files. > - Use macros directly in the trace events as suggested Jeffrey Hugo. > - Reorder the structure in the events as suggested by steve to avoid holes in > the buffer. > - removed the mhi_to_physical function as this can give security issues. > - removed macros to define strings as we can get those from driver headers. > - Link to v2: > https://lore.kernel.org/r/20231013-ftrace_support-v2-1-6e893ce01...@quicinc.com > > Changes in v2: > - Passing the raw state into the trace event and using __print_symbolic() as > suggested by bjorn. > - Change mhi_pm_st_worker to mhi_pm_st_transition as suggested by bjorn. > - Fixed the kernel test rebot issues. > - Link to v1: > https://lore.kernel.org/r/20231005-ftrace_support-v1-1-23a2f394f...@quicinc.com > --- > drivers/bus/mhi/common.h| 38 +++--- > drivers/bus/mhi/host/init.c | 63 + > drivers/bus/mhi/host/internal.h | 40 ++ > drivers/bus/mhi/host/main.c | 19 ++- > drivers/bus/mhi/host/pm.c | 7 +- > drivers/bus/mhi/host/trace.h| 275 > > 6 files changed, 378 insertions(+), 64 deletions(-) > [...] > +TRACE_EVENT(mhi_gen_tre, > + > + TP_PROTO(struct mhi_controller *mhi_cntrl, struct mhi_chan *mhi_chan, > + struct mhi_ring_element *mhi_tre), > + > + TP_ARGS(mhi_cntrl, mhi_chan, mhi_tre), > + > + TP_STRUCT__entry( > + __string(name, mhi_cntrl->mhi_dev->name) > + __field(int, ch_num) > + __field(void *, wp) > + __field(__le64, tre_ptr) > + __field(__le32, dword0) > + __field(__le32, dword1) > + ), > + > + TP_fast_assign( > + __assign_str(name, mhi_cntrl->mhi_dev->name); > + __entry->ch_num = mhi_chan->chan; > + __entry->wp = mhi_tre; > + __entry->tre_ptr = mhi_tre->ptr; > + __entry->dword0 = mhi_tre->dword[0]; > + __entry->dword1 = mhi_tre->dword[1]; > + ), > + > + TP_printk("%s: Chan: %d Tre: 0x%p Tre buf: 0x%llx dword0: 0x%08x > dword1: 0x%08x\n", Use caps for printing the acronyms everywhere. Like TRE, DWORD etc... > + __get_str(name), __entry->ch_num, __entry->wp, > __entry->tre_ptr, > + __entry->dword0, __entry->dword1) > +); > + > +TRACE_EVENT(mhi_intvec_states, > + > + TP_PROTO(struct mhi_controller *mhi_cntrl, int dev_ee, int dev_state), > + > + TP_ARGS(mhi_cntrl, dev_ee, dev_state), > + > + TP_STRUCT__entry( > + __string(name, mhi_cntrl->mhi_dev->name) > + __field(int, local_ee) > +
Re: [PATCH v2 1/4] remoteproc: Add TEE support
On 1/29/24 19:55, Mathieu Poirier wrote: > On Thu, Jan 18, 2024 at 11:04:30AM +0100, Arnaud Pouliquen wrote: >> From: Arnaud Pouliquen >> >> Add a remoteproc TEE (Trusted Execution Environment) device >> that will be probed by the TEE bus. If the associated Trusted >> application is supported on secure part this device offers a client >> interface to load a firmware in the secure part. >> This firmware could be authenticated and decrypted by the secure >> trusted application. >> >> Signed-off-by: Arnaud Pouliquen >> --- >> drivers/remoteproc/Kconfig | 9 + >> drivers/remoteproc/Makefile | 1 + >> drivers/remoteproc/tee_remoteproc.c | 393 >> include/linux/tee_remoteproc.h | 99 +++ >> 4 files changed, 502 insertions(+) >> create mode 100644 drivers/remoteproc/tee_remoteproc.c >> create mode 100644 include/linux/tee_remoteproc.h >> >> diff --git a/drivers/remoteproc/Kconfig b/drivers/remoteproc/Kconfig >> index 48845dc8fa85..85299606806c 100644 >> --- a/drivers/remoteproc/Kconfig >> +++ b/drivers/remoteproc/Kconfig >> @@ -365,6 +365,15 @@ config XLNX_R5_REMOTEPROC >> >>It's safe to say N if not interested in using RPU r5f cores. >> >> + >> +config TEE_REMOTEPROC >> +tristate "trusted firmware support by a TEE application" >> +depends on OPTEE >> +help >> + Support for trusted remote processors firmware. The firmware >> + authentication and/or decryption are managed by a trusted application. >> + This can be either built-in or a loadable module. >> + >> endif # REMOTEPROC >> >> endmenu >> diff --git a/drivers/remoteproc/Makefile b/drivers/remoteproc/Makefile >> index 91314a9b43ce..fa8daebce277 100644 >> --- a/drivers/remoteproc/Makefile >> +++ b/drivers/remoteproc/Makefile >> @@ -36,6 +36,7 @@ obj-$(CONFIG_RCAR_REMOTEPROC) += rcar_rproc.o >> obj-$(CONFIG_ST_REMOTEPROC) += st_remoteproc.o >> obj-$(CONFIG_ST_SLIM_REMOTEPROC)+= st_slim_rproc.o >> obj-$(CONFIG_STM32_RPROC) += stm32_rproc.o >> +obj-$(CONFIG_TEE_REMOTEPROC)+= tee_remoteproc.o >> obj-$(CONFIG_TI_K3_DSP_REMOTEPROC) += ti_k3_dsp_remoteproc.o >> obj-$(CONFIG_TI_K3_R5_REMOTEPROC) += ti_k3_r5_remoteproc.o >> obj-$(CONFIG_XLNX_R5_REMOTEPROC)+= xlnx_r5_remoteproc.o >> diff --git a/drivers/remoteproc/tee_remoteproc.c >> b/drivers/remoteproc/tee_remoteproc.c >> new file mode 100644 >> index ..49e1e0caf889 >> --- /dev/null >> +++ b/drivers/remoteproc/tee_remoteproc.c >> @@ -0,0 +1,393 @@ >> +// SPDX-License-Identifier: GPL-2.0-or-later >> +/* >> + * Copyright (C) STMicroelectronics 2023 - All Rights Reserved >> + * Author: Arnaud Pouliquen >> + */ >> + >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> + >> +#include "remoteproc_internal.h" >> + >> +#define MAX_TEE_PARAM_ARRY_MEMBER 4 >> + >> +/* >> + * Authentication of the firmware and load in the remote processor memory >> + * >> + * [in] params[0].value.a: unique 32bit identifier of the remote processor >> + * [in] params[1].memref: buffer containing the image of the >> buffer >> + */ >> +#define TA_RPROC_FW_CMD_LOAD_FW 1 >> + >> +/* >> + * Start the remote processor >> + * >> + * [in] params[0].value.a: unique 32bit identifier of the remote processor >> + */ >> +#define TA_RPROC_FW_CMD_START_FW2 >> + >> +/* >> + * Stop the remote processor >> + * >> + * [in] params[0].value.a: unique 32bit identifier of the remote processor >> + */ >> +#define TA_RPROC_FW_CMD_STOP_FW 3 >> + >> +/* >> + * Return the address of the resource table, or 0 if not found >> + * No check is done to verify that the address returned is accessible by >> + * the non secure context. If the resource table is loaded in a protected >> + * memory the access by the non secure context will lead to a data abort. >> + * >> + * [in] params[0].value.a: unique 32bit identifier of the remote processor >> + * [out] params[1].value.a:32bit LSB resource table memory address >> + * [out] params[1].value.b:32bit MSB resource table memory address >> + * [out] params[2].value.a:32bit LSB resource table memory size >> + * [out] params[2].value.b:32bit MSB resource table memory size >> + */ >> +#define TA_RPROC_FW_CMD_GET_RSC_TABLE 4 >> + >> +/* >> + * Return the address of the core dump >> + * >> + * [in] params[0].value.a: unique 32bit identifier of the remote processor >> + * [out] params[1].memref: address of the core dump image if exist, >> + * else return Null >> + */ >> +#define TA_RPROC_FW_CMD_GET_COREDUMP5 >> + >> +struct tee_rproc_mem { >> +char name[20]; >> +void __iomem *cpu_addr; >> +phys_addr_t bus_addr; >> +u32 dev_addr; >> +size_t size; >> +}; >> + >> +struct tee_rproc_context { >> +struct list_head sessio
Re: [PATCH 1/3] dt-bindings: remoteproc: qcom,sm8550-pas: document the X1E80100 aDSP & cDSP
On 29/01/2024 14:34, Abel Vesa wrote: > Document the aDSP and cDSP Peripheral Authentication Service on the > X1E80100 Platform. > > Signed-off-by: Abel Vesa > --- > Documentation/devicetree/bindings/remoteproc/qcom,sm8550-pas.yaml | 6 ++ > 1 file changed, 6 insertions(+) > Reviewed-by: Krzysztof Kozlowski Best regards, Krzysztof
Re: [PATCH] media: dt-bindings: qcom,sc7280-venus: Allow one IOMMU entry
Hello, On 1/30/2024 1:17 PM, Luca Weiss wrote: > On Mon Jan 29, 2024 at 6:37 PM CET, Conor Dooley wrote: >> On Mon, Jan 29, 2024 at 08:48:54AM +0100, Luca Weiss wrote: >>> Some SC7280-based boards crash when providing the "secure_non_pixel" >>> context bank, so allow only one iommu in the bindings also. >>> >>> Signed-off-by: Luca Weiss >> >> Do we have any idea why this happens? How is someone supposed to know >> whether or not their system requires you to only provide one iommu? >> Yes, a crash might be the obvious answer, but is there a way of knowing >> without the crashes? > > +CC Vikash Garodia > > Unfortunately I don't really have much more information than this > message here: > https://lore.kernel.org/linux-arm-msm/ff021f49-f81b-0fd1-bd2c-895dbbb03...@quicinc.com/ > > And see also the following replies for a bit more context, like this > one: > https://lore.kernel.org/linux-arm-msm/a4e8b531-49f9-f4a1-51cb-e422c5628...@quicinc.com/ > > Maybe Vikash can add some more info regarding this. 0x2184 is a secure SID i.e any transactions with that ID would be access controlled by trustzone (TZ). SC7280 (chromebook) was designed without TZ, while some other DT deriving from SC7280 (like qcm6490) might have TZ. Hence it is good to move the iommu entry from SC7280 to chrome-common. Regards, Vikash
Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
On Mon, 29 Jan 2024 at 19:56, Linus Torvalds wrote: > > But I've been staring at this code for too long, so I'm posting this > just as a "it's broken, but _something_ like this", because I'm taking > a break from looking at this. Bah. I literally woke up and realized what the problem is. We're caching negative dentries, but eventfs_create_dir() has no way to invalidate the old negative dentry when it adds a new entry. The old "try to reuse dentry" ended up hiding that issue, because it would look up the negative dentry from the 'ei' and turn it into a positive one. And I was stupidly staring at the wrong code - all these functions with almost the same names. eventfs_create_events_dir() is fine, because it gets the parent as a dentry, so looking up the new dentry is trivial. But eventfs_create_dir() doesn't get a dentry at all. It gets the parent as a struct eventfs_inode *parent which is no good. So that explains why creating an event after deleting an old one - or after just looking it up before it was created - ends up with the nonsensical "ls" output - it gets listed by readdir() because the entry is there in the eventfs data structures, but then doing a "stat()" on it will just hit the negative dentry. So it won't actually look up the dentry. The simple solution is probably just to not cache negative dentries for eventfs. So instead of doing the "d_add(dentry, NULL);" we should just return "ERR_PTR(ENOENT)". Which is actually what /proc/ lookups do, for the same reason. I'll go back to bed, but I think the fix is something trivial like this: --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -517,9 +517,8 @@ static struct dentry *eventfs_root_lookup( } enoent: - /* Nothing found? */ - d_add(dentry, NULL); - result = NULL; + /* Don't cache negative lookups, just return an error */ + result = ERR_PTR(ENOENT); out: mutex_unlock(&eventfs_mutex); I just wanted to write it down when the likely solution struck me. Linus
Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
On Tue, 30 Jan 2024 at 00:43, Linus Torvalds wrote: > > I'll go back to bed, but I think the fix is something trivial like this: Almost. > + result = ERR_PTR(ENOENT); That needs a '-' in front of the ENOENT, otherwise you have a positive error number and things go wrong very quickly. And that does indeed fix the lookup problem, but you end up with the same problem later when you do the eventfs_remove_dir(). Again the eventfs data structure changes, but we don't have a reliable dentry that we can invalidate. The dentry cache is just very good at caching those old dentries, and the interface for eventfs_create_dir() and eventfs_remove_dir() is just not great. If those did an actual path lookup (like eventfs_create_events_dir() does), we'd have the dentry, and it's trivial to get from dentry to eventfs_inode. But going the other way is the broken thing because of how the dentries are just temporary caches. I suspect the solution is to make eventfs_create_dir() do the same as the events directory case does, and actually pin the directory dentry and save it off. Oh well. At least I understand what the problem is. Now I'm going to try to go back to sleep. Linus
Re: [PATCH v2 4/4] remoteproc: stm32: Add support of an OP-TEE TA to load the firmware
On 1/26/24 18:11, Mathieu Poirier wrote: > On Thu, Jan 18, 2024 at 11:04:33AM +0100, Arnaud Pouliquen wrote: >> The new TEE remoteproc device is used to manage remote firmware in a >> secure, trusted context. The 'st,stm32mp1-m4-tee' compatibility is >> introduced to delegate the loading of the firmware to the trusted >> execution context. In such cases, the firmware should be signed and >> adhere to the image format defined by the TEE. >> >> Signed-off-by: Arnaud Pouliquen >> --- >> V1 to V2 update: >> - remove the select "TEE_REMOTEPROC" in STM32_RPROC config as detected by >> the kernel test robot: >> WARNING: unmet direct dependencies detected for TEE_REMOTEPROC >> Depends on [n]: REMOTEPROC [=y] && OPTEE [=n] >> Selected by [y]: >> - STM32_RPROC [=y] && (ARCH_STM32 || COMPILE_TEST [=y]) && REMOTEPROC >> [=y] >> - Fix initialized trproc variable in stm32_rproc_probe >> --- >> drivers/remoteproc/stm32_rproc.c | 149 +-- >> 1 file changed, 144 insertions(+), 5 deletions(-) >> >> diff --git a/drivers/remoteproc/stm32_rproc.c >> b/drivers/remoteproc/stm32_rproc.c >> index fcc0001e2657..cf6a21bac945 100644 >> --- a/drivers/remoteproc/stm32_rproc.c >> +++ b/drivers/remoteproc/stm32_rproc.c >> @@ -20,6 +20,7 @@ >> #include >> #include >> #include >> +#include >> #include >> >> #include "remoteproc_internal.h" >> @@ -49,6 +50,9 @@ >> #define M4_STATE_STANDBY4 >> #define M4_STATE_CRASH 5 >> >> +/* Remote processor unique identifier aligned with the Trusted Execution >> Environment definitions */ >> +#define STM32_MP1_M4_PROC_ID0 >> + >> struct stm32_syscon { >> struct regmap *map; >> u32 reg; >> @@ -90,6 +94,8 @@ struct stm32_rproc { >> struct stm32_mbox mb[MBOX_NB_MBX]; >> struct workqueue_struct *workqueue; >> bool hold_boot_smc; >> +bool fw_loaded; >> +struct tee_rproc *trproc; >> void __iomem *rsc_va; >> }; >> >> @@ -257,6 +263,91 @@ static int stm32_rproc_release(struct rproc *rproc) >> return err; >> } >> >> +static int stm32_rproc_tee_elf_sanity_check(struct rproc *rproc, >> +const struct firmware *fw) >> +{ >> +struct stm32_rproc *ddata = rproc->priv; >> +unsigned int ret = 0; >> + >> +if (rproc->state == RPROC_DETACHED) >> +return 0; >> + >> +ret = tee_rproc_load_fw(ddata->trproc, fw); >> +if (!ret) >> +ddata->fw_loaded = true; >> + >> +return ret; >> +} >> + >> +static int stm32_rproc_tee_elf_load(struct rproc *rproc, >> +const struct firmware *fw) >> +{ >> +struct stm32_rproc *ddata = rproc->priv; >> +unsigned int ret; >> + >> +/* >> + * This function can be called by remote proc for recovery >> + * without the sanity check. In this case we need to load the firmware >> + * else nothing done here as the firmware has been preloaded for the >> + * sanity check to be able to parse it for the resource table. >> + */ > > This comment is very confusing - please consider refactoring. > >> +if (ddata->fw_loaded) >> +return 0; >> + > > I'm not sure about keeping a flag to indicate the status of the loaded > firmware. > It is not done for the non-secure method, I don't see why it would be needed > for > the secure one. > The difference is on the sanity check. - in rproc_elf_sanity_check we parse the elf file to verify that it is valid. - in stm32_rproc_tee_elf_sanity_check we have to do the same, that means to authenticate it. the authentication is done during the load. So this flag is used to avoid to reload it twice time. refactoring the comment should help to understand this flag An alternative would be to bypass the sanity check. But this lead to same limitation. Before loading the firmware in remoteproc_core, we call rproc_parse_fw() that is used to get the resource table address. To get it from tee we need to authenticate the firmware so load it... >> +ret = tee_rproc_load_fw(ddata->trproc, fw); >> +if (ret) >> +return ret; >> +ddata->fw_loaded = true; >> + >> +/* Update the resource table parameters. */ >> +if (rproc_tee_get_rsc_table(ddata->trproc)) { >> +/* No resource table: reset the related fields. */ >> +rproc->cached_table = NULL; >> +rproc->table_ptr = NULL; >> +rproc->table_sz = 0; >> +} >> + >> +return 0; >> +} >> + >> +static struct resource_table * >> +stm32_rproc_tee_elf_find_loaded_rsc_table(struct rproc *rproc, >> + const struct firmware *fw) >> +{ >> +struct stm32_rproc *ddata = rproc->priv; >> + >> +return tee_rproc_get_loaded_rsc_table(ddata->trproc); >> +} >> + >> +static int stm32_rproc_tee_start(struct rproc *rproc) >> +{ >> +struct stm32_rproc *ddata = rproc->priv; >> + >> +return tee_rproc_start(ddata->trproc); >> +} >>
Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time
Hi, On Mon, Jan 29, 2024 at 12:09:50PM -0800, Luis Chamberlain wrote: > On Thu, Dec 21, 2023 at 10:02:46AM +0100, Christophe Leroy wrote: > > Declaring rodata_enabled and mark_rodata_ro() at all time > > helps removing related #ifdefery in C files. > > > > Signed-off-by: Christophe Leroy > > Very nice cleanup, thanks!, applied and pushed > > Luis On next-20240130, which has your modules-next branch, and thus this series and the other "module: Use set_memory_rox()" series applied, my kernel crashes in some very weird way. Reverting your branch makes the crash go away. I thought I'd report it right away. Maybe you folks would know what's happening here? This is on arm64. [ 10.481015] Unable to handle kernel paging request at virtual address ffde85245d30 [ 10.490369] KASAN: maybe wild-memory-access in range [0x00f42922e980-0x00f42922e987] [ 10.503744] Mem abort info: [ 10.509383] ESR = 0x9647 [ 10.514400] EC = 0x25: DABT (current EL), IL = 32 bits [ 10.522366] SET = 0, FnV = 0 [ 10.526343] EA = 0, S1PTW = 0 [ 10.530695] FSC = 0x07: level 3 translation fault [ 10.537081] Data abort info: [ 10.540839] ISV = 0, ISS = 0x0047, ISS2 = 0x [ 10.546456] CM = 0, WnR = 1, TnD = 0, TagAccess = 0 [ 10.551726] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 10.557612] swapper pgtable: 4k pages, 39-bit VAs, pgdp=41f98000 [ 10.565214] [ffde85245d30] pgd=10023003, p4d=10023003, pud=10023003, pmd=1001121eb003, pte= [ 10.578887] Internal error: Oops: 9647 [#1] PREEMPT SMP [ 10.585815] Modules linked in: [ 10.590235] CPU: 6 PID: 195 Comm: (udev-worker) Tainted: GB 6.8.0-rc2-next-20240130-02908-ge8ad01d60927-dirty #163 3f2318148ecc5fa70d1092c2b874f9b59bdb7d60 [ 10.607021] Hardware name: Google Tentacruel board (DT) [ 10.613607] pstate: a049 (NzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 10.621954] pc : module_bug_finalize+0x118/0x148 [ 10.626823] lr : module_bug_finalize+0x118/0x148 [ 10.631463] sp : ffc0820478d0 [ 10.631466] x29: ffc0820478d0 x28: ffc082047ca0 x27: ffde8d7d31a0 [ 10.631477] x26: ffde85223780 x25: x24: ffde8c413cc0 [ 10.631486] x23: ffde8dfcec80 x22: ffde8dfce000 x21: ffde85223ba8 [ 10.631495] x20: ffde85223780 x19: ffde85245d28 x18: [ 10.631504] x17: ffde8aa15938 x16: ffde8aabdd90 x15: ffde8aab8124 [ 10.631513] x14: ffde8acdd380 x13: 41b58ab3 x12: ffbbd1bf9d91 [ 10.631522] x11: 1ffbd1bf9d90 x10: ffbbd1bf9d90 x9 : dfc0 [ 10.631531] x8 : 00442e406270 x7 : ffde8dfcec87 x6 : 0001 [ 10.631539] x5 : ffde8dfcec80 x4 : x3 : ffde8bbadf08 [ 10.631548] x2 : 0001 x1 : ffde8eaff080 x0 : [ 10.631556] Call trace: [ 10.631559] module_bug_finalize+0x118/0x148 [ 10.631565] load_module+0x25ec/0x2a78 [ 10.631572] __do_sys_init_module+0x234/0x418 [ 10.631578] __arm64_sys_init_module+0x4c/0x68 [ 10.631584] invoke_syscall+0x68/0x198 [ 10.631589] el0_svc_common.constprop.0+0x11c/0x150 [ 10.631594] do_el0_svc+0x38/0x50 [ 10.631598] el0_svc+0x50/0xa0 [ 10.631604] el0t_64_sync_handler+0x120/0x130 [ 10.631609] el0t_64_sync+0x1a8/0x1b0 [ 10.631619] Code: 97c5418e c89ffef5 91002260 97c53ca7 (f9000675) [ 10.631624] ---[ end trace ]--- [ 10.642965] Kernel panic - not syncing: Oops: Fatal exception [ 10.642975] SMP: stopping secondary CPUs [ 10.648339] Kernel Offset: 0x1e0a80 from 0xffc08000 [ 10.648343] PHYS_OFFSET: 0x4000 [ 10.648345] CPU features: 0x0,c061,7002814a,2100720b [ 10.648350] Memory Limit: none
Re: [PATCH RFC v3 10/35] mm: cma: Fast track allocating memory when the pages are free
On 1/25/24 22:12, Alexandru Elisei wrote: > If the pages to be allocated are free, take them directly off the buddy > allocator, instead of going through alloc_contig_range() and avoiding > costly calls to lru_cache_disable(). > > Only allocations of the same size as the CMA region order are considered, > to avoid taking the zone spinlock for too long. > > Signed-off-by: Alexandru Elisei This patch seems to be improving standard cma_alloc() as well as the previously added new allocator i.e cma_alloc_range() - via a new helper cma_alloc_pages_fastpath(). Should not any standard cma_alloc() improvement be discussed as an independent patch separately irrespective of this series. OR it is some how related to this series which I might be missing ? > --- > > Changes since rfc v2: > > * New patch. Reworked from the rfc v2 patch #26 ("arm64: mte: Fast track > reserving tag storage when the block is free") (David Hildenbrand). > > include/linux/page-flags.h | 15 -- > mm/Kconfig | 5 + > mm/cma.c | 42 ++ > mm/memory-failure.c| 8 > mm/page_alloc.c| 23 - > 5 files changed, 73 insertions(+), 20 deletions(-) > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 735cddc13d20..b7237bce7446 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -575,11 +575,22 @@ TESTSCFLAG(HWPoison, hwpoison, PF_ANY) > #define MAGIC_HWPOISON 0x48575053U /* HWPS */ > extern void SetPageHWPoisonTakenOff(struct page *page); > extern void ClearPageHWPoisonTakenOff(struct page *page); > -extern bool take_page_off_buddy(struct page *page); > -extern bool put_page_back_buddy(struct page *page); > +extern bool PageHWPoisonTakenOff(struct page *page); > #else > PAGEFLAG_FALSE(HWPoison, hwpoison) > +TESTSCFLAG_FALSE(HWPoison, hwpoison) > #define __PG_HWPOISON 0 > +static inline void SetPageHWPoisonTakenOff(struct page *page) { } > +static inline void ClearPageHWPoisonTakenOff(struct page *page) { } > +static inline bool PageHWPoisonTakenOff(struct page *page) > +{ > + return false; > +} > +#endif > + > +#ifdef CONFIG_WANTS_TAKE_PAGE_OFF_BUDDY > +extern bool take_page_off_buddy(struct page *page, bool poison); > +extern bool put_page_back_buddy(struct page *page, bool unpoison); > #endif > > #if defined(CONFIG_PAGE_IDLE_FLAG) && defined(CONFIG_64BIT) > diff --git a/mm/Kconfig b/mm/Kconfig > index ffc3a2ba3a8c..341cf53898db 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -745,12 +745,16 @@ config DEFAULT_MMAP_MIN_ADDR > config ARCH_SUPPORTS_MEMORY_FAILURE > bool > > +config WANTS_TAKE_PAGE_OFF_BUDDY > + bool> + > config MEMORY_FAILURE > depends on MMU > depends on ARCH_SUPPORTS_MEMORY_FAILURE > bool "Enable recovery from hardware memory errors" > select MEMORY_ISOLATION > select RAS > + select WANTS_TAKE_PAGE_OFF_BUDDY > help > Enables code to recover from some memory failures on systems > with MCA recovery. This allows a system to continue running > @@ -891,6 +895,7 @@ config CMA > depends on MMU > select MIGRATION > select MEMORY_ISOLATION > + select WANTS_TAKE_PAGE_OFF_BUDDY > help > This enables the Contiguous Memory Allocator which allows other > subsystems to allocate big physically-contiguous blocks of memory. > diff --git a/mm/cma.c b/mm/cma.c > index 2881bab12b01..15663f95d77b 100644 > --- a/mm/cma.c > +++ b/mm/cma.c > @@ -444,6 +444,34 @@ static void cma_debug_show_areas(struct cma *cma) > static inline void cma_debug_show_areas(struct cma *cma) { } > #endif > > +/* Called with the cma mutex held. */ > +static int cma_alloc_pages_fastpath(struct cma *cma, unsigned long start, > + unsigned long end) > +{ > + bool success = false; > + unsigned long i, j; > + > + /* Avoid contention on the zone lock. */ > + if (start - end != 1 << cma->order_per_bit) > + return -EINVAL; > + > + for (i = start; i < end; i++) { > + if (!is_free_buddy_page(pfn_to_page(i))) > + break; > + success = take_page_off_buddy(pfn_to_page(i), false); > + if (!success) > + break; > + } > + > + if (success) > + return 0; > + > + for (j = start; j < i; j++) > + put_page_back_buddy(pfn_to_page(j), false); > + > + return -EBUSY; > +} > + > /** > * cma_alloc_range() - allocate pages in a specific range > * @cma: Contiguous memory region for which the allocation is performed. > @@ -493,7 +521,11 @@ int cma_alloc_range(struct cma *cma, unsigned long > start, unsigned long count, > > for (i = 0; i < tries; i++) { > mutex_lock(&cma_mutex); > - err = alloc_contig_range(start, start + count, MIGRATE_CMA, > gfp); > +
Re: [PATCH v13 3/6] tracing: Add snapshot refcount
Hi Vincent, kernel test robot noticed the following build errors: [auto build test ERROR on 29142dc92c37d3259a33aef15b03e6ee25b0d188] url: https://github.com/intel-lab-lkp/linux/commits/Vincent-Donnefort/ring-buffer-Zero-ring-buffer-sub-buffers/20240129-223025 base: 29142dc92c37d3259a33aef15b03e6ee25b0d188 patch link: https://lore.kernel.org/r/20240129142802.2145305-4-vdonnefort%40google.com patch subject: [PATCH v13 3/6] tracing: Add snapshot refcount config: arc-randconfig-002-20240130 (https://download.01.org/0day-ci/archive/20240130/202401301740.qzzlpcyv-...@intel.com/config) compiler: arceb-elf-gcc (GCC) 13.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240130/202401301740.qzzlpcyv-...@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-kbuild-all/202401301740.qzzlpcyv-...@intel.com/ All errors (new ones prefixed by >>): kernel/trace/trace.c: In function 'tracing_set_tracer': kernel/trace/trace.c:6644:17: error: implicit declaration of function 'tracing_disarm_snapshot_locked'; did you mean 'tracing_disarm_snapshot'? [-Werror=implicit-function-declaration] 6644 | tracing_disarm_snapshot_locked(tr); | ^~ | tracing_disarm_snapshot >> kernel/trace/trace.c:6648:23: error: implicit declaration of function >> 'tracing_arm_snapshot_locked'; did you mean 'tracing_arm_snapshot'? >> [-Werror=implicit-function-declaration] 6648 | ret = tracing_arm_snapshot_locked(tr); | ^~~ | tracing_arm_snapshot cc1: some warnings being treated as errors vim +6648 kernel/trace/trace.c 6560 6561 int tracing_set_tracer(struct trace_array *tr, const char *buf) 6562 { 6563 struct tracer *t; 6564 #ifdef CONFIG_TRACER_MAX_TRACE 6565 bool had_max_tr; 6566 #endif 6567 int ret = 0; 6568 6569 mutex_lock(&trace_types_lock); 6570 6571 if (!tr->ring_buffer_expanded) { 6572 ret = __tracing_resize_ring_buffer(tr, trace_buf_size, 6573 RING_BUFFER_ALL_CPUS); 6574 if (ret < 0) 6575 goto out; 6576 ret = 0; 6577 } 6578 6579 for (t = trace_types; t; t = t->next) { 6580 if (strcmp(t->name, buf) == 0) 6581 break; 6582 } 6583 if (!t) { 6584 ret = -EINVAL; 6585 goto out; 6586 } 6587 if (t == tr->current_trace) 6588 goto out; 6589 6590 #ifdef CONFIG_TRACER_SNAPSHOT 6591 if (t->use_max_tr) { 6592 local_irq_disable(); 6593 arch_spin_lock(&tr->max_lock); 6594 if (tr->cond_snapshot) 6595 ret = -EBUSY; 6596 arch_spin_unlock(&tr->max_lock); 6597 local_irq_enable(); 6598 if (ret) 6599 goto out; 6600 } 6601 #endif 6602 /* Some tracers won't work on kernel command line */ 6603 if (system_state < SYSTEM_RUNNING && t->noboot) { 6604 pr_warn("Tracer '%s' is not allowed on command line, ignored\n", 6605 t->name); 6606 goto out; 6607 } 6608 6609 /* Some tracers are only allowed for the top level buffer */ 6610 if (!trace_ok_for_array(t, tr)) { 6611 ret = -EINVAL; 6612 goto out; 6613 } 6614 6615 /* If trace pipe files are being read, we can't change the tracer */ 6616 if (tr->trace_ref) { 6617 ret = -EBUSY; 6618 goto out; 6619 } 6620 6621 trace_branch_disable(); 6622 6623 tr->current_trace->enabled--; 6624 6625 if (tr->current_trace->reset) 6626 tr->current_trace->reset(tr); 6627 6628 #ifdef CONFIG_TRACER_MAX_TRACE 6629 had_max_tr = tr->current_trace->use_max_tr; 6630 6631 /* Current trace needs to be nop_trace before synchronize_rcu */ 6632 tr->current_trace = &nop_trace; 6633 6634 if (had_max_tr && !t->use_max_tr) { 6635 /* 6636 * W
Re: [PATCH RFC v3 11/35] mm: Allow an arch to hook into folio allocation when VMA is known
On 1/25/24 22:12, Alexandru Elisei wrote: > arm64 uses VM_HIGH_ARCH_0 and VM_HIGH_ARCH_1 for enabling MTE for a VMA. > When VM_HIGH_ARCH_0, which arm64 renames to VM_MTE, is set for a VMA, and > the gfp flag __GFP_ZERO is present, the __GFP_ZEROTAGS gfp flag also gets > set in vma_alloc_zeroed_movable_folio(). > > Expand this to be more generic by adding an arch hook that modifes the gfp > flags for an allocation when the VMA is known. > > Note that __GFP_ZEROTAGS is ignored by the page allocator unless __GFP_ZERO > is also set; from that point of view, the current behaviour is unchanged, > even though the arm64 flag is set in more places. When arm64 will have > support to reuse the tag storage for data allocation, the uses of the > __GFP_ZEROTAGS flag will be expanded to instruct the page allocator to try > to reserve the corresponding tag storage for the pages being allocated. Right but how will pushing __GFP_ZEROTAGS addition into gfp_t flags further down via a new arch call back i.e arch_calc_vma_gfp() while still maintaining (vma->vm_flags & VM_MTE) conditionality improve the current scenario. Because the page allocator could have still analyzed alloc flags for __GFP_ZEROTAGS for any additional stuff. OR this just adds some new core MM paths to get __GFP_ZEROTAGS which was not the case earlier via this call back. > > The flags returned by arch_calc_vma_gfp() are or'ed with the flags set by > the caller; this has been done to keep an architecture from modifying the > flags already set by the core memory management code; this is similar to > how do_mmap() -> calc_vm_flag_bits() -> arch_calc_vm_flag_bits() has been > implemented. This can be revisited in the future if there's a need to do > so. > > Signed-off-by: Alexandru Elisei > --- > arch/arm64/include/asm/page.h| 5 ++--- > arch/arm64/include/asm/pgtable.h | 3 +++ > arch/arm64/mm/fault.c| 19 ++- > include/linux/pgtable.h | 7 +++ > mm/mempolicy.c | 1 + > mm/shmem.c | 5 - > 6 files changed, 23 insertions(+), 17 deletions(-) > > diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h > index 2312e6ee595f..88bab032a493 100644 > --- a/arch/arm64/include/asm/page.h > +++ b/arch/arm64/include/asm/page.h > @@ -29,9 +29,8 @@ void copy_user_highpage(struct page *to, struct page *from, > void copy_highpage(struct page *to, struct page *from); > #define __HAVE_ARCH_COPY_HIGHPAGE > > -struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma, > - unsigned long vaddr); > -#define vma_alloc_zeroed_movable_folio vma_alloc_zeroed_movable_folio > +#define vma_alloc_zeroed_movable_folio(vma, vaddr) \ > + vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false) > > void tag_clear_highpage(struct page *to); > #define __HAVE_ARCH_TAG_CLEAR_HIGHPAGE > diff --git a/arch/arm64/include/asm/pgtable.h > b/arch/arm64/include/asm/pgtable.h > index 79ce70fbb751..08f0904dbfc2 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -1071,6 +1071,9 @@ static inline void arch_swap_restore(swp_entry_t entry, > struct folio *folio) > > #endif /* CONFIG_ARM64_MTE */ > > +#define __HAVE_ARCH_CALC_VMA_GFP > +gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp); > + > /* > * On AArch64, the cache coherency is handled via the set_pte_at() function. > */ > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c > index 55f6455a8284..4d3f0a870ad8 100644 > --- a/arch/arm64/mm/fault.c > +++ b/arch/arm64/mm/fault.c > @@ -937,22 +937,15 @@ void do_debug_exception(unsigned long > addr_if_watchpoint, unsigned long esr, > NOKPROBE_SYMBOL(do_debug_exception); > > /* > - * Used during anonymous page fault handling. > + * If this is called during anonymous page fault handling, and the page is > + * mapped with PROT_MTE, initialise the tags at the point of tag zeroing as > this > + * is usually faster than separate DC ZVA and STGM. > */ > -struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma, > - unsigned long vaddr) > +gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp) > { > - gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO; > - > - /* > - * If the page is mapped with PROT_MTE, initialise the tags at the > - * point of allocation and page zeroing as this is usually faster than > - * separate DC ZVA and STGM. > - */ > if (vma->vm_flags & VM_MTE) > - flags |= __GFP_ZEROTAGS; > - > - return vma_alloc_folio(flags, 0, vma, vaddr, false); > + return __GFP_ZEROTAGS; > + return 0; > } > > void tag_clear_highpage(struct page *page) > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index c5ddec6b5305..98f81ca08cbe 100644 > --- a/include/linux/pgtable.h > +++ b/include/
Re: [PATCH v13 3/6] tracing: Add snapshot refcount
On Tue, Jan 30, 2024 at 05:30:38PM +0800, kernel test robot wrote: > Hi Vincent, > > kernel test robot noticed the following build errors: > > [auto build test ERROR on 29142dc92c37d3259a33aef15b03e6ee25b0d188] > > url: > https://github.com/intel-lab-lkp/linux/commits/Vincent-Donnefort/ring-buffer-Zero-ring-buffer-sub-buffers/20240129-223025 > base: 29142dc92c37d3259a33aef15b03e6ee25b0d188 > patch link: > https://lore.kernel.org/r/20240129142802.2145305-4-vdonnefort%40google.com > patch subject: [PATCH v13 3/6] tracing: Add snapshot refcount > config: arc-randconfig-002-20240130 > (https://download.01.org/0day-ci/archive/20240130/202401301740.qzzlpcyv-...@intel.com/config) > compiler: arceb-elf-gcc (GCC) 13.2.0 > reproduce (this is a W=1 build): > (https://download.01.org/0day-ci/archive/20240130/202401301740.qzzlpcyv-...@intel.com/reproduce) > > If you fix the issue in a separate patch/commit (i.e. not just a new version > of > the same patch/commit), kindly add following tags > | Reported-by: kernel test robot > | Closes: > https://lore.kernel.org/oe-kbuild-all/202401301740.qzzlpcyv-...@intel.com/ > > All errors (new ones prefixed by >>): > >kernel/trace/trace.c: In function 'tracing_set_tracer': >kernel/trace/trace.c:6644:17: error: implicit declaration of function > 'tracing_disarm_snapshot_locked'; did you mean 'tracing_disarm_snapshot'? > [-Werror=implicit-function-declaration] > 6644 | tracing_disarm_snapshot_locked(tr); > | ^~ > | tracing_disarm_snapshot > >> kernel/trace/trace.c:6648:23: error: implicit declaration of function > >> 'tracing_arm_snapshot_locked'; did you mean 'tracing_arm_snapshot'? > >> [-Werror=implicit-function-declaration] > 6648 | ret = tracing_arm_snapshot_locked(tr); > | ^~~ > | tracing_arm_snapshot >cc1: some warnings being treated as errors Right, two tracers (hwlat and osnoise) select _only_ MAX_TRACE and not TRACER_SNAPSHOT. However, AFAICT, they will not call any of the swapping functions (they don't set use_max_tr). So I suppose arm/disarm can be ommited in that case. > > > vim +6648 kernel/trace/trace.c > > 6560 > 6561int tracing_set_tracer(struct trace_array *tr, const char *buf) > 6562{ > 6563struct tracer *t; > 6564#ifdef CONFIG_TRACER_MAX_TRACE > 6565bool had_max_tr; > 6566#endif > 6567int ret = 0; > 6568 > 6569mutex_lock(&trace_types_lock); > 6570 > 6571if (!tr->ring_buffer_expanded) { > 6572ret = __tracing_resize_ring_buffer(tr, > trace_buf_size, > 6573 > RING_BUFFER_ALL_CPUS); > 6574if (ret < 0) > 6575goto out; > 6576ret = 0; > 6577} > 6578 > 6579for (t = trace_types; t; t = t->next) { > 6580if (strcmp(t->name, buf) == 0) > 6581break; > 6582} > 6583if (!t) { > 6584ret = -EINVAL; > 6585goto out; > 6586} > 6587if (t == tr->current_trace) > 6588goto out; > 6589 > 6590#ifdef CONFIG_TRACER_SNAPSHOT > 6591if (t->use_max_tr) { > 6592local_irq_disable(); > 6593arch_spin_lock(&tr->max_lock); > 6594if (tr->cond_snapshot) > 6595ret = -EBUSY; > 6596arch_spin_unlock(&tr->max_lock); > 6597local_irq_enable(); > 6598if (ret) > 6599goto out; > 6600} > 6601#endif > 6602/* Some tracers won't work on kernel command line */ > 6603if (system_state < SYSTEM_RUNNING && t->noboot) { > 6604pr_warn("Tracer '%s' is not allowed on command > line, ignored\n", > 6605t->name); > 6606goto out; > 6607
Re: [PATCH] media: dt-bindings: qcom,sc7280-venus: Allow one IOMMU entry
On 29/01/2024 08:48, Luca Weiss wrote: > Some SC7280-based boards crash when providing the "secure_non_pixel" > context bank, so allow only one iommu in the bindings also. > > Signed-off-by: Luca Weiss > --- Acked-by: Krzysztof Kozlowski Best regards, Krzysztof
Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time
Le 30/01/2024 à 10:16, Chen-Yu Tsai a écrit : > [Vous ne recevez pas souvent de courriers de we...@chromium.org. D?couvrez > pourquoi ceci est important ? https://aka.ms/LearnAboutSenderIdentification ] > > Hi, > > On Mon, Jan 29, 2024 at 12:09:50PM -0800, Luis Chamberlain wrote: >> On Thu, Dec 21, 2023 at 10:02:46AM +0100, Christophe Leroy wrote: >>> Declaring rodata_enabled and mark_rodata_ro() at all time >>> helps removing related #ifdefery in C files. >>> >>> Signed-off-by: Christophe Leroy >> >> Very nice cleanup, thanks!, applied and pushed >> >>Luis > > On next-20240130, which has your modules-next branch, and thus this > series and the other "module: Use set_memory_rox()" series applied, > my kernel crashes in some very weird way. Reverting your branch > makes the crash go away. > > I thought I'd report it right away. Maybe you folks would know what's > happening here? This is on arm64. That's strange, it seems to bug in module_bug_finalize() which is _before_ calls to module_enable_ro() and such. Can you try to revert the 6 patches one by one to see which one introduces the problem ? In reality, only patch 677bfb9db8a3 really change things. Other ones are more on less only cleanup. Thanks Christophe > > [ 10.481015] Unable to handle kernel paging request at virtual address > ffde85245d30 > [ 10.490369] KASAN: maybe wild-memory-access in range > [0x00f42922e980-0x00f42922e987] > [ 10.503744] Mem abort info: > [ 10.509383] ESR = 0x9647 > [ 10.514400] EC = 0x25: DABT (current EL), IL = 32 bits > [ 10.522366] SET = 0, FnV = 0 > [ 10.526343] EA = 0, S1PTW = 0 > [ 10.530695] FSC = 0x07: level 3 translation fault > [ 10.537081] Data abort info: > [ 10.540839] ISV = 0, ISS = 0x0047, ISS2 = 0x > [ 10.546456] CM = 0, WnR = 1, TnD = 0, TagAccess = 0 > [ 10.551726] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 > [ 10.557612] swapper pgtable: 4k pages, 39-bit VAs, pgdp=41f98000 > [ 10.565214] [ffde85245d30] pgd=10023003, p4d=10023003, > pud=10023003, pmd=1001121eb003, pte= > [ 10.578887] Internal error: Oops: 9647 [#1] PREEMPT SMP > [ 10.585815] Modules linked in: > [ 10.590235] CPU: 6 PID: 195 Comm: (udev-worker) Tainted: GB > 6.8.0-rc2-next-20240130-02908-ge8ad01d60927-dirty #163 > 3f2318148ecc5fa70d1092c2b874f9b59bdb7d60 > [ 10.607021] Hardware name: Google Tentacruel board (DT) > [ 10.613607] pstate: a049 (NzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 10.621954] pc : module_bug_finalize+0x118/0x148 > [ 10.626823] lr : module_bug_finalize+0x118/0x148 > [ 10.631463] sp : ffc0820478d0 > [ 10.631466] x29: ffc0820478d0 x28: ffc082047ca0 x27: > ffde8d7d31a0 > [ 10.631477] x26: ffde85223780 x25: x24: > ffde8c413cc0 > [ 10.631486] x23: ffde8dfcec80 x22: ffde8dfce000 x21: > ffde85223ba8 > [ 10.631495] x20: ffde85223780 x19: ffde85245d28 x18: > > [ 10.631504] x17: ffde8aa15938 x16: ffde8aabdd90 x15: > ffde8aab8124 > [ 10.631513] x14: ffde8acdd380 x13: 41b58ab3 x12: > ffbbd1bf9d91 > [ 10.631522] x11: 1ffbd1bf9d90 x10: ffbbd1bf9d90 x9 : > dfc0 > [ 10.631531] x8 : 00442e406270 x7 : ffde8dfcec87 x6 : > 0001 > [ 10.631539] x5 : ffde8dfcec80 x4 : x3 : > ffde8bbadf08 > [ 10.631548] x2 : 0001 x1 : ffde8eaff080 x0 : > > [ 10.631556] Call trace: > [ 10.631559] module_bug_finalize+0x118/0x148 > [ 10.631565] load_module+0x25ec/0x2a78 > [ 10.631572] __do_sys_init_module+0x234/0x418 > [ 10.631578] __arm64_sys_init_module+0x4c/0x68 > [ 10.631584] invoke_syscall+0x68/0x198 > [ 10.631589] el0_svc_common.constprop.0+0x11c/0x150 > [ 10.631594] do_el0_svc+0x38/0x50 > [ 10.631598] el0_svc+0x50/0xa0 > [ 10.631604] el0t_64_sync_handler+0x120/0x130 > [ 10.631609] el0t_64_sync+0x1a8/0x1b0 > [ 10.631619] Code: 97c5418e c89ffef5 91002260 97c53ca7 (f9000675) > [ 10.631624] ---[ end trace ]--- > [ 10.642965] Kernel panic - not syncing: Oops: Fatal exception > [ 10.642975] SMP: stopping secondary CPUs > [ 10.648339] Kernel Offset: 0x1e0a80 from 0xffc08000 > [ 10.648343] PHYS_OFFSET: 0x4000 > [ 10.648345] CPU features: 0x0,c061,7002814a,2100720b > [ 10.648350] Memory Limit: none >
Re: [RFC PATCH 1/2] x86/kprobes: Prohibit kprobing on INT and UD
On Mon, 29 Jan 2024 20:50:39 -0600 Jinghao Jia wrote: > On 1/29/24 19:44, Masami Hiramatsu (Google) wrote: > > On Sun, 28 Jan 2024 15:25:59 -0600 > > Jinghao Jia wrote: > > > /* Check if paddr is at an instruction boundary */ > static int can_probe(unsigned long paddr) > { > @@ -294,6 +310,16 @@ static int can_probe(unsigned long paddr) > #endif > addr += insn.length; > } > +__addr = recover_probed_instruction(buf, addr); > +if (!__addr) > +return 0; > + > +if (insn_decode_kernel(&insn, (void *)__addr) < 0) > +return 0; > + > +if (is_exception_insn(&insn)) > +return 0; > + > >>> > >>> Please don't put this outside of decoding loop. You should put these in > >>> the loop which decodes the instruction from the beginning of the function. > >>> Since the x86 instrcution is variable length, can_probe() needs to check > >>> whether that the address is instruction boundary and decodable. > >>> > >>> Thank you, > >> > >> If my understanding is correct then this is trying to decode the kprobe > >> target instruction, given that it is after the main decoding loop. Here I > >> hoisted the decoding logic out of the if(IS_ENABLED(CONFIG_CFI_CLANG)) > >> block so that we do not need to decode the same instruction twice. I left > >> the main decoding loop unchanged so it is still decoding the function from > >> the start and should handle instruction boundaries. Are there any caveats > >> that I missed? > > > > Ah, sorry I misread the patch. You're correct! > > This is a good place to do that. > > > > But hmm, I think we should add another patch to check the addr == paddr > > soon after the loop so that we will avoid decoding. > > > > Thank you, > > > > Yes, that makes sense to me. At the same time, I'm also thinking about > changing the return type of can_probe() to bool, since we are just using > int as bool in this context. Yes, that is also a good change :) Thank you, > > --Jinghao > > >> > >> --Jinghao > >> > >>> > if (IS_ENABLED(CONFIG_CFI_CLANG)) { > /* > * The compiler generates the following instruction > sequence > @@ -308,13 +334,6 @@ static int can_probe(unsigned long paddr) > * Also, these movl and addl are used for showing > expected > * type. So those must not be touched. > */ > -__addr = recover_probed_instruction(buf, addr); > -if (!__addr) > -return 0; > - > -if (insn_decode_kernel(&insn, (void *)__addr) < 0) > -return 0; > - > if (insn.opcode.value == 0xBA) > offset = 12; > else if (insn.opcode.value == 0x3) > -- > 2.43.0 > > >>> > >>> > > > > -- Masami Hiramatsu (Google)
Re: [PATCH RFC v3 09/35] mm: cma: Introduce cma_remove_mem()
Hi, I really appreciate the feedback you have given me so far. I believe the commit message isn't clear enough and there has been a confusion. A CMA user adds a CMA area to the cma_areas array with cma_declare_contiguous_nid() or cma_init_reserved_mem(). init_cma_reserved_pageblock() then iterates over the array and activates all cma areas. The function cma_remove_mem() is intended to be used to remove a cma area from the cma_areas array **before** the area has been activated. Usecase: a driver (in this case, the arm64 dynamic tag storage code) manages several cma areas. The driver successfully adds the first area to the cma_areas array. When the driver tries to adds the second area, the function fails. Without cma_remove_mem(), the driver has no way to prevent the first area from being freed to the page allocator. cma_remove_mem() is about providing a means to do cleanup in case of error. Does that make more sense now? Ok Tue, Jan 30, 2024 at 11:20:56AM +0530, Anshuman Khandual wrote: > > > On 1/25/24 22:12, Alexandru Elisei wrote: > > Memory is added to CMA with cma_declare_contiguous_nid() and > > cma_init_reserved_mem(). This memory is then put on the MIGRATE_CMA list in > > cma_init_reserved_areas(), where the page allocator can make use of it. > > cma_declare_contiguous_nid() reserves memory in memblock and marks the You forgot about about cma_init_reserved_mem() which does the same thing, but yes, you are right. > for subsequent CMA usage, where as cma_init_reserved_areas() activates > these memory areas through init_cma_reserved_pageblock(). Standard page > allocator only receives these memory via free_reserved_page() - only if I don't think that's correct. init_cma_reserved_pageblock() clears the PG_reserved page flag, sets the migratetype to MIGRATE_CMA and then frees the page. After that, the page is available to the standard page allocator to use for allocation. Otherwise, what would be the point of the MIGRATE_CMA migratetype? > the page block activation fails. For the sake of having a complete picture, I'll add that that only happens if cma->reserve_pages_on_error is false. If the CMA user sets the field to 'true' (with cma_reserve_pages_on_error()), then the pages in the CMA region are kept PG_reserved if activation fails. > > > > > If a device manages multiple CMA areas, and there's an error when one of > > the areas is added to CMA, there is no mechanism for the device to prevent > > What kind of error ? init_cma_reserved_pageblock() fails ? But that will > not happen until cma_init_reserved_areas(). I think I haven't been clear enough. When I say that "an area is added to CMA", I mean that the memory region is added to cma_areas array, via cma_declare_contiguous_nid() or cma_init_reserved_mem(). There are several ways in which either function can fail. > > > the rest of the areas, which were added before the error occured, from > > being later added to the MIGRATE_CMA list. > > Why is this mechanism required ? cma_init_reserved_areas() scans over all > CMA areas and try and activate each of them sequentially. Why is not this > sufficient ? This patch is about removing a struct cma from the cma_areas array after it has been added to the array, with cma_declare_contiguous_nid() or cma_init_reserved_mem(), to prevent the area from being activated in cma_init_reserved_areas(). Sorry for the confusion. I'll add a check in cma_remove_mem() to fail if the cma area has been activated, and a comment to the function to explain its usage. > > > > > Add cma_remove_mem() which allows a previously reserved CMA area to be > > removed and thus it cannot be used by the page allocator. > > Successfully activated CMA areas do not get used by the buddy allocator. I don't believe that is correct, see above. > > > > > Signed-off-by: Alexandru Elisei > > --- > > > > Changes since rfc v2: > > > > * New patch. > > > > include/linux/cma.h | 1 + > > mm/cma.c| 30 +- > > 2 files changed, 30 insertions(+), 1 deletion(-) > > > > diff --git a/include/linux/cma.h b/include/linux/cma.h > > index e32559da6942..787cbec1702e 100644 > > --- a/include/linux/cma.h > > +++ b/include/linux/cma.h > > @@ -48,6 +48,7 @@ extern int cma_init_reserved_mem(phys_addr_t base, > > phys_addr_t size, > > unsigned int order_per_bit, > > const char *name, > > struct cma **res_cma); > > +extern void cma_remove_mem(struct cma **res_cma); > > extern struct page *cma_alloc(struct cma *cma, unsigned long count, > > unsigned int align, > > bool no_warn); > > extern int cma_alloc_range(struct cma *cma, unsigned long start, unsigned > > long count, > > diff --git a/mm/cma.c b/mm/cma.c > > index 4a0f68b9443b..2881bab12b01 100644 > > --- a/mm/cma.c > > +++ b/mm/cma.c > > @@ -147,8 +147,12 @@ static int __init cma_init_reserved_areas(void
Re: [RFC PATCH 4/7] ext2: Use dax_is_supported()
On Mon 29-01-24 16:06:28, Mathieu Desnoyers wrote: > Use dax_is_supported() to validate whether the architecture has > virtually aliased caches at mount time. > > This is relevant for architectures which require a dynamic check > to validate whether they have virtually aliased data caches > (ARCH_HAS_CACHE_ALIASING_DYNAMIC=y). > > Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing > caches") > Signed-off-by: Mathieu Desnoyers > Cc: Jan Kara > Cc: linux-e...@vger.kernel.org > Cc: Andrew Morton > Cc: Linus Torvalds > Cc: linux...@kvack.org > Cc: linux-a...@vger.kernel.org > Cc: Dan Williams > Cc: Vishal Verma > Cc: Dave Jiang > Cc: Matthew Wilcox > Cc: nvd...@lists.linux.dev > Cc: linux-...@vger.kernel.org Looks good to me (although I share Dave's opinion it would be nice to CC the whole series to fsdevel - luckily we have lore these days so it is not that tedious to find the whole series :)). Feel free to add: Acked-by: Jan Kara Honza > --- > fs/ext2/super.c | 14 +++--- > 1 file changed, 7 insertions(+), 7 deletions(-) > > diff --git a/fs/ext2/super.c b/fs/ext2/super.c > index 01f9addc8b1f..0398e7a90eb6 100644 > --- a/fs/ext2/super.c > +++ b/fs/ext2/super.c > @@ -585,13 +585,13 @@ static int parse_options(char *options, struct > super_block *sb, > set_opt(opts->s_mount_opt, XIP); > fallthrough; > case Opt_dax: > -#ifdef CONFIG_FS_DAX > - ext2_msg(sb, KERN_WARNING, > - "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"); > - set_opt(opts->s_mount_opt, DAX); > -#else > - ext2_msg(sb, KERN_INFO, "dax option not supported"); > -#endif > + if (dax_is_supported()) { > + ext2_msg(sb, KERN_WARNING, > + "DAX enabled. Warning: EXPERIMENTAL, > use at your own risk"); > + set_opt(opts->s_mount_opt, DAX); > + } else { > + ext2_msg(sb, KERN_INFO, "dax option not > supported"); > + } > break; > > #if defined(CONFIG_QUOTA) > -- > 2.39.2 > -- Jan Kara SUSE Labs, CR
Re: [PATCH RFC v3 10/35] mm: cma: Fast track allocating memory when the pages are free
Hi, On Tue, Jan 30, 2024 at 02:48:53PM +0530, Anshuman Khandual wrote: > > > On 1/25/24 22:12, Alexandru Elisei wrote: > > If the pages to be allocated are free, take them directly off the buddy > > allocator, instead of going through alloc_contig_range() and avoiding > > costly calls to lru_cache_disable(). > > > > Only allocations of the same size as the CMA region order are considered, > > to avoid taking the zone spinlock for too long. > > > > Signed-off-by: Alexandru Elisei > > This patch seems to be improving standard cma_alloc() as well as > the previously added new allocator i.e cma_alloc_range() - via a > new helper cma_alloc_pages_fastpath(). Yes, that's correct. > > Should not any standard cma_alloc() improvement be discussed as > an independent patch separately irrespective of this series. OR > it is some how related to this series which I might be missing ? Yes, it's related to this series. I wrote this patch because it fixes a performance regression with Chrome when dynamic tag storage management is enabled [1]. I will bring back the commit message explaining that. [1] https://lore.kernel.org/linux-fsdevel/20231119165721.9849-27-alexandru.eli...@arm.com/ Thanks, Alex > > > --- > > > > Changes since rfc v2: > > > > * New patch. Reworked from the rfc v2 patch #26 ("arm64: mte: Fast track > > reserving tag storage when the block is free") (David Hildenbrand). > > > > include/linux/page-flags.h | 15 -- > > mm/Kconfig | 5 + > > mm/cma.c | 42 ++ > > mm/memory-failure.c| 8 > > mm/page_alloc.c| 23 - > > 5 files changed, 73 insertions(+), 20 deletions(-) > > > > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > > index 735cddc13d20..b7237bce7446 100644 > > --- a/include/linux/page-flags.h > > +++ b/include/linux/page-flags.h > > @@ -575,11 +575,22 @@ TESTSCFLAG(HWPoison, hwpoison, PF_ANY) > > #define MAGIC_HWPOISON 0x48575053U /* HWPS */ > > extern void SetPageHWPoisonTakenOff(struct page *page); > > extern void ClearPageHWPoisonTakenOff(struct page *page); > > -extern bool take_page_off_buddy(struct page *page); > > -extern bool put_page_back_buddy(struct page *page); > > +extern bool PageHWPoisonTakenOff(struct page *page); > > #else > > PAGEFLAG_FALSE(HWPoison, hwpoison) > > +TESTSCFLAG_FALSE(HWPoison, hwpoison) > > #define __PG_HWPOISON 0 > > +static inline void SetPageHWPoisonTakenOff(struct page *page) { } > > +static inline void ClearPageHWPoisonTakenOff(struct page *page) { } > > +static inline bool PageHWPoisonTakenOff(struct page *page) > > +{ > > + return false; > > +} > > +#endif > > + > > +#ifdef CONFIG_WANTS_TAKE_PAGE_OFF_BUDDY > > +extern bool take_page_off_buddy(struct page *page, bool poison); > > +extern bool put_page_back_buddy(struct page *page, bool unpoison); > > #endif > > > > #if defined(CONFIG_PAGE_IDLE_FLAG) && defined(CONFIG_64BIT) > > diff --git a/mm/Kconfig b/mm/Kconfig > > index ffc3a2ba3a8c..341cf53898db 100644 > > --- a/mm/Kconfig > > +++ b/mm/Kconfig > > @@ -745,12 +745,16 @@ config DEFAULT_MMAP_MIN_ADDR > > config ARCH_SUPPORTS_MEMORY_FAILURE > > bool > > > > +config WANTS_TAKE_PAGE_OFF_BUDDY > > + bool> + > > config MEMORY_FAILURE > > depends on MMU > > depends on ARCH_SUPPORTS_MEMORY_FAILURE > > bool "Enable recovery from hardware memory errors" > > select MEMORY_ISOLATION > > select RAS > > + select WANTS_TAKE_PAGE_OFF_BUDDY > > help > > Enables code to recover from some memory failures on systems > > with MCA recovery. This allows a system to continue running > > @@ -891,6 +895,7 @@ config CMA > > depends on MMU > > select MIGRATION > > select MEMORY_ISOLATION > > + select WANTS_TAKE_PAGE_OFF_BUDDY > > help > > This enables the Contiguous Memory Allocator which allows other > > subsystems to allocate big physically-contiguous blocks of memory. > > diff --git a/mm/cma.c b/mm/cma.c > > index 2881bab12b01..15663f95d77b 100644 > > --- a/mm/cma.c > > +++ b/mm/cma.c > > @@ -444,6 +444,34 @@ static void cma_debug_show_areas(struct cma *cma) > > static inline void cma_debug_show_areas(struct cma *cma) { } > > #endif > > > > +/* Called with the cma mutex held. */ > > +static int cma_alloc_pages_fastpath(struct cma *cma, unsigned long start, > > + unsigned long end) > > +{ > > + bool success = false; > > + unsigned long i, j; > > + > > + /* Avoid contention on the zone lock. */ > > + if (start - end != 1 << cma->order_per_bit) > > + return -EINVAL; > > + > > + for (i = start; i < end; i++) { > > + if (!is_free_buddy_page(pfn_to_page(i))) > > + break; > > + success = take_page_off_buddy(pfn_to_page(i), false); > > + if (!success) > > + break; > > + } > > + > > + if (success
Re: [PATCH RFC v3 11/35] mm: Allow an arch to hook into folio allocation when VMA is known
Hi, On Tue, Jan 30, 2024 at 03:25:20PM +0530, Anshuman Khandual wrote: > > > On 1/25/24 22:12, Alexandru Elisei wrote: > > arm64 uses VM_HIGH_ARCH_0 and VM_HIGH_ARCH_1 for enabling MTE for a VMA. > > When VM_HIGH_ARCH_0, which arm64 renames to VM_MTE, is set for a VMA, and > > the gfp flag __GFP_ZERO is present, the __GFP_ZEROTAGS gfp flag also gets > > set in vma_alloc_zeroed_movable_folio(). > > > > Expand this to be more generic by adding an arch hook that modifes the gfp > > flags for an allocation when the VMA is known. > > > > Note that __GFP_ZEROTAGS is ignored by the page allocator unless __GFP_ZERO > > is also set; from that point of view, the current behaviour is unchanged, > > even though the arm64 flag is set in more places. When arm64 will have > > support to reuse the tag storage for data allocation, the uses of the > > __GFP_ZEROTAGS flag will be expanded to instruct the page allocator to try > > to reserve the corresponding tag storage for the pages being allocated. > > Right but how will pushing __GFP_ZEROTAGS addition into gfp_t flags further > down via a new arch call back i.e arch_calc_vma_gfp() while still maintaining > (vma->vm_flags & VM_MTE) conditionality improve the current scenario. Because I'm afraid I don't follow you. > the page allocator could have still analyzed alloc flags for __GFP_ZEROTAGS > for any additional stuff. > > OR this just adds some new core MM paths to get __GFP_ZEROTAGS which was not > the case earlier via this call back. Before this patch: vma_alloc_zeroed_movable_folio() sets __GFP_ZEROTAGS. After this patch: vma_alloc_folio() sets __GFP_ZEROTAGS. This patch is about adding __GFP_ZEROTAGS for more callers. Thanks, Alex > > > > > The flags returned by arch_calc_vma_gfp() are or'ed with the flags set by > > the caller; this has been done to keep an architecture from modifying the > > flags already set by the core memory management code; this is similar to > > how do_mmap() -> calc_vm_flag_bits() -> arch_calc_vm_flag_bits() has been > > implemented. This can be revisited in the future if there's a need to do > > so. > > > > Signed-off-by: Alexandru Elisei > > --- > > arch/arm64/include/asm/page.h| 5 ++--- > > arch/arm64/include/asm/pgtable.h | 3 +++ > > arch/arm64/mm/fault.c| 19 ++- > > include/linux/pgtable.h | 7 +++ > > mm/mempolicy.c | 1 + > > mm/shmem.c | 5 - > > 6 files changed, 23 insertions(+), 17 deletions(-) > > > > diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h > > index 2312e6ee595f..88bab032a493 100644 > > --- a/arch/arm64/include/asm/page.h > > +++ b/arch/arm64/include/asm/page.h > > @@ -29,9 +29,8 @@ void copy_user_highpage(struct page *to, struct page > > *from, > > void copy_highpage(struct page *to, struct page *from); > > #define __HAVE_ARCH_COPY_HIGHPAGE > > > > -struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma, > > - unsigned long vaddr); > > -#define vma_alloc_zeroed_movable_folio vma_alloc_zeroed_movable_folio > > +#define vma_alloc_zeroed_movable_folio(vma, vaddr) \ > > + vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false) > > > > void tag_clear_highpage(struct page *to); > > #define __HAVE_ARCH_TAG_CLEAR_HIGHPAGE > > diff --git a/arch/arm64/include/asm/pgtable.h > > b/arch/arm64/include/asm/pgtable.h > > index 79ce70fbb751..08f0904dbfc2 100644 > > --- a/arch/arm64/include/asm/pgtable.h > > +++ b/arch/arm64/include/asm/pgtable.h > > @@ -1071,6 +1071,9 @@ static inline void arch_swap_restore(swp_entry_t > > entry, struct folio *folio) > > > > #endif /* CONFIG_ARM64_MTE */ > > > > +#define __HAVE_ARCH_CALC_VMA_GFP > > +gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp); > > + > > /* > > * On AArch64, the cache coherency is handled via the set_pte_at() > > function. > > */ > > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c > > index 55f6455a8284..4d3f0a870ad8 100644 > > --- a/arch/arm64/mm/fault.c > > +++ b/arch/arm64/mm/fault.c > > @@ -937,22 +937,15 @@ void do_debug_exception(unsigned long > > addr_if_watchpoint, unsigned long esr, > > NOKPROBE_SYMBOL(do_debug_exception); > > > > /* > > - * Used during anonymous page fault handling. > > + * If this is called during anonymous page fault handling, and the page is > > + * mapped with PROT_MTE, initialise the tags at the point of tag zeroing > > as this > > + * is usually faster than separate DC ZVA and STGM. > > */ > > -struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma, > > - unsigned long vaddr) > > +gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp) > > { > > - gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO; > > - > > - /* > > -* If the page is mapped with PROT_MTE, initialise the tags at the > > -* point of allocati
Re: [PATCH RFC v3 08/35] mm: cma: Introduce cma_alloc_range()
Hi, On Tue, Jan 30, 2024 at 10:50:00AM +0530, Anshuman Khandual wrote: > > > On 1/25/24 22:12, Alexandru Elisei wrote: > > Today, cma_alloc() is used to allocate a contiguous memory region. The > > function allows the caller to specify the number of pages to allocate, but > > not the starting address. cma_alloc() will walk over the entire CMA region > > trying to allocate the first available range of the specified size. > > > > Introduce cma_alloc_range(), which makes CMA more versatile by allowing the > > caller to specify a particular range in the CMA region, defined by the > > start pfn and the size. > > > > arm64 will make use of this function when tag storage management will be > > implemented: cma_alloc_range() will be used to reserve the tag storage > > associated with a tagged page. > > Basically, you would like to pass on a preferred start address and the > allocation could just fail if a contig range is not available from such > a starting address ? > > Then why not just change cma_alloc() to take a new argument 'start_pfn'. > Why create a new but almost similar allocator ? I tried doing that, and I gave up because: - It made cma_alloc() even more complex and hard to follow. - What value should 'start_pfn' be to tell cma_alloc() that it should be ignored? Or, to put it another way, what pfn number is invalid on **all** platforms that Linux supports? I can give it another go if we can come up with an invalid value for 'start_pfn'. > > But then I am wondering why this could not be done in the arm64 platform > code itself operating on a CMA area reserved just for tag storage. Unless > this new allocator has other usage beyond MTE, this could be implemented > in the platform itself. I had the same idea in the previous iteration, David Hildenbrand suggested this approach [1]. [1] https://lore.kernel.org/linux-fsdevel/2aafd53f-af1f-45f3-a08c-d11962254...@redhat.com/ Thanks, Alex > > > > > Signed-off-by: Alexandru Elisei > > --- > > > > Changes since rfc v2: > > > > * New patch. > > > > include/linux/cma.h| 2 + > > include/trace/events/cma.h | 59 ++ > > mm/cma.c | 86 ++ > > 3 files changed, 147 insertions(+) > > > > diff --git a/include/linux/cma.h b/include/linux/cma.h > > index 63873b93deaa..e32559da6942 100644 > > --- a/include/linux/cma.h > > +++ b/include/linux/cma.h > > @@ -50,6 +50,8 @@ extern int cma_init_reserved_mem(phys_addr_t base, > > phys_addr_t size, > > struct cma **res_cma); > > extern struct page *cma_alloc(struct cma *cma, unsigned long count, > > unsigned int align, > > bool no_warn); > > +extern int cma_alloc_range(struct cma *cma, unsigned long start, unsigned > > long count, > > + unsigned tries, gfp_t gfp); > > extern bool cma_pages_valid(struct cma *cma, const struct page *pages, > > unsigned long count); > > extern bool cma_release(struct cma *cma, const struct page *pages, > > unsigned long count); > > > > diff --git a/include/trace/events/cma.h b/include/trace/events/cma.h > > index 25103e67737c..a89af313a572 100644 > > --- a/include/trace/events/cma.h > > +++ b/include/trace/events/cma.h > > @@ -36,6 +36,65 @@ TRACE_EVENT(cma_release, > > __entry->count) > > ); > > > > +TRACE_EVENT(cma_alloc_range_start, > > + > > + TP_PROTO(const char *name, unsigned long start, unsigned long count, > > +unsigned tries), > > + > > + TP_ARGS(name, start, count, tries), > > + > > + TP_STRUCT__entry( > > + __string(name, name) > > + __field(unsigned long, start) > > + __field(unsigned long, count) > > + __field(unsigned, tries) > > + ), > > + > > + TP_fast_assign( > > + __assign_str(name, name); > > + __entry->start = start; > > + __entry->count = count; > > + __entry->tries = tries; > > + ), > > + > > + TP_printk("name=%s start=%lx count=%lu tries=%u", > > + __get_str(name), > > + __entry->start, > > + __entry->count, > > + __entry->tries) > > +); > > + > > +TRACE_EVENT(cma_alloc_range_finish, > > + > > + TP_PROTO(const char *name, unsigned long start, unsigned long count, > > +unsigned attempts, int err), > > + > > + TP_ARGS(name, start, count, attempts, err), > > + > > + TP_STRUCT__entry( > > + __string(name, name) > > + __field(unsigned long, start) > > + __field(unsigned long, count) > > + __field(unsigned, attempts) > > + __field(int, err) > > + ), > > + > > + TP_fast_assign( > > + __assign_str(name, name); > > + __entry->start = start; > > + __entry->count = count; > > + __entry->attempts = attempts; > > + __entry->err = err; > > + ), > > + > > + TP_printk("name=%s start=%lx count=%lu a
[PATCH net-next v4 2/5] page_frag: unify gfp bits for order 3 page allocation
Currently there seems to be three page frag implementions which all try to allocate order 3 page, if that fails, it then fail back to allocate order 0 page, and each of them all allow order 3 page allocation to fail under certain condition by using specific gfp bits. The gfp bits for order 3 page allocation are different between different implementation, __GFP_NOMEMALLOC is or'd to forbid access to emergency reserves memory for __page_frag_cache_refill(), but it is not or'd in other implementions, __GFP_DIRECT_RECLAIM is masked off to avoid direct reclaim in skb_page_frag_refill(), but it is not masked off in __page_frag_cache_refill(). This patch unifies the gfp bits used between different implementions by or'ing __GFP_NOMEMALLOC and masking off __GFP_DIRECT_RECLAIM for order 3 page allocation to avoid possible pressure for mm. Signed-off-by: Yunsheng Lin Reviewed-by: Alexander Duyck CC: Alexander Duyck --- drivers/vhost/net.c | 2 +- mm/page_alloc.c | 4 ++-- net/core/sock.c | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index f2ed7167c848..e574e21cc0ca 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -670,7 +670,7 @@ static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz, /* Avoid direct reclaim but allow kswapd to wake */ pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP | __GFP_NOWARN | - __GFP_NORETRY, + __GFP_NORETRY | __GFP_NOMEMALLOC, SKB_FRAG_PAGE_ORDER); if (likely(pfrag->page)) { pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c0f7e67c4250..636145c29f70 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4685,8 +4685,8 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc, gfp_t gfp = gfp_mask; #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) - gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY | - __GFP_NOMEMALLOC; + gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP | + __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC; page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER); nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE; diff --git a/net/core/sock.c b/net/core/sock.c index 88bf810394a5..8289a3d8c375 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2919,7 +2919,7 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp) /* Avoid direct reclaim but allow kswapd to wake */ pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) | __GFP_COMP | __GFP_NOWARN | - __GFP_NORETRY, + __GFP_NORETRY | __GFP_NOMEMALLOC, SKB_FRAG_PAGE_ORDER); if (likely(pfrag->page)) { pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER; -- 2.33.0
[PATCH net-next v4 4/5] vhost/net: remove vhost_net_page_frag_refill()
The page frag in vhost_net_page_frag_refill() uses the 'struct page_frag' from skb_page_frag_refill(), but it's implementation is similar to page_frag_alloc_align() now. This patch removes vhost_net_page_frag_refill() by using 'struct page_frag_cache' instead of 'struct page_frag', and allocating frag using page_frag_alloc_align(). The added benefit is that not only unifying the page frag implementation a little, but also having about 0.5% performance boost testing by using the vhost_net_test introduced in the last patch. Signed-off-by: Yunsheng Lin Acked-by: Jason Wang --- drivers/vhost/net.c | 91 ++--- 1 file changed, 27 insertions(+), 64 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index e574e21cc0ca..4b2fcb228a0a 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -141,10 +141,8 @@ struct vhost_net { unsigned tx_zcopy_err; /* Flush in progress. Protected by tx vq lock. */ bool tx_flush; - /* Private page frag */ - struct page_frag page_frag; - /* Refcount bias of page frag */ - int refcnt_bias; + /* Private page frag cache */ + struct page_frag_cache pf_cache; }; static unsigned vhost_net_zcopy_mask __read_mostly; @@ -655,41 +653,6 @@ static bool tx_can_batch(struct vhost_virtqueue *vq, size_t total_len) !vhost_vq_avail_empty(vq->dev, vq); } -static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int sz, - struct page_frag *pfrag, gfp_t gfp) -{ - if (pfrag->page) { - if (pfrag->offset + sz <= pfrag->size) - return true; - __page_frag_cache_drain(pfrag->page, net->refcnt_bias); - } - - pfrag->offset = 0; - net->refcnt_bias = 0; - if (SKB_FRAG_PAGE_ORDER) { - /* Avoid direct reclaim but allow kswapd to wake */ - pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) | - __GFP_COMP | __GFP_NOWARN | - __GFP_NORETRY | __GFP_NOMEMALLOC, - SKB_FRAG_PAGE_ORDER); - if (likely(pfrag->page)) { - pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER; - goto done; - } - } - pfrag->page = alloc_page(gfp); - if (likely(pfrag->page)) { - pfrag->size = PAGE_SIZE; - goto done; - } - return false; - -done: - net->refcnt_bias = USHRT_MAX; - page_ref_add(pfrag->page, USHRT_MAX - 1); - return true; -} - #define VHOST_NET_RX_PAD (NET_IP_ALIGN + NET_SKB_PAD) static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq, @@ -699,7 +662,6 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq, struct vhost_net *net = container_of(vq->dev, struct vhost_net, dev); struct socket *sock = vhost_vq_get_backend(vq); - struct page_frag *alloc_frag = &net->page_frag; struct virtio_net_hdr *gso; struct xdp_buff *xdp = &nvq->xdp[nvq->batched_xdp]; struct tun_xdp_hdr *hdr; @@ -710,6 +672,7 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq, int sock_hlen = nvq->sock_hlen; void *buf; int copied; + int ret; if (unlikely(len < nvq->sock_hlen)) return -EFAULT; @@ -719,18 +682,17 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq, return -ENOSPC; buflen += SKB_DATA_ALIGN(len + pad); - alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES); - if (unlikely(!vhost_net_page_frag_refill(net, buflen, -alloc_frag, GFP_KERNEL))) + buf = page_frag_alloc_align(&net->pf_cache, buflen, GFP_KERNEL, + SMP_CACHE_BYTES); + if (unlikely(!buf)) return -ENOMEM; - buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset; - copied = copy_page_from_iter(alloc_frag->page, -alloc_frag->offset + -offsetof(struct tun_xdp_hdr, gso), -sock_hlen, from); - if (copied != sock_hlen) - return -EFAULT; + copied = copy_from_iter(buf + offsetof(struct tun_xdp_hdr, gso), + sock_hlen, from); + if (copied != sock_hlen) { + ret = -EFAULT; + goto err; + } hdr = buf; gso = &hdr->gso; @@ -743,27 +705,30 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq, vhost16_to_cpu(vq, gso->csum_start) + vhost16_to_cpu(vq, gso->csum_offset) + 2); -
[PATCH net-next v4 5/5] tools: virtio: introduce vhost_net_test
introduce vhost_net_test basing on virtio_test to test vhost_net changing in the kernel. Signed-off-by: Yunsheng Lin --- tools/virtio/.gitignore | 1 + tools/virtio/Makefile | 8 +- tools/virtio/vhost_net_test.c | 576 ++ 3 files changed, 582 insertions(+), 3 deletions(-) create mode 100644 tools/virtio/vhost_net_test.c diff --git a/tools/virtio/.gitignore b/tools/virtio/.gitignore index 9934d48d9a55..7e47b281c442 100644 --- a/tools/virtio/.gitignore +++ b/tools/virtio/.gitignore @@ -1,5 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-only *.d virtio_test +vhost_net_test vringh_test virtio-trace/trace-agent diff --git a/tools/virtio/Makefile b/tools/virtio/Makefile index d128925980e0..e25e99c1c3b7 100644 --- a/tools/virtio/Makefile +++ b/tools/virtio/Makefile @@ -1,8 +1,9 @@ # SPDX-License-Identifier: GPL-2.0 all: test mod -test: virtio_test vringh_test +test: virtio_test vringh_test vhost_net_test virtio_test: virtio_ring.o virtio_test.o vringh_test: vringh_test.o vringh.o virtio_ring.o +vhost_net_test: virtio_ring.o vhost_net_test.o try-run = $(shell set -e; \ if ($(1)) >/dev/null 2>&1; \ @@ -49,6 +50,7 @@ oot-clean: OOT_BUILD+=clean .PHONY: all test mod clean vhost oot oot-clean oot-build clean: - ${RM} *.o vringh_test virtio_test vhost_test/*.o vhost_test/.*.cmd \ - vhost_test/Module.symvers vhost_test/modules.order *.d + ${RM} *.o vringh_test virtio_test vhost_net_test vhost_test/*.o \ + vhost_test/.*.cmd vhost_test/Module.symvers \ + vhost_test/modules.order *.d -include *.d diff --git a/tools/virtio/vhost_net_test.c b/tools/virtio/vhost_net_test.c new file mode 100644 index ..e336792a0d77 --- /dev/null +++ b/tools/virtio/vhost_net_test.c @@ -0,0 +1,576 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define RANDOM_BATCH -1 +#define HDR_LEN12 +#define TEST_BUF_LEN 256 +#define TEST_PTYPE ETH_P_LOOPBACK + +/* Used by implementation of kmalloc() in tools/virtio/linux/kernel.h */ +void *__kmalloc_fake, *__kfree_ignore_start, *__kfree_ignore_end; + +struct vq_info { + int kick; + int call; + int idx; + long started; + long completed; + struct pollfd fds; + void *ring; + /* copy used for control */ + struct vring vring; + struct virtqueue *vq; +}; + +struct vdev_info { + struct virtio_device vdev; + int control; + struct vq_info vqs[2]; + int nvqs; + void *buf; + size_t buf_size; + char *test_buf; + char *res_buf; + struct vhost_memory *mem; + int sock; + int ifindex; + unsigned char mac[ETHER_ADDR_LEN]; +}; + +static int tun_alloc(struct vdev_info *dev) +{ + struct ifreq ifr; + int len = HDR_LEN; + int fd, e; + + fd = open("/dev/net/tun", O_RDWR); + if (fd < 0) { + perror("Cannot open /dev/net/tun"); + return fd; + } + + memset(&ifr, 0, sizeof(ifr)); + + ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR; + snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid()); + + e = ioctl(fd, TUNSETIFF, &ifr); + if (e < 0) { + perror("ioctl[TUNSETIFF]"); + close(fd); + return e; + } + + e = ioctl(fd, TUNSETVNETHDRSZ, &len); + if (e < 0) { + perror("ioctl[TUNSETVNETHDRSZ]"); + close(fd); + return e; + } + + e = ioctl(fd, SIOCGIFHWADDR, &ifr); + if (e < 0) { + perror("ioctl[SIOCGIFHWADDR]"); + close(fd); + return e; + } + + memcpy(dev->mac, &ifr.ifr_hwaddr.sa_data, ETHER_ADDR_LEN); + return fd; +} + +static void vdev_create_socket(struct vdev_info *dev) +{ + struct ifreq ifr; + + dev->sock = socket(AF_PACKET, SOCK_RAW, htons(TEST_PTYPE)); + assert(dev->sock != -1); + + snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid()); + assert(ioctl(dev->sock, SIOCGIFINDEX, &ifr) >= 0); + + dev->ifindex = ifr.ifr_ifindex; + + /* Set the flags that bring the device up */ + assert(ioctl(dev->sock, SIOCGIFFLAGS, &ifr) >= 0); + ifr.ifr_flags |= (IFF_UP | IFF_RUNNING); + assert(ioctl(dev->sock, SIOCSIFFLAGS, &ifr) >= 0); +} + +static void vdev_send_packet(struct vdev_info *dev) +{ + char *sendbuf = dev->test_buf + HDR_LEN; + struct sockaddr_ll saddrll = {0}; + int sockfd = dev->sock; + int ret; + + saddrll.sll_family = PF_PACKET; + saddrll.sll_ifindex = dev->ifindex; + saddrll.sll_halen = ETH_ALEN; + sadd
Re: [PATCH RFC v3 23/35] arm64: mte: Try to reserve tag storage in arch_alloc_page()
Hi Peter, On Mon, Jan 29, 2024 at 04:04:18PM -0800, Peter Collingbourne wrote: > On Thu, Jan 25, 2024 at 8:45 AM Alexandru Elisei > wrote: > > > > Reserve tag storage for a page that is being allocated as tagged. This > > is a best effort approach, and failing to reserve tag storage is > > allowed. > > > > When all the associated tagged pages have been freed, return the tag > > storage pages back to the page allocator, where they can be used again for > > data allocations. > > > > Signed-off-by: Alexandru Elisei > > --- > > > > Changes since rfc v2: > > > > * Based on rfc v2 patch #16 ("arm64: mte: Manage tag storage on page > > allocation"). > > * Fixed calculation of the number of associated tag storage blocks (Hyesoo > > Yu). > > * Tag storage is reserved in arch_alloc_page() instead of > > arch_prep_new_page(). > > > > arch/arm64/include/asm/mte.h | 16 +- > > arch/arm64/include/asm/mte_tag_storage.h | 31 +++ > > arch/arm64/include/asm/page.h| 5 + > > arch/arm64/include/asm/pgtable.h | 19 ++ > > arch/arm64/kernel/mte_tag_storage.c | 234 +++ > > arch/arm64/mm/fault.c| 7 + > > fs/proc/page.c | 1 + > > include/linux/kernel-page-flags.h| 1 + > > include/linux/page-flags.h | 1 + > > include/trace/events/mmflags.h | 3 +- > > mm/huge_memory.c | 1 + > > 11 files changed, 316 insertions(+), 3 deletions(-) > > > > diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h > > index 8034695b3dd7..6457b7899207 100644 > > --- a/arch/arm64/include/asm/mte.h > > +++ b/arch/arm64/include/asm/mte.h > > @@ -40,12 +40,24 @@ void mte_free_tag_buf(void *buf); > > #ifdef CONFIG_ARM64_MTE > > > > /* track which pages have valid allocation tags */ > > -#define PG_mte_tagged PG_arch_2 > > +#define PG_mte_tagged PG_arch_2 > > /* simple lock to avoid multiple threads tagging the same page */ > > -#define PG_mte_lockPG_arch_3 > > +#define PG_mte_lockPG_arch_3 > > +/* Track if a tagged page has tag storage reserved */ > > +#define PG_tag_storage_reservedPG_arch_4 > > + > > +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE > > +DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key); > > +extern bool page_tag_storage_reserved(struct page *page); > > +#endif > > > > static inline void set_page_mte_tagged(struct page *page) > > { > > +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE > > + /* Open code mte_tag_storage_enabled() */ > > + WARN_ON_ONCE(static_branch_likely(&tag_storage_enabled_key) && > > +!page_tag_storage_reserved(page)); > > +#endif > > /* > > * Ensure that the tags written prior to this function are visible > > * before the page flags update. > > diff --git a/arch/arm64/include/asm/mte_tag_storage.h > > b/arch/arm64/include/asm/mte_tag_storage.h > > index 7b3f6bff8e6f..09f1318d924e 100644 > > --- a/arch/arm64/include/asm/mte_tag_storage.h > > +++ b/arch/arm64/include/asm/mte_tag_storage.h > > @@ -5,6 +5,12 @@ > > #ifndef __ASM_MTE_TAG_STORAGE_H > > #define __ASM_MTE_TAG_STORAGE_H > > > > +#ifndef __ASSEMBLY__ > > + > > +#include > > + > > +#include > > + > > #ifdef CONFIG_ARM64_MTE_TAG_STORAGE > > > > DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key); > > @@ -15,6 +21,15 @@ static inline bool tag_storage_enabled(void) > > } > > > > void mte_init_tag_storage(void); > > + > > +static inline bool alloc_requires_tag_storage(gfp_t gfp) > > +{ > > + return gfp & __GFP_TAGGED; > > +} > > +int reserve_tag_storage(struct page *page, int order, gfp_t gfp); > > +void free_tag_storage(struct page *page, int order); > > + > > +bool page_tag_storage_reserved(struct page *page); > > #else > > static inline bool tag_storage_enabled(void) > > { > > @@ -23,6 +38,22 @@ static inline bool tag_storage_enabled(void) > > static inline void mte_init_tag_storage(void) > > { > > } > > +static inline bool alloc_requires_tag_storage(struct page *page) > > This function should take a gfp_t to match the > CONFIG_ARM64_MTE_TAG_STORAGE case. Ah, yes, it should, nice catch, the compiler didn't throw an error. Will fix, thanks! Alex
Re: [PATCH v3] virtio_net: Support RX hash XDP hint
On Thu, 25 Jan 2024 18:19:12 +0800, Liang Chen wrote: > The RSS hash report is a feature that's part of the virtio specification. > Currently, virtio backends like qemu, vdpa (mlx5), and potentially vhost > (still a work in progress as per [1]) support this feature. While the > capability to obtain the RSS hash has been enabled in the normal path, > it's currently missing in the XDP path. Therefore, we are introducing > XDP hints through kfuncs to allow XDP programs to access the RSS hash. > > 1. > https://lore.kernel.org/all/20231015141644.260646-1-akihiko.od...@daynix.com/#r > > Signed-off-by: Liang Chen > --- > drivers/net/virtio_net.c | 98 +++- > 1 file changed, 86 insertions(+), 12 deletions(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index d7ce4a1011ea..0c845f2223da 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -349,6 +349,12 @@ struct virtio_net_common_hdr { > }; > }; > > +struct virtnet_xdp_buff { > + struct xdp_buff xdp; > + u32 hash_value; > + u16 hash_report; > +}; > + > static void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf); > > static bool is_xdp_frame(void *ptr) > @@ -1033,6 +1039,16 @@ static void put_xdp_frags(struct xdp_buff *xdp) > } > } > > +static void virtnet_xdp_save_rx_hash(struct virtnet_xdp_buff *virtnet_xdp, > + struct net_device *dev, > + struct virtio_net_hdr_v1_hash *hdr_hash) > +{ > + if (dev->features & NETIF_F_RXHASH) { > + virtnet_xdp->hash_value = __le32_to_cpu(hdr_hash->hash_value); > + virtnet_xdp->hash_report = __le16_to_cpu(hdr_hash->hash_report); Could we put the __leXX_to_cpu to virtnet_xdp_rx_hash? Other looks good to me. Reviewed-by: Xuan Zhuo Thanks. > + } > +} > + > static int virtnet_xdp_handler(struct bpf_prog *xdp_prog, struct xdp_buff > *xdp, > struct net_device *dev, > unsigned int *xdp_xmit, > @@ -1199,9 +1215,10 @@ static struct sk_buff *receive_small_xdp(struct > net_device *dev, > unsigned int headroom = vi->hdr_len + header_offset; > struct virtio_net_hdr_mrg_rxbuf *hdr = buf + header_offset; > struct page *page = virt_to_head_page(buf); > + struct virtnet_xdp_buff virtnet_xdp; > struct page *xdp_page; > + struct xdp_buff *xdp; > unsigned int buflen; > - struct xdp_buff xdp; > struct sk_buff *skb; > unsigned int metasize = 0; > u32 act; > @@ -1233,17 +1250,20 @@ static struct sk_buff *receive_small_xdp(struct > net_device *dev, > page = xdp_page; > } > > - xdp_init_buff(&xdp, buflen, &rq->xdp_rxq); > - xdp_prepare_buff(&xdp, buf + VIRTNET_RX_PAD + vi->hdr_len, > + xdp = &virtnet_xdp.xdp; > + xdp_init_buff(xdp, buflen, &rq->xdp_rxq); > + xdp_prepare_buff(xdp, buf + VIRTNET_RX_PAD + vi->hdr_len, >xdp_headroom, len, true); > > - act = virtnet_xdp_handler(xdp_prog, &xdp, dev, xdp_xmit, stats); > + virtnet_xdp_save_rx_hash(&virtnet_xdp, dev, (void *)hdr); > + > + act = virtnet_xdp_handler(xdp_prog, xdp, dev, xdp_xmit, stats); > > switch (act) { > case XDP_PASS: > /* Recalculate length in case bpf program changed it */ > - len = xdp.data_end - xdp.data; > - metasize = xdp.data - xdp.data_meta; > + len = xdp->data_end - xdp->data; > + metasize = xdp->data - xdp->data_meta; > break; > > case XDP_TX: > @@ -1254,7 +1274,7 @@ static struct sk_buff *receive_small_xdp(struct > net_device *dev, > goto err_xdp; > } > > - skb = virtnet_build_skb(buf, buflen, xdp.data - buf, len); > + skb = virtnet_build_skb(buf, buflen, xdp->data - buf, len); > if (unlikely(!skb)) > goto err; > > @@ -1591,10 +1611,11 @@ static struct sk_buff *receive_mergeable_xdp(struct > net_device *dev, > int num_buf = virtio16_to_cpu(vi->vdev, hdr->num_buffers); > struct page *page = virt_to_head_page(buf); > int offset = buf - page_address(page); > + struct virtnet_xdp_buff virtnet_xdp; > unsigned int xdp_frags_truesz = 0; > struct sk_buff *head_skb; > unsigned int frame_sz; > - struct xdp_buff xdp; > + struct xdp_buff *xdp; > void *data; > u32 act; > int err; > @@ -1604,16 +1625,19 @@ static struct sk_buff *receive_mergeable_xdp(struct > net_device *dev, > if (unlikely(!data)) > goto err_xdp; > > - err = virtnet_build_xdp_buff_mrg(dev, vi, rq, &xdp, data, len, frame_sz, > + xdp = &virtnet_xdp.xdp; > + err = virtnet_build_xdp_buff_mrg(dev, vi, rq, xdp, data, len, frame_sz, >&num_buf, &xdp_frags_truesz, stats); > if (unlikely(err)) > goto err_xdp; > > - a
Re: [PATCH RFC v3 01/35] mm: page_alloc: Add gfp_flags parameter to arch_alloc_page()
Hi, On Tue, Jan 30, 2024 at 09:56:10AM +0530, Anshuman Khandual wrote: > > > On 1/29/24 17:11, Alexandru Elisei wrote: > > Hi, > > > > On Mon, Jan 29, 2024 at 11:18:59AM +0530, Anshuman Khandual wrote: > >> On 1/25/24 22:12, Alexandru Elisei wrote: > >>> Extend the usefulness of arch_alloc_page() by adding the gfp_flags > >>> parameter. > >> Although the change here is harmless in itself, it will definitely benefit > >> from some additional context explaining the rationale, taking into account > >> why-how arch_alloc_page() got added particularly for s390 platform and how > >> it's going to be used in the present proposal. > > arm64 will use it to reserve tag storage if the caller requested a tagged > > page. Right now that means that __GFP_ZEROTAGS is set in the gfp mask, but > > I'll rename it to __GFP_TAGGED in patch #18 ("arm64: mte: Rename > > __GFP_ZEROTAGS to __GFP_TAGGED") [1]. > > > > [1] > > https://lore.kernel.org/lkml/20240125164256.4147-19-alexandru.eli...@arm.com/ > > Makes sense, but please do update the commit message explaining how > new gfp mask argument will be used to detect tagged page allocation > requests, further requiring tag storage allocation. Will do, thanks! Alex
Re: [PATCH RFC v3 04/35] mm: page_alloc: Partially revert "mm: page_alloc: remove stale CMA guard code"
Hi, On Tue, Jan 30, 2024 at 10:04:02AM +0530, Anshuman Khandual wrote: > > > On 1/29/24 17:16, Alexandru Elisei wrote: > > Hi, > > > > On Mon, Jan 29, 2024 at 02:31:23PM +0530, Anshuman Khandual wrote: > >> > >> > >> On 1/25/24 22:12, Alexandru Elisei wrote: > >>> The patch f945116e4e19 ("mm: page_alloc: remove stale CMA guard code") > >>> removed the CMA filter when allocating from the MIGRATE_MOVABLE pcp list > >>> because CMA is always allowed when __GFP_MOVABLE is set. > >>> > >>> With the introduction of the arch_alloc_cma() function, the above is not > >>> true anymore, so bring back the filter. > >> > >> This makes sense as arch_alloc_cma() now might prevent ALLOC_CMA being > >> assigned to alloc_flags in gfp_to_alloc_flags_cma(). > > > > Can I add your Reviewed-by tag then? > > I think all these changes need to be reviewed in their entirety > even though some patches do look good on their own. For example > this patch depends on whether [PATCH 03/35] is acceptable or not. > > I would suggest separating out CMA patches which could be debated > and merged regardless of this series. Ah, I see, makes sense. Since basically all the core mm changes are there to enable dynamic tag storage for arm64, I'll hold on until the series stabilises before separating the core mm from the arm64 patches. Thanks, Alex
Re: [PATCH RFC v3 06/35] mm: cma: Make CMA_ALLOC_SUCCESS/FAIL count the number of pages
Hi, On Tue, Jan 30, 2024 at 10:22:11AM +0530, Anshuman Khandual wrote: > > > On 1/29/24 17:21, Alexandru Elisei wrote: > > Hi, > > > > On Mon, Jan 29, 2024 at 02:54:20PM +0530, Anshuman Khandual wrote: > >> > >> > >> On 1/25/24 22:12, Alexandru Elisei wrote: > >>> The CMA_ALLOC_SUCCESS, respectively CMA_ALLOC_FAIL, are increased by one > >>> after each cma_alloc() function call. This is done even though cma_alloc() > >>> can allocate an arbitrary number of CMA pages. When looking at > >>> /proc/vmstat, the number of successful (or failed) cma_alloc() calls > >>> doesn't tell much with regards to how many CMA pages were allocated via > >>> cma_alloc() versus via the page allocator (regular allocation request or > >>> PCP lists refill). > >>> > >>> This can also be rather confusing to a user who isn't familiar with the > >>> code, since the unit of measurement for nr_free_cma is the number of > >>> pages, > >>> but cma_alloc_success and cma_alloc_fail count the number of cma_alloc() > >>> function calls. > >>> > >>> Let's make this consistent, and arguably more useful, by having > >>> CMA_ALLOC_SUCCESS count the number of successfully allocated CMA pages, > >>> and > >>> CMA_ALLOC_FAIL count the number of pages the cma_alloc() failed to > >>> allocate. > >>> > >>> For users that wish to track the number of cma_alloc() calls, there are > >>> tracepoints for that already implemented. > >>> > >>> Signed-off-by: Alexandru Elisei > >>> --- > >>> mm/cma.c | 4 ++-- > >>> 1 file changed, 2 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/mm/cma.c b/mm/cma.c > >>> index f49c95f8ee37..dbf7fe8cb1bd 100644 > >>> --- a/mm/cma.c > >>> +++ b/mm/cma.c > >>> @@ -517,10 +517,10 @@ struct page *cma_alloc(struct cma *cma, unsigned > >>> long count, > >>> pr_debug("%s(): returned %p\n", __func__, page); > >>> out: > >>> if (page) { > >>> - count_vm_event(CMA_ALLOC_SUCCESS); > >>> + count_vm_events(CMA_ALLOC_SUCCESS, count); > >>> cma_sysfs_account_success_pages(cma, count); > >>> } else { > >>> - count_vm_event(CMA_ALLOC_FAIL); > >>> + count_vm_events(CMA_ALLOC_FAIL, count); > >>> if (cma) > >>> cma_sysfs_account_fail_pages(cma, count); > >>> } > >> > >> Without getting into the merits of this patch - which is actually trying > >> to do > >> semantics change to /proc/vmstat, wondering how is this even related to > >> this > >> particular series ? If required this could be debated on it's on > >> separately. > > > > Having the number of CMA pages allocated and the number of CMA pages freed > > allows someone to infer how many tagged pages are in use at a given time: > > That should not be done in CMA which is a generic multi purpose allocator. Ah, ok. Let me rephrase that: Having the number of CMA pages allocated, the number of failed CMA page allocations and the number of freed CMA pages allows someone to infer how many CMA pages are in use at a given time. That's valuable information for software designers and system administrators, as it allows them to tune the number of CMA pages available in a system. Or put another way: what would you consider to be more useful? Knowing the number of cma_alloc()/cma_release() calls, or knowing the number of pages that cma_alloc()/cma_release() allocated or freed? > > > (allocated CMA pages - CMA pages allocated by drivers* - CMA pages > > released) * 32. That is valuable information for software and hardware > > designers. > > > > Besides that, for every iteration of the series, this has proven invaluable > > for discovering bugs with freeing and/or reserving tag storage pages. > > I am afraid that might not be enough justification for getting something > merged mainline. > > > > > *that would require userspace reading cma_alloc_success and > > cma_release_success before any tagged allocations are performed. > > While assuming that no other non-memory-tagged CMA based allocation amd free > call happens in the meantime ? That would be on real thin ice. > > I suppose arm64 tagged memory specific allocation or free related counters > need to be created on the caller side, including arch_free_pages_prepare(). I'll think about this. At the very least, I can add tracepoints. Thanks, Alex
[PATCH v3 0/5] PM: domains: Add helpers for multi PM domains to avoid open-coding
Rafael, my plan is queue up this series via my pmdomain tree. Please let me know if you see any issues with that, especially around patch1. Updates in v3: - Added tested-by, reviewed-by and suggested-by tags. No other changes have been made. Updates in v2: - Ccing Daniel Baluta and Iuliana Prodan the NXP remoteproc patches to requests help with testing. - Fixed NULL pointer bug in patch1, pointed out by Nikunj. - Added some tested/reviewed-by tags. Attaching/detaching of a device to multiple PM domains has started to become a common operation for many drivers, typically during ->probe() and ->remove(). In most cases, this has lead to lots of boilerplate code in the drivers. This series adds a pair of helper functions to manage the attach/detach of a device to its multiple PM domains. Moreover, a couple of drivers have been converted to use the new helpers as a proof of concept. Note 1) The changes in the drivers have only been compile tested, while the helpers have been tested along with a couple of local dummy drivers that I have hacked up to model both genpd providers and genpd consumers. Note 2) I was struggling to make up mind if we should have a separate helper to attach all available power-domains described in DT, rather than providing "NULL" to the dev_pm_domain_attach_list(). I decided not to, but please let me know if you prefer the other option. Note 3) For OPP integration, as a follow up I am striving to make the dev_pm_opp_attach_genpd() redundant. Instead I think we should move towards using dev_pm_opp_set_config()->_opp_set_required_devs(), which would allow us to use the helpers that $subject series is adding. Kind regards Ulf Hansson Ulf Hansson (5): PM: domains: Add helper functions to attach/detach multiple PM domains remoteproc: imx_dsp_rproc: Convert to dev_pm_domain_attach|detach_list() remoteproc: imx_rproc: Convert to dev_pm_domain_attach|detach_list() remoteproc: qcom_q6v5_adsp: Convert to dev_pm_domain_attach|detach_list() media: venus: Convert to dev_pm_domain_attach|detach_list() for vcodec drivers/base/power/common.c | 134 +++ drivers/media/platform/qcom/venus/core.c | 12 +- drivers/media/platform/qcom/venus/core.h | 7 +- .../media/platform/qcom/venus/pm_helpers.c| 48 ++ drivers/remoteproc/imx_dsp_rproc.c| 82 + drivers/remoteproc/imx_rproc.c| 73 +--- drivers/remoteproc/qcom_q6v5_adsp.c | 160 -- include/linux/pm_domain.h | 38 + 8 files changed, 289 insertions(+), 265 deletions(-) -- 2.34.1
[PATCH v3 1/5] PM: domains: Add helper functions to attach/detach multiple PM domains
Attaching/detaching of a device to multiple PM domains has started to become a common operation for many drivers, typically during ->probe() and ->remove(). In most cases, this has lead to lots of boilerplate code in the drivers. To fixup up the situation, let's introduce a pair of helper functions, dev_pm_domain_attach|detach_list(), that driver can use instead of the open-coding. Note that, it seems reasonable to limit the support for these helpers to DT based platforms, at it's the only valid use case for now. Suggested-by: Daniel Baluta Tested-by: Bryan O'Donoghue Tested-by: Iuliana Prodan Signed-off-by: Ulf Hansson --- Changes in v3: - Added suggested-by and tested-by tags. Changes in v2: - Fix NULL pointer bug pointed out by Nikunj. --- drivers/base/power/common.c | 134 include/linux/pm_domain.h | 38 ++ 2 files changed, 172 insertions(+) diff --git a/drivers/base/power/common.c b/drivers/base/power/common.c index 44ec20918a4d..327d168dd37a 100644 --- a/drivers/base/power/common.c +++ b/drivers/base/power/common.c @@ -167,6 +167,115 @@ struct device *dev_pm_domain_attach_by_name(struct device *dev, } EXPORT_SYMBOL_GPL(dev_pm_domain_attach_by_name); +/** + * dev_pm_domain_attach_list - Associate a device with its PM domains. + * @dev: The device used to lookup the PM domains for. + * @data: The data used for attaching to the PM domains. + * @list: An out-parameter with an allocated list of attached PM domains. + * + * This function helps to attach a device to its multiple PM domains. The + * caller, which is typically a driver's probe function, may provide a list of + * names for the PM domains that we should try to attach the device to, but it + * may also provide an empty list, in case the attach should be done for all of + * the available PM domains. + * + * Callers must ensure proper synchronization of this function with power + * management callbacks. + * + * Returns the number of attached PM domains or a negative error code in case of + * a failure. Note that, to detach the list of PM domains, the driver shall call + * dev_pm_domain_detach_list(), typically during the remove phase. + */ +int dev_pm_domain_attach_list(struct device *dev, + const struct dev_pm_domain_attach_data *data, + struct dev_pm_domain_list **list) +{ + struct device_node *np = dev->of_node; + struct dev_pm_domain_list *pds; + struct device *pd_dev = NULL; + int ret, i, num_pds = 0; + bool by_id = true; + u32 pd_flags = data ? data->pd_flags : 0; + u32 link_flags = pd_flags & PD_FLAG_NO_DEV_LINK ? 0 : + DL_FLAG_STATELESS | DL_FLAG_PM_RUNTIME; + + if (dev->pm_domain) + return -EEXIST; + + /* For now this is limited to OF based platforms. */ + if (!np) + return 0; + + if (data && data->pd_names) { + num_pds = data->num_pd_names; + by_id = false; + } else { + num_pds = of_count_phandle_with_args(np, "power-domains", +"#power-domain-cells"); + } + + if (num_pds <= 0) + return 0; + + pds = devm_kzalloc(dev, sizeof(*pds), GFP_KERNEL); + if (!pds) + return -ENOMEM; + + pds->pd_devs = devm_kcalloc(dev, num_pds, sizeof(*pds->pd_devs), + GFP_KERNEL); + if (!pds->pd_devs) + return -ENOMEM; + + pds->pd_links = devm_kcalloc(dev, num_pds, sizeof(*pds->pd_links), +GFP_KERNEL); + if (!pds->pd_links) + return -ENOMEM; + + if (link_flags && pd_flags & PD_FLAG_DEV_LINK_ON) + link_flags |= DL_FLAG_RPM_ACTIVE; + + for (i = 0; i < num_pds; i++) { + if (by_id) + pd_dev = dev_pm_domain_attach_by_id(dev, i); + else + pd_dev = dev_pm_domain_attach_by_name(dev, + data->pd_names[i]); + if (IS_ERR_OR_NULL(pd_dev)) { + ret = pd_dev ? PTR_ERR(pd_dev) : -ENODEV; + goto err_attach; + } + + if (link_flags) { + struct device_link *link; + + link = device_link_add(dev, pd_dev, link_flags); + if (!link) { + ret = -ENODEV; + goto err_link; + } + + pds->pd_links[i] = link; + } + + pds->pd_devs[i] = pd_dev; + } + + pds->num_pds = num_pds; + *list = pds; + return num_pds; + +err_link: + dev_pm_domain_detach(pd_dev, true); +err_attach: + while (--i >= 0) { + if (pds->pd_
[PATCH v3 2/5] remoteproc: imx_dsp_rproc: Convert to dev_pm_domain_attach|detach_list()
Let's avoid the boilerplate code to manage the multiple PM domain case, by converting into using dev_pm_domain_attach|detach_list(). Cc: Bjorn Andersson Cc: Shawn Guo Cc: Sascha Hauer Cc: Daniel Baluta Cc: Tested-by: Iuliana Prodan Reviewed-by: Iuliana Prodan Reviewed-by: Mathieu Poirier Signed-off-by: Ulf Hansson --- Changes in v3: - Added reviewed-by and tested-by tags. Changes in v2: - None. --- drivers/remoteproc/imx_dsp_rproc.c | 82 -- 1 file changed, 9 insertions(+), 73 deletions(-) diff --git a/drivers/remoteproc/imx_dsp_rproc.c b/drivers/remoteproc/imx_dsp_rproc.c index a1c62d15f16c..d73727a5828a 100644 --- a/drivers/remoteproc/imx_dsp_rproc.c +++ b/drivers/remoteproc/imx_dsp_rproc.c @@ -103,12 +103,10 @@ enum imx_dsp_rp_mbox_messages { * @tx_ch: mailbox tx channel handle * @rx_ch: mailbox rx channel handle * @rxdb_ch: mailbox rx doorbell channel handle - * @pd_dev: power domain device - * @pd_dev_link: power domain device link + * @pd_list: power domain list * @ipc_handle: System Control Unit ipc handle * @rproc_work: work for processing virtio interrupts * @pm_comp: completion primitive to sync for suspend response - * @num_domains: power domain number * @flags: control flags */ struct imx_dsp_rproc { @@ -121,12 +119,10 @@ struct imx_dsp_rproc { struct mbox_chan*tx_ch; struct mbox_chan*rx_ch; struct mbox_chan*rxdb_ch; - struct device **pd_dev; - struct device_link **pd_dev_link; + struct dev_pm_domain_list *pd_list; struct imx_sc_ipc *ipc_handle; struct work_struct rproc_work; struct completion pm_comp; - int num_domains; u32 flags; }; @@ -955,74 +951,14 @@ static const struct rproc_ops imx_dsp_rproc_ops = { static int imx_dsp_attach_pm_domains(struct imx_dsp_rproc *priv) { struct device *dev = priv->rproc->dev.parent; - int ret, i; - - priv->num_domains = of_count_phandle_with_args(dev->of_node, - "power-domains", - "#power-domain-cells"); - - /* If only one domain, then no need to link the device */ - if (priv->num_domains <= 1) - return 0; - - priv->pd_dev = devm_kmalloc_array(dev, priv->num_domains, - sizeof(*priv->pd_dev), - GFP_KERNEL); - if (!priv->pd_dev) - return -ENOMEM; - - priv->pd_dev_link = devm_kmalloc_array(dev, priv->num_domains, - sizeof(*priv->pd_dev_link), - GFP_KERNEL); - if (!priv->pd_dev_link) - return -ENOMEM; - - for (i = 0; i < priv->num_domains; i++) { - priv->pd_dev[i] = dev_pm_domain_attach_by_id(dev, i); - if (IS_ERR(priv->pd_dev[i])) { - ret = PTR_ERR(priv->pd_dev[i]); - goto detach_pm; - } - - /* -* device_link_add will check priv->pd_dev[i], if it is -* NULL, then will break. -*/ - priv->pd_dev_link[i] = device_link_add(dev, - priv->pd_dev[i], - DL_FLAG_STATELESS | - DL_FLAG_PM_RUNTIME); - if (!priv->pd_dev_link[i]) { - dev_pm_domain_detach(priv->pd_dev[i], false); - ret = -EINVAL; - goto detach_pm; - } - } - - return 0; - -detach_pm: - while (--i >= 0) { - device_link_del(priv->pd_dev_link[i]); - dev_pm_domain_detach(priv->pd_dev[i], false); - } - - return ret; -} - -static int imx_dsp_detach_pm_domains(struct imx_dsp_rproc *priv) -{ - int i; + int ret; - if (priv->num_domains <= 1) + /* A single PM domain is already attached. */ + if (dev->pm_domain) return 0; - for (i = 0; i < priv->num_domains; i++) { - device_link_del(priv->pd_dev_link[i]); - dev_pm_domain_detach(priv->pd_dev[i], false); - } - - return 0; + ret = dev_pm_domain_attach_list(dev, NULL, &priv->pd_list); + return ret < 0 ? ret : 0; } /** @@ -1154,7 +1090,7 @@ static int imx_dsp_rproc_probe(struct platform_device *pdev) return 0; err_detach_domains: - imx_dsp_detach_pm_domains(priv); + dev_pm_domain_deta
[PATCH v3 3/5] remoteproc: imx_rproc: Convert to dev_pm_domain_attach|detach_list()
Let's avoid the boilerplate code to manage the multiple PM domain case, by converting into using dev_pm_domain_attach|detach_list(). Cc: Bjorn Andersson Cc: Shawn Guo Cc: Sascha Hauer Cc: Daniel Baluta Cc: Tested-by: Iuliana Prodan Reviewed-by: Iuliana Prodan Reviewed-by: Mathieu Poirier Signed-off-by: Ulf Hansson --- Changes in v3: - Added reviewed-by and tested-by tags. Changes in v2: - None. --- drivers/remoteproc/imx_rproc.c | 73 +- 1 file changed, 9 insertions(+), 64 deletions(-) diff --git a/drivers/remoteproc/imx_rproc.c b/drivers/remoteproc/imx_rproc.c index 8bb293b9f327..3161f14442bc 100644 --- a/drivers/remoteproc/imx_rproc.c +++ b/drivers/remoteproc/imx_rproc.c @@ -92,7 +92,6 @@ struct imx_rproc_mem { static int imx_rproc_xtr_mbox_init(struct rproc *rproc); static void imx_rproc_free_mbox(struct rproc *rproc); -static int imx_rproc_detach_pd(struct rproc *rproc); struct imx_rproc { struct device *dev; @@ -113,10 +112,8 @@ struct imx_rproc { u32 rproc_pt; /* partition id */ u32 rsrc_id;/* resource id */ u32 entry; /* cpu start address */ - int num_pd; u32 core_index; - struct device **pd_dev; - struct device_link **pd_dev_link; + struct dev_pm_domain_list *pd_list; }; static const struct imx_rproc_att imx_rproc_att_imx93[] = { @@ -853,7 +850,7 @@ static void imx_rproc_put_scu(struct rproc *rproc) return; if (imx_sc_rm_is_resource_owned(priv->ipc_handle, priv->rsrc_id)) { - imx_rproc_detach_pd(rproc); + dev_pm_domain_detach_list(priv->pd_list); return; } @@ -880,72 +877,20 @@ static int imx_rproc_partition_notify(struct notifier_block *nb, static int imx_rproc_attach_pd(struct imx_rproc *priv) { struct device *dev = priv->dev; - int ret, i; - - /* -* If there is only one power-domain entry, the platform driver framework -* will handle it, no need handle it in this driver. -*/ - priv->num_pd = of_count_phandle_with_args(dev->of_node, "power-domains", - "#power-domain-cells"); - if (priv->num_pd <= 1) - return 0; - - priv->pd_dev = devm_kmalloc_array(dev, priv->num_pd, sizeof(*priv->pd_dev), GFP_KERNEL); - if (!priv->pd_dev) - return -ENOMEM; - - priv->pd_dev_link = devm_kmalloc_array(dev, priv->num_pd, sizeof(*priv->pd_dev_link), - GFP_KERNEL); - - if (!priv->pd_dev_link) - return -ENOMEM; - - for (i = 0; i < priv->num_pd; i++) { - priv->pd_dev[i] = dev_pm_domain_attach_by_id(dev, i); - if (IS_ERR(priv->pd_dev[i])) { - ret = PTR_ERR(priv->pd_dev[i]); - goto detach_pd; - } - - priv->pd_dev_link[i] = device_link_add(dev, priv->pd_dev[i], DL_FLAG_STATELESS | - DL_FLAG_PM_RUNTIME | DL_FLAG_RPM_ACTIVE); - if (!priv->pd_dev_link[i]) { - dev_pm_domain_detach(priv->pd_dev[i], false); - ret = -EINVAL; - goto detach_pd; - } - } - - return 0; - -detach_pd: - while (--i >= 0) { - device_link_del(priv->pd_dev_link[i]); - dev_pm_domain_detach(priv->pd_dev[i], false); - } - - return ret; -} - -static int imx_rproc_detach_pd(struct rproc *rproc) -{ - struct imx_rproc *priv = rproc->priv; - int i; + int ret; + struct dev_pm_domain_attach_data pd_data = { + .pd_flags = PD_FLAG_DEV_LINK_ON, + }; /* * If there is only one power-domain entry, the platform driver framework * will handle it, no need handle it in this driver. */ - if (priv->num_pd <= 1) + if (dev->pm_domain) return 0; - for (i = 0; i < priv->num_pd; i++) { - device_link_del(priv->pd_dev_link[i]); - dev_pm_domain_detach(priv->pd_dev[i], false); - } - - return 0; + ret = dev_pm_domain_attach_list(dev, &pd_data, &priv->pd_list); + return ret < 0 ? ret : 0; } static int imx_rproc_detect_mode(struct imx_rproc *priv) -- 2.34.1
[PATCH v3 4/5] remoteproc: qcom_q6v5_adsp: Convert to dev_pm_domain_attach|detach_list()
Let's avoid some of the boilerplate code to manage the various PM domain cases, by converting into using dev_pm_domain_attach|detach_list(). As a part of the conversion, we are moving over to use device_links, which simplifies the runtime PM support too. Moreover, while attaching let's trust that an already attached single PM domain is the correct one. Cc: Mathieu Poirier Cc: Bjorn Andersson Cc: Konrad Dybcio Cc: Signed-off-by: Ulf Hansson --- Changes in v3: - None. Changes in v2: - None. --- drivers/remoteproc/qcom_q6v5_adsp.c | 160 +--- 1 file changed, 73 insertions(+), 87 deletions(-) diff --git a/drivers/remoteproc/qcom_q6v5_adsp.c b/drivers/remoteproc/qcom_q6v5_adsp.c index 6c67514cc493..93f9a1537ec6 100644 --- a/drivers/remoteproc/qcom_q6v5_adsp.c +++ b/drivers/remoteproc/qcom_q6v5_adsp.c @@ -55,8 +55,6 @@ #define QDSP6SS_CORE_CBCR 0x20 #define QDSP6SS_SLEEP_CBCR 0x3c -#define QCOM_Q6V5_RPROC_PROXY_PD_MAX 3 - #define LPASS_BOOT_CORE_START BIT(0) #define LPASS_BOOT_CMD_START BIT(0) #define LPASS_EFUSE_Q6SS_EVB_SEL 0x0 @@ -74,7 +72,8 @@ struct adsp_pil_data { const char **clk_ids; int num_clks; - const char **proxy_pd_names; + const char **pd_names; + unsigned int num_pds; const char *load_state; }; @@ -110,8 +109,7 @@ struct qcom_adsp { size_t mem_size; bool has_iommu; - struct device *proxy_pds[QCOM_Q6V5_RPROC_PROXY_PD_MAX]; - size_t proxy_pd_count; + struct dev_pm_domain_list *pd_list; struct qcom_rproc_glink glink_subdev; struct qcom_rproc_ssr ssr_subdev; @@ -120,98 +118,92 @@ struct qcom_adsp { int (*shutdown)(struct qcom_adsp *adsp); }; -static int qcom_rproc_pds_attach(struct device *dev, struct qcom_adsp *adsp, -const char **pd_names) +static int qcom_rproc_pds_attach(struct qcom_adsp *adsp, const char **pd_names, +unsigned int num_pds) { - struct device **devs = adsp->proxy_pds; - size_t num_pds = 0; + struct device *dev = adsp->dev; + struct dev_pm_domain_attach_data pd_data = { + .pd_names = pd_names, + .num_pd_names = num_pds, + }; int ret; - int i; - - if (!pd_names) - return 0; /* Handle single power domain */ - if (dev->pm_domain) { - devs[0] = dev; - pm_runtime_enable(dev); - return 1; - } + if (dev->pm_domain) + goto out; - while (pd_names[num_pds]) - num_pds++; + if (!pd_names) + return 0; - if (num_pds > ARRAY_SIZE(adsp->proxy_pds)) - return -E2BIG; + ret = dev_pm_domain_attach_list(dev, &pd_data, &adsp->pd_list); + if (ret < 0) + return ret; - for (i = 0; i < num_pds; i++) { - devs[i] = dev_pm_domain_attach_by_name(dev, pd_names[i]); - if (IS_ERR_OR_NULL(devs[i])) { - ret = PTR_ERR(devs[i]) ? : -ENODATA; - goto unroll_attach; - } - } +out: + pm_runtime_enable(dev); + return 0; +} - return num_pds; +static void qcom_rproc_pds_detach(struct qcom_adsp *adsp) +{ + struct device *dev = adsp->dev; + struct dev_pm_domain_list *pds = adsp->pd_list; -unroll_attach: - for (i--; i >= 0; i--) - dev_pm_domain_detach(devs[i], false); + dev_pm_domain_detach_list(pds); - return ret; + if (dev->pm_domain || pds) + pm_runtime_disable(adsp->dev); } -static void qcom_rproc_pds_detach(struct qcom_adsp *adsp, struct device **pds, - size_t pd_count) +static int qcom_rproc_pds_enable(struct qcom_adsp *adsp) { struct device *dev = adsp->dev; - int i; + struct dev_pm_domain_list *pds = adsp->pd_list; + int ret, i = 0; - /* Handle single power domain */ - if (dev->pm_domain && pd_count) { - pm_runtime_disable(dev); - return; - } + if (!dev->pm_domain && !pds) + return 0; - for (i = 0; i < pd_count; i++) - dev_pm_domain_detach(pds[i], false); -} + if (dev->pm_domain) + dev_pm_genpd_set_performance_state(dev, INT_MAX); -static int qcom_rproc_pds_enable(struct qcom_adsp *adsp, struct device **pds, -size_t pd_count) -{ - int ret; - int i; - - for (i = 0; i < pd_count; i++) { - dev_pm_genpd_set_performance_state(pds[i], INT_MAX); - ret = pm_runtime_resume_and_get(pds[i]); - if (ret < 0) { - dev_pm_genpd_set_performance_state(pds[i], 0); - goto unroll_pd_votes; - } + while (pds && i < pds->num_pds)
[PATCH v3 5/5] media: venus: Convert to dev_pm_domain_attach|detach_list() for vcodec
Let's avoid some of the boilerplate code to manage the vcodec PM domains, by converting into using dev_pm_domain_attach|detach_list(). Cc: Mauro Carvalho Chehab Cc: Stanimir Varbanov Cc: Vikash Garodia Cc: Bjorn Andersson Cc: Konrad Dybcio Cc: Tested-by: Bryan O'Donoghue Reviewed-by: Bryan O'Donoghue Signed-off-by: Ulf Hansson --- Changes in v3: - None. Changes in v2: - Added reviewed-by and tested-by tags. --- drivers/media/platform/qcom/venus/core.c | 12 +++-- drivers/media/platform/qcom/venus/core.h | 7 ++- .../media/platform/qcom/venus/pm_helpers.c| 48 +++ 3 files changed, 26 insertions(+), 41 deletions(-) diff --git a/drivers/media/platform/qcom/venus/core.c b/drivers/media/platform/qcom/venus/core.c index a712dd4f02a5..ce206b709754 100644 --- a/drivers/media/platform/qcom/venus/core.c +++ b/drivers/media/platform/qcom/venus/core.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include #include @@ -114,7 +115,8 @@ static void venus_sys_error_handler(struct work_struct *work) pm_runtime_put_sync(core->dev); for (i = 0; i < max_attempts; i++) { - if (!core->pmdomains[0] || !pm_runtime_active(core->pmdomains[0])) + if (!core->pmdomains || + !pm_runtime_active(core->pmdomains->pd_devs[0])) break; usleep_range(1000, 1500); } @@ -705,7 +707,7 @@ static const struct venus_resources sdm845_res_v2 = { .vcodec0_clks = { "vcodec0_core", "vcodec0_bus" }, .vcodec1_clks = { "vcodec1_core", "vcodec1_bus" }, .vcodec_clks_num = 2, - .vcodec_pmdomains = { "venus", "vcodec0", "vcodec1" }, + .vcodec_pmdomains = (const char *[]) { "venus", "vcodec0", "vcodec1" }, .vcodec_pmdomains_num = 3, .opp_pmdomain = (const char *[]) { "cx", NULL }, .vcodec_num = 2, @@ -754,7 +756,7 @@ static const struct venus_resources sc7180_res = { .clks_num = 3, .vcodec0_clks = { "vcodec0_core", "vcodec0_bus" }, .vcodec_clks_num = 2, - .vcodec_pmdomains = { "venus", "vcodec0" }, + .vcodec_pmdomains = (const char *[]) { "venus", "vcodec0" }, .vcodec_pmdomains_num = 2, .opp_pmdomain = (const char *[]) { "cx", NULL }, .vcodec_num = 1, @@ -811,7 +813,7 @@ static const struct venus_resources sm8250_res = { .resets_num = 2, .vcodec0_clks = { "vcodec0_core" }, .vcodec_clks_num = 1, - .vcodec_pmdomains = { "venus", "vcodec0" }, + .vcodec_pmdomains = (const char *[]) { "venus", "vcodec0" }, .vcodec_pmdomains_num = 2, .opp_pmdomain = (const char *[]) { "mx", NULL }, .vcodec_num = 1, @@ -870,7 +872,7 @@ static const struct venus_resources sc7280_res = { .clks_num = 3, .vcodec0_clks = {"vcodec_core", "vcodec_bus"}, .vcodec_clks_num = 2, - .vcodec_pmdomains = { "venus", "vcodec0" }, + .vcodec_pmdomains = (const char *[]) { "venus", "vcodec0" }, .vcodec_pmdomains_num = 2, .opp_pmdomain = (const char *[]) { "cx", NULL }, .vcodec_num = 1, diff --git a/drivers/media/platform/qcom/venus/core.h b/drivers/media/platform/qcom/venus/core.h index 4a633261ece4..7ef341bf21cc 100644 --- a/drivers/media/platform/qcom/venus/core.h +++ b/drivers/media/platform/qcom/venus/core.h @@ -25,7 +25,6 @@ #define VIDC_CLKS_NUM_MAX 4 #define VIDC_VCODEC_CLKS_NUM_MAX 2 -#define VIDC_PMDOMAINS_NUM_MAX 3 #define VIDC_RESETS_NUM_MAX2 extern int venus_fw_debug; @@ -72,7 +71,7 @@ struct venus_resources { const char * const vcodec0_clks[VIDC_VCODEC_CLKS_NUM_MAX]; const char * const vcodec1_clks[VIDC_VCODEC_CLKS_NUM_MAX]; unsigned int vcodec_clks_num; - const char * const vcodec_pmdomains[VIDC_PMDOMAINS_NUM_MAX]; + const char **vcodec_pmdomains; unsigned int vcodec_pmdomains_num; const char **opp_pmdomain; unsigned int vcodec_num; @@ -134,7 +133,7 @@ struct venus_format { * @video_path: an interconnect handle to video to/from memory path * @cpucfg_path: an interconnect handle to cpu configuration path * @has_opp_table: does OPP table exist - * @pmdomains: an array of pmdomains struct device pointers + * @pmdomains: a pointer to a list of pmdomains * @opp_dl_venus: an device-link for device OPP * @opp_pmdomain: an OPP power-domain * @resets: an array of reset signals @@ -187,7 +186,7 @@ struct venus_core { struct icc_path *video_path; struct icc_path *cpucfg_path; bool has_opp_table; - struct device *pmdomains[VIDC_PMDOMAINS_NUM_MAX]; + struct dev_pm_domain_list *pmdomains; struct device_link *opp_dl_venus; struct device *opp_pmdomain; struct reset_control *resets[VIDC_RESETS_NUM_MAX]; diff --git a/drivers/media/platform/qcom/venus/pm_helpers.c b/drivers/media/platform/qcom
Re: [PATCH 2/4] tracing/user_events: Introduce multi-format events
On Mon, 29 Jan 2024 09:29:07 -0800 Beau Belgrave wrote: > On Fri, Jan 26, 2024 at 03:04:45PM -0500, Steven Rostedt wrote: > > On Fri, 26 Jan 2024 11:10:07 -0800 > > Beau Belgrave wrote: > > > > > > OK, so the each different event has suffixed name. But this will > > > > introduce non C-variable name. > > > > > > > > Steve, do you think your library can handle these symbols? It will > > > > be something like "event:[1]" as the event name. > > > > Personally I like "event.1" style. (of course we need to ensure the > > > > user given event name is NOT including such suffix numbers) > > > > > > > > > > Just to clarify around events including a suffix number. This is why > > > multi-events use "user_events_multi" system name and the single-events > > > using just "user_events". > > > > > > Even if a user program did include a suffix, the suffix would still get > > > appended. An example is "test" vs "test:[0]" using multi-format would > > > result in two tracepoints ("test:[0]" and "test:[0]:[1]" respectively > > > (assuming these are the first multi-events on the system). > > > > > > I'm with you, we really don't want any spoofing or squatting possible. > > > By using different system names and always appending the suffix I > > > believe covers this. > > > > > > Looking forward to hearing Steven's thoughts on this as well. > > > > I'm leaning towards Masami's suggestion to use dots, as that won't conflict > > with special characters from bash, as '[' and ']' do. > > > > Thanks, yeah ideally we wouldn't use special characters. > > I'm not picky about this. However, I did want something that clearly > allowed a glob pattern to find all versions of a given register name of > user_events by user programs that record. The dot notation will pull in > more than expected if dotted namespace style names are used. > > An example is "Asserts" and "Asserts.Verbose" from different programs. > If we tried to find all versions of "Asserts" via glob of "Asserts.*" it > will pull in "Asserts.Verbose.1" in addition to "Asserts.0". If we use dot for the suffix number, we can prohibit user to use it for their name. They still can use '_' (or change the group name?) I just concerned that the name can be parsed by existing tools. Since ':' is used as a separator for group and event name in some case (e.g. tracefs "set_event" is using, so trace-cmd and perf is using it.) > While a glob of "Asserts.[0-9]" works when the unique ID is 0-9, it > doesn't work if the number is higher, like 128. If we ever decide to > change the ID from an integer to say hex to save space, these globs > would break. Hmm, why can't we use regexp? And if we limits the number of events up to 1000 for each same-name event we can use fixed numbers, like Assets.[0-9][0-9][0-9] Thank you, > > Is there some scheme that fits the C-variable name that addresses the > above scenarios? Brackets gave me a simple glob that seemed to prevent a > lot of this ("Asserts.\[*\]" in this case). > > Are we confident that we always want to represent the ID as a base-10 > integer vs a base-16 integer? The suffix will be ABI to ensure recording > programs can find their events easily. > > Thanks, > -Beau > > > -- Steve -- Masami Hiramatsu (Google)
Re: [RESEND PATCH v2] modules: wait do_free_init correctly
On Tue, Jan 30, 2024 at 09:40:38AM +0800, Changbin Du wrote: > On Mon, Jan 29, 2024 at 09:53:58AM -0800, Luis Chamberlain wrote: > > On Mon, Jan 29, 2024 at 10:03:04AM +0800, Changbin Du wrote: > > > The commit 1a7b7d922081 ("modules: Use vmalloc special flag") moves > > > do_free_init() into a global workqueue instead of call_rcu(). So now > > > rcu_barrier() can not ensure that do_free_init has completed. We should > > > wait it via flush_work(). > > > > > > Without this fix, we still could encounter false positive reports in > > > W+X checking, and rcu synchronization is unnecessary. > > > > You didn't answer my question, which should be documented in the commit log. > > > > Does this mean we never freed modules init because of this? If so then > > your commit log should clearly explain that. It should also explain that > > if true (you have to verify) then it means we were no longer saving > > the memory we wished to save, and that is important for distributions > > which do want to save anything on memory. You may want to do a general > > estimate on how much that means these days on any desktop / server. > > Actually, I have explained it in commit msg. It's not about saving memory. The > synchronization here is just to ensure the module init's been freed before > doing W+X checking. The problem is that the current implementation is wrong, > rcu_barrier() cannot guarantee that. So we can encounter false positive > reports. > But anyway, the module init will be freed, and it's just a timing related > issue. Your desciption here is better than the commit log. Luis
Re: [PATCH v9] bus: mhi: host: Add tracing support
On Tue, 30 Jan 2024 13:41:52 +0530 Manivannan Sadhasivam wrote: > So same trace will get printed for both mhi_channel_command_start() and > mhi_channel_command_end()? The trace output will also include the tracepoint name. That is, it will have the same content but will be preceded with: mhi_channel_command_start: ... mhi_channel_command_end: ... -- Steve
Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
On Tue, 30 Jan 2024 01:12:05 -0800 Linus Torvalds wrote: > On Tue, 30 Jan 2024 at 00:43, Linus Torvalds > wrote: > > > > I'll go back to bed, but I think the fix is something trivial like this: > > Almost. > > > + result = ERR_PTR(ENOENT); > > That needs a '-' in front of the ENOENT, otherwise you have a positive > error number and things go wrong very quickly. > > And that does indeed fix the lookup problem, but you end up with the > same problem later when you do the eventfs_remove_dir(). Again the > eventfs data structure changes, but we don't have a reliable dentry > that we can invalidate. > > The dentry cache is just very good at caching those old dentries, and > the interface for eventfs_create_dir() and eventfs_remove_dir() is > just not great. > > If those did an actual path lookup (like eventfs_create_events_dir() > does), we'd have the dentry, and it's trivial to get from dentry to > eventfs_inode. > > But going the other way is the broken thing because of how the > dentries are just temporary caches. > > I suspect the solution is to make eventfs_create_dir() do the same as > the events directory case does, and actually pin the directory dentry > and save it off. I rather not have the create do that because that happens for every event directory. On my machine that's: # find events -type d | wc -l 2102 And that's regardless if tracefs is mounted or not. And that's how many are also created with every instance creation. And doesn't pinning the dentry also require it to be positive? That is, have a full inode allocated with it? I may try something that will still let me get rid of the ei->dentry. > > Oh well. At least I understand what the problem is. Yep! > Now I'm going to try to go back to sleep. Hope you sleep better than I did. We just bought a new mattress, which felt great in the store, but now after 4 or 5 hours sleeping in it, I wake up with a sore back and have to sleep on the couch. And we bought the most expensive "best" mattress in the store :-p Oh, and NY State law has it that you can't return in once it is delivered. -- Steve
[kees:devel/overflow/sanitizers] [overflow] 660787b56e: UBSAN:signed-integer-overflow_in_lib/test_memcat_p.c
Hello, kernel test robot noticed "UBSAN:signed-integer-overflow_in_lib/test_memcat_p.c" on: commit: 660787b56e6e97ddc34c7882cbe1228f4040ef74 ("overflow: Reintroduce signed and unsigned overflow sanitizers") https://git.kernel.org/cgit/linux/kernel/git/kees/linux.git devel/overflow/sanitizers in testcase: boot compiler: gcc-11 test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G (please refer to attached dmesg/kmsg for entire log/backtrace) we noticed this commit is reintroducing "signed and unsigned overflow sanitizers", there is below config diff between parent and this commit in our buildings: --- ea804316c9db5148d2bb0c1f40f70d7a83404638/.config2024-01-26 22:09:35.046768122 +0800 +++ 660787b56e6e97ddc34c7882cbe1228f4040ef74/.config2024-01-26 19:53:20.693434428 +0800 @@ -6706,6 +6706,7 @@ CONFIG_UBSAN_BOUNDS_STRICT=y CONFIG_UBSAN_SHIFT=y # CONFIG_UBSAN_DIV_ZERO is not set CONFIG_UBSAN_UNREACHABLE=y +CONFIG_UBSAN_SIGNED_WRAP=y # CONFIG_UBSAN_BOOL is not set # CONFIG_UBSAN_ENUM is not set # CONFIG_UBSAN_ALIGNMENT is not set while testing, we observed below different (and same part) between parent and this commit: ea804316c9db5148 660787b56e6e97ddc34c7882cbe --- fail:runs %reproductionfail:runs | | | 6:60% 6:6 dmesg.UBSAN:shift-out-of-bounds_in_arch/x86/kernel/cpu/intel.c 6:60% 6:6 dmesg.UBSAN:shift-out-of-bounds_in_arch/x86/kernel/cpu/topology.c 6:60% 6:6 dmesg.UBSAN:shift-out-of-bounds_in_fs/namespace.c 6:60% 6:6 dmesg.UBSAN:shift-out-of-bounds_in_fs/read_write.c 6:60% 6:6 dmesg.UBSAN:shift-out-of-bounds_in_include/linux/rhashtable.h 6:60% 6:6 dmesg.UBSAN:shift-out-of-bounds_in_include/net/tcp.h :6 100% 6:6 dmesg.UBSAN:signed-integer-overflow_in_lib/test_memcat_p.c this looks like the commit uncovered issue. but since it's hard for us to back port this commit to each commit while bisecting, we cannot capture the real first bad commit. not sure if this report could help somebody to investigate the real issue? If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-lkp/202401302219.db90a6d5-oliver.s...@intel.com [ 42.894536][T1] [ cut here ] [ 42.895474][T1] UBSAN: signed-integer-overflow in lib/test_memcat_p.c:47:10 [ 42.897128][T1] 6570 * 725861 cannot be represented in type 'int' [ 42.898391][T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.8.0-rc1-7-g660787b56e6e #1 [ 42.899962][T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 42.901661][T1] Call Trace: [ 42.902009][ T1] dump_stack_lvl (??:?) [ 42.902009][ T1] dump_stack (??:?) [ 42.902009][ T1] handle_overflow (ubsan.c:?) [ 42.902009][ T1] ? kmemleak_alloc (??:?) [ 42.902009][ T1] ? kmalloc_trace (??:?) [ 42.902009][ T1] ? test_memcat_p_init (test_memcat_p.c:?) [ 42.902009][ T1] __ubsan_handle_mul_overflow (??:?) [ 42.902009][ T1] test_memcat_p_init (test_memcat_p.c:?) [ 42.902009][ T1] ? trace_hardirqs_on (??:?) [ 42.902009][ T1] ? _raw_spin_unlock_irqrestore (??:?) [ 42.902009][ T1] ? test_string_helpers_init (test_memcat_p.c:?) [ 42.902009][ T1] do_one_initcall (??:?) [ 42.902009][ T1] ? parameq (??:?) [ 42.902009][ T1] ? parse_args (??:?) [ 42.902009][ T1] do_initcalls (main.c:?) [ 42.902009][ T1] ? rdinit_setup (main.c:?) [ 42.902009][ T1] kernel_init_freeable (main.c:?) [ 42.902009][ T1] ? rest_init (main.c:?) [ 42.902009][ T1] kernel_init (main.c:?) [ 42.902009][ T1] ? schedule_tail (??:?) [ 42.902009][ T1] ret_from_fork (??:?) [ 42.902009][ T1] ? rest_init (main.c:?) [ 42.902009][ T1] ret_from_fork_asm (??:?) [ 42.902009][ T1] entry_INT80_32 (??:?) [ 42.924183][T1] ---[ end trace ]--- [ 42.925743][T1] test_memcat_p: test passed The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20240130/202401302219.db90a6d5-oliver.s...@intel.com -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki
Re: [PATCH v13 2/6] ring-buffer: Introducing ring-buffer mapping functions
Hi Vincent, Thanks for update the code. On Mon, 29 Jan 2024 14:27:58 + Vincent Donnefort wrote: > In preparation for allowing the user-space to map a ring-buffer, add > a set of mapping functions: > > ring_buffer_{map,unmap}() > ring_buffer_map_fault() > > And controls on the ring-buffer: > > ring_buffer_map_get_reader() /* swap reader and head */ > > Mapping the ring-buffer also involves: > > A unique ID for each subbuf of the ring-buffer, currently they are > only identified through their in-kernel VA. > > A meta-page, where are stored ring-buffer statistics and a > description for the current reader > > The linear mapping exposes the meta-page, and each subbuf of the > ring-buffer, ordered following their unique ID, assigned during the > first mapping. > > Once mapped, no subbuf can get in or out of the ring-buffer: the buffer > size will remain unmodified and the splice enabling functions will in > reality simply memcpy the data instead of swapping subbufs. > > Signed-off-by: Vincent Donnefort > > diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h > index fa802db216f9..0841ba8bab14 100644 > --- a/include/linux/ring_buffer.h > +++ b/include/linux/ring_buffer.h > @@ -6,6 +6,8 @@ > #include > #include > > +#include > + > struct trace_buffer; > struct ring_buffer_iter; > > @@ -221,4 +223,9 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct > hlist_node *node); > #define trace_rb_cpu_prepare NULL > #endif > > +int ring_buffer_map(struct trace_buffer *buffer, int cpu); > +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu); > +struct page *ring_buffer_map_fault(struct trace_buffer *buffer, int cpu, > +unsigned long pgoff); > +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu); > #endif /* _LINUX_RING_BUFFER_H */ > diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h > new file mode 100644 > index ..d4bb67430719 > --- /dev/null > +++ b/include/uapi/linux/trace_mmap.h > @@ -0,0 +1,43 @@ > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ > +#ifndef _TRACE_MMAP_H_ > +#define _TRACE_MMAP_H_ > + > +#include > + > +/** > + * struct trace_buffer_meta - Ring-buffer Meta-page description > + * @meta_page_size: Size of this meta-page. > + * @meta_struct_len: Size of this structure. > + * @subbuf_size: Size of each subbuf, including the header. > + * @nr_subbufs: Number of subbfs in the ring-buffer. > + * @reader.lost_events: Number of events lost at the time of the reader > swap. > + * @reader.id: subbuf ID of the current reader. From 0 to > @nr_subbufs - 1 > + * @reader.read: Number of bytes read on the reader subbuf. > + * @entries: Number of entries in the ring-buffer. > + * @overrun: Number of entries lost in the ring-buffer. > + * @read:Number of entries that have been read. > + */ > +struct trace_buffer_meta { > + __u32 meta_page_size; > + __u32 meta_struct_len; > + > + __u32 subbuf_size; > + __u32 nr_subbufs; > + > + struct { > + __u64 lost_events; > + __u32 id; > + __u32 read; > + } reader; > + > + __u64 flags; > + > + __u64 entries; > + __u64 overrun; > + __u64 read; > + > + __u64 Reserved1; > + __u64 Reserved2; > +}; > + > +#endif /* _TRACE_MMAP_H_ */ > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c > index 8179e0a8984e..081065e76d4a 100644 > --- a/kernel/trace/ring_buffer.c > +++ b/kernel/trace/ring_buffer.c > @@ -338,6 +338,7 @@ struct buffer_page { > local_t entries; /* entries on this page */ > unsigned longreal_end; /* real end of data */ > unsigned order; /* order of the page */ > + u32 id;/* ID for external mapping */ > struct buffer_data_page *page; /* Actual data page */ > }; > > @@ -484,6 +485,12 @@ struct ring_buffer_per_cpu { > u64 read_stamp; > /* pages removed since last reset */ > unsigned long pages_removed; > + > + int mapped; > + struct mutexmapping_lock; > + unsigned long *subbuf_ids;/* ID to addr */ > + struct trace_buffer_meta*meta_page; > + > /* ring buffer pages to update, > 0 to add, < 0 to remove */ > longnr_pages_to_update; > struct list_headnew_pages; /* new pages to add */ > @@ -1548,6 +1555,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, > long nr_pages, int cpu) > init_irq_work(&cpu_buffer->irq_work.work, rb_wake_up_waiters); > init_waitqueue_head(&cpu_buffer->irq_work.waiters); > init_waitqueue_head(&cpu_buffer->irq_work.full_wa
Re: [RFC PATCH 0/7] Introduce cache_is_aliasing() to fix DAX regression
On 2024-01-29 16:22, Dan Williams wrote: Mathieu Desnoyers wrote: This commit introduced in v5.13 prevents building FS_DAX on 32-bit ARM, even on ARMv7 which does not have virtually aliased dcaches: commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") It used to work fine before: I have customers using dax over pmem on ARMv7, but this regression will likely prevent them from upgrading their kernel. The root of the issue here is the fact that DAX was never designed to handle virtually aliased dcache (VIVT and VIPT with aliased dcache). It touches the pages through their linear mapping, which is not consistent with the userspace mappings on virtually aliased dcaches. This patch series introduces cache_is_aliasing() with new Kconfig options: * ARCH_HAS_CACHE_ALIASING * ARCH_HAS_CACHE_ALIASING_DYNAMIC and implements it for all architectures. The "DYNAMIC" implementation implements cache_is_aliasing() as a runtime check, which is what is needed on architectures like 32-bit ARMV6 and ARMV6K. With this we can basically narrow down the list of architectures which are unsupported by DAX to those which are really affected. Feedback is welcome, Hi Mathieu, this looks good overall, just some quibbling about the ordering. Thanks for having a look ! I would introduce dax_is_supported() with the current overly broad interpretation of "!(ARM || MIPS || SPARC)" using IS_ENABLED(), then fixup the filesystems to use the new helper, and finally go back and convert dax_is_supported() to use cache_is_aliasing() internally. Will do. Separately, it is not clear to me why ARCH_HAS_CACHE_ALIASING_DYNAMIC needs to exist. As long as all paths that care are calling cache_is_aliasing() then whether it is dynamic or not is something only the compiler cares about. If those dynamic archs do not want to pay the .text size increase they can always do CONFIG_FS_DAX=n, right? Good point. It will help reduce complexity and improve test coverage. I also intend to rename "cache_is_aliasing()" to "dcache_is_aliasing()", so if we introduce an "icache_is_aliasing()" in the future, it won't be confusing. Having aliasing icache-dcache but not dcache-dcache seems to be fairly common. So basically: If an arch selects ARCH_HAS_CACHE_ALIASING, it implements dcache_is_aliasing() (for now), and eventually we can implement icache_is_aliasing() as well. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [RFC PATCH 7/7] xfs: Use dax_is_supported()
On 2024-01-29 21:38, Dave Chinner wrote: On Mon, Jan 29, 2024 at 04:06:31PM -0500, Mathieu Desnoyers wrote: Use dax_is_supported() to validate whether the architecture has virtually aliased caches at mount time. This is relevant for architectures which require a dynamic check to validate whether they have virtually aliased data caches (ARCH_HAS_CACHE_ALIASING_DYNAMIC=y). Where's the rest of this patchset? I have no idea what dax_is_supported() actually does, how it interacts with CONFIG_FS_DAX, etc. If you are changing anything to do with FSDAX, the cc-ing the -entire- patchset to linux-fsdevel is absolutely necessary so the entire patchset lands in our inboxes and not just a random patch from the middle of a bigger change. Sorry, I will Cc linux-fsdevel on all patches for the next round. Meanwhile you can find the whole series on lore: https://lore.kernel.org/lkml/20240129210631.193493-1-mathieu.desnoy...@efficios.com/ [...] Assuming that I understand what dax_is_supported() is doing, this change isn't right. We're just setting the DAX configuration flags from the mount options here, we don't validate them until we've parsed all options and eliminated conflicts and rejected conflicting options. We validate whether the options are appropriate for the underlying hardware configuration later in the mount process. dax=always suitability is check in xfs_setup_dax_always() called later in the mount process when we have enough context and support to open storage devices and check them for DAX support. If the hardware does not support DAX then we simply we turn off DAX support, we do not reject the mount as this change does. dax=inode and dax=never are valid options on all configurations, even those with without FSDAX support or have hardware that is not capable of using DAX. dax=inode only affects how an inode is instantiated in cache - if the inode has a flag that says "use DAX" and dax is suppoortable by the hardware, then the turn on DAX for that inode. Otherwise we just use the normal non-dax IO paths. Again, we don't error out the filesystem if DAX is not supported, we just don't turn it on. This check is done in xfs_inode_should_enable_dax() and I think all you need to do is replace the IS_ENABLED(CONFIG_FS_DAX) with a dax_is_supported() call... Thanks a lot for the detailed explanation. You are right, I will move the dax_is_supported() check to xfs_inode_should_enable_dax(). Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [RFC PATCH 4/7] ext2: Use dax_is_supported()
On 2024-01-30 06:33, Jan Kara wrote: On Mon 29-01-24 16:06:28, Mathieu Desnoyers wrote: Use dax_is_supported() to validate whether the architecture has virtually aliased caches at mount time. This is relevant for architectures which require a dynamic check to validate whether they have virtually aliased data caches (ARCH_HAS_CACHE_ALIASING_DYNAMIC=y). Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Jan Kara Cc: linux-e...@vger.kernel.org Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org Looks good to me (although I share Dave's opinion it would be nice to CC the whole series to fsdevel - luckily we have lore these days so it is not that tedious to find the whole series :)). Feel free to add: Acked-by: Jan Kara Hi Jan, Thanks for looking at it! I will do significant changes for v2, so I will hold on before integrating your acked-by if it's OK with you. Thanks! Mathieu Honza --- fs/ext2/super.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 01f9addc8b1f..0398e7a90eb6 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -585,13 +585,13 @@ static int parse_options(char *options, struct super_block *sb, set_opt(opts->s_mount_opt, XIP); fallthrough; case Opt_dax: -#ifdef CONFIG_FS_DAX - ext2_msg(sb, KERN_WARNING, - "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"); - set_opt(opts->s_mount_opt, DAX); -#else - ext2_msg(sb, KERN_INFO, "dax option not supported"); -#endif + if (dax_is_supported()) { + ext2_msg(sb, KERN_WARNING, +"DAX enabled. Warning: EXPERIMENTAL, use at your own risk"); + set_opt(opts->s_mount_opt, DAX); + } else { + ext2_msg(sb, KERN_INFO, "dax option not supported"); + } break; #if defined(CONFIG_QUOTA) -- 2.39.2 -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
RE: [PATCH v8 04/15] x86/sgx: Implement basic EPC misc cgroup functionality
> struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim) { > + struct sgx_epc_cgroup *epc_cg; > struct sgx_epc_page *page; > + int ret; > + > + epc_cg = sgx_get_current_epc_cg(); > + ret = sgx_epc_cgroup_try_charge(epc_cg); > + if (ret) { > + sgx_put_epc_cg(epc_cg); > + return ERR_PTR(ret); > + } > > for ( ; ; ) { > page = __sgx_alloc_epc_page(); > @@ -567,8 +578,10 @@ struct sgx_epc_page *sgx_alloc_epc_page(void > *owner, bool reclaim) > break; > } > > - if (list_empty(&sgx_active_page_list)) > - return ERR_PTR(-ENOMEM); > + if (list_empty(&sgx_active_page_list)) { > + page = ERR_PTR(-ENOMEM); > + break; > + } (Sorry for replying from Outlook because I am in travel for Chinese New Year.) Perhaps I am missing something but I don't understand this change. An empty sgx_active_page_list means you cannot reclaim any page from it, so why need to break? > > if (!reclaim) { > page = ERR_PTR(-EBUSY); > @@ -580,10 +593,25 @@ struct sgx_epc_page *sgx_alloc_epc_page(void > *owner, bool reclaim) > break; > } > > + /* > + * Need to do a global reclamation if cgroup was not full but > free > + * physical pages run out, causing __sgx_alloc_epc_page() to > fail. > + */ > sgx_reclaim_pages(); > cond_resched(); > } And why adding this comment, especially in this patch? I don't see it brings additional clarity because there's only global reclaim now, no matter whether cgroup is enabled or not.
RE: [PATCH v8 07/15] x86/sgx: Expose sgx_reclaim_pages() for cgroup
> + * @lru: The LRU from which pages are reclaimed. > + * @nr_to_scan: Pointer to the target number of pages to scan, must be less > than > + * SGX_NR_TO_SCAN. > + * Return: Number of pages reclaimed. > */ > -static void sgx_reclaim_pages(void) > +unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned > +int *nr_to_scan) Since the function is now returning the number of reclaimed pages, why do you need to make the @nr_to_scan as pointer? Cannot the caller just adjust @nr_to_scan when calling this function based on how many pages have reclaimed? I am not even sure whether you need @nr_to_scan at all because as we discussed I think it's just extremely rare you need to pass "< SGX_NR_TO_SCAN" to this function. Even if you need, you can always choose to try to reclaim SGX_NR_TO_SCAN pages. [...] > > +static void sgx_reclaim_pages_global(void) { > + unsigned int nr_to_scan = SGX_NR_TO_SCAN; > + > + sgx_reclaim_pages(&sgx_global_lru, &nr_to_scan); } > + I think this function doesn't look sane at all when you have @nr_to_scan being a pointer? I am also not sure whether this function is needed -- if we don't add @nr_to_scan to sgx_reclaim_pages(), then this function is basically: sgx_reclaim_pages(&sgx_global_lru);
[PATCH v5] remoteproc: Make rproc_get_by_phandle() work for clusters
From: Mathieu Poirier Multi-cluster remoteproc designs typically have the following DT declaration: remoteproc-cluster { compatible = "soc,remoteproc-cluster"; core0: core0 { compatible = "soc,remoteproc-core" memory-region; sram; }; core1: core1 { compatible = "soc,remoteproc-core" memory-region; sram; } }; A driver exists for the cluster rather than the individual cores themselves so that operation mode and HW specific configurations applicable to the cluster can be made. Because the driver exists at the cluster level and not the individual core level, function rproc_get_by_phandle() fails to return the remoteproc associated with the phandled it is called for. This patch enhances rproc_get_by_phandle() by looking for the cluster's driver when the driver for the immediate remoteproc's parent is not found. Reported-by: Ben Levinsky Signed-off-by: Mathieu Poirier Co-developed-by: Tarak Reddy Signed-off-by: Tarak Reddy Co-developed-by: Tanmay Shah Signed-off-by: Tanmay Shah --- drivers/remoteproc/remoteproc_core.c | 29 ++-- 1 file changed, 27 insertions(+), 2 deletions(-) diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c index 695cce218e8c..f276956f2c5c 100644 --- a/drivers/remoteproc/remoteproc_core.c +++ b/drivers/remoteproc/remoteproc_core.c @@ -33,6 +33,7 @@ #include #include #include +#include #include #include #include @@ -2112,6 +2113,7 @@ EXPORT_SYMBOL(rproc_detach); struct rproc *rproc_get_by_phandle(phandle phandle) { struct rproc *rproc = NULL, *r; + struct device_driver *driver; struct device_node *np; np = of_find_node_by_phandle(phandle); @@ -2122,7 +2124,26 @@ struct rproc *rproc_get_by_phandle(phandle phandle) list_for_each_entry_rcu(r, &rproc_list, node) { if (r->dev.parent && device_match_of_node(r->dev.parent, np)) { /* prevent underlying implementation from being removed */ - if (!try_module_get(r->dev.parent->driver->owner)) { + + /* +* If the remoteproc's parent has a driver, the +* remoteproc is not part of a cluster and we can use +* that driver. +*/ + driver = r->dev.parent->driver; + + /* +* If the remoteproc's parent does not have a driver, +* look for the driver associated with the cluster. +*/ + if (!driver) { + if (r->dev.parent->parent) + driver = r->dev.parent->parent->driver; + if (!driver) + break; + } + + if (!try_module_get(driver->owner)) { dev_err(&r->dev, "can't get owner\n"); break; } @@ -2533,7 +2554,11 @@ EXPORT_SYMBOL(rproc_free); */ void rproc_put(struct rproc *rproc) { - module_put(rproc->dev.parent->driver->owner); + if (rproc->dev.parent->driver) + module_put(rproc->dev.parent->driver->owner); + else + module_put(rproc->dev.parent->parent->driver->owner); + put_device(&rproc->dev); } EXPORT_SYMBOL(rproc_put); base-commit: 99f59b148871dadb9104366e3d25b120a97f897b -- 2.25.1
Re: [PATCH v13 2/6] ring-buffer: Introducing ring-buffer mapping functions
On Tue, Jan 30, 2024 at 11:55:10PM +0900, Masami Hiramatsu wrote: > Hi Vincent, > > Thanks for update the code. > > On Mon, 29 Jan 2024 14:27:58 + > Vincent Donnefort wrote: > > > In preparation for allowing the user-space to map a ring-buffer, add > > a set of mapping functions: > > > > ring_buffer_{map,unmap}() > > ring_buffer_map_fault() > > > > And controls on the ring-buffer: > > > > ring_buffer_map_get_reader() /* swap reader and head */ > > > > Mapping the ring-buffer also involves: > > > > A unique ID for each subbuf of the ring-buffer, currently they are > > only identified through their in-kernel VA. > > > > A meta-page, where are stored ring-buffer statistics and a > > description for the current reader > > > > The linear mapping exposes the meta-page, and each subbuf of the > > ring-buffer, ordered following their unique ID, assigned during the > > first mapping. > > > > Once mapped, no subbuf can get in or out of the ring-buffer: the buffer > > size will remain unmodified and the splice enabling functions will in > > reality simply memcpy the data instead of swapping subbufs. > > > > Signed-off-by: Vincent Donnefort > > > > diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h > > index fa802db216f9..0841ba8bab14 100644 > > --- a/include/linux/ring_buffer.h > > +++ b/include/linux/ring_buffer.h > > @@ -6,6 +6,8 @@ > > #include > > #include > > > > +#include > > + > > struct trace_buffer; > > struct ring_buffer_iter; > > > > @@ -221,4 +223,9 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct > > hlist_node *node); > > #define trace_rb_cpu_prepare NULL > > #endif > > > > +int ring_buffer_map(struct trace_buffer *buffer, int cpu); > > +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu); > > +struct page *ring_buffer_map_fault(struct trace_buffer *buffer, int cpu, > > + unsigned long pgoff); > > +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu); > > #endif /* _LINUX_RING_BUFFER_H */ > > diff --git a/include/uapi/linux/trace_mmap.h > > b/include/uapi/linux/trace_mmap.h > > new file mode 100644 > > index ..d4bb67430719 > > --- /dev/null > > +++ b/include/uapi/linux/trace_mmap.h > > @@ -0,0 +1,43 @@ > > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ > > +#ifndef _TRACE_MMAP_H_ > > +#define _TRACE_MMAP_H_ > > + > > +#include > > + > > +/** > > + * struct trace_buffer_meta - Ring-buffer Meta-page description > > + * @meta_page_size:Size of this meta-page. > > + * @meta_struct_len: Size of this structure. > > + * @subbuf_size: Size of each subbuf, including the header. > > + * @nr_subbufs:Number of subbfs in the ring-buffer. > > + * @reader.lost_events:Number of events lost at the time of the reader > > swap. > > + * @reader.id: subbuf ID of the current reader. From 0 to > > @nr_subbufs - 1 > > + * @reader.read: Number of bytes read on the reader subbuf. > > + * @entries: Number of entries in the ring-buffer. > > + * @overrun: Number of entries lost in the ring-buffer. > > + * @read: Number of entries that have been read. > > + */ > > +struct trace_buffer_meta { > > + __u32 meta_page_size; > > + __u32 meta_struct_len; > > + > > + __u32 subbuf_size; > > + __u32 nr_subbufs; > > + > > + struct { > > + __u64 lost_events; > > + __u32 id; > > + __u32 read; > > + } reader; > > + > > + __u64 flags; > > + > > + __u64 entries; > > + __u64 overrun; > > + __u64 read; > > + > > + __u64 Reserved1; > > + __u64 Reserved2; > > +}; > > + > > +#endif /* _TRACE_MMAP_H_ */ > > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c > > index 8179e0a8984e..081065e76d4a 100644 > > --- a/kernel/trace/ring_buffer.c > > +++ b/kernel/trace/ring_buffer.c > > @@ -338,6 +338,7 @@ struct buffer_page { > > local_t entries; /* entries on this page */ > > unsigned longreal_end; /* real end of data */ > > unsigned order; /* order of the page */ > > + u32 id;/* ID for external mapping */ > > struct buffer_data_page *page; /* Actual data page */ > > }; > > > > @@ -484,6 +485,12 @@ struct ring_buffer_per_cpu { > > u64 read_stamp; > > /* pages removed since last reset */ > > unsigned long pages_removed; > > + > > + int mapped; > > + struct mutexmapping_lock; > > + unsigned long *subbuf_ids;/* ID to addr */ > > + struct trace_buffer_meta*meta_page; > > + > > /* ring buffer pages to update, > 0 to add, < 0 to remove */ > > longnr_pages_to_update; > > struct list_headnew_pages; /* new pages to a
Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
On Tue, 30 Jan 2024 09:39:42 -0500 Steven Rostedt wrote: > I may try something that will still let me get rid of the ei->dentry. This appears to work, but like always, I may have missed something. I need to add debugging (printks) to make sure the that I don't leave any dangling dentries around (dget without a dput). Here's what I did: - Removed the ei->dentry that you loved so much - Removed the ei->d_children that you shared as much love for. - Removed SRCU. All accesses are under the eventfs_mutex I need to add comments that the callbacks need to be aware of this. Currently, the callbacks do not take any other locks. I may comment that they should never take a lock. - Added the kref that you recommended - Created a eventfs_root_inode that has the structure of: struct eventfs_root_inode { struct eventfs_inodeei; struct dentry *root_dentry; }; The "events" directory is the only directory that allocates this. It is required that its ei->is_events is set, and no other ei has that set. This will hold the only non-dynamic dentry. - I added "parent" to the eventfs_inode that points to the parent eventfs_inode. - On removal, I got rid of the SRCU callback and the work queue. Instead, I find the dentry of the current eventfs_inode that is being deleted by walking the ei->parent until I find the events inode that has a dentry. I then use that to do a lookup walking back down to the eventfs_inode I want to delete. This gives me the dentry that I can call d_invalidate() on. This all works with light testing. I'm sure I did something wrong, but hopefully this is more inline to what you are looking for. This patch is on top of your last patch series. -- Steve diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index ad11063bdd53..49d4630d5d70 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -24,17 +24,24 @@ #include #include "internal.h" -/* - * eventfs_mutex protects the eventfs_inode (ei) dentry. Any access - * to the ei->dentry must be done under this mutex and after checking - * if ei->is_freed is not set. When ei->is_freed is set, the dentry - * is on its way to being freed after the last dput() is made on it. - */ +/* eventfs_mutex protects ei->is_freed and the ei->ref counting. */ static DEFINE_MUTEX(eventfs_mutex); /* Choose something "unique" ;-) */ #define EVENTFS_FILE_INODE_INO 0x12c4e37 +/* Used only by the "events" directory */ +struct eventfs_root_inode { + struct eventfs_inodeei; + struct dentry *root_dentry; +}; + +static struct eventfs_root_inode *get_root_inode(struct eventfs_inode *ei) +{ + WARN_ON_ONCE(!ei->is_events); + return container_of(ei, struct eventfs_inode, ei); +} + /* Just try to make something consistent and unique */ static int eventfs_dir_ino(struct eventfs_inode *ei) { @@ -44,14 +51,6 @@ static int eventfs_dir_ino(struct eventfs_inode *ei) return ei->ino; } -/* - * The eventfs_inode (ei) itself is protected by SRCU. It is released from - * its parent's list and will have is_freed set (under eventfs_mutex). - * After the SRCU grace period is over and the last dput() is called - * the ei is freed. - */ -DEFINE_STATIC_SRCU(eventfs_srcu); - /* Mode is unsigned short, use the upper bits for flags */ enum { EVENTFS_SAVE_MODE = BIT(16), @@ -360,7 +359,7 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry, ti->private = ei; dentry->d_fsdata = ei; -ei->dentry = dentry; // Remove me! + kref_get(&ei->kref); inc_nlink(inode); d_add(dentry, inode); @@ -371,10 +370,30 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry, static void free_ei(struct eventfs_inode *ei) { + struct eventfs_root_inode *rei; + kfree_const(ei->name); - kfree(ei->d_children); kfree(ei->entry_attrs); - kfree(ei); + + if (ei->is_events) { + rei = get_root_inode(ei); + kfree(rei); + } else { + kfree(ei); + } +} + +static void kref_ei_release(struct kref *kref) +{ + struct eventfs_inode *ei = container_of(kref, struct eventfs_inode, kref); + WARN_ON_ONCE(!ei->is_freed); + free_ei(ei); +} + + +static void eventfs_inode_put(struct eventfs_inode *ei) +{ + kref_put(&ei->kref, kref_ei_release); } /** @@ -387,7 +406,6 @@ static void free_ei(struct eventfs_inode *ei) void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry) { struct eventfs_inode *ei; - int i; mutex_lock(&eventfs_mutex); @@ -395,20 +413,7 @@ void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry) if (!ei) goto out; - /* This could belong to one of the files of the ei */ - if (ei->dentry != dentry) { - for (i = 0; i < ei->nr_entries; i++) { -
Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
On Tue, 30 Jan 2024 at 06:39, Steven Rostedt wrote: > > On Tue, 30 Jan 2024 01:12:05 -0800 > > > > I suspect the solution is to make eventfs_create_dir() do the same as > > the events directory case does, and actually pin the directory dentry > > and save it off. > > I rather not have the create do that because that happens for every event > directory. I wasn't thinking straight yesterday - the real fix is to just do a "d_revalidate()". It's literally why that thing exists: check whether a dentry is still valid. In fact, that trivially gets rid of the 'events' subdirectory issues too, so we can stop doing that horrendous simple_recursive_removal() too. Let me reboot into the trivial fix. I just needed to think about this the right way. None of this is anything that the VFS layer has any problems with. Linus
[RFC PATCH v2 0/8] Introduce dcache_is_aliasing() to fix DAX regression
This commit introduced in v5.13 prevents building FS_DAX on 32-bit ARM, even on ARMv7 which does not have virtually aliased dcaches: commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") It used to work fine before: I have customers using DAX over pmem on ARMv7, but this regression will likely prevent them from upgrading their kernel. The root of the issue here is the fact that DAX was never designed to handle virtually aliased dcache (VIVT and VIPT with aliased dcache). It touches the pages through their linear mapping, which is not consistent with the userspace mappings on virtually aliased dcaches. This patch series introduces dcache_is_aliasing() with the new Kconfig option ARCH_HAS_CACHE_ALIASING and implements it for all architectures. The implementation of dcache_is_aliasing() is either evaluated to a constant at compile-time or a runtime check, which is what is needed on ARM. With this we can basically narrow down the list of architectures which are unsupported by DAX to those which are really affected. Note that the order of the series was completely changed based on the feedback received on v1, cache_is_aliasing() is renamed to dcache_is_aliasing(), ARCH_HAS_CACHE_ALIASING_DYNAMIC is gone, dcache_is_aliasing() vs ARCH_HAS_CACHE_ALIASING relationship is simplified, and the dax_is_supported() check was moved to its rightful place in all filesystems. Feedback is welcome, Thanks, Mathieu Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-...@vger.kernel.org Cc: nvd...@lists.linux.dev Cc: linux-fsde...@vger.kernel.org Mathieu Desnoyers (8): dax: Introduce dax_is_supported() erofs: Use dax_is_supported() ext2: Use dax_is_supported() ext4: Use dax_is_supported() fuse: Use dax_is_supported() xfs: Use dax_is_supported() Introduce dcache_is_aliasing() across all architectures dax: Fix incorrect list of dcache aliasing architectures arch/arc/Kconfig| 1 + arch/arc/include/asm/cachetype.h| 9 + arch/arm/Kconfig| 1 + arch/arm/include/asm/cachetype.h| 2 ++ arch/csky/Kconfig | 1 + arch/csky/include/asm/cachetype.h | 9 + arch/m68k/Kconfig | 1 + arch/m68k/include/asm/cachetype.h | 9 + arch/mips/Kconfig | 1 + arch/mips/include/asm/cachetype.h | 9 + arch/nios2/Kconfig | 1 + arch/nios2/include/asm/cachetype.h | 10 ++ arch/parisc/Kconfig | 1 + arch/parisc/include/asm/cachetype.h | 9 + arch/sh/Kconfig | 1 + arch/sh/include/asm/cachetype.h | 9 + arch/sparc/Kconfig | 1 + arch/sparc/include/asm/cachetype.h | 14 ++ arch/xtensa/Kconfig | 1 + arch/xtensa/include/asm/cachetype.h | 10 ++ fs/Kconfig | 1 - fs/erofs/super.c| 5 - fs/ext2/super.c | 6 +- fs/ext4/super.c | 5 - fs/fuse/dax.c | 7 +++ fs/xfs/xfs_iops.c | 2 +- include/linux/cacheinfo.h | 6 ++ include/linux/dax.h | 9 + mm/Kconfig | 6 ++ 29 files changed, 142 insertions(+), 5 deletions(-) create mode 100644 arch/arc/include/asm/cachetype.h create mode 100644 arch/csky/include/asm/cachetype.h create mode 100644 arch/m68k/include/asm/cachetype.h create mode 100644 arch/mips/include/asm/cachetype.h create mode 100644 arch/nios2/include/asm/cachetype.h create mode 100644 arch/parisc/include/asm/cachetype.h create mode 100644 arch/sh/include/asm/cachetype.h create mode 100644 arch/sparc/include/asm/cachetype.h create mode 100644 arch/xtensa/include/asm/cachetype.h -- 2.39.2
[RFC PATCH v2 1/8] dax: Introduce dax_is_supported()
Introduce a new dax_is_supported() static inline to check whether the architecture supports DAX. This replaces the following fs/Kconfig:FS_DAX dependency: depends on !(ARM || MIPS || SPARC) This is done in preparation for its use by each filesystem supporting the dax mount option to validate whether dax is indeed supported. This is done in preparation for using dcache_is_aliasing() in a following change which will properly support architectures which detect dcache aliasing at runtime. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org --- fs/Kconfig | 1 - include/linux/dax.h | 10 ++ 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/fs/Kconfig b/fs/Kconfig index 42837617a55b..e5efdb3b276b 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -56,7 +56,6 @@ endif # BLOCK config FS_DAX bool "File system based Direct Access (DAX) support" depends on MMU - depends on !(ARM || MIPS || SPARC) depends on ZONE_DEVICE || FS_DAX_LIMITED select FS_IOMAP select DAX diff --git a/include/linux/dax.h b/include/linux/dax.h index b463502b16e1..cfc8cd4a3eae 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -78,6 +78,12 @@ static inline bool daxdev_mapping_supported(struct vm_area_struct *vma, return false; return dax_synchronous(dax_dev); } +static inline bool dax_is_supported(void) +{ + return !IS_ENABLED(CONFIG_ARM) && + !IS_ENABLED(CONFIG_MIPS) && + !IS_ENABLED(CONFIG_SPARC); +} #else static inline void *dax_holder(struct dax_device *dax_dev) { @@ -122,6 +128,10 @@ static inline size_t dax_recovery_write(struct dax_device *dax_dev, { return 0; } +static inline bool dax_is_supported(void) +{ + return false; +} #endif void set_dax_nocache(struct dax_device *dax_dev); -- 2.39.2
[RFC PATCH v2 3/8] ext2: Use dax_is_supported()
Use dax_is_supported() to validate whether the architecture has virtually aliased data caches at mount time. Print an error and disable DAX if dax=always is requested as a mount option on an architecture which does not support DAX. This is relevant for architectures which require a dynamic check to validate whether they have virtually aliased data caches. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Jan Kara Cc: linux-e...@vger.kernel.org Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org --- fs/ext2/super.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 01f9addc8b1f..30ff57d47ed4 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -955,7 +955,11 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size); if (test_opt(sb, DAX)) { - if (!sbi->s_daxdev) { + if (!dax_is_supported()) { + ext2_msg(sb, KERN_ERR, + "DAX unsupported by architecture. Turning off DAX."); + clear_opt(sbi->s_mount_opt, DAX); + } else if (!sbi->s_daxdev) { ext2_msg(sb, KERN_ERR, "DAX unsupported by block device. Turning off DAX."); clear_opt(sbi->s_mount_opt, DAX); -- 2.39.2
[RFC PATCH v2 4/8] ext4: Use dax_is_supported()
Use dax_is_supported() to validate whether the architecture has virtually aliased data caches at mount time. Mount fails if dax=always is requested as a mount option on an architecture which does not support DAX. This is relevant for architectures which require a dynamic check to validate whether they have virtually aliased data caches. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: "Theodore Ts'o" Cc: Andreas Dilger Cc: linux-e...@vger.kernel.org Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org --- fs/ext4/super.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index c5fcf377ab1f..f2c11ae3ec29 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4743,7 +4743,10 @@ static int ext4_check_feature_compatibility(struct super_block *sb, } if (sbi->s_mount_opt & EXT4_MOUNT_DAX_ALWAYS) { - if (ext4_has_feature_inline_data(sb)) { + if (!dax_is_supported()) { + ext4_msg(sb, KERN_ERR, "DAX unsupported by architecture."); + return -EINVAL; + } else if (ext4_has_feature_inline_data(sb)) { ext4_msg(sb, KERN_ERR, "Cannot use DAX on a filesystem" " that may contain inline data"); return -EINVAL; -- 2.39.2
[RFC PATCH v2 5/8] fuse: Use dax_is_supported()
Use dax_is_supported() to validate whether the architecture has virtually aliased data caches at mount time. Silently disable DAX if dax=always is requested as a mount option on an architecture which does not support DAX. This is relevant for architectures which require a dynamic check to validate whether they have virtually aliased data caches. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Miklos Szeredi Cc: linux-fsde...@vger.kernel.org Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org --- fs/fuse/dax.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c index 12ef91d170bb..36e1c1abbf8e 100644 --- a/fs/fuse/dax.c +++ b/fs/fuse/dax.c @@ -1336,6 +1336,13 @@ static bool fuse_should_enable_dax(struct inode *inode, unsigned int flags) if (dax_mode == FUSE_DAX_NEVER) return false; + /* +* Silently fallback to 'never' mode if the architecture does +* not support DAX. +*/ + if (!dax_is_supported()) + return false; + /* * fc->dax may be NULL in 'inode' mode when filesystem device doesn't * support DAX, in which case it will silently fallback to 'never' mode. -- 2.39.2
[RFC PATCH v2 6/8] xfs: Use dax_is_supported()
Use dax_is_supported() to validate whether the architecture has virtually aliased data caches at mount time. Silently disable DAX if dax=always is requested as a mount option on an architecture which does not support DAX. This is relevant for architectures which require a dynamic check to validate whether they have virtually aliased data caches. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Chandan Babu R Cc: Darrick J. Wong Cc: linux-...@vger.kernel.org Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org --- fs/xfs/xfs_iops.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index a0d77f5f512e..360f640159b0 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -1208,7 +1208,7 @@ static bool xfs_inode_should_enable_dax( struct xfs_inode *ip) { - if (!IS_ENABLED(CONFIG_FS_DAX)) + if (!dax_is_supported()) return false; if (xfs_has_dax_never(ip->i_mount)) return false; -- 2.39.2
[RFC PATCH v2 8/8] dax: Fix incorrect list of dcache aliasing architectures
commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") prevents DAX from building on architectures with virtually aliased dcache with: depends on !(ARM || MIPS || SPARC) This check is too broad (e.g. recent ARMv7 don't have virtually aliased dcaches), and also misses many other architectures with virtually aliased dcache. This is a regression introduced in the v5.13 Linux kernel where the dax mount option is removed for 32-bit ARMv7 boards which have no dcache aliasing, and therefore should work fine with FS_DAX. This was turned into the following implementation of dax_is_supported() by a preparatory change: return !IS_ENABLED(CONFIG_ARM) && !IS_ENABLED(CONFIG_MIPS) && !IS_ENABLED(CONFIG_SPARC); Use dcache_is_aliasing() instead to figure out whether the environment has aliasing dcaches. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org --- include/linux/dax.h | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/include/linux/dax.h b/include/linux/dax.h index cfc8cd4a3eae..f59e604662e4 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -5,6 +5,7 @@ #include #include #include +#include typedef unsigned long dax_entry_t; @@ -80,9 +81,7 @@ static inline bool daxdev_mapping_supported(struct vm_area_struct *vma, } static inline bool dax_is_supported(void) { - return !IS_ENABLED(CONFIG_ARM) && - !IS_ENABLED(CONFIG_MIPS) && - !IS_ENABLED(CONFIG_SPARC); + return !dcache_is_aliasing(); } #else static inline void *dax_holder(struct dax_device *dax_dev) -- 2.39.2
[RFC PATCH v2 7/8] Introduce dcache_is_aliasing() across all architectures
Introduce a generic way to query whether the dcache is virtually aliased on all architectures. Its purpose is to ensure that subsystems which are incompatible with virtually aliased data caches (e.g. FS_DAX) can reliably query this. For dcache aliasing, there are three scenarios dependending on the architecture. Here is a breakdown based on my understanding: A) The dcache is always aliasing: * arc * csky * m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.) * sh * parisc B) The dcache aliasing is statically known or depends on querying CPU state at runtime: * arm (cache_is_vivt() || cache_is_vipt_aliasing()) * mips (cpu_has_dc_aliases) * nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE) * sparc32 (vac_cache_size > PAGE_SIZE) * sparc64 (L1DCACHE_SIZE > PAGE_SIZE) * xtensa (DCACHE_WAY_SIZE > PAGE_SIZE) C) The dcache is never aliasing: * alpha * arm64 (aarch64) * hexagon * loongarch (but with incoherent write buffers, which are disabled since commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to PAGE_SIZE")) * microblaze * openrisc * powerpc * riscv * s390 * um * x86 Require architectures in A) and B) to select ARCH_HAS_CACHE_ALIASING and implement "dcache_is_aliasing()". Architectures in C) don't select ARCH_HAS_CACHE_ALIASING, and thus dcache_is_aliasing() simply evaluates to "false". Note that this leaves "icache_is_aliasing()" to be implemented as future work. This would be useful to gate features like XIP on architectures which have aliasing dcache-icache but not dcache-dcache. Link: https://lore.kernel.org/lkml/20030910210416.ga24...@mail.jlokier.co.uk/ Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: linux...@kvack.org Cc: linux-a...@vger.kernel.org Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-...@vger.kernel.org Cc: nvd...@lists.linux.dev Cc: linux-...@vger.kernel.org --- arch/arc/Kconfig| 1 + arch/arc/include/asm/cachetype.h| 9 + arch/arm/Kconfig| 1 + arch/arm/include/asm/cachetype.h| 2 ++ arch/csky/Kconfig | 1 + arch/csky/include/asm/cachetype.h | 9 + arch/m68k/Kconfig | 1 + arch/m68k/include/asm/cachetype.h | 9 + arch/mips/Kconfig | 1 + arch/mips/include/asm/cachetype.h | 9 + arch/nios2/Kconfig | 1 + arch/nios2/include/asm/cachetype.h | 10 ++ arch/parisc/Kconfig | 1 + arch/parisc/include/asm/cachetype.h | 9 + arch/sh/Kconfig | 1 + arch/sh/include/asm/cachetype.h | 9 + arch/sparc/Kconfig | 1 + arch/sparc/include/asm/cachetype.h | 14 ++ arch/xtensa/Kconfig | 1 + arch/xtensa/include/asm/cachetype.h | 10 ++ include/linux/cacheinfo.h | 6 ++ mm/Kconfig | 6 ++ 22 files changed, 112 insertions(+) create mode 100644 arch/arc/include/asm/cachetype.h create mode 100644 arch/csky/include/asm/cachetype.h create mode 100644 arch/m68k/include/asm/cachetype.h create mode 100644 arch/mips/include/asm/cachetype.h create mode 100644 arch/nios2/include/asm/cachetype.h create mode 100644 arch/parisc/include/asm/cachetype.h create mode 100644 arch/sh/include/asm/cachetype.h create mode 100644 arch/sparc/include/asm/cachetype.h create mode 100644 arch/xtensa/include/asm/cachetype.h diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig index 1b0483c51cc1..969e6740bcf7 100644 --- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -6,6 +6,7 @@ config ARC def_bool y select ARC_TIMERS + select ARCH_HAS_CACHE_ALIASING select ARCH_HAS_CACHE_LINE_SIZE select ARCH_HAS_DEBUG_VM_PGTABLE select ARCH_HAS_DMA_PREP_COHERENT diff --git a/arch/arc/include/asm/cachetype.h b/arch/arc/include/asm/cachetype.h new file mode 100644 index ..290e3cc85845 --- /dev/null +++ b/arch/arc/include/asm/cachetype.h @@ -0,0 +1,9 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __ASM_ARC_CACHETYPE_H +#define __ASM_ARC_CACHETYPE_H + +#include + +#define dcache_is_aliasing() true + +#endif diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index f8567e95f98b..5adeee5e421f 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -5,6 +5,7 @@ config ARM select ARCH_32BIT_OFF_T select ARCH_CORRECT_STACKTRACE_ON_KRETPROBE if HAVE_KRETPROBES && FRAME_POINTER && !ARM_UNWIND select ARCH_HAS_BINFMT_FLAT + select ARCH_HAS_CACHE_ALIASING select ARCH_HAS_CPU_FINALIZE_INIT if MMU select ARCH_HAS_CURRENT_STACK_POINTER select ARCH_HAS_DEBUG_VIRTUAL if MMU diff --git a/arch/arm/include/asm/cachetype.h b/arch/arm/include/asm/cachetype.h index e8c30430be33..18311570d4f0 10
Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
On Tue, 30 Jan 2024 at 08:49, Steven Rostedt wrote: > > - On removal, I got rid of the SRCU callback and the work queue. > Instead, I find the dentry of the current eventfs_inode that is being > deleted by walking the ei->parent until I find the events inode that has > a dentry. I then use that to do a lookup walking back down to the > eventfs_inode I want to delete. This gives me the dentry that I can call > d_invalidate() on. Yes, that works. However, I have a patch that is *much* smaller and simpler, and doesn't need that walk. The VFS layer already has a good interface for "should I still use this dentry", which is needed for various network filesystems etc that want to time out caches (or check explicitly whether the file still exists etc): it's the dentry d_revalidate() check. Let me just reboot into it to test that I got all the cases. It makes the code even more obvious, and avoids all the complexity. Linus
Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
On Tue, 30 Jan 2024 08:55:51 -0800 Linus Torvalds wrote: > On Tue, 30 Jan 2024 at 08:49, Steven Rostedt wrote: > > > > - On removal, I got rid of the SRCU callback and the work queue. > > Instead, I find the dentry of the current eventfs_inode that is being > > deleted by walking the ei->parent until I find the events inode that has > > a dentry. I then use that to do a lookup walking back down to the > > eventfs_inode I want to delete. This gives me the dentry that I can call > > d_invalidate() on. > > Yes, that works. > > However, I have a patch that is *much* smaller and simpler, and > doesn't need that walk. > > The VFS layer already has a good interface for "should I still use > this dentry", which is needed for various network filesystems etc that > want to time out caches (or check explicitly whether the file still > exists etc): it's the dentry d_revalidate() check. > > Let me just reboot into it to test that I got all the cases. > > It makes the code even more obvious, and avoids all the complexity. I actually had this before, but it wasn't working (likely to something else that wasn't working or I did it wrong) so I reverted it. -- Steve diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 49d4630d5d70..9867b39ae24c 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -451,6 +451,13 @@ lookup_file_dentry(struct dentry *dentry, return dentry; } +int eventfs_revalidate(struct dentry *dentry, unsigned int flags) +{ + struct eventfs_inode *ei = dentry->d_fsdata; + + return ei && !ei->is_freed; +} + /** * eventfs_root_lookup - lookup routine to create file/dir * @dir: in which a lookup is being done diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index e1b172c0e091..0395459d919e 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -392,8 +392,24 @@ static void tracefs_dentry_iput(struct dentry *dentry, struct inode *inode) iput(inode); } +static int tracefs_revalidate(struct dentry *dentry, unsigned int flags) +{ + struct inode *inode = dentry->d_inode; + struct tracefs_inode *ti; + + if (!dentry || !inode) + return 0; + + ti = get_tracefs(inode); + if (!ti || !(ti->flags & TRACEFS_EVENT_INODE)) + return 1; + + return eventfs_revalidate(dentry, flags); +} + static const struct dentry_operations tracefs_dentry_operations = { - .d_iput = tracefs_dentry_iput, + .d_iput = tracefs_dentry_iput, + .d_revalidate = tracefs_revalidate, }; static int trace_fill_super(struct super_block *sb, void *data, int silent) diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h index 2af78fd95c93..a1024202c4e5 100644 --- a/fs/tracefs/internal.h +++ b/fs/tracefs/internal.h @@ -80,5 +80,6 @@ struct dentry *eventfs_start_creating(const char *name, struct dentry *parent); struct dentry *eventfs_failed_creating(struct dentry *dentry); struct dentry *eventfs_end_creating(struct dentry *dentry); void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry); +int eventfs_revalidate(struct dentry *dentry, unsigned int flags); #endif /* _TRACEFS_INTERNAL_H */
Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
On Tue, 30 Jan 2024 at 08:55, Linus Torvalds wrote: > > Let me just reboot into it to test that I got all the cases. > > It makes the code even more obvious, and avoids all the complexity. Yup, this works at least for the case you pointed out, and it really feels like the RightThing(tm) from a VFS standpoint. NOTE! This patch is on top of my previous 5-patch series. Linus From 67b31bb53ef65132a65bfbe915628c481a39e73a Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Tue, 30 Jan 2024 09:00:18 -0800 Subject: [PATCH] eventfs: clean up dentry ops and add revalidate function In order for the dentries to stay up-to-date with the eventfs changes, just add a 'd_revalidate' function that checks the 'is_freed' bit. Also, clean up the dentry release to actually use d_release() rather than the slightly odd d_iput() function. We don't care about the inode, all we want to do is to get rid of the refcount to the eventfs data added by dentry->d_fsdata. It would probably be cleaner to make eventfs its own filesystem, or at least set its own dentry ops when looking up eventfs files. But as it is, only eventfs dentries use d_fsdata, so we don't really need to split these things up by use. Another thing that might be worth doing is to make all eventfs lookups mark their dentries as not worth caching. We could do that with d_delete(), but the DCACHE_DONTCACHE flag would likely be even better. As it is, the dentries are all freeable, but they only tend to get freed at memory pressure rather than more proactively. But that's a separate issue. Signed-off-by: Linus Torvalds --- fs/tracefs/event_inode.c | 21 + fs/tracefs/inode.c | 27 ++- fs/tracefs/internal.h| 3 ++- 3 files changed, 25 insertions(+), 26 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index a37db0dac302..acdc797bd9c0 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -414,23 +414,14 @@ static inline struct eventfs_inode *alloc_ei(const char *name) /** - * eventfs_set_ei_status_free - remove the dentry reference from an eventfs_inode - * @ti: the tracefs_inode of the dentry + * eventfs_d_release - dentry is going away * @dentry: dentry which has the reference to remove. * * Remove the association between a dentry from an eventfs_inode. */ -void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry) +void eventfs_d_release(struct dentry *dentry) { - struct eventfs_inode *ei; - - mutex_lock(&eventfs_mutex); - ei = dentry->d_fsdata; - if (ei) { - dentry->d_fsdata = NULL; - put_ei(ei); - } - mutex_unlock(&eventfs_mutex); + put_ei(dentry->d_fsdata); } /** @@ -517,9 +508,8 @@ static struct dentry *eventfs_root_lookup(struct inode *dir, } enoent: - /* Nothing found? */ - d_add(dentry, NULL); - result = NULL; + /* Don't cache negative lookups, just return an error */ + result = ERR_PTR(-ENOENT); out: mutex_unlock(&eventfs_mutex); @@ -857,6 +847,5 @@ void eventfs_remove_events_dir(struct eventfs_inode *ei) * sticks around while the other ei->dentry are created * and destroyed dynamically. */ - simple_recursive_removal(dentry, NULL); dput(dentry); } diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index e1b172c0e091..64122787e5d0 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -379,21 +379,30 @@ static const struct super_operations tracefs_super_operations = { .show_options = tracefs_show_options, }; -static void tracefs_dentry_iput(struct dentry *dentry, struct inode *inode) +/* + * It would be cleaner if eventfs had its own dentry ops. + * + * Note that d_revalidate is called potentially under RCU, + * so it can't take the eventfs mutex etc. It's fine - if + * we open a file just as it's marked dead, things will + * still work just fine, and just see the old stale case. + */ +static void tracefs_d_release(struct dentry *dentry) { - struct tracefs_inode *ti; + if (dentry->d_fsdata) + eventfs_d_release(dentry); +} - if (!dentry || !inode) - return; +static int tracefs_d_revalidate(struct dentry *dentry, unsigned int flags) +{ + struct eventfs_inode *ei = dentry->d_fsdata; - ti = get_tracefs(inode); - if (ti && ti->flags & TRACEFS_EVENT_INODE) - eventfs_set_ei_status_free(ti, dentry); - iput(inode); + return !(ei && ei->is_freed); } static const struct dentry_operations tracefs_dentry_operations = { - .d_iput = tracefs_dentry_iput, + .d_revalidate = tracefs_d_revalidate, + .d_release = tracefs_d_release, }; static int trace_fill_super(struct super_block *sb, void *data, int silent) diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h index 72db3bdc4dfb..d4194466b643 100644 --- a/fs/tracefs/internal.h +++ b/fs/tracefs/internal.h @@ -79,6 +79,7 @@ struct inode *tracefs_get_inode(struct super_block *sb); struct dentry *eventfs_start_creating(const char *name, struct dentry *parent); struct dentry *eventfs_failed_creating(struct d
Re: [PATCH v2 2/4] dt-bindings: remoteproc: Add compatibility for TEE support
On Fri, Jan 26, 2024 at 12:03:25PM +0100, Krzysztof Kozlowski wrote: > On 18/01/2024 11:04, Arnaud Pouliquen wrote: > > The "st,stm32mp1-m4-tee" compatible is utilized in a system configuration > > where the Cortex-M4 firmware is loaded by the Trusted execution Environment > > (TEE). > > For instance, this compatible is used in both the Linux and OP-TEE > > device-tree: > > - In OP-TEE, a node is defined in the device tree with the > > st,stm32mp1-m4-tee to support signed remoteproc firmware. > > Based on DT properties, OP-TEE authenticates, loads, starts, and stops > > the firmware. > > - On Linux, when the compatibility is set, the Cortex-M resets should not > > be declared in the device tree. > > > > Signed-off-by: Arnaud Pouliquen > > --- > > V1 to V2 updates > > - update "st,stm32mp1-m4" compatible description to generalize > > - remove the 'reset-names' requirement in one conditional branch, as the > > property is already part of the condition test. > > --- > > .../bindings/remoteproc/st,stm32-rproc.yaml | 52 +++ > > 1 file changed, 43 insertions(+), 9 deletions(-) > > > > diff --git > > a/Documentation/devicetree/bindings/remoteproc/st,stm32-rproc.yaml > > b/Documentation/devicetree/bindings/remoteproc/st,stm32-rproc.yaml > > index 370af61d8f28..6af821b15736 100644 > > --- a/Documentation/devicetree/bindings/remoteproc/st,stm32-rproc.yaml > > +++ b/Documentation/devicetree/bindings/remoteproc/st,stm32-rproc.yaml > > @@ -16,7 +16,12 @@ maintainers: > > > > properties: > >compatible: > > -const: st,stm32mp1-m4 > > +enum: > > + - st,stm32mp1-m4 > > + - st,stm32mp1-m4-tee > > The patch looks good to me, but I wonder about this choice of two > compatibles. > > Basically this is the same hardware with the same interface, but two > compatibles to differentiate a bit different firmware setup. We have > already such cases for Qualcomm [1] [2] and new ones will be coming. [3] > > I wonder whether this should be rather the same compatible with > additional property, e.g. "st,tee-control" or "remote-control". > > [1] > https://elixir.bootlin.com/linux/v6.7.1/source/Documentation/devicetree/bindings/dma/qcom,bam-dma.yaml#L54 > > [2] > https://elixir.bootlin.com/linux/v6.7.1/source/Documentation/devicetree/bindings/net/qcom,ipa.yaml#L129 > (that's a bit different) > > [3] https://lore.kernel.org/linux-devicetree/20240124103623.GJ4906@thinkpad/ > > @Rob, > Any general guidance for this and Qualcomm? I think we have cases using compatible already as well. Either way is fine with me. Rob
Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time
Dear All, On 30.01.2024 12:03, Christophe Leroy wrote: > Le 30/01/2024 à 10:16, Chen-Yu Tsai a écrit : >> [Vous ne recevez pas souvent de courriers de we...@chromium.org. D?couvrez >> pourquoi ceci est important ? https://aka.ms/LearnAboutSenderIdentification ] >> >> On Mon, Jan 29, 2024 at 12:09:50PM -0800, Luis Chamberlain wrote: >>> On Thu, Dec 21, 2023 at 10:02:46AM +0100, Christophe Leroy wrote: >>>> Declaring rodata_enabled and mark_rodata_ro() at all time >>>> helps removing related #ifdefery in C files. >>>> >>>> Signed-off-by: Christophe Leroy >>> Very nice cleanup, thanks!, applied and pushed >>> >>> Luis >> On next-20240130, which has your modules-next branch, and thus this >> series and the other "module: Use set_memory_rox()" series applied, >> my kernel crashes in some very weird way. Reverting your branch >> makes the crash go away. >> >> I thought I'd report it right away. Maybe you folks would know what's >> happening here? This is on arm64. > That's strange, it seems to bug in module_bug_finalize() which is > _before_ calls to module_enable_ro() and such. > > Can you try to revert the 6 patches one by one to see which one > introduces the problem ? > > In reality, only patch 677bfb9db8a3 really change things. Other ones are > more on less only cleanup. I've also run into this issue with today's (20240130) linux-next on my test farm. The issue is not fully reproducible, so it was a bit hard to bisect it automatically. I've spent some time on manual testing and it looks that reverting the following 2 commits on top of linux-next fixes the problem: 65929884f868 ("modules: Remove #ifdef CONFIG_STRICT_MODULE_RWX around rodata_enabled") 677bfb9db8a3 ("module: Don't ignore errors from set_memory_XX()") This in fact means that commit 677bfb9db8a3 is responsible for this regression, as 65929884f868 has to be reverted only because the latter depends on it. Let me know what I can do to help debugging this issue. Here is the stack trace I've got on Khadas VIM3 ARM64 board: Unable to handle kernel paging request at virtual address 80007bfeeb30 Mem abort info: ESR = 0x9647 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x07: level 3 translation fault Data abort info: ISV = 0, ISS = 0x0047, ISS2 = 0x CM = 0, WnR = 1, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 swapper pgtable: 4k pages, 48-bit VAs, pgdp=0a35a000 [80007bfeeb30] pgd=1000f4806003, p4d=1000f4806003, pud=17ed1003, pmd=17ed2003, pte= Internal error: Oops: 9647 [#1] PREEMPT SMP Modules linked in: CPU: 4 PID: 182 Comm: (udev-worker) Not tainted 6.8.0-rc2-next-20240130 #14391 Hardware name: Khadas VIM3 (DT) pstate: 6005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : module_bug_finalize+0xb0/0xdc lr : module_bug_finalize+0x70/0xdc ... Call trace: module_bug_finalize+0xb0/0xdc load_module+0x182c/0x1c88 init_module_from_file+0x84/0xc0 idempotent_init_module+0x180/0x250 __arm64_sys_finit_module+0x64/0xa0 invoke_syscall+0x48/0x114 el0_svc_common.constprop.0+0xc0/0xe0 do_el0_svc+0x1c/0x28 el0_svc+0x4c/0xe4 el0t_64_sync_handler+0xc0/0xc4 el0t_64_sync+0x190/0x194 Code: 9116e003 f942dc01 a93e8c41 c89ffc73 (f9000433) ---[ end trace ]--- Best regards -- Marek Szyprowski, PhD Samsung R&D Institute Poland
Re: [PATCH v2 2/4] dt-bindings: remoteproc: Add compatibility for TEE support
On Thu, Jan 18, 2024 at 11:04:31AM +0100, Arnaud Pouliquen wrote: > The "st,stm32mp1-m4-tee" compatible is utilized in a system configuration > where the Cortex-M4 firmware is loaded by the Trusted execution Environment > (TEE). > For instance, this compatible is used in both the Linux and OP-TEE > device-tree: > - In OP-TEE, a node is defined in the device tree with the > st,stm32mp1-m4-tee to support signed remoteproc firmware. > Based on DT properties, OP-TEE authenticates, loads, starts, and stops > the firmware. > - On Linux, when the compatibility is set, the Cortex-M resets should not > be declared in the device tree. > > Signed-off-by: Arnaud Pouliquen > --- > V1 to V2 updates > - update "st,stm32mp1-m4" compatible description to generalize > - remove the 'reset-names' requirement in one conditional branch, as the > property is already part of the condition test. > --- > .../bindings/remoteproc/st,stm32-rproc.yaml | 52 +++ > 1 file changed, 43 insertions(+), 9 deletions(-) > > diff --git a/Documentation/devicetree/bindings/remoteproc/st,stm32-rproc.yaml > b/Documentation/devicetree/bindings/remoteproc/st,stm32-rproc.yaml > index 370af61d8f28..6af821b15736 100644 > --- a/Documentation/devicetree/bindings/remoteproc/st,stm32-rproc.yaml > +++ b/Documentation/devicetree/bindings/remoteproc/st,stm32-rproc.yaml > @@ -16,7 +16,12 @@ maintainers: > > properties: >compatible: > -const: st,stm32mp1-m4 > +enum: > + - st,stm32mp1-m4 > + - st,stm32mp1-m4-tee > +description: > + Use "st,stm32mp1-m4" for the Cortex-M4 coprocessor management by > non-secure context > + Use "st,stm32mp1-m4-tee" for the Cortex-M4 coprocessor management by > secure context > >reg: > description: > @@ -142,21 +147,40 @@ properties: > required: >- compatible >- reg > - - resets > > allOf: >- if: >properties: > -reset-names: > - not: > -contains: > - const: hold_boot > +compatible: > + contains: > +const: st,stm32mp1-m4 > +then: > + if: > +properties: > + reset-names: > +not: > + contains: > +const: hold_boot Note that this is true when 'reset-names' is not present. If that is not desired, then you need 'required: [reset-names]'. Not really a new issue though. > + then: > +required: > + - st,syscfg-holdboot > + - resets > + else: > +properties: > + st,syscfg-holdboot: false > +required: > + - resets 'resets' is always required within the outer 'then' schema, so you can move this up a level. > + > + - if: > + properties: > +compatible: > + contains: > +const: st,stm32mp1-m4-tee > then: > - required: > -- st,syscfg-holdboot > -else: >properties: > st,syscfg-holdboot: false > +reset-names: false > +resets: false > > additionalProperties: false > > @@ -188,5 +212,15 @@ examples: >st,syscfg-rsc-tbl = <&tamp 0x144 0x>; >st,syscfg-m4-state = <&tamp 0x148 0x>; > }; > + - | > +#include > +m4@1000 { > + compatible = "st,stm32mp1-m4-tee"; > + reg = <0x1000 0x4>, > +<0x3000 0x4>, > +<0x3800 0x1>; > + st,syscfg-rsc-tbl = <&tamp 0x144 0x>; > + st,syscfg-m4-state = <&tamp 0x148 0x>; > +}; > > ... > -- > 2.25.1 >
Re: [PATCH 2/4] tracing/user_events: Introduce multi-format events
On Mon, Jan 29, 2024 at 09:24:07PM -0500, Steven Rostedt wrote: > On Mon, 29 Jan 2024 09:29:07 -0800 > Beau Belgrave wrote: > > > Thanks, yeah ideally we wouldn't use special characters. > > > > I'm not picky about this. However, I did want something that clearly > > allowed a glob pattern to find all versions of a given register name of > > user_events by user programs that record. The dot notation will pull in > > more than expected if dotted namespace style names are used. > > > > An example is "Asserts" and "Asserts.Verbose" from different programs. > > If we tried to find all versions of "Asserts" via glob of "Asserts.*" it > > will pull in "Asserts.Verbose.1" in addition to "Asserts.0". > > Do you prevent brackets in names? > No. However, since brackets have a start and end token that are distinct finding all versions of your event is trivial compared to a single dot. Imagine two events: Asserts Asserts[MyCoolIndex] Resolves to tracepoints of: Asserts:[0] Asserts[MyCoolIndex]:[1] Regardless of brackets in the names, a simple glob of Asserts:\[*\] only finds Asserts:[0]. This is because we have that end bracket in the glob and the full event name including the start bracket. If I register another "version" of Asserts, thne I'll have: Asserts:[0] Asserts[MyCoolIndex]:[1] Asserts:[2] The glob of Asserts:\[*\] will return both: Asserts:[0] Asserts:[2] At this point the program can either record all versions or scan further to find which version of Asserts is wanted. > > > > While a glob of "Asserts.[0-9]" works when the unique ID is 0-9, it > > doesn't work if the number is higher, like 128. If we ever decide to > > change the ID from an integer to say hex to save space, these globs > > would break. > > > > Is there some scheme that fits the C-variable name that addresses the > > above scenarios? Brackets gave me a simple glob that seemed to prevent a > > lot of this ("Asserts.\[*\]" in this case). > > Prevent a lot of what? I'm not sure what your example here is. > I'll try again :) We have 2 events registered via user_events: Asserts Asserts.Verbose Using dot notation these would result in tracepoints of: user_events_multi/Asserts.0 user_events_multi/Asserts.Verbose.1 Using bracket notation these would result in tracepoints of: user_events_multi/Asserts:[0] user_events_multi/Asserts.Verbose:[1] A recording program only wants to enable the Asserts tracepoint. It does not want to record the Asserts.Verbose tracepoint. The program must find the right tracepoint by scanning tracefs under the user_events_multi system. A single dot suffix does not allow a simple glob to be used. The glob Asserts.* will return both Asserts.0 and Asserts.Verbose.1. A simple glob of Asserts:\[*\] will only find Asserts:[0], it will not find Asserts.Verbose:[1]. We could just use brackets and not have the colon (Asserts[0] in this case). But brackets are still special for bash. > > > > Are we confident that we always want to represent the ID as a base-10 > > integer vs a base-16 integer? The suffix will be ABI to ensure recording > > programs can find their events easily. > > Is there a difference to what we choose? > If a simple glob of event_name:\[*\] cannot be used, then we must document what the suffix format is, so an appropriate regex can be created. If we start with base-10 then later move to base-16 we will break existing regex patterns on the recording side. I prefer, and have in this series, a base-16 output since it saves on the tracepoint name size. Either way we go, we need to define how recording programs should find the events they care about. So we must be very clear, IMHO, about the format of the tracepoint names in our documentation. I personally think recording programs are likely to get this wrong without proper guidance. Thanks, -Beau > -- Steve
Re: [PATCH 2/4] tracing/user_events: Introduce multi-format events
On Tue, Jan 30, 2024 at 11:12:22PM +0900, Masami Hiramatsu wrote: > On Mon, 29 Jan 2024 09:29:07 -0800 > Beau Belgrave wrote: > > > On Fri, Jan 26, 2024 at 03:04:45PM -0500, Steven Rostedt wrote: > > > On Fri, 26 Jan 2024 11:10:07 -0800 > > > Beau Belgrave wrote: > > > > > > > > OK, so the each different event has suffixed name. But this will > > > > > introduce non C-variable name. > > > > > > > > > > Steve, do you think your library can handle these symbols? It will > > > > > be something like "event:[1]" as the event name. > > > > > Personally I like "event.1" style. (of course we need to ensure the > > > > > user given event name is NOT including such suffix numbers) > > > > > > > > > > > > > Just to clarify around events including a suffix number. This is why > > > > multi-events use "user_events_multi" system name and the single-events > > > > using just "user_events". > > > > > > > > Even if a user program did include a suffix, the suffix would still get > > > > appended. An example is "test" vs "test:[0]" using multi-format would > > > > result in two tracepoints ("test:[0]" and "test:[0]:[1]" respectively > > > > (assuming these are the first multi-events on the system). > > > > > > > > I'm with you, we really don't want any spoofing or squatting possible. > > > > By using different system names and always appending the suffix I > > > > believe covers this. > > > > > > > > Looking forward to hearing Steven's thoughts on this as well. > > > > > > I'm leaning towards Masami's suggestion to use dots, as that won't > > > conflict > > > with special characters from bash, as '[' and ']' do. > > > > > > > Thanks, yeah ideally we wouldn't use special characters. > > > > I'm not picky about this. However, I did want something that clearly > > allowed a glob pattern to find all versions of a given register name of > > user_events by user programs that record. The dot notation will pull in > > more than expected if dotted namespace style names are used. > > > > An example is "Asserts" and "Asserts.Verbose" from different programs. > > If we tried to find all versions of "Asserts" via glob of "Asserts.*" it > > will pull in "Asserts.Verbose.1" in addition to "Asserts.0". > > If we use dot for the suffix number, we can prohibit user to use it > for their name. They still can use '_' (or change the group name?) We could, however, we have user_event integration in OpenTelemetry and I'm unsure if we should really try to restrict names. We'll also at some point have libside integration, which might not have the same restrictions on the user-tracer side as the kernel-tracer side. I'm trying to restrict the user_event group name from changing outside of an eventual tracer namespace. I'd like for each container to inherit a tracer namespace long-term which decides what the actual group name will be instead of users self-selecting names to prevent squatting or spoofing of events. > I just concerned that the name can be parsed by existing tools. Since > ':' is used as a separator for group and event name in some case (e.g. > tracefs "set_event" is using, so trace-cmd and perf is using it.) > Good point. What about just event_name[unique_id]? IE: Drop the colon. Brackets are still special in bash, but it would prevent simple glob patterns from matching to incorrect tracepoints under user_events_multi. > > While a glob of "Asserts.[0-9]" works when the unique ID is 0-9, it > > doesn't work if the number is higher, like 128. If we ever decide to > > change the ID from an integer to say hex to save space, these globs > > would break. > > Hmm, why can't we use regexp? We can use regex, but we'll need to agree the suffix format. We won't be able to change it after that point. I'd prefer a base-16/Hex suffix either in brackets or a simple dot. > And if we limits the number of events up to 1000 for each same-name event > we can use fixed numbers, like Assets.[0-9][0-9][0-9] > I'm always wrong when I guess how many events programs will end up using. Folks always surprise me. I'd rather have a solution that scales to the max number of tracepoints allowed on the system (currently 16-bit max value). Thanks, -Beau > Thank you, > > > > > Is there some scheme that fits the C-variable name that addresses the > > above scenarios? Brackets gave me a simple glob that seemed to prevent a > > lot of this ("Asserts.\[*\]" in this case). > > > > Are we confident that we always want to represent the ID as a base-10 > > integer vs a base-16 integer? The suffix will be ABI to ensure recording > > programs can find their events easily. > > > > Thanks, > > -Beau > > > > > -- Steve > > > -- > Masami Hiramatsu (Google)
Re: [PATCH v4 4/5] dt-bindings: input/touchscreen: imagis: add compatible for IST3032C
On Sat, 20 Jan 2024 20:11:15 +0100, Karel Balej wrote: > From: Karel Balej > > IST3032C is a touchscreen IC which seems mostly compatible with IST3038C > except that it reports a different chip ID value. > > Signed-off-by: Karel Balej > --- > > Notes: > v4: > * Reword commit description to mention how this IC differs from the > already supported. > > .../devicetree/bindings/input/touchscreen/imagis,ist3038c.yaml | 1 + > 1 file changed, 1 insertion(+) > Acked-by: Rob Herring
Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
On Mon, 29 Jan 2024 19:56:52 -0800 Linus Torvalds wrote: > [0005-eventfs-get-rid-of-dentry-pointers-without-refcounts.patch > text/x-patch (15652 bytes)] I missed this email when I wrote basically the same thing :-p It was sent just as I went to bed and this morning I mistaken it as an email I already read. When you mentioned that your last patch was on top of your other 5, I said to myself, "Wait? there's five?" and went back though this thread looking at every email with an attachment. Well, even though I duplicated your work, I'll take that as a learning experience. Anyway, I'm assuming that the other 4 patches are exactly the same as the previous version, as applying patches from attachments is a manual process. I don't even develop on the machine with my email client, so it either is me downloading each and scp'ing them over to my development box or finding the lore link and going through them individually one by one. I know you don't send patches inlined anymore, which is a shame, because patchwork takes care of all the administering when patches are inlined, and I don't miss patches like I use to. So, I'll down load patch 5 here and just assume that I already have your other 4 patches from the previous email. -- Steve
Re: [PATCH 0/4] tracing/user_events: Introduce multi-format events
On Tue, Jan 30, 2024 at 11:09:33AM +0900, Masami Hiramatsu wrote: > Hi Beau, > > On Tue, 23 Jan 2024 22:08:40 + > Beau Belgrave wrote: > > > Currently user_events supports 1 event with the same name and must have > > the exact same format when referenced by multiple programs. This opens > > an opportunity for malicous or poorly thought through programs to > > create events that others use with different formats. Another scenario > > is user programs wishing to use the same event name but add more fields > > later when the software updates. Various versions of a program may be > > running side-by-side, which is prevented by the current single format > > requirement. > > > > Add a new register flag (USER_EVENT_REG_MULTI_FORMAT) which indicates > > the user program wishes to use the same user_event name, but may have > > several different formats of the event in the future. When this flag is > > used, create the underlying tracepoint backing the user_event with a > > unique name per-version of the format. It's important that existing ABI > > users do not get this logic automatically, even if one of the multi > > format events matches the format. This ensures existing programs that > > create events and assume the tracepoint name will match exactly continue > > to work as expected. Add logic to only check multi-format events with > > other multi-format events and single-format events to only check > > single-format events during find. > > Thanks for this work! This will allow many instance to use the same > user-events at the same time. > > BTW, can we force this flag set by default? My concern is if any user > program use this user-event interface in the container (maybe it is > possible if we bind-mount it). In this case, the user program can > detect the other program is using the event if this flag is not set. > Moreover, if there is a malicious program running in the container, > it can prevent using the event name from other programs even if it > is isolated by the name-space. > The multi-format use a different system name (user_events_multi). So you cannot use the single-format flag to detect multi-format names, etc. You can only use it to find other single-format names like you could always do. Likewise, you cannot use the single-event flag to block or prevent multi-format events from being created. Changing this behavior to default would break all of our environments. So I'm pretty sure it would break others. The current environment expects tracepoints to show up as their registered name under the user_events system name. If this changed out from under us on a specific kernel version, that would be bad. I'd like eventually to have a tracer namespace concept for containers. Then we would have a user_event_group per tracer namespace. Then all user_events within the container have a unique system name which fully isolates them. However, even with that isolation, we still need a way to allow programs in the same container to have different versions of the same event name. Multi-format events fixes this problem. I think isolation should be dealt with via true namespace isolation at the tracing level. > Steve suggested that if a user program which is running in a namespace > uses user-event without this flag, we can reject that by default. > > What would you think about? > This would break all of our environments. It would make previously compiled programs using the existing ABI from working as expected. I'd much prefer that level of isolation to happen at the namespace level and why user_events as plumbing for different groups to achieve this. It's also why the ABI does not allow programs to state a system name. I'm trying to reserve the system name for the group/tracer/namespace isolation work. Thanks, -Beau > Thank you, > > > > > > Add a register_name (reg_name) to the user_event struct which allows for > > split naming of events. We now have the name that was used to register > > within user_events as well as the unique name for the tracepoint. Upon > > registering events ensure matches based on first the reg_name, followed > > by the fields and format of the event. This allows for multiple events > > with the same registered name to have different formats. The underlying > > tracepoint will have a unique name in the format of {reg_name}:[unique_id]. > > The unique_id is the time, in nanoseconds, of the event creation converted > > to hex. Since this is done under the register mutex, it is extremely > > unlikely for these IDs to ever match. It's also very unlikely a malicious > > program could consistently guess what the name would be and attempt to > > squat on it via the single format ABI. > > > > For example, if both "test u32 value" and "test u64 value" are used with > > the USER_EVENT_REG_MULTI_FORMAT the system would have 2 unique > > tracepoints. The dynamic_events file would then show the following: > > u:test u64 count > > u:test u32 count > > > > The actual trac
Re: [PATCH v9] bus: mhi: host: Add tracing support
On Tue, Jan 30, 2024 at 09:22:52AM -0500, Steven Rostedt wrote: > On Tue, 30 Jan 2024 13:41:52 +0530 > Manivannan Sadhasivam wrote: > > > So same trace will get printed for both mhi_channel_command_start() and > > mhi_channel_command_end()? > > The trace output will also include the tracepoint name. That is, it will > have the same content but will be preceded with: > > mhi_channel_command_start: ... > mhi_channel_command_end: ... > Yes, but the message will be the same: mhi_channel_command_start: chan%d: Updating state to: mhi_channel_command_end: chan%d: Updating state to: Either only one of the trace should be present or the second one should print, "mhi_channel_command_end: chan%d: Updated state to:" - Mani -- மணிவண்ணன் சதாசிவம்
Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
On Mon, 29 Jan 2024 19:56:52 -0800 Linus Torvalds wrote: > [0001-tracefs-avoid-using-the-ei-dentry-pointer-unnecessar.patch > text/x-patch (3561 bytes)] > > [0002-eventfsfs-initialize-the-tracefs-inode-properly.patch text/x-patch > (2548 bytes)] Ah these two are new. -- Steve
Re: [PATCH 2/4] tracing/user_events: Introduce multi-format events
On Tue, 30 Jan 2024 10:05:15 -0800 Beau Belgrave wrote: > On Mon, Jan 29, 2024 at 09:24:07PM -0500, Steven Rostedt wrote: > > On Mon, 29 Jan 2024 09:29:07 -0800 > > Beau Belgrave wrote: > > > > > Thanks, yeah ideally we wouldn't use special characters. > > > > > > I'm not picky about this. However, I did want something that clearly > > > allowed a glob pattern to find all versions of a given register name of > > > user_events by user programs that record. The dot notation will pull in > > > more than expected if dotted namespace style names are used. > > > > > > An example is "Asserts" and "Asserts.Verbose" from different programs. > > > If we tried to find all versions of "Asserts" via glob of "Asserts.*" it > > > will pull in "Asserts.Verbose.1" in addition to "Asserts.0". > > > > Do you prevent brackets in names? > > > > No. However, since brackets have a start and end token that are distinct > finding all versions of your event is trivial compared to a single dot. > > Imagine two events: > Asserts > Asserts[MyCoolIndex] > > Resolves to tracepoints of: > Asserts:[0] > Asserts[MyCoolIndex]:[1] > > Regardless of brackets in the names, a simple glob of Asserts:\[*\] only > finds Asserts:[0]. This is because we have that end bracket in the glob > and the full event name including the start bracket. > > If I register another "version" of Asserts, thne I'll have: > Asserts:[0] > Asserts[MyCoolIndex]:[1] > Asserts:[2] > > The glob of Asserts:\[*\] will return both: > Asserts:[0] > Asserts:[2] But what if you had registered "Asserts:[MyCoolIndex]:[1]" Do you prevent colons? > > At this point the program can either record all versions or scan further > to find which version of Asserts is wanted. > > > > > > > While a glob of "Asserts.[0-9]" works when the unique ID is 0-9, it > > > doesn't work if the number is higher, like 128. If we ever decide to > > > change the ID from an integer to say hex to save space, these globs > > > would break. > > > > > > Is there some scheme that fits the C-variable name that addresses the > > > above scenarios? Brackets gave me a simple glob that seemed to prevent a > > > lot of this ("Asserts.\[*\]" in this case). > > > > Prevent a lot of what? I'm not sure what your example here is. > > > > I'll try again :) > > We have 2 events registered via user_events: > Asserts > Asserts.Verbose > > Using dot notation these would result in tracepoints of: > user_events_multi/Asserts.0 > user_events_multi/Asserts.Verbose.1 > > Using bracket notation these would result in tracepoints of: > user_events_multi/Asserts:[0] > user_events_multi/Asserts.Verbose:[1] > > A recording program only wants to enable the Asserts tracepoint. It does > not want to record the Asserts.Verbose tracepoint. > > The program must find the right tracepoint by scanning tracefs under the > user_events_multi system. > > A single dot suffix does not allow a simple glob to be used. The glob > Asserts.* will return both Asserts.0 and Asserts.Verbose.1. > > A simple glob of Asserts:\[*\] will only find Asserts:[0], it will not > find Asserts.Verbose:[1]. > > We could just use brackets and not have the colon (Asserts[0] in this > case). But brackets are still special for bash. Are these shell scripts or programs. I use regex in programs all the time. And if you have shell scripts, use awk or something. Unless you prevent something from being added, I don't see the protection. > > > > > > > Are we confident that we always want to represent the ID as a base-10 > > > integer vs a base-16 integer? The suffix will be ABI to ensure recording > > > programs can find their events easily. > > > > Is there a difference to what we choose? > > > > If a simple glob of event_name:\[*\] cannot be used, then we must document > what the suffix format is, so an appropriate regex can be created. If we > start with base-10 then later move to base-16 we will break existing regex > patterns on the recording side. > > I prefer, and have in this series, a base-16 output since it saves on > the tracepoint name size. I honestly don't care which base you use. So if you want to use base 16, I'm fine with that. > > Either way we go, we need to define how recording programs should find > the events they care about. So we must be very clear, IMHO, about the > format of the tracepoint names in our documentation. > > I personally think recording programs are likely to get this wrong > without proper guidance. > Agreed. -- Steve
[PATCH 1/6] tracefs: avoid using the ei->dentry pointer unnecessarily
The eventfs_find_events() code tries to walk up the tree to find the event directory that a dentry belongs to, in order to then find the eventfs inode that is associated with that event directory. However, it uses an odd combination of walking the dentry parent, looking up the eventfs inode associated with that, and then looking up the dentry from there. Repeat. But the code shouldn't have back-pointers to dentries in the first place, and it should just walk the dentry parenthood chain directly. Similarly, 'set_top_events_ownership()' looks up the dentry from the eventfs inode, but the only reason it wants a dentry is to look up the superblock in order to look up the root dentry. But it already has the real filesystem inode, which has that same superblock pointer. So just pass in the superblock pointer using the information that's already there, instead of looking up extraneous data that is irrelevant. Signed-off-by: Linus Torvalds --- fs/tracefs/event_inode.c | 26 -- 1 file changed, 12 insertions(+), 14 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 1c3dd0ad4660..2d128bedd654 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -156,33 +156,30 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry, return ret; } -static void update_top_events_attr(struct eventfs_inode *ei, struct dentry *dentry) +static void update_top_events_attr(struct eventfs_inode *ei, struct super_block *sb) { - struct inode *inode; + struct inode *root; /* Only update if the "events" was on the top level */ if (!ei || !(ei->attr.mode & EVENTFS_TOPLEVEL)) return; /* Get the tracefs root inode. */ - inode = d_inode(dentry->d_sb->s_root); - ei->attr.uid = inode->i_uid; - ei->attr.gid = inode->i_gid; + root = d_inode(sb->s_root); + ei->attr.uid = root->i_uid; + ei->attr.gid = root->i_gid; } static void set_top_events_ownership(struct inode *inode) { struct tracefs_inode *ti = get_tracefs(inode); struct eventfs_inode *ei = ti->private; - struct dentry *dentry; /* The top events directory doesn't get automatically updated */ if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL)) return; - dentry = ei->dentry; - - update_top_events_attr(ei, dentry); + update_top_events_attr(ei, inode->i_sb); if (!(ei->attr.mode & EVENTFS_SAVE_UID)) inode->i_uid = ei->attr.uid; @@ -235,8 +232,10 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry) mutex_lock(&eventfs_mutex); do { - /* The parent always has an ei, except for events itself */ - ei = dentry->d_parent->d_fsdata; + // The parent is stable because we do not do renames + dentry = dentry->d_parent; + // ... and directories always have d_fsdata + ei = dentry->d_fsdata; /* * If the ei is being freed, the ownership of the children @@ -246,12 +245,11 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry) ei = NULL; break; } - - dentry = ei->dentry; + // Walk upwards until you find the events inode } while (!ei->is_events); mutex_unlock(&eventfs_mutex); - update_top_events_attr(ei, dentry); + update_top_events_attr(ei, dentry->d_sb); return ei; } -- 2.43.0.5.g38fb137bdb
[PATCH 2/6] eventfsfs: initialize the tracefs inode properly
The tracefs-specific fields in the inode were not initialized before the inode was exposed to others through the dentry with 'd_instantiate()'. And the ->flags file was initialized incorrectly with a '|=', when the old value was stale. It should have just been a straight assignment. Move the field initializations up to before the d_instantiate, and fix the use of uninitialized data. Signed-off-by: Linus Torvalds --- fs/tracefs/event_inode.c | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 2d128bedd654..c0d977e6c0f2 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -328,7 +328,9 @@ static struct dentry *create_file(const char *name, umode_t mode, inode->i_ino = EVENTFS_FILE_INODE_INO; ti = get_tracefs(inode); - ti->flags |= TRACEFS_EVENT_INODE; + ti->flags = TRACEFS_EVENT_INODE; + ti->private = NULL; // Directories have 'ei', files not + d_instantiate(dentry, inode); fsnotify_create(dentry->d_parent->d_inode, dentry); return eventfs_end_creating(dentry); @@ -367,7 +369,8 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent inode->i_ino = eventfs_dir_ino(ei); ti = get_tracefs(inode); - ti->flags |= TRACEFS_EVENT_INODE; + ti->flags = TRACEFS_EVENT_INODE; + ti->private = ei; inc_nlink(inode); d_instantiate(dentry, inode); @@ -513,7 +516,6 @@ create_file_dentry(struct eventfs_inode *ei, int idx, static void eventfs_post_create_dir(struct eventfs_inode *ei) { struct eventfs_inode *ei_child; - struct tracefs_inode *ti; lockdep_assert_held(&eventfs_mutex); @@ -523,9 +525,6 @@ static void eventfs_post_create_dir(struct eventfs_inode *ei) srcu_read_lock_held(&eventfs_srcu)) { ei_child->d_parent = ei->dentry; } - - ti = get_tracefs(ei->dentry->d_inode); - ti->private = ei; } /** @@ -943,7 +942,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry INIT_LIST_HEAD(&ei->list); ti = get_tracefs(inode); - ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE; + ti->flags = TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE; ti->private = ei; inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO; -- 2.43.0.5.g38fb137bdb
[PATCH 3/6] tracefs: dentry lookup crapectomy
The dentry lookup for eventfs files was very broken, and had lots of signs of the old situation where the filesystem names were all created statically in the dentry tree, rather than being looked up dynamically based on the eventfs data structures. You could see it in the naming - how it claimed to "create" dentries rather than just look up the dentries that were given it. You could see it in various nonsensical and very incorrect operations, like using "simple_lookup()" on the dentries that were passed in, which only results in those dentries becoming negative dentries. Which meant that any other lookup would possibly return ENOENT if it saw that negative dentry before the data rwas then later filled in. You could see it in the immesnse amount of nonsensical code that didn't actually just do lookups. Signed-off-by: Linus Torvalds --- fs/tracefs/event_inode.c | 275 --- 1 file changed, 52 insertions(+), 223 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index c0d977e6c0f2..ad11063bdd53 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -230,7 +230,6 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry) { struct eventfs_inode *ei; - mutex_lock(&eventfs_mutex); do { // The parent is stable because we do not do renames dentry = dentry->d_parent; @@ -247,7 +246,6 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry) } // Walk upwards until you find the events inode } while (!ei->is_events); - mutex_unlock(&eventfs_mutex); update_top_events_attr(ei, dentry->d_sb); @@ -280,11 +278,10 @@ static void update_inode_attr(struct dentry *dentry, struct inode *inode, } /** - * create_file - create a file in the tracefs filesystem - * @name: the name of the file to create. + * lookup_file - look up a file in the tracefs filesystem + * @dentry: the dentry to look up * @mode: the permission that the file should have. * @attr: saved attributes changed by user - * @parent: parent dentry for this file. * @data: something that the caller will want to get to later on. * @fop: struct file_operations that should be used for this file. * @@ -292,13 +289,13 @@ static void update_inode_attr(struct dentry *dentry, struct inode *inode, * directory. The inode.i_private pointer will point to @data in the open() * call. */ -static struct dentry *create_file(const char *name, umode_t mode, +static struct dentry *lookup_file(struct dentry *dentry, + umode_t mode, struct eventfs_attr *attr, - struct dentry *parent, void *data, + void *data, const struct file_operations *fop) { struct tracefs_inode *ti; - struct dentry *dentry; struct inode *inode; if (!(mode & S_IFMT)) @@ -307,12 +304,6 @@ static struct dentry *create_file(const char *name, umode_t mode, if (WARN_ON_ONCE(!S_ISREG(mode))) return NULL; - WARN_ON_ONCE(!parent); - dentry = eventfs_start_creating(name, parent); - - if (IS_ERR(dentry)) - return dentry; - inode = tracefs_get_inode(dentry->d_sb); if (unlikely(!inode)) return eventfs_failed_creating(dentry); @@ -331,29 +322,25 @@ static struct dentry *create_file(const char *name, umode_t mode, ti->flags = TRACEFS_EVENT_INODE; ti->private = NULL; // Directories have 'ei', files not - d_instantiate(dentry, inode); + d_add(dentry, inode); fsnotify_create(dentry->d_parent->d_inode, dentry); return eventfs_end_creating(dentry); }; /** - * create_dir - create a dir in the tracefs filesystem + * lookup_dir_entry - look up a dir in the tracefs filesystem + * @dentry: the directory to look up * @ei: the eventfs_inode that represents the directory to create - * @parent: parent dentry for this file. * - * This function will create a dentry for a directory represented by + * This function will look up a dentry for a directory represented by * a eventfs_inode. */ -static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent) +static struct dentry *lookup_dir_entry(struct dentry *dentry, + struct eventfs_inode *pei, struct eventfs_inode *ei) { struct tracefs_inode *ti; - struct dentry *dentry; struct inode *inode; - dentry = eventfs_start_creating(ei->name, parent); - if (IS_ERR(dentry)) - return dentry; - inode = tracefs_get_inode(dentry->d_sb); if (unlikely(!inode)) return eventfs_failed_creating(dentry); @@ -372,8 +359,11 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *par
[PATCH 4/6] eventfs: remove unused 'd_parent' pointer field
It's never used Signed-off-by: Linus Torvalds --- fs/tracefs/event_inode.c | 4 +--- fs/tracefs/internal.h| 2 -- 2 files changed, 1 insertion(+), 5 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index ad11063bdd53..1d0102bfd7da 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -685,10 +685,8 @@ struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode INIT_LIST_HEAD(&ei->list); mutex_lock(&eventfs_mutex); - if (!parent->is_freed) { + if (!parent->is_freed) list_add_tail(&ei->list, &parent->children); - ei->d_parent = parent->dentry; - } mutex_unlock(&eventfs_mutex); /* Was the parent freed? */ diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h index 91c2bf0b91d9..8f38740bfb5b 100644 --- a/fs/tracefs/internal.h +++ b/fs/tracefs/internal.h @@ -35,7 +35,6 @@ struct eventfs_attr { * @name: the name of the directory to create * @children: link list into the child eventfs_inode * @dentry: the dentry of the directory - * @d_parent: pointer to the parent's dentry * @d_children: The array of dentries to represent the files when created * @entry_attrs: Saved mode and ownership of the @d_children * @attr: Saved mode and ownership of eventfs_inode itself @@ -50,7 +49,6 @@ struct eventfs_inode { const char *name; struct list_headchildren; struct dentry *dentry; /* Check is_freed to access */ - struct dentry *d_parent; struct dentry **d_children; struct eventfs_attr *entry_attrs; struct eventfs_attr attr; -- 2.43.0.5.g38fb137bdb
[PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
The eventfs inode had pointers to dentries (and child dentries) without actually holding a refcount on said pointer. That is fundamentally broken, and while eventfs tried to then maintain coherence with dentries going away by hooking into the '.d_iput' callback, that doesn't actually work since it's not ordered wrt lookups. There were two reasonms why eventfs tried to keep a pointer to a dentry: - the creation of a 'events' directory would actually have a stable dentry pointer that it created with tracefs_start_creating(). And it needed that dentry when tearing it all down again in eventfs_remove_events_dir(). This use is actually ok, because the special top-level events directory dentries are actually stable, not just a temporary cache of the eventfs data structures. - the 'eventfs_inode' (aka ei) needs to stay around as long as there are dentries that refer to it. It then used these dentry pointers as a replacement for doing reference counting: it would try to make sure that there was only ever one dentry associated with an event_inode, and keep a child dentry array around to see which dentries might still refer to the parent ei. This gets rid of the invalid dentry pointer use, and renames the one valid case to a different name to make it clear that it's not just any random dentry. The magic child dentry array that is kind of a "reverse reference list" is simply replaced by having child dentries take a ref to the ei. As does the directory dentries. That makes the broken use case go away. Signed-off-by: Linus Torvalds --- fs/tracefs/event_inode.c | 245 --- fs/tracefs/internal.h| 9 +- 2 files changed, 80 insertions(+), 174 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 1d0102bfd7da..a37db0dac302 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -62,6 +62,34 @@ enum { #define EVENTFS_MODE_MASK (EVENTFS_SAVE_MODE - 1) +/* + * eventfs_inode reference count management. + * + * NOTE! We count only references from dentries, in the + * form 'dentry->d_fsdata'. There are also references from + * directory inodes ('ti->private'), but the dentry reference + * count is always a superset of the inode reference count. + */ +static void release_ei(struct kref *ref) +{ + struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, kref); + kfree(ei->entry_attrs); + kfree(ei); +} + +static inline void put_ei(struct eventfs_inode *ei) +{ + if (ei) + kref_put(&ei->kref, release_ei); +} + +static inline struct eventfs_inode *get_ei(struct eventfs_inode *ei) +{ + if (ei) + kref_get(&ei->kref); + return ei; +} + static struct dentry *eventfs_root_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags); @@ -289,7 +317,8 @@ static void update_inode_attr(struct dentry *dentry, struct inode *inode, * directory. The inode.i_private pointer will point to @data in the open() * call. */ -static struct dentry *lookup_file(struct dentry *dentry, +static struct dentry *lookup_file(struct eventfs_inode *parent_ei, + struct dentry *dentry, umode_t mode, struct eventfs_attr *attr, void *data, @@ -302,11 +331,11 @@ static struct dentry *lookup_file(struct dentry *dentry, mode |= S_IFREG; if (WARN_ON_ONCE(!S_ISREG(mode))) - return NULL; + return ERR_PTR(-EIO); inode = tracefs_get_inode(dentry->d_sb); if (unlikely(!inode)) - return eventfs_failed_creating(dentry); + return ERR_PTR(-ENOMEM); /* If the user updated the directory's attributes, use them */ update_inode_attr(dentry, inode, attr, mode); @@ -322,9 +351,12 @@ static struct dentry *lookup_file(struct dentry *dentry, ti->flags = TRACEFS_EVENT_INODE; ti->private = NULL; // Directories have 'ei', files not + // Files have their parent's ei as their fsdata + dentry->d_fsdata = get_ei(parent_ei); + d_add(dentry, inode); fsnotify_create(dentry->d_parent->d_inode, dentry); - return eventfs_end_creating(dentry); + return NULL; }; /** @@ -343,7 +375,7 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry, inode = tracefs_get_inode(dentry->d_sb); if (unlikely(!inode)) - return eventfs_failed_creating(dentry); + return ERR_PTR(-ENOMEM); /* If the user updated the directory's attributes, use them */ update_inode_attr(dentry, inode, &ei->attr, @@ -359,24 +391,28 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry, ti->flags = TRACEFS_EVENT_
[PATCH 6/6] eventfs: clean up dentry ops and add revalidate function
In order for the dentries to stay up-to-date with the eventfs changes, just add a 'd_revalidate' function that checks the 'is_freed' bit. Also, clean up the dentry release to actually use d_release() rather than the slightly odd d_iput() function. We don't care about the inode, all we want to do is to get rid of the refcount to the eventfs data added by dentry->d_fsdata. It would probably be cleaner to make eventfs its own filesystem, or at least set its own dentry ops when looking up eventfs files. But as it is, only eventfs dentries use d_fsdata, so we don't really need to split these things up by use. Another thing that might be worth doing is to make all eventfs lookups mark their dentries as not worth caching. We could do that with d_delete(), but the DCACHE_DONTCACHE flag would likely be even better. As it is, the dentries are all freeable, but they only tend to get freed at memory pressure rather than more proactively. But that's a separate issue. Signed-off-by: Linus Torvalds --- fs/tracefs/event_inode.c | 21 + fs/tracefs/inode.c | 27 ++- fs/tracefs/internal.h| 3 ++- 3 files changed, 25 insertions(+), 26 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index a37db0dac302..acdc797bd9c0 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -414,23 +414,14 @@ static inline struct eventfs_inode *alloc_ei(const char *name) /** - * eventfs_set_ei_status_free - remove the dentry reference from an eventfs_inode - * @ti: the tracefs_inode of the dentry + * eventfs_d_release - dentry is going away * @dentry: dentry which has the reference to remove. * * Remove the association between a dentry from an eventfs_inode. */ -void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry) +void eventfs_d_release(struct dentry *dentry) { - struct eventfs_inode *ei; - - mutex_lock(&eventfs_mutex); - ei = dentry->d_fsdata; - if (ei) { - dentry->d_fsdata = NULL; - put_ei(ei); - } - mutex_unlock(&eventfs_mutex); + put_ei(dentry->d_fsdata); } /** @@ -517,9 +508,8 @@ static struct dentry *eventfs_root_lookup(struct inode *dir, } enoent: - /* Nothing found? */ - d_add(dentry, NULL); - result = NULL; + /* Don't cache negative lookups, just return an error */ + result = ERR_PTR(-ENOENT); out: mutex_unlock(&eventfs_mutex); @@ -857,6 +847,5 @@ void eventfs_remove_events_dir(struct eventfs_inode *ei) * sticks around while the other ei->dentry are created * and destroyed dynamically. */ - simple_recursive_removal(dentry, NULL); dput(dentry); } diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index e1b172c0e091..64122787e5d0 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -379,21 +379,30 @@ static const struct super_operations tracefs_super_operations = { .show_options = tracefs_show_options, }; -static void tracefs_dentry_iput(struct dentry *dentry, struct inode *inode) +/* + * It would be cleaner if eventfs had its own dentry ops. + * + * Note that d_revalidate is called potentially under RCU, + * so it can't take the eventfs mutex etc. It's fine - if + * we open a file just as it's marked dead, things will + * still work just fine, and just see the old stale case. + */ +static void tracefs_d_release(struct dentry *dentry) { - struct tracefs_inode *ti; + if (dentry->d_fsdata) + eventfs_d_release(dentry); +} - if (!dentry || !inode) - return; +static int tracefs_d_revalidate(struct dentry *dentry, unsigned int flags) +{ + struct eventfs_inode *ei = dentry->d_fsdata; - ti = get_tracefs(inode); - if (ti && ti->flags & TRACEFS_EVENT_INODE) - eventfs_set_ei_status_free(ti, dentry); - iput(inode); + return !(ei && ei->is_freed); } static const struct dentry_operations tracefs_dentry_operations = { - .d_iput = tracefs_dentry_iput, + .d_revalidate = tracefs_d_revalidate, + .d_release = tracefs_d_release, }; static int trace_fill_super(struct super_block *sb, void *data, int silent) diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h index 72db3bdc4dfb..d4194466b643 100644 --- a/fs/tracefs/internal.h +++ b/fs/tracefs/internal.h @@ -79,6 +79,7 @@ struct inode *tracefs_get_inode(struct super_block *sb); struct dentry *eventfs_start_creating(const char *name, struct dentry *parent); struct dentry *eventfs_failed_creating(struct dentry *dentry); struct dentry *eventfs_end_creating(struct dentry *dentry); -void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry); + +void eventfs_d_release(struct dentry *dentry); #endif /* _TRACEFS_INTERNAL_H */ -- 2.43.0.5.g38fb137bdb
Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
On Tue, 30 Jan 2024 at 10:23, Steven Rostedt wrote: > > I know you don't send patches inlined anymore, which is a shame, because > patchwork takes care of all the administering when patches are inlined, and > I don't miss patches like I use to. I just sent the whole series as individual patches., and doing a b4 am 20240130190355.11486-1-torva...@linux-foundation.org should get them all (or if you use patchwork, do that) I don't normally do these inlined patches any more, because honestly, 99% of all patches I do end up being "I can't test this very well, I think you should do something like this". In fact, for simple ones where I might not have even compile-tested them, much less booted into a kernel with them, I will actively whitespace-corrupt the patch, just to make sure they aren't applied as any kind of real workflow - they are almost always meant as a "I think you should do this, and take credit for it all". And so when I'm working on a series like this, I'll send attachments just because it's easier, and because I don't want to patch-bomb people with some kind of crazy work-in-progress thing. But I'm reasonably comfortable with this series now, so I sent it as a "real" patch series. I like it partly because it just removes a lot of lines: 3 files changed, 160 insertions(+), 433 deletions(-) but mostly because the lines it removes are what I consider actively broken code. So it's not just getting rid of LOC, it's getting rid of complexity (and bugs) IMHO. That said, I also don't think that patch series is any kind of "final". I didn't fix up the readdir iterator locking, for example. I don't think the SRCU parts are needed at all any more thanks to the refcounting - the 'ei' is no longer protected by SRCU, it's protected by virtue of us having the file open (and thus holding the dentry). So please think of that series not as any kind of final word. More as a "I strongly believe this is the direction eventfs should go". I am perfectly ok with you munging those patches and taking full credit for them, for example. My "testing" has not involved any real tracing usage, and while I *have* booted that thing, and have done some very basic smoke-testing in /sys/kernel/tracing, 99% of the work was literally me just going through the lookup code, removing everything I found objectionable, and replacing it with what the VFS layer generally would want. Linus
Re: [ANNOUNCE] 5.10.204-rt100
Hi! We (as in cip project), are trying to do -cip-rt releases once a month. Are there any plans for 5.10-rt release any time soon? That would help us ;-). Best regards, Pavel -- DENX Software Engineering GmbH,Managing Director: Erika Unter HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany signature.asc Description: PGP signature
[RFC PATCH] kernel/module: add a safer implementation of try_module_get()
The current implementation of try_module_get() requires the module to exist and be live as a precondition. While this may seem intuitive at first glance, enforcing the precondition can be tricky, considering that modules can be unloaded at any time if not previously taken. For instance, the caller could be preempted just before calling try_module_get(), and while preempted, the module could be unloaded and freed. More subtly, the module could also be unloaded at any point while executing try_module_get() before incrementing the refount with atomic_inc_not_zero(). Neglecting the precondition that the module must exist and be live can cause unexpected race conditions that can lead to crashes. However, ensuring that the precondition is met may require additional locking that increases the complexity of the code and can make it more error-prone. This patch adds a slower yet safer implementation of try_module_get() that checks if the module is valid by looking into the mod_tree before taking the module's refcount. This new function can be safely called on stale and invalid module pointers, relieving developers from the burden of ensuring that the module exists and is live before attempting to take it. The tree lookup and refcount increment are executed after taking the module_mutex to prevent the module from being unloaded after looking up the tree. Signed-off-by: Marco Pagani --- include/linux/module.h | 15 +++ kernel/module/main.c | 27 +++ 2 files changed, 42 insertions(+) diff --git a/include/linux/module.h b/include/linux/module.h index 08364d5cbc07..86b6ea43d204 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -695,6 +695,19 @@ extern void __module_get(struct module *module); */ extern bool try_module_get(struct module *module); +/** + * try_module_get_safe() - safely take the refcount of a module. + * @module: address of the module to be taken. + * + * Safer version of try_module_get(). Check first if the module exists and is alive, + * and then take its reference count. + * + * Return: + * * %true - module exists and its refcount has been incremented or module is NULL. + * * %false - module does not exist. + */ +extern bool try_module_get_safe(struct module *module); + /** * module_put() - release a reference count to a module * @module: the module we should release a reference count for @@ -815,6 +828,8 @@ static inline bool try_module_get(struct module *module) return true; } +#define try_module_get_safe(module) try_module_get(module) + static inline void module_put(struct module *module) { } diff --git a/kernel/module/main.c b/kernel/module/main.c index 98fedfdb8db5..22376b69778c 100644 --- a/kernel/module/main.c +++ b/kernel/module/main.c @@ -842,6 +842,33 @@ bool try_module_get(struct module *module) } EXPORT_SYMBOL(try_module_get); +bool try_module_get_safe(struct module *module) +{ + struct module *mod; + bool ret = true; + + if (!module) + goto out; + + mutex_lock(&module_mutex); + + /* +* Check if the address points to a valid live module and take +* the refcount only if it points to the module struct. +*/ + mod = __module_address((unsigned long)module); + if (mod && mod == module && module_is_live(mod)) + __module_get(mod); + else + ret = false; + + mutex_unlock(&module_mutex); + +out: + return ret; +} +EXPORT_SYMBOL(try_module_get_safe); + void module_put(struct module *module) { int ret; base-commit: 4515d08a742c76612b65d2f47a87d12860519842 -- 2.43.0
Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
On Tue, 30 Jan 2024 11:19:01 -0800 Linus Torvalds wrote: > On Tue, 30 Jan 2024 at 10:23, Steven Rostedt wrote: > > > > I know you don't send patches inlined anymore, which is a shame, because > > patchwork takes care of all the administering when patches are inlined, and > > I don't miss patches like I use to. > > I just sent the whole series as individual patches., and doing a > > b4 am 20240130190355.11486-1-torva...@linux-foundation.org > > should get them all (or if you use patchwork, do that) Thanks. I can get the mbox with: wget https://patchwork.kernel.org/series/821406/mbox/ For those interested, the series is at: https://patchwork.kernel.org/project/linux-trace-kernel/list/?series=821406 And I got the above mbox link from the first patch and selecting the 'mbox' link. I download that and then run my scripts which adds the Link tags to them as well as Cc's. I'll add the link to this thread as well. Then I simply do a 'git am -s' Note, I mentioned the pain of pulling in your multi-attachment patches by hand on IRC, and Konstantin replied that he'll fix b4 to do those too ;-) But I still prefer patchwork, as that keeps history as well. > > I don't normally do these inlined patches any more, because honestly, > 99% of all patches I do end up being "I can't test this very well, I > think you should do something like this". > > In fact, for simple ones where I might not have even compile-tested > them, much less booted into a kernel with them, I will actively > whitespace-corrupt the patch, just to make sure they aren't applied as > any kind of real workflow - they are almost always meant as a "I think > you should do this, and take credit for it all". > > And so when I'm working on a series like this, I'll send attachments > just because it's easier, and because I don't want to patch-bomb > people with some kind of crazy work-in-progress thing. > > But I'm reasonably comfortable with this series now, so I sent it as a > "real" patch series. I like it partly because it just removes a lot of > lines: > > 3 files changed, 160 insertions(+), 433 deletions(-) > > but mostly because the lines it removes are what I consider actively > broken code. So it's not just getting rid of LOC, it's getting rid of > complexity (and bugs) IMHO. Do you want me to put this in my urgent branch and mark them for stable, and then send them for 6.8? > > That said, I also don't think that patch series is any kind of > "final". I didn't fix up the readdir iterator locking, for example. I > don't think the SRCU parts are needed at all any more thanks to the > refcounting - the 'ei' is no longer protected by SRCU, it's protected > by virtue of us having the file open (and thus holding the dentry). > > So please think of that series not as any kind of final word. More as > a "I strongly believe this is the direction eventfs should go". > > I am perfectly ok with you munging those patches and taking full > credit for them, for example. Before seeing your 5 patch series, I wrote that patch I showed you which did basically the same thing. It passed all the preliminary tests so I'm sure your version should work too. I'll still have to run it through my full ktest suite, but that takes several hours to complete. I do that before I send out my 'for-linus/next' or any git pulls. > > My "testing" has not involved any real tracing usage, and while I > *have* booted that thing, and have done some very basic smoke-testing > in /sys/kernel/tracing, 99% of the work was literally me just going > through the lookup code, removing everything I found objectionable, > and replacing it with what the VFS layer generally would want. I'll take your patches and start working on them. Thanks for doing this! -- Steve
Re: [PATCH 2/6] eventfsfs: initialize the tracefs inode properly
On Tue, 30 Jan 2024 11:03:51 -0800 Linus Torvalds wrote: > The tracefs-specific fields in the inode were not initialized before the > inode was exposed to others through the dentry with 'd_instantiate()'. > > And the ->flags file was initialized incorrectly with a '|=', when the > old value was stale. It should have just been a straight assignment. The ti is allocated from fs/tracefs/inode.c that has: static struct inode *tracefs_alloc_inode(struct super_block *sb) { struct tracefs_inode *ti; ti = kmem_cache_alloc(tracefs_inode_cachep, GFP_KERNEL); if (!ti) return NULL; ti->flags = 0; return &ti->vfs_inode; } Shouldn't that make it valid? Granted, the eventfs inodes don't have any of the tracefs inode flags set. But I purposely made he ti->flags initialized to zero. Is this update really necessary? Or do I need to make sure that the iput() clears it? The flags are used by tracefs, so I want to know if there's not a bug there. -- Steve > > Move the field initializations up to before the d_instantiate, and fix > the use of uninitialized data. > > Signed-off-by: Linus Torvalds > --- > fs/tracefs/event_inode.c | 13 ++--- > 1 file changed, 6 insertions(+), 7 deletions(-) > > diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c > index 2d128bedd654..c0d977e6c0f2 100644 > --- a/fs/tracefs/event_inode.c > +++ b/fs/tracefs/event_inode.c > @@ -328,7 +328,9 @@ static struct dentry *create_file(const char *name, > umode_t mode, > inode->i_ino = EVENTFS_FILE_INODE_INO; > > ti = get_tracefs(inode); > - ti->flags |= TRACEFS_EVENT_INODE; > + ti->flags = TRACEFS_EVENT_INODE; > + ti->private = NULL; // Directories have 'ei', files > not > + > d_instantiate(dentry, inode); > fsnotify_create(dentry->d_parent->d_inode, dentry); > return eventfs_end_creating(dentry); > @@ -367,7 +369,8 @@ static struct dentry *create_dir(struct eventfs_inode > *ei, struct dentry *parent > inode->i_ino = eventfs_dir_ino(ei); > > ti = get_tracefs(inode); > - ti->flags |= TRACEFS_EVENT_INODE; > + ti->flags = TRACEFS_EVENT_INODE; > + ti->private = ei; > > inc_nlink(inode); > d_instantiate(dentry, inode); > @@ -513,7 +516,6 @@ create_file_dentry(struct eventfs_inode *ei, int idx, > static void eventfs_post_create_dir(struct eventfs_inode *ei) > { > struct eventfs_inode *ei_child; > - struct tracefs_inode *ti; > > lockdep_assert_held(&eventfs_mutex); > > @@ -523,9 +525,6 @@ static void eventfs_post_create_dir(struct eventfs_inode > *ei) >srcu_read_lock_held(&eventfs_srcu)) { > ei_child->d_parent = ei->dentry; > } > - > - ti = get_tracefs(ei->dentry->d_inode); > - ti->private = ei; > } > > /** > @@ -943,7 +942,7 @@ struct eventfs_inode *eventfs_create_events_dir(const > char *name, struct dentry > INIT_LIST_HEAD(&ei->list); > > ti = get_tracefs(inode); > - ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE; > + ti->flags = TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE; > ti->private = ei; > > inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;
Re: [PATCH 2/6] eventfsfs: initialize the tracefs inode properly
On Tue, 30 Jan 2024 11:03:51 -0800 Linus Torvalds wrote: > @@ -328,7 +328,9 @@ static struct dentry *create_file(const char *name, > umode_t mode, > inode->i_ino = EVENTFS_FILE_INODE_INO; > > ti = get_tracefs(inode); > - ti->flags |= TRACEFS_EVENT_INODE; > + ti->flags = TRACEFS_EVENT_INODE; > + ti->private = NULL; // Directories have 'ei', files > not Although ti->private does need to be initialized here. -- Steve > + > d_instantiate(dentry, inode); > fsnotify_create(dentry->d_parent->d_inode, dentry); > return eventfs_end_creating(dentry);
Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
On Tue, 30 Jan 2024 at 11:37, Steven Rostedt wrote: > > Do you want me to put this in my urgent branch and mark them for stable, > and then send them for 6.8? Hmm. I think the only one that fixes a _reported_ bug is that [PTCH 2/6]. And it turns out that while 'ti->private' really is entirely uninitialized (and I still think it's the cause of the kernel test robot report that started this thread), the ti->flags field _is_ initialized to zero in tracefs_alloc_inode(). So even in that [PATCH 2/6], these parts: - ti->flags |= TRACEFS_EVENT_INODE; + ti->flags = TRACEFS_EVENT_INODE; aren't strictly needed (but aren't wrong either). The 'make sure to initialize ti->private before exposing the dentry' part *definitely* needs to be part of 6.8, though. That has an outstanding actually triggered bug report on it. I do think that tracefs_alloc_inode() should also initialize ti->private to NULL, but while that would fix the oops that the test robot reported, it wouldn't fix the data-race on any ti->private accesses. So that "ti->private = ei" needs to be done before the d_instantiate() (that later became a d_add()) regardless. But not having random fields left uninitialized for future subtle bugs would be a good idea too. Anyway. If you do run the full tracefs tests on the whole series, and there are no other major problems, I'll happily take it all for 6.8. And yes, even mark it for stable. I think the other bugs are much harder to hit, but I do think they exist. And code deletion is always good. So give it the full test attention, and *if* it all still looks good and there are no new subtle issues that crop up, let's just put this saga behind us asap. Linus
Re: [PATCH 2/6] eventfsfs: initialize the tracefs inode properly
On Tue, 30 Jan 2024 14:48:02 -0500 Steven Rostedt wrote: > The ti is allocated from fs/tracefs/inode.c that has: > > static struct inode *tracefs_alloc_inode(struct super_block *sb) > { > struct tracefs_inode *ti; > > ti = kmem_cache_alloc(tracefs_inode_cachep, GFP_KERNEL); I could also just add __GFP_ZERO so that all of it is initialized to zero, and then we don't need to assign NULL to any part of it. -- Steve > if (!ti) > return NULL; > > ti->flags = 0; > > return &ti->vfs_inode; > } > >
Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address
On Tue, 30 Jan 2024 11:54:56 -0800 Linus Torvalds wrote: > On Tue, 30 Jan 2024 at 11:37, Steven Rostedt wrote: > > > > Do you want me to put this in my urgent branch and mark them for stable, > > and then send them for 6.8? > > Hmm. I think the only one that fixes a _reported_ bug is that [PTCH > 2/6]. And it turns out that while 'ti->private' really is entirely > uninitialized (and I still think it's the cause of the kernel test > robot report that started this thread), the ti->flags field _is_ > initialized to zero in tracefs_alloc_inode(). > > So even in that [PATCH 2/6], these parts: > > - ti->flags |= TRACEFS_EVENT_INODE; > + ti->flags = TRACEFS_EVENT_INODE; > > aren't strictly needed (but aren't wrong either). > > The 'make sure to initialize ti->private before exposing the dentry' > part *definitely* needs to be part of 6.8, though. That has an > outstanding actually triggered bug report on it. > > I do think that tracefs_alloc_inode() should also initialize > ti->private to NULL, but while that would fix the oops that the test > robot reported, it wouldn't fix the data-race on any ti->private > accesses. > > So that "ti->private = ei" needs to be done before the d_instantiate() > (that later became a d_add()) regardless. But not having random fields > left uninitialized for future subtle bugs would be a good idea too. I'll add a patch to add __GFP_ZERO to the tracefs inode allocation, and modify your patch 2 to just move the ti->private = ei; > > Anyway. > > If you do run the full tracefs tests on the whole series, and there > are no other major problems, I'll happily take it all for 6.8. And > yes, even mark it for stable. I think the other bugs are much harder > to hit, but I do think they exist. And code deletion is always good. Sounds good. > > So give it the full test attention, and *if* it all still looks good > and there are no new subtle issues that crop up, let's just put this > saga behind us asap. Yes, I've been wanting to get away from eventfs for a month now. Again, I really do appreciate the time you put in to fixing this for me. I'm going to be backporting this to older Chromebooks as we really need to cut down on the memory overhead of instances. -- Steve
[PATCH] tracefs: Zero out the tracefs_inode when allocating it
From: "Steven Rostedt (Google)" eventfs uses the tracefs_inode and assumes that it's already initialized to zero. That is, it doesn't set fields to zero (like ti->private) after getting its tracefs_inode. This causes bugs due to stale values. Just initialize the entire structure to zero on allocation so there isn't any more surprises. This is a partial fix for accessing ti->private. The assignment still needs to be made before the dentry is instantiated. Cc: sta...@vger.kernel.org Fixes: 5790b1fb3d672 ("eventfs: Remove eventfs_file and just use eventfs_inode") Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202401291043.e62e89dc-oliver.s...@intel.com Suggested-by: Linus Torvalds Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/inode.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index e1b172c0e091..f7cde61ff2fc 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -34,12 +34,10 @@ static struct inode *tracefs_alloc_inode(struct super_block *sb) { struct tracefs_inode *ti; - ti = kmem_cache_alloc(tracefs_inode_cachep, GFP_KERNEL); + ti = kmem_cache_alloc(tracefs_inode_cachep, GFP_KERNEL | __GFP_ZERO); if (!ti) return NULL; - ti->flags = 0; - return &ti->vfs_inode; } -- 2.43.0
[PATCH v2] eventfs: Initialize the tracefs inode properly
From: Linus Torvalds The tracefs-specific fields in the inode were not initialized before the inode was exposed to others through the dentry with 'd_instantiate()'. Move the field initializations up to before the d_instantiate. Cc: sta...@vger.kernel.org Fixes: 5790b1fb3d672 ("eventfs: Remove eventfs_file and just use eventfs_inode") Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202401291043.e62e89dc-oliver.s...@intel.com Signed-off-by: Linus Torvalds Signed-off-by: Steven Rostedt (Google) --- Changes since v1: https://lore.kernel.org/linux-trace-kernel/20240130190355.11486-2-torva...@linux-foundation.org - Since another patch zeroed out the entire tracefs_inode, there's no need to initialize any of its fields to NULL. (see https://lore.kernel.org/linux-trace-kernel/20240130151737.6e97a...@gandalf.local.home/) fs/tracefs/event_inode.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 1c3dd0ad4660..824b1811e342 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -370,6 +370,8 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent ti = get_tracefs(inode); ti->flags |= TRACEFS_EVENT_INODE; + /* Only directories have ti->private set to an ei, not files */ + ti->private = ei; inc_nlink(inode); d_instantiate(dentry, inode); @@ -515,7 +517,6 @@ create_file_dentry(struct eventfs_inode *ei, int idx, static void eventfs_post_create_dir(struct eventfs_inode *ei) { struct eventfs_inode *ei_child; - struct tracefs_inode *ti; lockdep_assert_held(&eventfs_mutex); @@ -525,9 +526,6 @@ static void eventfs_post_create_dir(struct eventfs_inode *ei) srcu_read_lock_held(&eventfs_srcu)) { ei_child->d_parent = ei->dentry; } - - ti = get_tracefs(ei->dentry->d_inode); - ti->private = ei; } /** -- 2.43.0
Re: [PATCH 1/3] init: Declare rodata_enabled and mark_rodata_ro() at all time
On Tue, Jan 30, 2024 at 06:48:11PM +0100, Marek Szyprowski wrote: > Dear All, > > On 30.01.2024 12:03, Christophe Leroy wrote: > > Le 30/01/2024 à 10:16, Chen-Yu Tsai a écrit : > >> [Vous ne recevez pas souvent de courriers de we...@chromium.org. D?couvrez > >> pourquoi ceci est important ? > >> https://aka.ms/LearnAboutSenderIdentification ] > >> > >> On Mon, Jan 29, 2024 at 12:09:50PM -0800, Luis Chamberlain wrote: > >>> On Thu, Dec 21, 2023 at 10:02:46AM +0100, Christophe Leroy wrote: > >>>> Declaring rodata_enabled and mark_rodata_ro() at all time > >>>> helps removing related #ifdefery in C files. > >>>> > >>>> Signed-off-by: Christophe Leroy > >>> Very nice cleanup, thanks!, applied and pushed > >>> > >>> Luis > >> On next-20240130, which has your modules-next branch, and thus this > >> series and the other "module: Use set_memory_rox()" series applied, > >> my kernel crashes in some very weird way. Reverting your branch > >> makes the crash go away. > >> > >> I thought I'd report it right away. Maybe you folks would know what's > >> happening here? This is on arm64. > > That's strange, it seems to bug in module_bug_finalize() which is > > _before_ calls to module_enable_ro() and such. > > > > Can you try to revert the 6 patches one by one to see which one > > introduces the problem ? > > > > In reality, only patch 677bfb9db8a3 really change things. Other ones are > > more on less only cleanup. > > I've also run into this issue with today's (20240130) linux-next on my > test farm. The issue is not fully reproducible, so it was a bit hard to > bisect it automatically. I've spent some time on manual testing and it > looks that reverting the following 2 commits on top of linux-next fixes > the problem: > > 65929884f868 ("modules: Remove #ifdef CONFIG_STRICT_MODULE_RWX around > rodata_enabled") > 677bfb9db8a3 ("module: Don't ignore errors from set_memory_XX()") > > This in fact means that commit 677bfb9db8a3 is responsible for this > regression, as 65929884f868 has to be reverted only because the latter > depends on it. Let me know what I can do to help debugging this issue. Thanks for the bisect, I've reset my tree to commit 3559ad395bf02 ("module: Change module_enable_{nx/x/ro}() to more explicit names") for now then, so to remove those commits. Luis
Re: [PATCH v9 3/3] remoteproc: zynqmp: parse TCM from device tree
On 1/17/24 12:58 PM, Mathieu Poirier wrote: > Alright, I spent several hours looking at this patchset and the driver as a > whole. I certainly salute your efforts to heed my advice and make the code > less > brittle but I'm afraid we are not there. Hi Mathieu, I am back from vacation. Started looking into this. Thanks for spending time on this and helping to make clean design of the driver. Please find my comments below. > See below for a different way to proceed. > > On Wed, Jan 10, 2024 at 01:35:05PM -0800, Tanmay Shah wrote: > > ZynqMP TCM information was fixed in driver. Now ZynqMP TCM information > > is available in device-tree. Parse TCM information in driver > > as per new bindings. > > > > Signed-off-by: Tanmay Shah > > --- > > > > Changes in v9: > > - Introduce new API to request and release core1 TCM power-domains in > > lockstep mode. This will be used during prepare -> add_tcm_banks > > callback to enable TCM in lockstep mode. > > - Parse TCM from device-tree in lockstep mode and split mode in > > uniform way. > > - Fix TCM representation in device-tree in lockstep mode. > > > > Changes in v8: > > - Remove pm_domains framework > > - Remove checking of pm_domain_id validation to power on/off tcm > > - Remove spurious change > > - parse power-domains property from device-tree and use EEMI calls > > to power on/off TCM instead of using pm domains framework > > > > Changes in v7: > > - move checking of pm_domain_id from previous patch > > - fix mem_bank_data memory allocation > > > > drivers/remoteproc/xlnx_r5_remoteproc.c | 245 +++- > > 1 file changed, 239 insertions(+), 6 deletions(-) > > > > diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c > > b/drivers/remoteproc/xlnx_r5_remoteproc.c > > index 4395edea9a64..0f87b984850b 100644 > > --- a/drivers/remoteproc/xlnx_r5_remoteproc.c > > +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c > > @@ -74,8 +74,8 @@ struct mbox_info { > > }; > > > > /* > > - * Hardcoded TCM bank values. This will be removed once TCM bindings are > > - * accepted for system-dt specifications and upstreamed in linux kernel > > + * Hardcoded TCM bank values. This will stay in driver to maintain backward > > + * compatibility with device-tree that does not have TCM information. > > */ > > static const struct mem_bank_data zynqmp_tcm_banks_split[] = { > > {0xffe0UL, 0x0, 0x1UL, PD_R5_0_ATCM, "atcm0"}, /* TCM 64KB each > > */ > > @@ -102,6 +102,7 @@ static const struct mem_bank_data > > zynqmp_tcm_banks_lockstep[] = { > > * @rproc: rproc handle > > * @pm_domain_id: RPU CPU power domain id > > * @ipi: pointer to mailbox information > > + * @lockstep_core1_np: second core's device_node to use in lockstep mode > > */ > > struct zynqmp_r5_core { > > struct device *dev; > > @@ -111,6 +112,7 @@ struct zynqmp_r5_core { > > struct rproc *rproc; > > u32 pm_domain_id; > > struct mbox_info *ipi; > > + struct device_node *lockstep_core1_np; > > }; > > > > /** > > @@ -539,6 +541,110 @@ static int tcm_mem_map(struct rproc *rproc, > > return 0; > > } > > > > +int request_core1_tcm_lockstep(struct rproc *rproc) > > +{ > > + struct zynqmp_r5_core *r5_core = rproc->priv; > > + struct of_phandle_args out_args = {0}; > > + int ret, i, num_pd, pd_id, ret_err; > > + struct device_node *np; > > + > > + np = r5_core->lockstep_core1_np; > > + > > + /* Get number of power-domains */ > > + num_pd = of_count_phandle_with_args(np, "power-domains", > > + "#power-domain-cells"); > > + if (num_pd <= 0) > > + return -EINVAL; > > + > > + /* Get individual power-domain id and enable TCM */ > > + for (i = 1; i < num_pd; i++) { > > + ret = of_parse_phandle_with_args(np, "power-domains", > > +"#power-domain-cells", > > +i, &out_args); > > + if (ret) { > > + dev_warn(r5_core->dev, > > +"failed to get tcm %d in power-domains list, > > ret %d\n", > > +i, ret); > > + goto fail_request_core1_tcm; > > + } > > + > > + pd_id = out_args.args[0]; > > + of_node_put(out_args.np); > > + > > + ret = zynqmp_pm_request_node(pd_id, > > ZYNQMP_PM_CAPABILITY_ACCESS, 0, > > +ZYNQMP_PM_REQUEST_ACK_BLOCKING); > > + if (ret) { > > + dev_err(r5_core->dev, "failed to request TCM node > > 0x%x\n", > > + pd_id); > > + goto fail_request_core1_tcm; > > + } > > + } > > + > > + return 0; > > + > > +fail_request_core1_tcm: > > + > > + /* Cache actual error to return later */ > > + ret_err = ret; > > + > > + /* Release previously requested TCM in case of failure */ > > + while (--i > 0)
Re: [RFC PATCH] kernel/module: add a safer implementation of try_module_get()
On Tue, Jan 30, 2024 at 08:36:14PM +0100, Marco Pagani wrote: > The current implementation of try_module_get() requires the module to > exist and be live as a precondition. While this may seem intuitive at > first glance, enforcing the precondition can be tricky, considering that > modules can be unloaded at any time if not previously taken. For > instance, the caller could be preempted just before calling > try_module_get(), and while preempted, the module could be unloaded and > freed. More subtly, the module could also be unloaded at any point while > executing try_module_get() before incrementing the refount with > atomic_inc_not_zero(). > > Neglecting the precondition that the module must exist and be live can > cause unexpected race conditions that can lead to crashes. However, > ensuring that the precondition is met may require additional locking > that increases the complexity of the code and can make it more > error-prone. > > This patch adds a slower yet safer implementation of try_module_get() > that checks if the module is valid by looking into the mod_tree before > taking the module's refcount. This new function can be safely called on > stale and invalid module pointers, relieving developers from the burden > of ensuring that the module exists and is live before attempting to take > it. > > The tree lookup and refcount increment are executed after taking the > module_mutex to prevent the module from being unloaded after looking up > the tree. > > Signed-off-by: Marco Pagani It very much sounds like there is a desire to have this but without a user, there is no justification. > +bool try_module_get_safe(struct module *module) > +{ > + struct module *mod; > + bool ret = true; > + > + if (!module) > + goto out; > + > + mutex_lock(&module_mutex); If a user comes around then this should be mutex_lock_interruptible(), and add might_sleep() > + > + /* > + * Check if the address points to a valid live module and take > + * the refcount only if it points to the module struct. > + */ > + mod = __module_address((unsigned long)module); > + if (mod && mod == module && module_is_live(mod)) > + __module_get(mod); > + else > + ret = false; > + > + mutex_unlock(&module_mutex); > + > +out: > + return ret; > +} > +EXPORT_SYMBOL(try_module_get_safe); And EXPORT_SYMBOL_GPL() would need to be used. I'd also expect selftests to be expanded for this case, but again, without a user, this is just trying to resolve a problem which does not exist. Luis
Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
On Tue, 30 Jan 2024 11:03:54 -0800 Linus Torvalds wrote: > -static void free_ei(struct eventfs_inode *ei) > +static inline struct eventfs_inode *alloc_ei(const char *name) > { > - kfree_const(ei->name); > - kfree(ei->d_children); > - kfree(ei->entry_attrs); > - kfree(ei); > + int namesize = strlen(name) + 1; > + struct eventfs_inode *ei = kzalloc(sizeof(*ei) + namesize, GFP_KERNEL); > + > + if (ei) { > + memcpy((char *)ei->name, name, namesize); > + kref_init(&ei->kref); > + } I'm going to be putting back the ei->name pointer as the above actually adds more memory usage. The reason being is that all static trace events (not kprobes or other dynamic events) are created with the TRACE_EVENT() macro that points to a constant name. Passing in a const to kstrdup_const() checks if the name passed in is a constant or not. If it is, it doesn't allocate the name and just passes back the pointer. Removing the pointer to "name" removed 8 bytes on 64 bit machines from the eventfs_inode structure, but added strlen(name)+1 bytes to it, where the majority of trace event names are greater than 8 bytes, and are constant strings. Another reason for converting back to the way it was is I have moving the ei allocation to a slab on my TODO list. -- Steve > + return ei; > } > > diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h > index 8f38740bfb5b..72db3bdc4dfb 100644 > --- a/fs/tracefs/internal.h > +++ b/fs/tracefs/internal.h > @@ -34,8 +34,7 @@ struct eventfs_attr { > * @entries: the array of entries representing the files in the directory > * @name:the name of the directory to create > * @children:link list into the child eventfs_inode > - * @dentry: the dentry of the directory > - * @d_children: The array of dentries to represent the files when created > + * @events_dir: the dentry of the events directory > * @entry_attrs: Saved mode and ownership of the @d_children > * @attr:Saved mode and ownership of eventfs_inode itself > * @data:The private data to pass to the callbacks > @@ -44,12 +43,11 @@ struct eventfs_attr { > * @nr_entries: The number of items in @entries > */ > struct eventfs_inode { > + struct kref kref; > struct list_headlist; > const struct eventfs_entry *entries; > - const char *name; > struct list_headchildren; > - struct dentry *dentry; /* Check is_freed to access */ > - struct dentry **d_children; > + struct dentry *events_dir; > struct eventfs_attr *entry_attrs; > struct eventfs_attr attr; > void*data; > @@ -66,6 +64,7 @@ struct eventfs_inode { > struct llist_node llist; > struct rcu_head rcu; > }; > + const char name[]; > }; > > static inline struct tracefs_inode *get_tracefs(const struct inode *inode)
Re: [PATCH 6/6] eventfs: clean up dentry ops and add revalidate function
On Tue, 30 Jan 2024 11:03:55 -0800 Linus Torvalds wrote: > Another thing that might be worth doing is to make all eventfs lookups > mark their dentries as not worth caching. We could do that with > d_delete(), but the DCACHE_DONTCACHE flag would likely be even better. > > As it is, the dentries are all freeable, but they only tend to get freed > at memory pressure rather than more proactively. But that's a separate > issue. I actually find the dentry's sticking around when their ref counts go to zero a feature. Tracing applications will open and close the eventfs files many times. If the dentry were to be deleted on every close, it would need to be create over again in short spurts. There's some operations that will traverse the tree many times. One operation is on reset, as there's some dependencies in the order of disabling triggers. If there's a dependency, and a trigger is to be disabled out of order, the disabling will fail with -EBUSY. Thus the tools will iterate the trigger files several times until it there's no more changes. It's not a critical path, but I rather not add more overhead to it. -- Steve
Re: [PATCH 6/6] eventfs: clean up dentry ops and add revalidate function
On Tue, 30 Jan 2024 at 13:25, Steven Rostedt wrote: > > I actually find the dentry's sticking around when their ref counts go to > zero a feature. Tracing applications will open and close the eventfs files > many times. If the dentry were to be deleted on every close, it would need > to be create over again in short spurts. Sure. I think this is literally a tuning thing, and we'll see if anybody ever says "the dentry cache grows too much with tracefs". Linus
Re: [RFC PATCH v2 3/8] ext2: Use dax_is_supported()
On Tue 30-01-24 11:52:50, Mathieu Desnoyers wrote: > Use dax_is_supported() to validate whether the architecture has > virtually aliased data caches at mount time. Print an error and disable > DAX if dax=always is requested as a mount option on an architecture > which does not support DAX. > > This is relevant for architectures which require a dynamic check > to validate whether they have virtually aliased data caches. > > Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing > caches") > Signed-off-by: Mathieu Desnoyers > Cc: Jan Kara > Cc: linux-e...@vger.kernel.org > Cc: Andrew Morton > Cc: Linus Torvalds > Cc: linux...@kvack.org > Cc: linux-a...@vger.kernel.org > Cc: Dan Williams > Cc: Vishal Verma > Cc: Dave Jiang > Cc: Matthew Wilcox > Cc: Arnd Bergmann > Cc: Russell King > Cc: nvd...@lists.linux.dev > Cc: linux-...@vger.kernel.org > Cc: linux-fsde...@vger.kernel.org OK, yeah, this is better than v1. Feel free to add: Acked-by: Jan Kara Honza > --- > fs/ext2/super.c | 6 +- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/fs/ext2/super.c b/fs/ext2/super.c > index 01f9addc8b1f..30ff57d47ed4 100644 > --- a/fs/ext2/super.c > +++ b/fs/ext2/super.c > @@ -955,7 +955,11 @@ static int ext2_fill_super(struct super_block *sb, void > *data, int silent) > blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size); > > if (test_opt(sb, DAX)) { > - if (!sbi->s_daxdev) { > + if (!dax_is_supported()) { > + ext2_msg(sb, KERN_ERR, > + "DAX unsupported by architecture. Turning off > DAX."); > + clear_opt(sbi->s_mount_opt, DAX); > + } else if (!sbi->s_daxdev) { > ext2_msg(sb, KERN_ERR, > "DAX unsupported by block device. Turning off > DAX."); > clear_opt(sbi->s_mount_opt, DAX); > -- > 2.39.2 > -- Jan Kara SUSE Labs, CR
Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
On Tue, 30 Jan 2024 at 12:55, Steven Rostedt wrote: > > I'm going to be putting back the ei->name pointer as the above actually > adds more memory usage. I did it mainly because I hate having multiple different allocation sites that then have to do that kref_init() etc individually, and once there was a single site the "name" thing really looked lik ean obvious simplification. That said, I think you're confused about the memory usage. Sure, 'kstrdup_const()' optimizes away the allocation for static constant strings, but what it does *not* do is to optimize away the pointer. In contrast, allocating them together gets rid of the pointer itself, because now the name is just an offset in the structure. And the pointer is 8 bytes right there. So allocating the string _with_ the ei will always save at least 8 bytes. So whenever the string is less than that in length it's *always* a win. And even when it's not an absolute win, it will take advantage of the kmalloc() quantized sizes too, and generally not be a loss even with longer names. So I'm pretty sure you are simply entirely wrong on the memory usage. Not counting the size of the pointer is overlooking a big piece of the puzzle. Btw, you can look at name lengths in tracefs with something stupid like this: find . | sed 's:^.*/::' | tr -c '\n' . | sort | uniq -c and you will see that (at least in my case) names of lengths 1..7 are dominating it all: 1 . 2189 .. 34 ... 2229 207 . 6833 .. 2211 ... with the rest being insignificant in comparison. The reason? There's a *lot* of those 'filter' and 'enable' and 'id' files. All of which are better off without a 'const char *name' taking 8 bytes. Linus
Re: [PATCH 2/4] tracing/user_events: Introduce multi-format events
On Tue, Jan 30, 2024 at 01:52:30PM -0500, Steven Rostedt wrote: > On Tue, 30 Jan 2024 10:05:15 -0800 > Beau Belgrave wrote: > > > On Mon, Jan 29, 2024 at 09:24:07PM -0500, Steven Rostedt wrote: > > > On Mon, 29 Jan 2024 09:29:07 -0800 > > > Beau Belgrave wrote: > > > > > > > Thanks, yeah ideally we wouldn't use special characters. > > > > > > > > I'm not picky about this. However, I did want something that clearly > > > > allowed a glob pattern to find all versions of a given register name of > > > > user_events by user programs that record. The dot notation will pull in > > > > more than expected if dotted namespace style names are used. > > > > > > > > An example is "Asserts" and "Asserts.Verbose" from different programs. > > > > If we tried to find all versions of "Asserts" via glob of "Asserts.*" it > > > > will pull in "Asserts.Verbose.1" in addition to "Asserts.0". > > > > > > Do you prevent brackets in names? > > > > > > > No. However, since brackets have a start and end token that are distinct > > finding all versions of your event is trivial compared to a single dot. > > > > Imagine two events: > > Asserts > > Asserts[MyCoolIndex] > > > > Resolves to tracepoints of: > > Asserts:[0] > > Asserts[MyCoolIndex]:[1] > > > > Regardless of brackets in the names, a simple glob of Asserts:\[*\] only > > finds Asserts:[0]. This is because we have that end bracket in the glob > > and the full event name including the start bracket. > > > > If I register another "version" of Asserts, thne I'll have: > > Asserts:[0] > > Asserts[MyCoolIndex]:[1] > > Asserts:[2] > > > > The glob of Asserts:\[*\] will return both: > > Asserts:[0] > > Asserts:[2] > > But what if you had registered "Asserts:[MyCoolIndex]:[1]" > Good point, the above would still require a regex type pattern to not get pulled in. > Do you prevent colons? > No, nothing is prevented at this point. It seems we could either prevent certain characters to make it easier or define a good regex that we should document. I'm leaning toward just doing a simple suffix and documenting the regex well. > > > > At this point the program can either record all versions or scan further > > to find which version of Asserts is wanted. > > > > > > > > > > While a glob of "Asserts.[0-9]" works when the unique ID is 0-9, it > > > > doesn't work if the number is higher, like 128. If we ever decide to > > > > change the ID from an integer to say hex to save space, these globs > > > > would break. > > > > > > > > Is there some scheme that fits the C-variable name that addresses the > > > > above scenarios? Brackets gave me a simple glob that seemed to prevent a > > > > lot of this ("Asserts.\[*\]" in this case). > > > > > > Prevent a lot of what? I'm not sure what your example here is. > > > > > > > I'll try again :) > > > > We have 2 events registered via user_events: > > Asserts > > Asserts.Verbose > > > > Using dot notation these would result in tracepoints of: > > user_events_multi/Asserts.0 > > user_events_multi/Asserts.Verbose.1 > > > > Using bracket notation these would result in tracepoints of: > > user_events_multi/Asserts:[0] > > user_events_multi/Asserts.Verbose:[1] > > > > A recording program only wants to enable the Asserts tracepoint. It does > > not want to record the Asserts.Verbose tracepoint. > > > > The program must find the right tracepoint by scanning tracefs under the > > user_events_multi system. > > > > A single dot suffix does not allow a simple glob to be used. The glob > > Asserts.* will return both Asserts.0 and Asserts.Verbose.1. > > > > A simple glob of Asserts:\[*\] will only find Asserts:[0], it will not > > find Asserts.Verbose:[1]. > > > > We could just use brackets and not have the colon (Asserts[0] in this > > case). But brackets are still special for bash. > > Are these shell scripts or programs. I use regex in programs all the time. > And if you have shell scripts, use awk or something. > They could be both. In our case, it is a program. > Unless you prevent something from being added, I don't see the protection. > Yeah, it just makes it way less likely. Given that, I'm starting to lean toward just documenting the regex well and not trying to get fancy. > > > > > > > > > > Are we confident that we always want to represent the ID as a base-10 > > > > integer vs a base-16 integer? The suffix will be ABI to ensure recording > > > > programs can find their events easily. > > > > > > Is there a difference to what we choose? > > > > > > > If a simple glob of event_name:\[*\] cannot be used, then we must document > > what the suffix format is, so an appropriate regex can be created. If we > > start with base-10 then later move to base-16 we will break existing regex > > patterns on the recording side. > > > > I prefer, and have in this series, a base-16 output since it saves on > > the tracepoint name size. > > I honestly don't care which base you use. So if you want to use base 16,
Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
On Tue, 30 Jan 2024 13:52:05 -0800 Linus Torvalds wrote: > On Tue, 30 Jan 2024 at 12:55, Steven Rostedt wrote: > > > > I'm going to be putting back the ei->name pointer as the above actually > > adds more memory usage. > > I did it mainly because I hate having multiple different allocation > sites that then have to do that kref_init() etc individually, and once > there was a single site the "name" thing really looked lik ean obvious > simplification. > > That said, I think you're confused about the memory usage. > > Sure, 'kstrdup_const()' optimizes away the allocation for static > constant strings, but what it does *not* do is to optimize away the > pointer. > > In contrast, allocating them together gets rid of the pointer itself, > because now the name is just an offset in the structure. > > And the pointer is 8 bytes right there. > > So allocating the string _with_ the ei will always save at least 8 bytes. Not if the string is not allocated, and ei->name is just pointing to an existing string. Which is what kstrdup_const() does if the string is a constant. If the length of the string was one byte, so you allocate sizeof(*ei) + 2 (for the '\0' as well), is kzalloc() going to allocate anything less than 8 byte alignment? According to /proc/slabinfo, any allocation will be 8 bytes. Thus it is the same as having the pointer if the string is constant but less than 8 bytes in length. But a waste if it is more. > > So whenever the string is less than that in length it's *always* a win. > > And even when it's not an absolute win, it will take advantage of the > kmalloc() quantized sizes too, and generally not be a loss even with > longer names. > > So I'm pretty sure you are simply entirely wrong on the memory usage. > Not counting the size of the pointer is overlooking a big piece of the > puzzle. > > Btw, you can look at name lengths in tracefs with something stupid like this: > > find . | sed 's:^.*/::' | tr -c '\n' . | sort | uniq -c > > and you will see that (at least in my case) names of lengths 1..7 are > dominating it all: > > 1 . >2189 .. > 34 ... >2229 > 207 . >6833 .. >2211 ... > > with the rest being insignificant in comparison. > > The reason? There's a *lot* of those 'filter' and 'enable' and 'id' > files. All of which are better off without a 'const char *name' taking > 8 bytes. Remember, the files don't have an ei allocated. Only directories. # find . -type d | sed 's:^.*/::' | tr -c '\n' . | sort | uniq -c 1 . 2 .. 30 ... 14 21 . 15 .. 18 ... Above is 8 bytes (length of 7 + '\0') 27 36 . 35 .. 49 ... 63 85 . 140 .. 180 ... 180 155 . 171 .. 158 ... 134 116 . 106 .. 81 ... 61 58 . 43 .. 41 ... 23 14 . 13 .. 9 ... 7 6 . 5 .. 1 ... 2 1 . 1 The rest is all greater than the pointer size. Also, it does prevent me from switching to a slab. I added the below patch (after changing this back to kstrdup_const), that determines if the string is allocated by checking if ei->name is the same as the pointer passed in. This showed me: # cat /sys/kernel/tracing/dyn_ftrace_total_info 53698 pages:220 groups: 8 const cnt:2099 len:22446 dyn cnt: 3 len:56 That is, 2099 ei->name's point to a constant string without allocation. Granted, I built my system without many modules. If you have more code built as modules, that number may be less. I then counted the length of the string - 7 bytes (which is the same as the length of the string plus the '\0' character - 8 bytes) to accommodate the pointer size, and it's a savings of 22KB per instance. And in actuality, I should have rounded to the nearest kmalloc slab size as kzalloc() isn't going to return 35 bytes back but 64 bytes will be allocated. I think that's a win. -- Steve diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index e9819d719d2a..ed9ffaac9a32 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -785,6 +785,25 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx) goto out; } +int sdr_ei_const; +int sdr_ei_len; +int sdr_ei_dyn; +int s
Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
On Tue, 30 Jan 2024 at 13:52, Linus Torvalds wrote: > > Btw, you can look at name lengths in tracefs with something stupid like this: > > find . | sed 's:^.*/::' | tr -c '\n' . | sort | uniq -c Actually, since only directories have these 'ei' fields, that find should have "-type d", and then the stats change. Most directory filenames there are indeed longer than 8 bytes (16 bytes seems to be the most common length). That means that whether it wins or loses in allocation size tends to be mostly about the kmalloc sizes and the size of that structure. And it turns out that the real memory savings there would be this patch --- a/fs/tracefs/internal.h +++ b/fs/tracefs/internal.h @@ -55,15 +55,6 @@ struct eventfs_inode { unsigned intis_events:1; unsigned intnr_entries:30; unsigned intino; - /* - * Union - used for deletion - * @llist: for calling dput() if needed after RCU - * @rcu:eventfs_inode to delete in RCU - */ - union { - struct llist_node llist; - struct rcu_head rcu; - }; const char name[]; }; since with all the other cleanups, neither of those fields are actually used. With that, the base size of 'struct eventfs_inode' actually becomes 96 bytes for me. And at least on x86-64, the kmalloc sizes are 96 and then 128 bytes. But that means that - with the added name pointer, the eventfs_inode would grow to 104 bytes, and grow to that next allocation size (128) - with the embedded name, the size is 96+strlen+1, and will also need at least a 128 byte allocation, but any name that is shorter than 32 bytes is free so it ends up being a wash. You're going to allocate 128 bytes regardless. But the deallocation is simpler. Linus
Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts
On Tue, 30 Jan 2024 at 14:55, Steven Rostedt wrote: > > Remember, the files don't have an ei allocated. Only directories. Crossed emails. > I then counted the length of the string - 7 bytes (which is the same as the > length of the string plus the '\0' character - 8 bytes) to accommodate the > pointer size, and it's a savings of 22KB per instance. And in actuality, I > should have rounded to the nearest kmalloc slab size as kzalloc() isn't going > to > return 35 bytes back but 64 bytes will be allocated. No. See my other email. The kmalloc sizes actually means that it comes out exactly even. Linus