[PATCH 0/4] Add new driver for WCSS secure PIL loading
This series depends on q6 clock removal series [1]. - Secure PIL is signed, split firmware images which only TrustZone (TZ) can authenticate and load. Linux kernel will send a request to TZ to authenticate and load the PIL images. - When secure PIL support was added to the existing wcss PIL driver earlier in [2], Bjorn suggested not to overload the existing WCSS rproc driver [2], instead post a new driver for secure PIL alone. This series adds a new secure PIL driver for the same. [1] https://patchwork.kernel.org/project/linux-arm-msm/cover/20240820055618.267554-1-quic_gokul...@quicinc.com/ [2] https://patchwork.kernel.org/project/linux-arm-msm/patch/1611984013-10201-3-git-send-email-gokul...@codeaurora.org/ Manikanta Mylavarapu (3): dt-bindings: remoteproc: qcom: document hexagon based WCSS secure PIL arm64: dts: qcom: ipq5332: add nodes to bringup q6 arm64: dts: qcom: ipq9574: add nodes to bring up q6 Vignesh Viswanathan (1): remoteproc: qcom: add hexagon based WCSS secure PIL driver .../remoteproc/qcom,wcss-sec-pil.yaml | 125 ++ arch/arm64/boot/dts/qcom/ipq5332.dtsi | 62 +++ arch/arm64/boot/dts/qcom/ipq9574.dtsi | 58 +++ drivers/remoteproc/Kconfig| 22 ++ drivers/remoteproc/Makefile | 1 + drivers/remoteproc/qcom_q6v5_wcss_sec.c | 360 ++ 6 files changed, 628 insertions(+) create mode 100644 Documentation/devicetree/bindings/remoteproc/qcom,wcss-sec-pil.yaml create mode 100644 drivers/remoteproc/qcom_q6v5_wcss_sec.c -- 2.34.1
[PATCH 1/2] dt-bindings: remoteproc: qcom: document hexagon based WCSS secure PIL
From: Manikanta Mylavarapu Add new binding document for hexagon based WCSS secure PIL remoteproc. IPQ5332, IPQ9574 follows secure PIL remoteproc. Signed-off-by: Manikanta Mylavarapu Signed-off-by: Gokul Sriram Palanisamy --- .../remoteproc/qcom,wcss-sec-pil.yaml | 125 ++ 1 file changed, 125 insertions(+) create mode 100644 Documentation/devicetree/bindings/remoteproc/qcom,wcss-sec-pil.yaml diff --git a/Documentation/devicetree/bindings/remoteproc/qcom,wcss-sec-pil.yaml b/Documentation/devicetree/bindings/remoteproc/qcom,wcss-sec-pil.yaml new file mode 100644 index ..c69401b6cec1 --- /dev/null +++ b/Documentation/devicetree/bindings/remoteproc/qcom,wcss-sec-pil.yaml @@ -0,0 +1,125 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/remoteproc/qcom,wcss-sec-pil.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Qualcomm WCSS Secure Peripheral Image Loader + +maintainers: + - Manikanta Mylavarapu + +description: + WCSS Secure Peripheral Image Loader loads firmware and power up QDSP6 + remoteproc's on the Qualcomm IPQ9574, IPQ5332 SoC. + +properties: + compatible: +enum: + - qcom,ipq5332-wcss-sec-pil + - qcom,ipq9574-wcss-sec-pil + + reg: +maxItems: 1 + + firmware-name: +$ref: /schemas/types.yaml#/definitions/string +description: Firmware name for the Hexagon core + + interrupts: +items: + - description: Watchdog interrupt + - description: Fatal interrupt + - description: Ready interrupt + - description: Handover interrupt + - description: Stop acknowledge interrupt + + interrupt-names: +items: + - const: wdog + - const: fatal + - const: ready + - const: handover + - const: stop-ack + + clocks: +items: + - description: IM SLEEP clock + + clock-names: +items: + - const: im_sleep + + qcom,smem-states: +$ref: /schemas/types.yaml#/definitions/phandle-array +description: States used by the AP to signal the remote processor +items: + - description: Shutdown Q6 + - description: Stop Q6 + + qcom,smem-state-names: +description: + Names of the states used by the AP to signal the remote processor +items: + - const: shutdown + - const: stop + + memory-region: +items: + - description: Q6 reserved region + + glink-edge: +$ref: /schemas/remoteproc/qcom,glink-edge.yaml# +description: + Qualcomm G-Link subnode which represents communication edge, channels + and devices related to the Modem. +unevaluatedProperties: false + +required: + - compatible + - firmware-name + - reg + - interrupts + - interrupt-names + - qcom,smem-states + - qcom,smem-state-names + - memory-region + +additionalProperties: false + +examples: + - | +#include +#include +q6v5_wcss: remoteproc@d10 { + compatible = "qcom,ipq5332-wcss-sec-pil"; + reg = <0xd10 0x4040>; + firmware-name = "ath12k/IPQ5332/hw1.0/q6_fw0.mdt"; + interrupts-extended = <&intc GIC_SPI 291 IRQ_TYPE_EDGE_RISING>, +<&wcss_smp2p_in 0 IRQ_TYPE_NONE>, +<&wcss_smp2p_in 1 IRQ_TYPE_NONE>, +<&wcss_smp2p_in 2 IRQ_TYPE_NONE>, +<&wcss_smp2p_in 3 IRQ_TYPE_NONE>; + interrupt-names = "wdog", +"fatal", +"ready", +"handover", +"stop-ack"; + + clocks = <&gcc GCC_IM_SLEEP_CLK>; + clock-names = "im_sleep"; + + qcom,smem-states = <&wcss_smp2p_out 0>, + <&wcss_smp2p_out 1>; + qcom,smem-state-names = "shutdown", + "stop"; + + memory-region = <&q6_region>; + + glink-edge { +interrupts = ; +label = "rtr"; +qcom,remote-pid = <1>; +mboxes = <&apcs_glb 8>; + }; +}; -- 2.34.1
[PATCH 2/2] remoteproc: qcom: add hexagon based WCSS secure PIL driver
From: Vignesh Viswanathan Add support to bring up hexagon based WCSS secure PIL remoteproc. IPQ5332, IPQ9574 supports secure PIL remoteproc. Signed-off-by: Vignesh Viswanathan Signed-off-by: Manikanta Mylavarapu Signed-off-by: Gokul Sriram Palanisamy --- drivers/remoteproc/Kconfig | 22 ++ drivers/remoteproc/Makefile | 1 + drivers/remoteproc/qcom_q6v5_wcss_sec.c | 363 3 files changed, 386 insertions(+) create mode 100644 drivers/remoteproc/qcom_q6v5_wcss_sec.c diff --git a/drivers/remoteproc/Kconfig b/drivers/remoteproc/Kconfig index dda2ada215b7..368405825d83 100644 --- a/drivers/remoteproc/Kconfig +++ b/drivers/remoteproc/Kconfig @@ -255,6 +255,28 @@ config QCOM_Q6V5_WCSS Hexagon V5 based WCSS remote processors on e.g. IPQ8074. This is a non-TrustZone wireless subsystem. +config QCOM_Q6V5_WCSS_SEC + tristate "Qualcomm Hexagon based WCSS Secure Peripheral Image Loader" + depends on OF && ARCH_QCOM + depends on QCOM_SMEM + depends on RPMSG_QCOM_SMD || RPMSG_QCOM_SMD=n + depends on RPMSG_QCOM_GLINK_SMEM || RPMSG_QCOM_GLINK_SMEM=n + depends on QCOM_SYSMON || QCOM_SYSMON=n + depends on RPMSG_QCOM_GLINK || RPMSG_QCOM_GLINK=n + depends on QCOM_AOSS_QMP || QCOM_AOSS_QMP=n + select QCOM_MDT_LOADER + select QCOM_PIL_INFO + select QCOM_Q6V5_COMMON + select QCOM_RPROC_COMMON + select QCOM_SCM + help + Say y here to support the Qualcomm Secure Peripheral Image Loader + for the Hexagon based remote processors on e.g. IPQ5332. + + This is TrustZone wireless subsystem. The firmware is + verified and booted with the help of the Peripheral Authentication + System (PAS) in TrustZone. + config QCOM_SYSMON tristate "Qualcomm sysmon driver" depends on RPMSG diff --git a/drivers/remoteproc/Makefile b/drivers/remoteproc/Makefile index 91314a9b43ce..54530d7fa510 100644 --- a/drivers/remoteproc/Makefile +++ b/drivers/remoteproc/Makefile @@ -28,6 +28,7 @@ obj-$(CONFIG_QCOM_Q6V5_ADSP) += qcom_q6v5_adsp.o obj-$(CONFIG_QCOM_Q6V5_MSS)+= qcom_q6v5_mss.o obj-$(CONFIG_QCOM_Q6V5_PAS)+= qcom_q6v5_pas.o obj-$(CONFIG_QCOM_Q6V5_WCSS) += qcom_q6v5_wcss.o +obj-$(CONFIG_QCOM_Q6V5_WCSS_SEC) += qcom_q6v5_wcss_sec.o obj-$(CONFIG_QCOM_SYSMON) += qcom_sysmon.o obj-$(CONFIG_QCOM_WCNSS_PIL) += qcom_wcnss_pil.o qcom_wcnss_pil-y += qcom_wcnss.o diff --git a/drivers/remoteproc/qcom_q6v5_wcss_sec.c b/drivers/remoteproc/qcom_q6v5_wcss_sec.c new file mode 100644 index ..8c47cf35a00b --- /dev/null +++ b/drivers/remoteproc/qcom_q6v5_wcss_sec.c @@ -0,0 +1,363 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2016-2018 Linaro Ltd. + * Copyright (C) 2014 Sony Mobile Communications AB + * Copyright (c) 2012-2018, 2024 The Linux Foundation. All rights reserved. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "qcom_common.h" +#include "qcom_q6v5.h" + +#include "remoteproc_internal.h" + +#define WCSS_CRASH_REASON 421 + +#define WCSS_PAS_ID0x6 +#define MPD_WCSS_PAS_ID0xD + +struct wcss_sec { + struct device *dev; + struct qcom_rproc_glink glink_subdev; + struct qcom_rproc_ssr ssr_subdev; + struct qcom_q6v5 q6; + phys_addr_t mem_phys; + phys_addr_t mem_reloc; + void *mem_region; + size_t mem_size; + const struct wcss_data *desc; + const char *fw_name; + + struct clk *im_sleep; +}; + +struct wcss_data { + u32 pasid; + const struct rproc_ops *ops; + bool need_auto_boot; + int (*init_clk)(struct wcss_sec *wcss); +}; + +static int wcss_sec_start(struct rproc *rproc) +{ + struct wcss_sec *wcss = rproc->priv; + struct device *dev = wcss->dev; + const struct wcss_data *desc = of_device_get_match_data(dev); + int ret; + + qcom_q6v5_prepare(&wcss->q6); + + ret = qcom_scm_pas_auth_and_reset(desc->pasid); + if (ret) { + dev_err(dev, "wcss_reset failed\n"); + return ret; + } + + ret = qcom_q6v5_wait_for_start(&wcss->q6, 5 * HZ); + if (ret == -ETIMEDOUT) + dev_err(dev, "start timed out\n"); + + return ret; +} + +static int wcss_sec_stop(struct rproc *rproc) +{ + struct wcss_sec *wcss = rproc->priv; + struct device *dev = wcss->dev; + const struct wcss_data *desc = of_device_get_match_data(dev); + int ret; + + ret = qcom_scm_pas_shutdown(desc->pasid); + if (ret) { + dev_err(dev, "not able to shutdown\n"); + return ret; + } + qcom_q6v5_unprepare(&wcss->q6); + + retur
[PATCH 3/4] arm64: dts: qcom: ipq5332: add nodes to bringup q6
From: Manikanta Mylavarapu Enable nodes required for q6 remoteproc bring up. Signed-off-by: Manikanta Mylavarapu Signed-off-by: Gokul Sriram Palanisamy --- arch/arm64/boot/dts/qcom/ipq5332.dtsi | 62 +++ 1 file changed, 62 insertions(+) diff --git a/arch/arm64/boot/dts/qcom/ipq5332.dtsi b/arch/arm64/boot/dts/qcom/ipq5332.dtsi index 0a74ed4f72cc..ec93e7b64b9e 100644 --- a/arch/arm64/boot/dts/qcom/ipq5332.dtsi +++ b/arch/arm64/boot/dts/qcom/ipq5332.dtsi @@ -145,6 +145,11 @@ smem@4a80 { hwlocks = <&tcsr_mutex 3>; }; + + q6_region: wcnss@4a90 { + reg = <0x0 0x4a90 0x0 0x2b0>; + no-map; + }; }; soc@0 { @@ -476,6 +481,39 @@ frame@b128000 { status = "disabled"; }; }; + + q6v5_wcss: remoteproc@d10 { + compatible = "qcom,ipq5332-wcss-sec-pil"; + reg = <0xd10 0x4040>; + firmware-name = "ath12k/IPQ5332/hw1.0/q6_fw0.mdt"; + interrupts-extended = <&intc GIC_SPI 421 IRQ_TYPE_EDGE_RISING>, + <&wcss_smp2p_in 0 IRQ_TYPE_NONE>, + <&wcss_smp2p_in 1 IRQ_TYPE_NONE>, + <&wcss_smp2p_in 2 IRQ_TYPE_NONE>, + <&wcss_smp2p_in 3 IRQ_TYPE_NONE>; + interrupt-names = "wdog", + "fatal", + "ready", + "handover", + "stop-ack"; + + clocks = <&gcc GCC_IM_SLEEP_CLK>; + clock-names = "im_sleep"; + + qcom,smem-states = <&wcss_smp2p_out 0>, + <&wcss_smp2p_out 1>; + qcom,smem-state-names = "shutdown", + "stop"; + + memory-region = <&q6_region>; + + glink-edge { + interrupts = ; + label = "rtr"; + qcom,remote-pid = <1>; + mboxes = <&apcs_glb 8>; + }; + }; }; timer { @@ -485,4 +523,28 @@ timer { , ; }; + + wcss: wcss-smp2p { + compatible = "qcom,smp2p"; + qcom,smem = <435>, <428>; + + interrupt-parent = <&intc>; + interrupts = ; + + mboxes = <&apcs_glb 9>; + + qcom,local-pid = <0>; + qcom,remote-pid = <1>; + + wcss_smp2p_out: master-kernel { + qcom,entry-name = "master-kernel"; + #qcom,smem-state-cells = <1>; + }; + + wcss_smp2p_in: slave-kernel { + qcom,entry-name = "slave-kernel"; + interrupt-controller; + #interrupt-cells = <2>; + }; + }; }; -- 2.34.1
[PATCH 4/4] arm64: dts: qcom: ipq9574: add nodes to bring up q6
From: Manikanta Mylavarapu Enable nodes required for q6 remoteproc bring up. Signed-off-by: Manikanta Mylavarapu Signed-off-by: Gokul Sriram Palanisamy --- arch/arm64/boot/dts/qcom/ipq9574.dtsi | 58 +++ 1 file changed, 58 insertions(+) diff --git a/arch/arm64/boot/dts/qcom/ipq9574.dtsi b/arch/arm64/boot/dts/qcom/ipq9574.dtsi index 48dfafea46a7..bca3a50d21d7 100644 --- a/arch/arm64/boot/dts/qcom/ipq9574.dtsi +++ b/arch/arm64/boot/dts/qcom/ipq9574.dtsi @@ -213,6 +213,11 @@ smem@4aa0 { hwlocks = <&tcsr_mutex 3>; no-map; }; + + q6_region: wcnss@4ab0 { + reg = <0x0 0x4ab0 0x0 0x2b0>; + no-map; + }; }; soc: soc@0 { @@ -756,6 +761,35 @@ frame@b128000 { status = "disabled"; }; }; + + q6v5_wcss: remoteproc@cd0 { + compatible = "qcom,ipq9574-wcss-sec-pil"; + reg = <0x0cd0 0x4040>; + firmware-name = "ath11k/IPQ9574/hw1.0/q6_fw.mdt"; + interrupts-extended = <&intc GIC_SPI 325 IRQ_TYPE_EDGE_RISING>, + <&wcss_smp2p_in 0 IRQ_TYPE_NONE>, + <&wcss_smp2p_in 1 IRQ_TYPE_NONE>, + <&wcss_smp2p_in 2 IRQ_TYPE_NONE>, + <&wcss_smp2p_in 3 IRQ_TYPE_NONE>; + interrupt-names = "wdog", + "fatal", + "ready", + "handover", + "stop-ack"; + + qcom,smem-states = <&wcss_smp2p_out 0>, + <&wcss_smp2p_out 1>; + qcom,smem-state-names = "shutdown", + "stop"; + memory-region = <&q6_region>; + + glink-edge { + interrupts = ; + label = "rtr"; + qcom,remote-pid = <1>; + mboxes = <&apcs_glb 8>; + }; + }; }; thermal-zones { @@ -987,4 +1021,28 @@ timer { , ; }; + + wcss: wcss-smp2p { + compatible = "qcom,smp2p"; + qcom,smem = <435>, <428>; + + interrupt-parent = <&intc>; + interrupts = ; + + mboxes = <&apcs_glb 9>; + + qcom,local-pid = <0>; + qcom,remote-pid = <1>; + + wcss_smp2p_out: master-kernel { + qcom,entry-name = "master-kernel"; + #qcom,smem-state-cells = <1>; + }; + + wcss_smp2p_in: slave-kernel { + qcom,entry-name = "slave-kernel"; + interrupt-controller; + #interrupt-cells = <2>; + }; + }; }; -- 2.34.1
[PATCH V5 1/1] remoteproc: qcom: q6v5: Get crash reason from specific SMEM partition
From: Vignesh Viswanathan q6v5 fatal and watchdog IRQ handlers always retrieves the crash reason information from SMEM global partition (QCOM_SMEM_HOST_ANY). For some targets like IPQ9574 and IPQ5332, crash reason information is present in target specific partition due to which the crash reason is not printed in the current implementation. Add support to pass crash_reason_partition along with crash_reason_item number in qcom_q6v5_init call and use the same to get the crash information from SMEM in fatal and watchdog IRQ handlers. While at it, rename all instances of "crash_reason_smem" with "crash_reason_item" as this reflects the proper meaning. Signed-off-by: Vignesh Viswanathan --- This change depends on WCSS secure PIL driver [1]. Changes in V5: Added changes to WCSS secure PIL driver. Changes in V4: Rename all instances of crash_reason_smem to crash_reason_item Changes in V3: Updated commit message. Changes in V2: Addressed comments in V1. [1] https://patchwork.kernel.org/project/linux-arm-msm/cover/20240820085517.435566-1-quic_gokul...@quicinc.com/ drivers/remoteproc/qcom_q6v5.c | 10 ++-- drivers/remoteproc/qcom_q6v5.h | 6 ++- drivers/remoteproc/qcom_q6v5_adsp.c | 17 +++--- drivers/remoteproc/qcom_q6v5_mss.c | 5 +- drivers/remoteproc/qcom_q6v5_pas.c | 69 + drivers/remoteproc/qcom_q6v5_wcss.c | 12 +++-- drivers/remoteproc/qcom_q6v5_wcss_sec.c | 3 +- 7 files changed, 66 insertions(+), 56 deletions(-) diff --git a/drivers/remoteproc/qcom_q6v5.c b/drivers/remoteproc/qcom_q6v5.c index 4ee5e67a9f03..4d801ba3c791 100644 --- a/drivers/remoteproc/qcom_q6v5.c +++ b/drivers/remoteproc/qcom_q6v5.c @@ -100,7 +100,7 @@ static irqreturn_t q6v5_wdog_interrupt(int irq, void *data) return IRQ_HANDLED; } - msg = qcom_smem_get(QCOM_SMEM_HOST_ANY, q6v5->crash_reason, &len); + msg = qcom_smem_get(q6v5->crash_reason_partition, q6v5->crash_reason_item, &len); if (!IS_ERR(msg) && len > 0 && msg[0]) dev_err(q6v5->dev, "watchdog received: %s\n", msg); else @@ -121,7 +121,7 @@ static irqreturn_t q6v5_fatal_interrupt(int irq, void *data) if (!q6v5->running) return IRQ_HANDLED; - msg = qcom_smem_get(QCOM_SMEM_HOST_ANY, q6v5->crash_reason, &len); + msg = qcom_smem_get(q6v5->crash_reason_partition, q6v5->crash_reason_item, &len); if (!IS_ERR(msg) && len > 0 && msg[0]) dev_err(q6v5->dev, "fatal error received: %s\n", msg); else @@ -244,14 +244,16 @@ EXPORT_SYMBOL_GPL(qcom_q6v5_panic); * Return: 0 on success, negative errno on failure */ int qcom_q6v5_init(struct qcom_q6v5 *q6v5, struct platform_device *pdev, - struct rproc *rproc, int crash_reason, const char *load_state, + struct rproc *rproc, int crash_reason_partition, + int crash_reason_item, const char *load_state, void (*handover)(struct qcom_q6v5 *q6v5)) { int ret; q6v5->rproc = rproc; q6v5->dev = &pdev->dev; - q6v5->crash_reason = crash_reason; + q6v5->crash_reason_partition = crash_reason_partition; + q6v5->crash_reason_item = crash_reason_item; q6v5->handover = handover; init_completion(&q6v5->start_done); diff --git a/drivers/remoteproc/qcom_q6v5.h b/drivers/remoteproc/qcom_q6v5.h index 5a859c41896e..7cdfa21fd9e4 100644 --- a/drivers/remoteproc/qcom_q6v5.h +++ b/drivers/remoteproc/qcom_q6v5.h @@ -34,7 +34,8 @@ struct qcom_q6v5 { struct completion start_done; struct completion stop_done; - int crash_reason; + int crash_reason_partition; + int crash_reason_item; bool running; @@ -43,7 +44,8 @@ struct qcom_q6v5 { }; int qcom_q6v5_init(struct qcom_q6v5 *q6v5, struct platform_device *pdev, - struct rproc *rproc, int crash_reason, const char *load_state, + struct rproc *rproc, int crash_reason_partition, + int crash_reason_item, const char *load_state, void (*handover)(struct qcom_q6v5 *q6v5)); void qcom_q6v5_deinit(struct qcom_q6v5 *q6v5); diff --git a/drivers/remoteproc/qcom_q6v5_adsp.c b/drivers/remoteproc/qcom_q6v5_adsp.c index 572dcb0f055b..174b0e213b12 100644 --- a/drivers/remoteproc/qcom_q6v5_adsp.c +++ b/drivers/remoteproc/qcom_q6v5_adsp.c @@ -60,7 +60,7 @@ #define LPASS_EFUSE_Q6SS_EVB_SEL 0x0 struct adsp_pil_data { - int crash_reason_smem; + int crash_reason_item; const char *firmware_name; const char *ssr_name; @@ -97,7 +97,7 @@ struct qcom_adsp { struct regmap *halt_map; unsigned int halt_lpass; - int crash_reason_smem; + int crash_reason_item; const char *info_name; struct completion start_done; @@ -721,8 +721,9 @@ static int adsp_probe(struct platform_device *pdev) if (ret) g
Re: [PATCH] remoteproc: k3-r5: Fix driver shutdown
Hi Jan, On 19-08-2024 22:17, Jan Kiszka wrote: From: Jan Kiszka When k3_r5_cluster_rproc_exit is run, core 1 is shutdown and removed first. When core 0 should then be stopped before its removal, it will find core1->rproc as NULL already and crashes. Happens on rmmod e.g. Did you check this on top of -next-20240820 tag? There was a series[0] which was merged recently which fixed this condition. I don't see this issue when trying on top of -next-20240820 tag. [0]: https://lore.kernel.org/all/20240808074127.2688131-1-b-pa...@ti.com/ Fixes: 3c8a9066d584 ("remoteproc: k3-r5: Do not allow core1 to power up before core0 via sysfs") CC: sta...@vger.kernel.org Signed-off-by: Jan Kiszka --- There might be one more because I can still make this driver crash after an operator error. Were error scenarios tested at all? Can you point out what is this issue more specifically, and I can take this up then. drivers/remoteproc/ti_k3_r5_remoteproc.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c b/drivers/remoteproc/ti_k3_r5_remoteproc.c index eb09d2e9b32a..9ebd7a34e638 100644 --- a/drivers/remoteproc/ti_k3_r5_remoteproc.c +++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c @@ -646,7 +646,8 @@ static int k3_r5_rproc_stop(struct rproc *rproc) /* do not allow core 0 to stop before core 1 */ core1 = list_last_entry(&cluster->cores, struct k3_r5_core, elem); - if (core != core1 && core1->rproc->state != RPROC_OFFLINE) { + if (core != core1 && core1->rproc && + core1->rproc->state != RPROC_OFFLINE) { dev_err(dev, "%s: can not stop core 0 before core 1\n", __func__); ret = -EPERM;
Re: [PATCH] remoteproc: k3-r5: Fix driver shutdown
On 20.08.24 11:30, Beleswar Prasad Padhi wrote: > Hi Jan, > > On 19-08-2024 22:17, Jan Kiszka wrote: >> From: Jan Kiszka >> >> When k3_r5_cluster_rproc_exit is run, core 1 is shutdown and removed >> first. When core 0 should then be stopped before its removal, it will >> find core1->rproc as NULL already and crashes. Happens on rmmod e.g. > > > Did you check this on top of -next-20240820 tag? There was a series[0] > which was merged recently which fixed this condition. I don't see this > issue when trying on top of -next-20240820 tag. > [0]: https://lore.kernel.org/all/20240808074127.2688131-1-b-pa...@ti.com/ > I didn't try those yet, I was on 6.11-rcX. But from reading them quickly, I'm not seeing the two issues I found directly addressed there. >> >> Fixes: 3c8a9066d584 ("remoteproc: k3-r5: Do not allow core1 to power >> up before core0 via sysfs") >> CC: sta...@vger.kernel.org >> Signed-off-by: Jan Kiszka >> --- >> >> There might be one more because I can still make this driver crash >> after an operator error. Were error scenarios tested at all? > > > Can you point out what is this issue more specifically, and I can take > this up then. Try starting core1 before core0, and then again - system will hang or crash while trying to wipe ATCM. I do not understand this problem yet - same VA is used, and I cannot see where/how the region should have been unmapped in between. Jan > >> >> drivers/remoteproc/ti_k3_r5_remoteproc.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c >> b/drivers/remoteproc/ti_k3_r5_remoteproc.c >> index eb09d2e9b32a..9ebd7a34e638 100644 >> --- a/drivers/remoteproc/ti_k3_r5_remoteproc.c >> +++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c >> @@ -646,7 +646,8 @@ static int k3_r5_rproc_stop(struct rproc *rproc) >> /* do not allow core 0 to stop before core 1 */ >> core1 = list_last_entry(&cluster->cores, struct k3_r5_core, >> elem); >> - if (core != core1 && core1->rproc->state != RPROC_OFFLINE) { >> + if (core != core1 && core1->rproc && >> + core1->rproc->state != RPROC_OFFLINE) { >> dev_err(dev, "%s: can not stop core 0 before core 1\n", >> __func__); >> ret = -EPERM; -- Siemens AG, Technology Linux Expert Center
Re: [PATCH] remoteproc: k3-r5: Fix driver shutdown
On 20-08-2024 15:09, Jan Kiszka wrote: On 20.08.24 11:30, Beleswar Prasad Padhi wrote: Hi Jan, On 19-08-2024 22:17, Jan Kiszka wrote: From: Jan Kiszka When k3_r5_cluster_rproc_exit is run, core 1 is shutdown and removed first. When core 0 should then be stopped before its removal, it will find core1->rproc as NULL already and crashes. Happens on rmmod e.g. Did you check this on top of -next-20240820 tag? There was a series[0] which was merged recently which fixed this condition. I don't see this issue when trying on top of -next-20240820 tag. [0]: https://lore.kernel.org/all/20240808074127.2688131-1-b-pa...@ti.com/ I didn't try those yet, I was on 6.11-rcX. But from reading them quickly, I'm not seeing the two issues I found directly addressed there. Check the comment by Andrew Davis[0], that addresses the above issue. [0]: https://lore.kernel.org/all/0bba5293-a55d-4f13-887c-272a54d6e...@ti.com/ Fixes: 3c8a9066d584 ("remoteproc: k3-r5: Do not allow core1 to power up before core0 via sysfs") CC: sta...@vger.kernel.org Signed-off-by: Jan Kiszka --- There might be one more because I can still make this driver crash after an operator error. Were error scenarios tested at all? Can you point out what is this issue more specifically, and I can take this up then. Try starting core1 before core0, and then again - system will hang or If you are trying to stop and then start the cores from sysfs, that is not yet supported. The hang is thus expected. crash while trying to wipe ATCM. I do not understand this problem yet - same VA is used, and I cannot see where/how the region should have been unmapped in between. Jan drivers/remoteproc/ti_k3_r5_remoteproc.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c b/drivers/remoteproc/ti_k3_r5_remoteproc.c index eb09d2e9b32a..9ebd7a34e638 100644 --- a/drivers/remoteproc/ti_k3_r5_remoteproc.c +++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c @@ -646,7 +646,8 @@ static int k3_r5_rproc_stop(struct rproc *rproc) /* do not allow core 0 to stop before core 1 */ core1 = list_last_entry(&cluster->cores, struct k3_r5_core, elem); - if (core != core1 && core1->rproc->state != RPROC_OFFLINE) { + if (core != core1 && core1->rproc && + core1->rproc->state != RPROC_OFFLINE) { dev_err(dev, "%s: can not stop core 0 before core 1\n", __func__); ret = -EPERM;
Re: [PATCH RFC v2] mmc/virtio: Add virtio MMC driver for QEMU emulation
On Sun, 7 Jul 2024 at 18:06, Mikhail Krasheninnikov wrote: > > Introduce a new virtio MMC driver to enable virtio SD/MMC card > emulation with QEMU. This driver allows emulating MMC cards in > virtual environments, enhancing functionality and testing > capabilities within QEMU. > > Link to the QEMU patch: > https://lists.nongnu.org/archive/html/qemu-block/2024-07/msg00221.html > > No changes to existing dependencies or documentation. > > Signed-off-by: Mikhail Krasheninnikov > CC: Matwey Kornilov > CC: Ulf Hansson > CC: linux-...@vger.kernel.org > CC: "Michael S. Tsirkin" > CC: Jason Wang > CC: Paolo Bonzini > CC: Stefan Hajnoczi > CC: Xuan Zhuo > CC: virtualizat...@lists.linux-foundation.org Sorry for the delay. Can you please re-spin/post this to add Adrian Hunter, the sdhci maintainer. Moreover, it would be nice to get some from the virtio folkz to review this too. My comments at this point are mostly nitpicks. Please see below. > > --- > Changes from v1: > - Add MAINTAINERS entry > - Refactor includes > - Change CPU endian format to little endian for device communication > - Move structs that belonged to uapi > - Validate multiple fields > - Introduce mutexes for safe request handling > - Call virtio_device_ready before adding host > - Fix removal of the device > > MAINTAINERS | 15 ++ > drivers/mmc/host/Kconfig | 14 ++ > drivers/mmc/host/Makefile | 2 + > drivers/mmc/host/virtio-sdhci.c | 258 ++ > drivers/mmc/host/virtio-sdhci.h | 40 + > include/uapi/linux/virtio-sdhci.h | 39 + Please rename to sdhci-virtio.* and update the corresponding prefix of the function names accordingly. > include/uapi/linux/virtio_ids.h | 1 + > 7 files changed, 369 insertions(+) > create mode 100644 drivers/mmc/host/virtio-sdhci.c > create mode 100644 drivers/mmc/host/virtio-sdhci.h > create mode 100644 include/uapi/linux/virtio-sdhci.h > > diff --git a/MAINTAINERS b/MAINTAINERS > index dd5de540ec0b..be86156cd66c 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -22907,6 +22907,21 @@ S: Maintained > F: drivers/nvdimm/nd_virtio.c > F: drivers/nvdimm/virtio_pmem.c > > +VIRTIO SDHCI DRIVER > +M: Mikhail Krasheninnikov > +M: "Michael S. Tsirkin" > +M: Jason Wang > +M: Paolo Bonzini > +M: Stefan Hajnoczi > +M: Xuan Zhuo > +L: virtualizat...@lists.linux-foundation.org > +L: linux-...@vger.kernel.org > +L: oasis-vir...@connectedcommunity.org > +S: Maintained > +F: drivers/mmc/host/virtio-sdhci.c > +F: drivers/mmc/host/virtio-sdhci.h > +F: include/uapi/linux/virtio-sdhci.h > + > VIRTIO SOUND DRIVER > M: Anton Yakovlev > M: "Michael S. Tsirkin" > diff --git a/drivers/mmc/host/Kconfig b/drivers/mmc/host/Kconfig > index 554e67103c1a..d62d669ee117 100644 > --- a/drivers/mmc/host/Kconfig > +++ b/drivers/mmc/host/Kconfig > @@ -1069,3 +1069,17 @@ config MMC_LITEX > module will be called litex_mmc. > > If unsure, say N. > + > +config SDHCI_VIRTIO > + tristate "VirtIO SDHCI Host Controller support" > + depends on VIRTIO > + help > + This enables support for the Virtio SDHCI driver, which allows the > + kernel to interact with SD/MMC devices over Virtio. Virtio is a > + virtualization standard for network and disk device drivers, > + providing a common API for virtualized environments. > + > + Enable this option if you are running the kernel in a virtualized > + environment and need SD/MMC support via Virtio. > + > + If unsure, say N. > \ No newline at end of file > diff --git a/drivers/mmc/host/Makefile b/drivers/mmc/host/Makefile > index a693fa3d3f1c..f9b05a07c6db 100644 > --- a/drivers/mmc/host/Makefile > +++ b/drivers/mmc/host/Makefile > @@ -108,3 +108,5 @@ endif > > obj-$(CONFIG_MMC_SDHCI_XENON) += sdhci-xenon-driver.o > sdhci-xenon-driver-y += sdhci-xenon.o sdhci-xenon-phy.o > + > +obj-$(CONFIG_SDHCI_VIRTIO) += virtio-sdhci.o > \ No newline at end of file > diff --git a/drivers/mmc/host/virtio-sdhci.c b/drivers/mmc/host/virtio-sdhci.c > new file mode 100644 > index ..1a637ab5e010 > --- /dev/null > +++ b/drivers/mmc/host/virtio-sdhci.c > @@ -0,0 +1,258 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > +/* > + * VirtIO SD/MMC driver > + * > + * Author: Mikhail Krasheninnikov > + */ > + > +#include "virtio-sdhci.h" > +#include "linux/mmc/host.h" > +#include "linux/virtio.h" > +#include "linux/virtio_config.h" > +#include "linux/completion.h" > +#include "uapi/linux/virtio-sdhci.h" > + > +struct virtio_sdhci_host { > + struct virtio_device *vdev; > + struct mmc_host *mmc; > + struct virtqueue *vq; > + struct mmc_request *current_request; > + > + struct virtio_mmc_request virtio_request; > + struct virtio_mmc_response virtio_response; > + > + struct completion request_han
Re: [BUG] tracing: dynamic ftrace selftest detected failures
On Tue, Aug 20, 2024 at 10:03:30AM +0900, Masami Hiramatsu wrote: > On Mon, 19 Aug 2024 12:02:44 -0400 > Steven Rostedt wrote: > > > On Tue, 20 Aug 2024 00:56:49 +0900 > > Masami Hiramatsu (Google) wrote: > > > > > > > > > > > We may need to add "noinline" or something to make sure those functions > > > > don't get inlined for LTO. > > > > > > Yeah, we need such option at least for function call test. > > > > Could you add the noinline, and if it fixes the issue send a patch? > > I found the target function already has "noinline". I tried to add noinline > to the testing function (callsite), but it also did not work. > I think "noinline" is for the compiler, but LTO is done by the linker. If LTO is breaking noinline, then that has much larger implications for noinstr code and similar, and means that LTO is unsound... > I found a discussion similar to this problem, but it seems very hacky. > > https://groups.google.com/g/llvm-dev/c/6baI9LISmSU/m/uEeY_CRbBQAJ?pli=1 Thanks for the link! Mark.
[PATCH v2] remoteproc: k3-r5: Delay notification of wakeup event
From: Udit Kumar Few times, core1 was scheduled to boot first before core0, which leads to error: 'k3_r5_rproc_start: can not start core 1 before core 0'. This was happening due to some scheduling between prepare and start callback. The probe function waits for event, which is getting triggered by prepare callback. To avoid above condition move event trigger to start instead of prepare callback. Fixes: 61f6f68447ab ("remoteproc: k3-r5: Wait for core0 power-up before powering up core1") Signed-off-by: Udit Kumar [ Applied wakeup event trigger only for Split-Mode booted rprocs ] Signed-off-by: Beleswar Padhi --- v2: Changelog: * Mathieu 1) Rebased changes on top of -next-20240820 tag. Link to v1: https://lore.kernel.org/all/20240809060132.308642-1-b-pa...@ti.com/ drivers/remoteproc/ti_k3_r5_remoteproc.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c b/drivers/remoteproc/ti_k3_r5_remoteproc.c index 8a63a9360c0f..e61e53381abc 100644 --- a/drivers/remoteproc/ti_k3_r5_remoteproc.c +++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c @@ -469,8 +469,6 @@ static int k3_r5_rproc_prepare(struct rproc *rproc) ret); return ret; } - core->released_from_reset = true; - wake_up_interruptible(&cluster->core_transition); /* * Newer IP revisions like on J7200 SoCs support h/w auto-initialization @@ -587,6 +585,9 @@ static int k3_r5_rproc_start(struct rproc *rproc) ret = k3_r5_core_run(core); if (ret) return ret; + + core->released_from_reset = true; + wake_up_interruptible(&cluster->core_transition); } return 0; -- 2.34.1
Re: [PATCH 0/4] Add new driver for WCSS secure PIL loading
On 20/08/2024 10:55, Gokul Sriram Palanisamy wrote: > This series depends on q6 clock removal series [1]. How? So this cannot be tested and merged? That's second patchset to day with some totally bogus dependencies. People, stop it. Best regards, Krzysztof
Re: [PATCH 1/2] dt-bindings: remoteproc: qcom: document hexagon based WCSS secure PIL
On Tue, Aug 20, 2024 at 02:25:14PM +0530, Gokul Sriram Palanisamy wrote: > From: Manikanta Mylavarapu > > Add new binding document for hexagon based WCSS secure PIL remoteproc. > IPQ5332, IPQ9574 follows secure PIL remoteproc. > > Signed-off-by: Manikanta Mylavarapu > Signed-off-by: Gokul Sriram Palanisamy > --- > .../remoteproc/qcom,wcss-sec-pil.yaml | 125 ++ > 1 file changed, 125 insertions(+) > create mode 100644 > Documentation/devicetree/bindings/remoteproc/qcom,wcss-sec-pil.yaml > > diff --git > a/Documentation/devicetree/bindings/remoteproc/qcom,wcss-sec-pil.yaml > b/Documentation/devicetree/bindings/remoteproc/qcom,wcss-sec-pil.yaml > new file mode 100644 > index ..c69401b6cec1 > --- /dev/null > +++ b/Documentation/devicetree/bindings/remoteproc/qcom,wcss-sec-pil.yaml > @@ -0,0 +1,125 @@ > +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) > +%YAML 1.2 > +--- > +$id: http://devicetree.org/schemas/remoteproc/qcom,wcss-sec-pil.yaml# > +$schema: http://devicetree.org/meta-schemas/core.yaml# > + > +title: Qualcomm WCSS Secure Peripheral Image Loader ... > + > +maintainers: > + - Manikanta Mylavarapu > + > +description: > + WCSS Secure Peripheral Image Loader loads firmware and power up QDSP6 What is WCSS? Don't add random acronyms without any explanation. > + remoteproc's on the Qualcomm IPQ9574, IPQ5332 SoC. > + > +properties: > + compatible: > +enum: > + - qcom,ipq5332-wcss-sec-pil > + - qcom,ipq9574-wcss-sec-pil > + > + reg: > +maxItems: 1 > + > + firmware-name: > +$ref: /schemas/types.yaml#/definitions/string > +description: Firmware name for the Hexagon core No, look how other bindings are doing it. It looks like you develop patches on some old kernel, so please start from scratch on newest kernel. > + > + interrupts: > +items: > + - description: Watchdog interrupt > + - description: Fatal interrupt > + - description: Ready interrupt > + - description: Handover interrupt > + - description: Stop acknowledge interrupt > + > + interrupt-names: > +items: > + - const: wdog > + - const: fatal > + - const: ready > + - const: handover > + - const: stop-ack > + > + clocks: > +items: > + - description: IM SLEEP clock What is IM? Explain all acronyms. What is SLEEP? > + > + clock-names: > +items: > + - const: im_sleep sleep? Are there different sleep clocks here? > + > + qcom,smem-states: > +$ref: /schemas/types.yaml#/definitions/phandle-array > +description: States used by the AP to signal the remote processor > +items: > + - description: Shutdown Q6 > + - description: Stop Q6 > + Do not introduce other order. First stop, then shutdown. > + qcom,smem-state-names: > +description: > + Names of the states used by the AP to signal the remote processor > +items: > + - const: shutdown > + - const: stop The same. > + > + memory-region: > +items: > + - description: Q6 reserved region > + > + glink-edge: > +$ref: /schemas/remoteproc/qcom,glink-edge.yaml# > +description: > + Qualcomm G-Link subnode which represents communication edge, channels > + and devices related to the Modem. > +unevaluatedProperties: false > + > +required: > + - compatible > + - firmware-name > + - reg > + - interrupts > + - interrupt-names > + - qcom,smem-states > + - qcom,smem-state-names > + - memory-region Keep the same order as in properties. > + > +additionalProperties: false > + > +examples: > + - | > +#include > +#include > +q6v5_wcss: remoteproc@d10 { Drop unused label. > + compatible = "qcom,ipq5332-wcss-sec-pil"; > + reg = <0xd10 0x4040>; > + firmware-name = "ath12k/IPQ5332/hw1.0/q6_fw0.mdt"; > + interrupts-extended = <&intc GIC_SPI 291 IRQ_TYPE_EDGE_RISING>, > +<&wcss_smp2p_in 0 IRQ_TYPE_NONE>, > +<&wcss_smp2p_in 1 IRQ_TYPE_NONE>, Best regards, Krzysztof
Re: [PATCH 3/4] arm64: dts: qcom: ipq5332: add nodes to bringup q6
On Tue, Aug 20, 2024 at 02:25:16PM +0530, Gokul Sriram Palanisamy wrote: > From: Manikanta Mylavarapu > > Enable nodes required for q6 remoteproc bring up. > > Signed-off-by: Manikanta Mylavarapu > Signed-off-by: Gokul Sriram Palanisamy > --- > arch/arm64/boot/dts/qcom/ipq5332.dtsi | 62 +++ > 1 file changed, 62 insertions(+) > > diff --git a/arch/arm64/boot/dts/qcom/ipq5332.dtsi > b/arch/arm64/boot/dts/qcom/ipq5332.dtsi > index 0a74ed4f72cc..ec93e7b64b9e 100644 > --- a/arch/arm64/boot/dts/qcom/ipq5332.dtsi > +++ b/arch/arm64/boot/dts/qcom/ipq5332.dtsi > @@ -145,6 +145,11 @@ smem@4a80 { > > hwlocks = <&tcsr_mutex 3>; > }; > + > + q6_region: wcnss@4a90 { Why here it is wcnss... > + reg = <0x0 0x4a90 0x0 0x2b0>; > + no-map; > + }; > }; > > soc@0 { > @@ -476,6 +481,39 @@ frame@b128000 { > status = "disabled"; > }; > }; > + > + q6v5_wcss: remoteproc@d10 { but everywhere else is wcss? > + compatible = "qcom,ipq5332-wcss-sec-pil"; > + reg = <0xd10 0x4040>; > + firmware-name = "ath12k/IPQ5332/hw1.0/q6_fw0.mdt"; It's one firmware independent of board? Best regards, Krzysztof
Re: [PATCH 2/2] remoteproc: qcom: add hexagon based WCSS secure PIL driver
On Tue, Aug 20, 2024 at 02:25:15PM +0530, Gokul Sriram Palanisamy wrote: > From: Vignesh Viswanathan > > Add support to bring up hexagon based WCSS secure PIL remoteproc. > IPQ5332, IPQ9574 supports secure PIL remoteproc. > > Signed-off-by: Vignesh Viswanathan > Signed-off-by: Manikanta Mylavarapu > Signed-off-by: Gokul Sriram Palanisamy > +static int wcss_sec_dump_segments(struct rproc *rproc, > + const struct firmware *fw) > +{ > + struct device *dev = rproc->dev.parent; > + struct reserved_mem *rmem = NULL; > + struct device_node *node; > + int num_segs, index = 0; > + int ret; > + > + /* Parse through additional reserved memory regions for the rproc > + * and add them to the coredump segments > + */ > + num_segs = of_count_phandle_with_args(dev->of_node, > + "memory-region", NULL); > + while (index < num_segs) { > + node = of_parse_phandle(dev->of_node, > + "memory-region", index); > + if (!node) > + return -EINVAL; > + > + rmem = of_reserved_mem_lookup(node); > + if (!rmem) { > + dev_err(dev, "unable to acquire memory-region index %d > num_segs %d\n", > + index, num_segs); Leaking refcnt. > + return -EINVAL; > + } > + > + of_node_put(node); > + > + dev_dbg(dev, "Adding segment 0x%pa size 0x%pa", > + &rmem->base, &rmem->size); > + ret = rproc_coredump_add_custom_segment(rproc, > + rmem->base, > + rmem->size, > + wcss_sec_copy_segment, > + NULL); > + if (ret) > + return ret; > + > + index++; > + } > + > + return 0; > +} > + > +static const struct rproc_ops wcss_sec_ops = { > + .start = wcss_sec_start, > + .stop = wcss_sec_stop, > + .da_to_va = wcss_sec_da_to_va, > + .load = wcss_sec_load, > + .get_boot_addr = rproc_elf_get_boot_addr, > + .panic = wcss_sec_panic, > + .parse_fw = wcss_sec_dump_segments, > +}; > + > +static int wcss_sec_alloc_memory_region(struct wcss_sec *wcss) > +{ > + struct reserved_mem *rmem = NULL; > + struct device_node *node; > + struct device *dev = wcss->dev; > + > + node = of_parse_phandle(dev->of_node, "memory-region", 0); > + if (node) { > + rmem = of_reserved_mem_lookup(node); > + } else { No, that's over complicated. Just if (!node) { error handling }. > + dev_err(dev, "can't find phandle memory-region\n"); > + return -EINVAL; > + } > + > + of_node_put(node); > + > + if (!rmem) { > + dev_err(dev, "unable to acquire memory-region\n"); > + return -EINVAL; > + } > + > + wcss->mem_phys = rmem->base; > + wcss->mem_reloc = rmem->base; > + wcss->mem_size = rmem->size; > + wcss->mem_region = devm_ioremap_wc(dev, wcss->mem_phys, wcss->mem_size); > + if (!wcss->mem_region) { > + dev_err(dev, "unable to map memory region: %pa+%pa\n", > + &rmem->base, &rmem->size); > + return -ENOMEM; > + } > + > + return 0; > +} > + ... > +static int wcss_sec_ipq5332_init_clk(struct wcss_sec *wcss) > +{ > + int ret; > + struct device *dev = wcss->dev; > + > + wcss->im_sleep = devm_clk_get(wcss->dev, "im_sleep"); > + if (IS_ERR(wcss->im_sleep)) { > + ret = PTR_ERR(wcss->im_sleep); > + if (ret != -EPROBE_DEFER) > + dev_err(dev, "failed to get im_sleep clock"); Syntax is return dev_err_probe. > + return ret; > + } > + > + ret = clk_prepare_enable(wcss->im_sleep); > + if (ret) { > + dev_err(dev, "could not enable im_sleep clk\n"); > + return ret; Just use devm_clk_get_enabled. > + } > + > + return 0; Best regards, Krzysztof
[PATCH] tracing/timerlat: Check tlat_var for NULL in timerlat_fd_release
From: Tomas Glozar When running timerlat with a userspace workload (NO_OSNOISE_WORKLOAD), NULL pointer dereference can be triggered by sending consequent SIGINT and SIGTERM signals to the workload process. That then causes timerlat_fd_release to be called twice in a row, and the second time, hrtimer_cancel is called on a zeroed hrtimer struct, causing the NULL dereference. This can be reproduced using rtla: ``` $ while true; do rtla timerlat top -u -q & PID=$!; sleep 5; \ kill -INT $PID; sleep 0.001; kill -TERM $PID; wait $PID; done [1] 1675 [1]+ Aborted (SIGTERM) rtla timerlat top -u -q [1] 1688 client_loop: send disconnect: Broken pipe ``` triggering the bug: ``` BUG: kernel NULL pointer dereference, address: 0010 Oops: Oops: [#1] PREEMPT SMP NOPTI CPU: 6 PID: 1679 Comm: timerlatu/6 Kdump: loaded Not tainted 6.10.0-rc2+ #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014 RIP: 0010:hrtimer_active+0xd/0x50 RSP: 0018:a86641567cc0 EFLAGS: 00010286 RAX: 0002e2c0 RBX: 994c6bf2e2c8 RCX: 994b0911ac18 RDX: RSI: 994b02f10700 RDI: 994c6bf2e2c8 RBP: 994c6bf2e340 R08: 994b158f7400 R09: 994b0911ac18 R10: 0010 R11: 994b00d40f00 R12: 994c6bf2e2c8 R13: 994b02049b20 R14: 994b011806c0 R15: FS: () GS:994c6bf0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0010 CR3: 000139020006 CR4: 00770ef0 PKRU: 5554 Call Trace: ? __die+0x24/0x70 ? page_fault_oops+0x75/0x170 ? mt_destroy_walk.isra.0+0x2b3/0x320 ? exc_page_fault+0x70/0x160 ? asm_exc_page_fault+0x26/0x30 ? hrtimer_active+0xd/0x50 hrtimer_cancel+0x15/0x40 timerlat_fd_release+0x48/0xe0 __fput+0xed/0x2c0 task_work_run+0x59/0x90 do_exit+0x275/0x4b0 do_group_exit+0x30/0x80 get_signal+0x917/0x960 ? vfs_read+0xb7/0x340 arch_do_signal_or_restart+0x29/0xf0 ? syscall_exit_to_user_mode+0x70/0x1f0 ? syscall_exit_work+0xf3/0x120 syscall_exit_to_user_mode+0x1a0/0x1f0 do_syscall_64+0x89/0x160 ? clear_bhb_loop+0x25/0x80 ? clear_bhb_loop+0x25/0x80 ? clear_bhb_loop+0x25/0x80 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f75790fd9ec ... ``` Fix the NULL pointer dereference by checking tlat_var->kthread for zero first in timerlat_fd_release, before calling hrtimer_cancel. tlat_var->kthread is always non-zero unless the entire tlat_var is zero, since it is set to the TID of the userspace workload in timerlat_fd_open under a mutex. Cc: sta...@vger.kernel.org Fixes: e88ed227f639e ("tracing/timerlat: Add user-space interface") Co-developed-by: Luis Claudio R. Goncalves Signed-off-by: Luis Claudio R. Goncalves Signed-off-by: Tomas Glozar --- kernel/trace/trace_osnoise.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c index 66a871553d4a..6d2b39da4dce 100644 --- a/kernel/trace/trace_osnoise.c +++ b/kernel/trace/trace_osnoise.c @@ -2572,6 +2572,7 @@ static int timerlat_fd_release(struct inode *inode, struct file *file) struct osnoise_variables *osn_var; struct timerlat_variables *tlat_var; long cpu = (long) file->private_data; + int ret = 0; migrate_disable(); mutex_lock(&interface_lock); @@ -2579,6 +2580,12 @@ static int timerlat_fd_release(struct inode *inode, struct file *file) osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu); tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu); + if (!tlat_var->kthread) { + /* the fd has been closed already */ + ret = -EBADF; + goto out; + } + hrtimer_cancel(&tlat_var->timer); memset(tlat_var, 0, sizeof(*tlat_var)); @@ -2593,9 +2600,10 @@ static int timerlat_fd_release(struct inode *inode, struct file *file) osn_var->kthread = NULL; } +out: mutex_unlock(&interface_lock); migrate_enable(); - return 0; + return ret; } #endif -- 2.46.0
[PATCH 1/1] uprobe: fix comment of uprobe_apply()
Depending on the argument 'add', uprobe_apply() may be registering or unregistering a probe. The current comment misses the description of the registration. Signed-off-by: Zhen Lei --- kernel/events/uprobes.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 73cc47708679f0c..c9de255e56e777f 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1201,8 +1201,8 @@ int uprobe_register_refctr(struct inode *inode, loff_t offset, EXPORT_SYMBOL_GPL(uprobe_register_refctr); /* - * uprobe_apply - unregister an already registered probe. - * @inode: the file in which the probe has to be removed. + * uprobe_apply - register a probe or unregister an already registered probe. + * @inode: the file in which the probe has to be placed or removed. * @offset: offset from the start of the file. * @uc: consumer which wants to add more or remove some breakpoints * @add: add or remove the breakpoints -- 2.34.1
Re: [BUG] tracing: dynamic ftrace selftest detected failures
On Tue, 20 Aug 2024 11:48:07 +0100 Mark Rutland wrote: > > I found the target function already has "noinline". I tried to add noinline > > to the testing function (callsite), but it also did not work. > > I think "noinline" is for the compiler, but LTO is done by the linker. > > If LTO is breaking noinline, then that has much larger implications for > noinstr code and similar, and means that LTO is unsound... Hmm, doesn't noinstr place the code in a separate section? I wonder if we create a separate section for the test function that LTO is inlining, if it will prevent it from being inlined. That is, noinline tells the compiler not to inline, but LTO happens after the compiler is done and may inline functions in the same section. But the linker does see separate sections and I don't think it will try to inline that code into another section. But I could be wrong. ;-) -- Steve
Re: [PATCH] remoteproc: k3-r5: Fix driver shutdown
On 20.08.24 11:48, Beleswar Prasad Padhi wrote: > > On 20-08-2024 15:09, Jan Kiszka wrote: >> On 20.08.24 11:30, Beleswar Prasad Padhi wrote: >>> Hi Jan, >>> >>> On 19-08-2024 22:17, Jan Kiszka wrote: >>>> From: Jan Kiszka >>>> >>>> When k3_r5_cluster_rproc_exit is run, core 1 is shutdown and removed >>>> first. When core 0 should then be stopped before its removal, it will >>>> find core1->rproc as NULL already and crashes. Happens on rmmod e.g. >>> >>> Did you check this on top of -next-20240820 tag? There was a series[0] >>> which was merged recently which fixed this condition. I don't see this >>> issue when trying on top of -next-20240820 tag. >>> [0]: >>> https://lore.kernel.org/all/20240808074127.2688131-1-b-pa...@ti.com/ >>> >> I didn't try those yet, I was on 6.11-rcX. But from reading them >> quickly, I'm not seeing the two issues I found directly addressed there. > > Check the comment by Andrew Davis[0], that addresses the above issue. > > [0]: > https://lore.kernel.org/all/0bba5293-a55d-4f13-887c-272a54d6e...@ti.com/ > OK, then someone still needs to update his patch accordingly. >> >>>> Fixes: 3c8a9066d584 ("remoteproc: k3-r5: Do not allow core1 to power >>>> up before core0 via sysfs") >>>> CC: sta...@vger.kernel.org >>>> Signed-off-by: Jan Kiszka >>>> --- >>>> >>>> There might be one more because I can still make this driver crash >>>> after an operator error. Were error scenarios tested at all? >>> >>> Can you point out what is this issue more specifically, and I can take >>> this up then. >> Try starting core1 before core0, and then again - system will hang or > If you are trying to stop and then start the cores from sysfs, that is > not yet supported. The hang is thus expected. What? Then the driver is broken, even more. Why wasn't it fully implemented? Jan -- Siemens AG, Technology Linux Expert Center
Re: [PATCH] virtio_pmem: Check device status before requesting flush
Philip Chen wrote: > On Mon, Aug 19, 2024 at 2:56 PM Ira Weiny wrote: > > > > Philip Chen wrote: > > > If a pmem device is in a bad status, the driver side could wait for > > > host ack forever in virtio_pmem_flush(), causing the system to hang. > > > > I assume this was supposed to be v2 and you resent this as a proper v2 > > with a change list from v1? > Ah...yes, I'll fix it and re-send it as a v2 patch. Wait didn't you already do that? Wasn't this v2? https://lore.kernel.org/all/20240815010337.2334245-1-philipc...@chromium.org/ Ira
Re: [PATCH 1/1] uprobe: fix comment of uprobe_apply()
On 08/20, Zhen Lei wrote: > > Depending on the argument 'add', uprobe_apply() may be registering or > unregistering a probe. ... > /* > - * uprobe_apply - unregister an already registered probe. > - * @inode: the file in which the probe has to be removed. > + * uprobe_apply - register a probe or unregister an already registered probe. Not really. See the commit 3c83a9ad0295eb63bd ("uprobes: make uprobe_register() return struct uprobe *") in tip/perf/core which changed this description * uprobe_apply - add or remove the breakpoints according to @uc->filter still looks confusing, yes... Oleg.
Re: [PATCH] remoteproc: k3-r5: Fix driver shutdown
On 20-08-2024 19:50, Jan Kiszka wrote: On 20.08.24 11:48, Beleswar Prasad Padhi wrote: On 20-08-2024 15:09, Jan Kiszka wrote: On 20.08.24 11:30, Beleswar Prasad Padhi wrote: Hi Jan, On 19-08-2024 22:17, Jan Kiszka wrote: From: Jan Kiszka When k3_r5_cluster_rproc_exit is run, core 1 is shutdown and removed first. When core 0 should then be stopped before its removal, it will find core1->rproc as NULL already and crashes. Happens on rmmod e.g. Did you check this on top of -next-20240820 tag? There was a series[0] which was merged recently which fixed this condition. I don't see this issue when trying on top of -next-20240820 tag. [0]: https://lore.kernel.org/all/20240808074127.2688131-1-b-pa...@ti.com/ I didn't try those yet, I was on 6.11-rcX. But from reading them quickly, I'm not seeing the two issues I found directly addressed there. Check the comment by Andrew Davis[0], that addresses the above issue. [0]: https://lore.kernel.org/all/0bba5293-a55d-4f13-887c-272a54d6e...@ti.com/ OK, then someone still needs to update his patch accordingly. That comment was addressed in the v4 series revision[1] and was merged to linux-next, available with tag -next-20240820. Request you to please check if the issue persists with -next-20240820 tag. I checked myself, and was not able to reproduce. [1]: https://lore.kernel.org/all/Zr9nbWnADDB+ZOlg@p14s/ Fixes: 3c8a9066d584 ("remoteproc: k3-r5: Do not allow core1 to power up before core0 via sysfs") CC: sta...@vger.kernel.org Signed-off-by: Jan Kiszka --- There might be one more because I can still make this driver crash after an operator error. Were error scenarios tested at all? Can you point out what is this issue more specifically, and I can take this up then. Try starting core1 before core0, and then again - system will hang or If you are trying to stop and then start the cores from sysfs, that is not yet supported. The hang is thus expected. What? Then the driver is broken, even more. Why wasn't it fully implemented? The driver is capable of starting a core and stopping it all well. The problem is, when we stop a core from sysfs (without resetting the SoC itself), the remotecore is powered off, but its resources are not relinquished. So when we start it back, there could be some memory corruptions. This feature of "graceful shutdown" of remotecores is almost implemented and will be posted to this driver soon. Request you to try out after that. With graceful shutdown feature, this will be the flow: 1. We issue a core stop operation from sysfs. 2. The remoteproc driver sends a special "SHUTDOWN" mailbox message to the remotecore. 3. The remotecore relinquishes all of its acquired resources through Device Manager Firmware and sends an ACK back. 4. The remotecore enters WFI state and then is resetted through Host core. 5. Then, if we try to do the core start operation from sysfs, core should be up as expected. Thanks, Beleswar Jan
Re: [PATCH RFC v3 09/13] uprobes: SRCU-protect uretprobe lifetime (with timeout)
On 08/19, Andrii Nakryiko wrote: > > On Mon, Aug 19, 2024 at 6:41 AM Oleg Nesterov wrote: > > > > On 08/12, Andrii Nakryiko wrote: > > > > > > Avoid taking refcount on uprobe in prepare_uretprobe(), instead take > > > uretprobe-specific SRCU lock and keep it active as kernel transfers > > > control back to user space. > > ... > > > include/linux/uprobes.h | 49 ++- > > > kernel/events/uprobes.c | 294 ++-- > > > 2 files changed, 301 insertions(+), 42 deletions(-) > > > > Oh. To be honest I don't like this patch. > > > > I would like to know what other reviewers think, but to me it adds too many > > complications that I can't even fully understand... > > Which parts? The atomic xchg() and cmpxchg() parts? What exactly do > you feel like you don't fully understand? Heh, everything looks too complex for me ;) Say, hprobe_expire(). It is also called by ri_timer() in softirq context, right? And it does /* We lost the race, undo our refcount bump. It can drop to zero. */ put_uprobe(uprobe); How so? If the refcount goes to zero, put_uprobe() does mutex_lock(), but it must not sleep in softirq context. Or, prepare_uretprobe() which does rcu_assign_pointer(utask->return_instances, ri); if (!timer_pending(&utask->ri_timer)) mod_timer(&utask->ri_timer, ...); Suppose that the timer was pending and it was fired right before rcu_assign_pointer(). What guarantees that prepare_uretprobe() will see timer_pending() == false? rcu_assign_pointer()->smp_store_release() is a one-way barrier. This timer_pending() check may appear to happen before rcu_assign_pointer() completes. In this (yes, theoretical) case ri_timer() can miss the new return_instance, while prepare_uretprobe() can miss the necessary mod_timer(). I think this needs another mb() in between. And I can't convince myself hprobe_expire() is correct... OK, I don't fully understand the logic, but why data_race(READ_ONCE(hprobe->leased)) ? READ_ONCE() should be enough in this case? > > As I have already mentioned in the previous discussions, we can simply kill > > utask->active_uprobe. And utask->auprobe. > > I don't have anything against that, in principle, but let's benchmark > and test that thoroughly. I'm a bit uneasy about the possibility that > some arch-specific code will do container_of() on this arch_uprobe in > order to get to uprobe, Well, struct uprobe is not "exported", the arch-specific code can't do this. Oleg.
Re: [BUG] tracing: dynamic ftrace selftest detected failures
On Tue, Aug 20, 2024 at 3:48 AM Mark Rutland wrote: > > On Tue, Aug 20, 2024 at 10:03:30AM +0900, Masami Hiramatsu wrote: > > On Mon, 19 Aug 2024 12:02:44 -0400 > > Steven Rostedt wrote: > > > > > On Tue, 20 Aug 2024 00:56:49 +0900 > > > Masami Hiramatsu (Google) wrote: > > > > > > > > > > > > > > We may need to add "noinline" or something to make sure those > > > > > functions > > > > > don't get inlined for LTO. > > > > > > > > Yeah, we need such option at least for function call test. > > > > > > Could you add the noinline, and if it fixes the issue send a patch? > > > > I found the target function already has "noinline". I tried to add noinline > > to the testing function (callsite), but it also did not work. > > I think "noinline" is for the compiler, but LTO is done by the linker. > > If LTO is breaking noinline, then that has much larger implications for > noinstr code and similar, and means that LTO is unsound... The noinline attribute is preserved in LLVM IR, so it should continue to work with LTO. Which function are we talking about here? Are you sure the function was inlined instead of being dropped completely? Does marking the function __used help? Sami
Re: [PATCH] remoteproc: k3-r5: Fix driver shutdown
On 20-08-2024 20:29, Beleswar Prasad Padhi wrote: On 20-08-2024 19:50, Jan Kiszka wrote: On 20.08.24 11:48, Beleswar Prasad Padhi wrote: On 20-08-2024 15:09, Jan Kiszka wrote: On 20.08.24 11:30, Beleswar Prasad Padhi wrote: Hi Jan, On 19-08-2024 22:17, Jan Kiszka wrote: From: Jan Kiszka When k3_r5_cluster_rproc_exit is run, core 1 is shutdown and removed first. When core 0 should then be stopped before its removal, it will find core1->rproc as NULL already and crashes. Happens on rmmod e.g. Did you check this on top of -next-20240820 tag? There was a series[0] which was merged recently which fixed this condition. I don't see this issue when trying on top of -next-20240820 tag. [0]: https://lore.kernel.org/all/20240808074127.2688131-1-b-pa...@ti.com/ I didn't try those yet, I was on 6.11-rcX. But from reading them quickly, I'm not seeing the two issues I found directly addressed there. Check the comment by Andrew Davis[0], that addresses the above issue. [0]: https://lore.kernel.org/all/0bba5293-a55d-4f13-887c-272a54d6e...@ti.com/ OK, then someone still needs to update his patch accordingly. That comment was addressed in the v4 series revision[1] and was merged to linux-next, available with tag -next-20240820. Request you to please check if the issue persists with -next-20240820 tag. I checked myself, and was not able to reproduce. [1]: https://lore.kernel.org/all/Zr9nbWnADDB+ZOlg@p14s/ Fixes: 3c8a9066d584 ("remoteproc: k3-r5: Do not allow core1 to power up before core0 via sysfs") CC: sta...@vger.kernel.org Signed-off-by: Jan Kiszka --- There might be one more because I can still make this driver crash after an operator error. Were error scenarios tested at all? Can you point out what is this issue more specifically, and I can take this up then. Try starting core1 before core0, and then again - system will hang or If you are trying to stop and then start the cores from sysfs, that is not yet supported. The hang is thus expected. What? Then the driver is broken, even more. Why wasn't it fully implemented? Just wanted to point out that this "graceful shutdown" feature is majorly dependent on the Device Manager Firmware(point 3) and minimal changes to the remoteproc driver (point 2 and 4). Thus, as soon as Firmware is capable, we will send out the patches for this feature. The driver is capable of starting a core and stopping it all well. The problem is, when we stop a core from sysfs (without resetting the SoC itself), the remotecore is powered off, but its resources are not relinquished. So when we start it back, there could be some memory corruptions. This feature of "graceful shutdown" of remotecores is almost implemented and will be posted to this driver soon. Request you to try out after that. With graceful shutdown feature, this will be the flow: 1. We issue a core stop operation from sysfs. 2. The remoteproc driver sends a special "SHUTDOWN" mailbox message to the remotecore. 3. The remotecore relinquishes all of its acquired resources through Device Manager Firmware and sends an ACK back. 4. The remotecore enters WFI state and then is resetted through Host core. 5. Then, if we try to do the core start operation from sysfs, core should be up as expected. Thanks, Beleswar Jan
Re: [BUG] tracing: dynamic ftrace selftest detected failures
On Tue, 20 Aug 2024 08:10:42 -0700 Sami Tolvanen wrote: > On Tue, Aug 20, 2024 at 3:48 AM Mark Rutland wrote: > > > > On Tue, Aug 20, 2024 at 10:03:30AM +0900, Masami Hiramatsu wrote: > > > On Mon, 19 Aug 2024 12:02:44 -0400 > > > Steven Rostedt wrote: > > > > > > > On Tue, 20 Aug 2024 00:56:49 +0900 > > > > Masami Hiramatsu (Google) wrote: > > > > > > > > > > > > > > > > > We may need to add "noinline" or something to make sure those > > > > > > functions > > > > > > don't get inlined for LTO. > > > > > > > > > > Yeah, we need such option at least for function call test. > > > > > > > > Could you add the noinline, and if it fixes the issue send a patch? > > > > > > I found the target function already has "noinline". I tried to add > > > noinline > > > to the testing function (callsite), but it also did not work. > > > I think "noinline" is for the compiler, but LTO is done by the linker. > > > > If LTO is breaking noinline, then that has much larger implications for > > noinstr code and similar, and means that LTO is unsound... > > The noinline attribute is preserved in LLVM IR, so it should continue > to work with LTO. Which function are we talking about here? Are you > sure the function was inlined instead of being dropped completely? > Does marking the function __used help? Hmm, maybe it was dropped, as the functions are basically just stubs. Masami, can you add __used to the functions in trace_selftests_dynamic.c and see if that fixes it? -- Steve
Re: [PATCH v6 2/4] kbuild, kconfig: generate offset range data for builtin modules
On Sun, Aug 18, 2024 at 03:19:36PM +0900, Masahiro Yamada wrote: > On Fri, Aug 16, 2024 at 12:04???AM Kris Van Hees > wrote: > > > The subject should be: > "kbuild: generate offset range data for builtin modules" > > > (Drop ", kconfig") Thank you - applied. > > > > Create file module.builtin.ranges that can be used to find where > > built-in modules are located by their addresses. This will be useful for > > tracing tools to find what functions are for various built-in modules. > > > > The offset range data for builtin modules is generated using: > > - modules.builtin: associates object files with module names > > - vmlinux.map: provides load order of sections and offset of first member > > per section > > - vmlinux.o.map: provides offset of object file content per section > > - .*.cmd: build cmd file with KBUILD_MODFILE and KBUILD_MODNAME > > > I do not see "KBUILD_MODNAME" in the code. > It only checks "KUILD_MODFILE". Ah yes, that was a leftover from the earlier implementation. Updated. > > > > The generated data will look like: > > > > .text - = _text > > .text baf0-cb10 amd_uncore > > .text 0009bd10-0009c8e0 iosf_mbi > > ... > > .text 008e6660-008e9630 snd_soc_wcd_mbhc > > .text 008e9630-008ea610 snd_soc_wcd9335 snd_soc_wcd934x snd_soc_wcd938x > > > > It is good to note that multiple module names appear > in one line, but the instance (snd_soc_wcd933*) no longer > occurs since 11b0b802f8e38d48ca74d520028add81263f003e. > > > I recommend to replace the output snippet with: > > > .text 00b9f080-00ba011a intel_skl_int3472_discrete > .text 00ba0120-00ba03c0 intel_skl_int3472_discrete intel_skl_int3472_tps68470 > .text 00ba03c0-00ba08d6 intel_skl_int3472_tps68470 > > > This still happens when CONFIG_INTEL_SKL_INT3472=y. Applied. Thanks for pointing this out - I didn't noticed that the original case was no longer present. > > .text 008ea610-008ea780 snd_soc_wcd9335 > > ... > > .data - = _sdata > > .data f020-f680 amd_uncore > > > > For each ELF section, it lists the offset of the first symbol. This can > > be used to determine the base address of the section at runtime. > > > > Next, it lists (in strict ascending order) offset ranges in that section > > that cover the symbols of one or more builtin modules. Multiple ranges > > can apply to a single module, and ranges can be shared between modules. > > > > The CONFIG_BUILTIN_MODULE_RANGES option controls whether offset range data > > is generated for kernel modules that are built into the kernel image. > > > > How it works: > > > > 1. The modules.builtin file is parsed to obtain a list of built-in > > module names and their associated object names (the .ko file that > > the module would be in if it were a loadable module, hereafter > > referred to as ). This object name can be used to > > identify objects in the kernel compile because any C or assembler > > code that ends up into a built-in module will have the option > > -DKBUILD_MODFILE= present in its build command, and those > > can be found in the ..cmd file in the kernel build tree. > > > > If an object is part of multiple modules, they will all be listed > > in the KBUILD_MODFILE option argument. > > > > This allows us to conclusively determine whether an object in the > > kernel build belong to any modules, and which. > > > > 2. The vmlinux.map is parsed next to determine the base address of each > > top level section so that all addresses into the section can be > > turned into offsets. This makes it possible to handle sections > > getting loaded at different addresses at system boot. > > > > We also determine an 'anchor' symbol at the beginning of each > > section to make it possible to calculate the true base address of > > a section at runtime (i.e. symbol address - symbol offset). > > > > We collect start addresses of sections that are included in the top > > level section. This is used when vmlinux is linked using vmlinux.o, > > because in that case, we need to look at the vmlinux.o linker map to > > know what object a symbol is found in. > > > > And finally, we process each symbol that is listed in vmlinux.map > > (or vmlinux.o.map) based on the following structure: > > > > vmlinux linked from vmlinux.a: > > > > vmlinux.map: > > > > -- might be same as top level section) > > -- built-in association known > > -- belongs to module(s) object belongs to > > ... > > > > vmlinux linked from vmlinux.o: > > > > vmlinux.map: > > > > -- might be same as top level section) > > vmlinux.o -- need to use vmlinux.o.map > > -- ignored > > ... > > > > vmlinux.o.map: > > > > -- built-in association known > >
Re: [PATCH v6 3/4] scripts: add verifier script for builtin module range data
On Sun, Aug 18, 2024 at 03:40:36PM +0900, Masahiro Yamada wrote: > On Fri, Aug 16, 2024 at 12:04???AM Kris Van Hees > wrote: > > > > The modules.builtin.ranges offset range data for builtin modules is > > generated at compile time based on the list of built-in modules and > > the vmlinux.map and vmlinux.o.map linker maps. This data can be used > > to determine whether a symbol at a particular address belongs to > > module code that was configured to be compiled into the kernel proper > > as a built-in module (rather than as a standalone module). > > > > This patch adds a script that uses the generated modules.builtin.ranges > > data to annotate the symbols in the System.map with module names if > > their address falls within a range that belongs to one or more built-in > > modules. > > > > It then processes the vmlinux.map (and if needed, vmlinux.o.map) to > > verify the annotation: > > > > - For each top-level section: > > - For each object in the section: > > - Determine whether the object is part of a built-in module > > (using modules.builtin and the .*.cmd file used to compile > >the object as suggested in [0]) > > - For each symbol in that object, verify that the built-in > > module association (or lack thereof) matches the annotation > > given to the symbol. > > > > Signed-off-by: Kris Van Hees > > Reviewed-by: Nick Alcock > > Reviewed-by: Alan Maguire > > --- > > Changes since v5: > > - Added optional 6th argument to specify kernel build directory. > > - Report error and exit if .*.o.cmd files cannot be read. > > > > Changes since v4: > > - New patch in the series > > --- > > scripts/verify_builtin_ranges.awk | 365 ++ > > 1 file changed, 365 insertions(+) > > create mode 100755 scripts/verify_builtin_ranges.awk > > > > diff --git a/scripts/verify_builtin_ranges.awk > > b/scripts/verify_builtin_ranges.awk > > new file mode 100755 > > index ..b82cf0a0fbeb > > --- /dev/null > > +++ b/scripts/verify_builtin_ranges.awk > > @@ -0,0 +1,365 @@ > > +#!/usr/bin/gawk -f > > +# SPDX-License-Identifier: GPL-2.0 > > +# verify_builtin_ranges.awk: Verify address range data for builtin modules > > +# Written by Kris Van Hees > > +# > > +# Usage: verify_builtin_ranges.awk modules.builtin.ranges System.map \ > > +# modules.builtin vmlinux.map > > vmlinux.o.map \ > > +# [ ] > > +# > > + > > +# Return the module name(s) (if any) associated with the given object. > > +# > > +# If we have seen this object before, return information from the cache. > > +# Otherwise, retrieve it from the corresponding .cmd file. > > +# > > +function get_module_info(fn, mod, obj, mfn, s) { > > + if (fn in omod) > > + return omod[fn]; > > + > > + if (match(fn, /\/[^/]+$/) == 0) > > + return ""; > > + > > + obj = fn; > > + mod = ""; > > + mfn = ""; > > + fn = kdir "/" substr(fn, 1, RSTART) "." substr(fn, RSTART + 1) > > ".cmd"; > > + if (getline s > + if (match(s, /DKBUILD_MODFILE=['"]+[^'"]+/) > 0) { > > + mfn = substr(s, RSTART + 16, RLENGTH - 16); > > + gsub(/['"]/, "", mfn); > > + > > + mod = mfn; > > + gsub(/([^/ ]*\/)+/, "", mod); > > + gsub(/-/, "_", mod); > > + } > > + } else { > > + print "ERROR: Failed to read: " fn "\n\n" \ > > + " Invalid kernel build directory (" kdir ")\n" \ > > + " or its content does not match " ARGV[1] > > >"/dev/stderr"; > > + close(fn); > > + total = 0; > > + exit(1); > > + } > > + close(fn); > > + > > + # A single module (common case) also reflects objects that are not > > part > > + # of a module. Some of those objects have names that are also a > > module > > + # name (e.g. core). We check the associated module file name, and > > if > > + # they do not match, the object is not part of a module. > > + if (mod !~ / /) { > > + if (!(mod in mods)) > > + return ""; > > + if (mods[mod] != mfn) > > + return ""; > > + } > > + > > + # At this point, mod is a single (valid) module name, or a list of > > + # module names (that do not need validation). > > + omod[obj] = mod; > > + close(fn); > > + > > + return mod; > > +} > > > > > > This code is copy-paste from scripts/generate_builtin_ranges.awk > So, my comments in 2/4 can apply to this patch, too. Yes, and I will apply the same changes to the verifier script. > Instead of adding a separate script, > we could add a "verify mode" option. > > > scripts/generate_builtin_ranges.awk --verify ... >
[PATCH v2] virtio_pmem: Check device status before requesting flush
If a pmem device is in a bad status, the driver side could wait for host ack forever in virtio_pmem_flush(), causing the system to hang. So add a status check in the beginning of virtio_pmem_flush() to return early if the device is not activated. Signed-off-by: Philip Chen --- v2: - Remove change id from the patch description - Add more details to the patch description drivers/nvdimm/nd_virtio.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c index 35c8fbbba10e..97addba06539 100644 --- a/drivers/nvdimm/nd_virtio.c +++ b/drivers/nvdimm/nd_virtio.c @@ -44,6 +44,15 @@ static int virtio_pmem_flush(struct nd_region *nd_region) unsigned long flags; int err, err1; + /* +* Don't bother to submit the request to the device if the device is +* not acticated. +*/ + if (vdev->config->get_status(vdev) & VIRTIO_CONFIG_S_NEEDS_RESET) { + dev_info(&vdev->dev, "virtio pmem device needs a reset\n"); + return -EIO; + } + might_sleep(); req_data = kmalloc(sizeof(*req_data), GFP_KERNEL); if (!req_data) -- 2.46.0.184.g6999bdac58-goog
Re: [PATCH RFC v3 09/13] uprobes: SRCU-protect uretprobe lifetime (with timeout)
On Tue, Aug 20, 2024 at 8:06 AM Oleg Nesterov wrote: > > On 08/19, Andrii Nakryiko wrote: > > > > On Mon, Aug 19, 2024 at 6:41 AM Oleg Nesterov wrote: > > > > > > On 08/12, Andrii Nakryiko wrote: > > > > > > > > Avoid taking refcount on uprobe in prepare_uretprobe(), instead take > > > > uretprobe-specific SRCU lock and keep it active as kernel transfers > > > > control back to user space. > > > ... > > > > include/linux/uprobes.h | 49 ++- > > > > kernel/events/uprobes.c | 294 ++-- > > > > 2 files changed, 301 insertions(+), 42 deletions(-) > > > > > > Oh. To be honest I don't like this patch. > > > > > > I would like to know what other reviewers think, but to me it adds too > > > many > > > complications that I can't even fully understand... > > > > Which parts? The atomic xchg() and cmpxchg() parts? What exactly do > > you feel like you don't fully understand? > > Heh, everything looks too complex for me ;) Well, the best code is no code. But I'm not doing this just for fun, so I'm happy with the simplest solution *that works*. > > Say, hprobe_expire(). It is also called by ri_timer() in softirq context, > right? And it does > > /* We lost the race, undo our refcount bump. It can drop to zero. */ > put_uprobe(uprobe); > > How so? If the refcount goes to zero, put_uprobe() does mutex_lock(), > but it must not sleep in softirq context. > Now we are talking about specific issues, thank you! It's hard to discuss "I don't like". Yes, I think you are right and it is certainly a problem with put_uprobe() that it can't be called from softirq (just having to remember that is error-prone, as is evidenced by me forgetting about this issue). It's easy enough to solve. We can either schedule work from timer thread (and that will solve this particular issue only), or we can teach put_uprobe() to schedule work item if it drops refcount to zero from softirq and other restricted contexts. I vote for making put_uprobe() flexible in this regard, add work_struct to uprobe, and schedule all this to be done in the work queue callback. WDYT? > > Or, prepare_uretprobe() which does > > rcu_assign_pointer(utask->return_instances, ri); > > if (!timer_pending(&utask->ri_timer)) > mod_timer(&utask->ri_timer, ...); > > Suppose that the timer was pending and it was fired right before > rcu_assign_pointer(). What guarantees that prepare_uretprobe() will see > timer_pending() == false? > > rcu_assign_pointer()->smp_store_release() is a one-way barrier. This > timer_pending() check may appear to happen before rcu_assign_pointer() > completes. > > In this (yes, theoretical) case ri_timer() can miss the new return_instance, > while prepare_uretprobe() can miss the necessary mod_timer(). I think this > needs another mb() in between. > Ok, that's fair. I felt like this pattern might be a bit problematic, but I also felt like it's good to have to ensure that we do occasionally have a race between timer callback and uretprobe, even if uretprobe returns very quickly. (and I did confirm we get those races and they seem to be handled fine, i.e., I saw uprobes being "expired" into refcounted ones from ri_timer) But the really simple way to solve this is to do unconditional mod_timer(), so I can do just that to keep this less tricky. Would you be ok with that? > > And I can't convince myself hprobe_expire() is correct... OK, I don't > fully understand the logic, but why data_race(READ_ONCE(hprobe->leased)) ? > READ_ONCE() should be enough in this case? you mean why data_race() annotation? To appease KCSAN, given that we modify hprobe->leased with xchg/cmpxchg, but read here with READ_ONCE(). Maybe I'm overthinking it, not sure. There is a reason why this is an RFC ;) > > > > > As I have already mentioned in the previous discussions, we can simply > > > kill > > > utask->active_uprobe. And utask->auprobe. > > > > I don't have anything against that, in principle, but let's benchmark > > and test that thoroughly. I'm a bit uneasy about the possibility that > > some arch-specific code will do container_of() on this arch_uprobe in > > order to get to uprobe, > > Well, struct uprobe is not "exported", the arch-specific code can't do this. > Ah, good point, that's great. But as I said, uretprobe is actually *the common* use case, not single-stepped uprobe. Still not very happy about that memcpy() and assumption that it's cheap, but that's minor. But no matter what we do for single-stepped one, uretprobe needs some solution. (and if that solution works for uretprobe, why wouldn't it work for single-step?...) > Oleg. >
Re: [PATCH v2 16/19] gendwarfksyms: Add support for reserved structure fields
On Mon, Aug 19, 2024 at 10:17 PM Benno Lossin wrote: > > On 19.08.24 21:38, Sami Tolvanen wrote: > > > > This definitely looks cleaner than unions in Rust, but how would this > > scheme be visible in DWARF? You might also need to expand the annotation > > to allow replacing one reserved field with multiple smaller ones without > > using structs. > > Hmm that's a good question, I have no idea how DWARF works. The way you > do it in this patch is just by the name of the field, right? Correct, it just looks at the name of the union fields. > If Rust's DWARF output contains exact types names (I just checked this, > I *think* that this is the case, but I have never used/seen DWARF > before), we might be able to just create a `KAbiReserved` type > that you search for instead of the attribute. The usage would then be > like this: > > #[repr(C)] > pub struct Struct1 { > a: u64, > _reserved: KAbiReserved<(), u64>, > } > > And then when adding a new field, you would do this: > > #[repr(C)] > pub struct Struct1 { > a: u64, > b: KAbiReserved, > } > > /* Struct2 as above */ > > The way `KAbiReserved` is implemented is via a `union` (maybe a bit > ironic, considering what I said in my other replies, but in this case, > we would provide a safe abstraction over this `union`, thus avoiding > exposing users of this type to `unsafe`): > > #[repr(C)] > pub union KAbiReserved { > value: T, > _reserved: R, > } I like this approach even better, assuming any remaining issues with ownership etc. can be sorted out. This would also look identical to the C version in DWARF if you rename _reserved in the union to __kabi_reserved. Of course, we can always change gendwarfksyms to support a different scheme for Rust code if a better solution comes along later. Sami
Re: [PATCH v2] virtio_pmem: Check device status before requesting flush
On 8/20/24 10:22 AM, Philip Chen wrote: > If a pmem device is in a bad status, the driver side could wait for > host ack forever in virtio_pmem_flush(), causing the system to hang. > > So add a status check in the beginning of virtio_pmem_flush() to return > early if the device is not activated. > > Signed-off-by: Philip Chen > --- > > v2: > - Remove change id from the patch description > - Add more details to the patch description > > drivers/nvdimm/nd_virtio.c | 9 + > 1 file changed, 9 insertions(+) > > diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c > index 35c8fbbba10e..97addba06539 100644 > --- a/drivers/nvdimm/nd_virtio.c > +++ b/drivers/nvdimm/nd_virtio.c > @@ -44,6 +44,15 @@ static int virtio_pmem_flush(struct nd_region *nd_region) > unsigned long flags; > int err, err1; > > + /* > + * Don't bother to submit the request to the device if the device is > + * not acticated. s/acticated/activated/ > + */ > + if (vdev->config->get_status(vdev) & VIRTIO_CONFIG_S_NEEDS_RESET) { > + dev_info(&vdev->dev, "virtio pmem device needs a reset\n"); > + return -EIO; > + } > + > might_sleep(); > req_data = kmalloc(sizeof(*req_data), GFP_KERNEL); > if (!req_data)
Re: [PATCH v2 16/19] gendwarfksyms: Add support for reserved structure fields
> > The way `KAbiReserved` is implemented is via a `union` (maybe a bit > > ironic, considering what I said in my other replies, but in this case, > > we would provide a safe abstraction over this `union`, thus avoiding > > exposing users of this type to `unsafe`): > > > > #[repr(C)] > > pub union KAbiReserved { > > value: T, > > _reserved: R, > > } > > I like this approach even better, assuming any remaining issues with > ownership etc. can be sorted out. This would also look identical to > the C version in DWARF if you rename _reserved in the union to > __kabi_reserved. Of course, we can always change gendwarfksyms to > support a different scheme for Rust code if a better solution comes > along later. > > Sami Agreement here - this seems like a good approach to representing reserved in Rust code. A few minor adjustments we discussed off-list which aren't required for gendwarfksyms to know about: 1. Types being added to reserved fields have to be `Copy`, e.g. they must be `!Drop`. 2. Types being added to reserved fields must be legal to be represented by all zeroes. 3. Reserved fields need to be initialized to zero before having their union set to the provided value when constructing them. 4. It may be helpful to have delegating trait implementations to avoid the couple places where autoderef won't handle the conversion. While I think this is the right solution, esp. since it can share a representation with C, I wanted to call out one minor shortfall - a reserved field can only be replaced by one type. We could still indicate a replacement by two fields the same as in C, by using a tuple which will look like an anonymous struct. The limitation will be that if two or more new fields were introduced, we'd need to edit the patches accessing them to do foo.x.y and foo.x.z for their accesses instead of simply foo.y and foo.z - the autoref trick only works for a single type.
Re: [BUG] tracing: dynamic ftrace selftest detected failures
On Tue, 20 Aug 2024 08:10:42 -0700 Sami Tolvanen wrote: > On Tue, Aug 20, 2024 at 3:48 AM Mark Rutland wrote: > > > > On Tue, Aug 20, 2024 at 10:03:30AM +0900, Masami Hiramatsu wrote: > > > On Mon, 19 Aug 2024 12:02:44 -0400 > > > Steven Rostedt wrote: > > > > > > > On Tue, 20 Aug 2024 00:56:49 +0900 > > > > Masami Hiramatsu (Google) wrote: > > > > > > > > > > > > > > > > > We may need to add "noinline" or something to make sure those > > > > > > functions > > > > > > don't get inlined for LTO. > > > > > > > > > > Yeah, we need such option at least for function call test. > > > > > > > > Could you add the noinline, and if it fixes the issue send a patch? > > > > > > I found the target function already has "noinline". I tried to add > > > noinline > > > to the testing function (callsite), but it also did not work. > > > I think "noinline" is for the compiler, but LTO is done by the linker. > > > > If LTO is breaking noinline, then that has much larger implications for > > noinstr code and similar, and means that LTO is unsound... > > The noinline attribute is preserved in LLVM IR, so it should continue > to work with LTO. Which function are we talking about here? Are you > sure the function was inlined instead of being dropped completely? > Does marking the function __used help? We are talking about trace_selftest_startup_dynamic_tracing() in kernel/trace/trace_selftest.c. The callee is func() which is actually DYN_FTRACE_TEST_NAME() in kernel/trace/trace_selftest_dynamic.c. That function passed as pointer (but the compiler can embed it by constant propagation.) Does the noinline attribute prevent embedding callsite too? I mean extern callee() noinline callee() { ... } caller() { callee() // (*) } In this case, does noinline prevent LTO to embed the callee at the callsite(*) or prevent LTO remove the callee() symbol? Thank you, > > Sami > -- Masami Hiramatsu (Google)
Re: [BUG] tracing: dynamic ftrace selftest detected failures
On Wed, 21 Aug 2024 07:05:39 +0900 Masami Hiramatsu (Google) wrote: > Does the noinline attribute prevent embedding callsite too? I mean > > extern callee() > > noinline callee() > { > ... > } > > caller() > { > callee() // (*) > } > > In this case, does noinline prevent LTO to embed the callee at the callsite(*) > or prevent LTO remove the callee() symbol? > Even though we have it passed as a parameter, I think the compiler and linker is smart enough to see that and notice its use, and that the function passed in is a nop, which doesn't break the flow. Can you add the __used and see if it fixes it? -- Steve
Re: [BUG] tracing: dynamic ftrace selftest detected failures
On Tue, 20 Aug 2024 18:11:09 -0400 Steven Rostedt wrote: > On Wed, 21 Aug 2024 07:05:39 +0900 > Masami Hiramatsu (Google) wrote: > > > > Does the noinline attribute prevent embedding callsite too? I mean > > > > extern callee() > > > > noinline callee() > > { > > ... > > } > > > > caller() > > { > > callee() // (*) > > } > > > > In this case, does noinline prevent LTO to embed the callee at the > > callsite(*) > > or prevent LTO remove the callee() symbol? > > > > Even though we have it passed as a parameter, I think the compiler and > linker is smart enough to see that and notice its use, and that the > function passed in is a nop, which doesn't break the flow. > > Can you add the __used and see if it fixes it? Adding __used to DYN_FTRACE_TEST_NAME() and DYN_FTRACE_TEST_NAME2() does not change, the test still fails. Hmm, what about makes the caller (trace_selftest_startup_dynamic_tracing()) called via a function pointer? In that case, wouldn't it be subject to constant propagetion? Let me try. Thanks, > > -- Steve -- Masami Hiramatsu (Google)
Re: [BUG] tracing: dynamic ftrace selftest detected failures
On Wed, 21 Aug 2024 08:43:51 +0900 Masami Hiramatsu (Google) wrote: > > Can you add the __used and see if it fixes it? > > Adding __used to DYN_FTRACE_TEST_NAME() and DYN_FTRACE_TEST_NAME2() does > not change, the test still fails. OK, now that sounds like a bug in LTO itself. -- Steve
Re: [BUG] tracing: dynamic ftrace selftest detected failures
On Tue, 20 Aug 2024 19:49:14 -0400 Steven Rostedt wrote: > On Wed, 21 Aug 2024 08:43:51 +0900 > Masami Hiramatsu (Google) wrote: > > > > Can you add the __used and see if it fixes it? > > > > Adding __used to DYN_FTRACE_TEST_NAME() and DYN_FTRACE_TEST_NAME2() does > > not change, the test still fails. > > OK, now that sounds like a bug in LTO itself. Hmm, I think __used just preserve the function exists as a function. But the callsite can be optimized. This mean the __used function code is duplicated, and embedded one copy in the callsite, but keep another copy as a function which can be used from outside. But "noinline" attribute seems to be expected not embedded in the callsite (because that is what "inlined" mean), so it looks like a bug. Thank you, > > -- Steve -- Masami Hiramatsu (Google)
Re: [BUG] tracing: dynamic ftrace selftest detected failures
On Wed, 21 Aug 2024 08:43:51 +0900 Masami Hiramatsu (Google) wrote: > On Tue, 20 Aug 2024 18:11:09 -0400 > Steven Rostedt wrote: > > > On Wed, 21 Aug 2024 07:05:39 +0900 > > Masami Hiramatsu (Google) wrote: > > > > > > > Does the noinline attribute prevent embedding callsite too? I mean > > > > > > extern callee() > > > > > > noinline callee() > > > { > > > ... > > > } > > > > > > caller() > > > { > > > callee() // (*) > > > } > > > > > > In this case, does noinline prevent LTO to embed the callee at the > > > callsite(*) > > > or prevent LTO remove the callee() symbol? > > > > > > > Even though we have it passed as a parameter, I think the compiler and > > linker is smart enough to see that and notice its use, and that the > > function passed in is a nop, which doesn't break the flow. > > > > Can you add the __used and see if it fixes it? > > Adding __used to DYN_FTRACE_TEST_NAME() and DYN_FTRACE_TEST_NAME2() does > not change, the test still fails. Hmm, what about makes the caller > (trace_selftest_startup_dynamic_tracing()) called via a function pointer? > In that case, wouldn't it be subject to constant propagetion? > > Let me try. OK, it is succeeded! Calling `caller` via global function pointer makes it run as we expected. It passed the dynamic_ftrace test, but other tests still fails. Those need to be called via function pointer too. [1.851324] Testing dynamic ftrace: PASSED [2.083329] Testing dynamic ftrace ops #1: [2.173751] (0 0 0 0 0) FAILED! [2.182337] [ cut here ] [2.183323] WARNING: CPU: 0 PID: 1 at kernel/trace/trace.c:2143 run_tracer_sel0 [2.184323] Modules linked in: Anyway, here is what I did. diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c index 97f1e4bc47dc..9663bc777888 100644 --- a/kernel/trace/trace_selftest.c +++ b/kernel/trace/trace_selftest.c @@ -353,9 +353,10 @@ static int trace_selftest_ops(struct trace_array *tr, int cnt) } /* Test dynamic code modification and ftrace filters */ -static int trace_selftest_startup_dynamic_tracing(struct tracer *trace, - struct trace_array *tr, - int (*func)(void)) +static int noinline +trace_selftest_startup_dynamic_tracing(struct tracer *trace, + struct trace_array *tr, + int (*func)(void)) { int save_ftrace_enabled = ftrace_enabled; unsigned long count; @@ -569,10 +570,22 @@ trace_selftest_function_recursion(void) return ret; } #else -# define trace_selftest_startup_dynamic_tracing(trace, tr, func) ({ 0; }) +static int trace_selftest_startup_dynamic_tracing(struct tracer *trace, + struct trace_array *tr, + int (*func)(void)) +{ + if (!trace || !tr || !func) + return -EINVAL; + return 0; +} # define trace_selftest_function_recursion() ({ 0; }) #endif /* CONFIG_DYNAMIC_FTRACE */ +int (*global_trace_selftest_startup_dynamic_tracing)(struct tracer *trace, + struct trace_array *tr, + int (*func)(void)) + = trace_selftest_startup_dynamic_tracing; + static enum { TRACE_SELFTEST_REGS_START, TRACE_SELFTEST_REGS_FOUND, @@ -732,7 +745,7 @@ trace_selftest_startup_function(struct tracer *trace, struct trace_array *tr) goto out; } - ret = trace_selftest_startup_dynamic_tracing(trace, tr, + ret = global_trace_selftest_startup_dynamic_tracing(trace, tr, DYN_FTRACE_TEST_NAME); if (ret) goto out; -- Masami Hiramatsu (Google)
[syzbot] [perf?] KASAN: slab-use-after-free Read in uprobe_unregister
Hello, syzbot found the following issue on: HEAD commit:367b5c3d53e5 Add linux-next specific files for 20240816 git tree: linux-next console+strace: https://syzkaller.appspot.com/x/log.txt?x=109d46a398 kernel config: https://syzkaller.appspot.com/x/.config?x=61ba6f3b22ee5467 dashboard link: https://syzkaller.appspot.com/bug?extid=b32fa80fe75273685f9c compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40 syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1099b7d398 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=15c513e598 Downloadable assets: disk image: https://storage.googleapis.com/syzbot-assets/0b1b4e3cad3c/disk-367b5c3d.raw.xz vmlinux: https://storage.googleapis.com/syzbot-assets/5bb090f7813c/vmlinux-367b5c3d.xz kernel image: https://storage.googleapis.com/syzbot-assets/6674cb0709b1/bzImage-367b5c3d.xz IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+b32fa80fe75273685...@syzkaller.appspotmail.com R10: 0001 R11: 0246 R12: R13: R14: 0001 R15: 0001 == BUG: KASAN: slab-use-after-free in consumer_del kernel/events/uprobes.c:772 [inline] BUG: KASAN: slab-use-after-free in uprobe_unregister+0x99/0x220 kernel/events/uprobes.c:1097 Read of size 8 at addr 888022b9eb40 by task syz-executor227/5226 CPU: 0 UID: 0 PID: 5226 Comm: syz-executor227 Not tainted 6.11.0-rc3-next-20240816-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/06/2024 Call Trace: __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:377 [inline] print_report+0x169/0x550 mm/kasan/report.c:488 kasan_report+0x143/0x180 mm/kasan/report.c:601 consumer_del kernel/events/uprobes.c:772 [inline] uprobe_unregister+0x99/0x220 kernel/events/uprobes.c:1097 bpf_uprobe_unregister kernel/trace/bpf_trace.c:3119 [inline] bpf_uprobe_multi_link_release+0x8c/0x1b0 kernel/trace/bpf_trace.c:3127 bpf_link_free+0xf5/0x250 kernel/bpf/syscall.c:2998 bpf_link_put_direct kernel/bpf/syscall.c:3038 [inline] bpf_link_release+0x7b/0x90 kernel/bpf/syscall.c:3045 __fput+0x24b/0x890 fs/file_table.c:424 task_work_run+0x24f/0x310 kernel/task_work.c:228 exit_task_work include/linux/task_work.h:40 [inline] do_exit+0xa2f/0x28e0 kernel/exit.c:939 do_group_exit+0x207/0x2c0 kernel/exit.c:1088 __do_sys_exit_group kernel/exit.c:1099 [inline] __se_sys_exit_group kernel/exit.c:1097 [inline] __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1097 x64_sys_call+0x26a8/0x26b0 arch/x86/include/generated/asm/syscalls_64.h:232 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f409a857c79 Code: Unable to access opcode bytes at 0x7f409a857c4f. RSP: 002b:7fff689d0b48 EFLAGS: 0246 ORIG_RAX: 00e7 RAX: ffda RBX: RCX: 7f409a857c79 RDX: 003c RSI: 00e7 RDI: RBP: 7f409a8cb370 R08: ffb8 R09: 00a0 R10: 0001 R11: 0246 R12: 7f409a8cb370 R13: R14: 7f409a8cbdc0 R15: 7f409a821340 Allocated by task 5226: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 poison_kmalloc_redzone mm/kasan/common.c:377 [inline] __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:394 kasan_kmalloc include/linux/kasan.h:260 [inline] __do_kmalloc_node mm/slub.c:4202 [inline] __kmalloc_node_noprof+0x22a/0x440 mm/slub.c:4208 __kvmalloc_node_noprof+0x72/0x1b0 mm/util.c:658 kvmalloc_array_node_noprof include/linux/slab.h:835 [inline] bpf_uprobe_multi_link_attach+0x43e/0xb90 kernel/trace/bpf_trace.c:3369 link_create+0x6e9/0x870 kernel/bpf/syscall.c:5236 __sys_bpf+0x4bc/0x810 kernel/bpf/syscall.c:5697 __do_sys_bpf kernel/bpf/syscall.c:5734 [inline] __se_sys_bpf kernel/bpf/syscall.c:5732 [inline] __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5732 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 5226: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:579 poison_slab_object mm/kasan/common.c:247 [inline] __kasan_slab_free+0x59/0x70 mm/kasan/common.c:264 kasan_slab_free include/linux/kasan.h:233 [inline] slab_free_hook mm/slub.c:2295 [inline] slab_free mm/slub.c:4517 [inline] kfree+0x196/0x3e0 mm/slub.c:4665 bpf_uprobe_multi_link_attach+0xa7d/0xb90 kernel/trace/bpf_trace.c:3431 link_create+0x6e9/0x870 kernel/bpf/syscall.c:5236 __sys_bpf+0x4bc/0x810 kernel/bpf/syscall.c:5697 __do_sys
Re: [PATCH 1/1] uprobe: fix comment of uprobe_apply()
On 2024/8/20 22:30, Oleg Nesterov wrote: > On 08/20, Zhen Lei wrote: >> >> Depending on the argument 'add', uprobe_apply() may be registering or >> unregistering a probe. > > ... > >> /* >> - * uprobe_apply - unregister an already registered probe. >> - * @inode: the file in which the probe has to be removed. >> + * uprobe_apply - register a probe or unregister an already registered >> probe. > > Not really. > > See the commit 3c83a9ad0295eb63bd ("uprobes: make uprobe_register() return > struct uprobe *") > in tip/perf/core which changed this description > > * uprobe_apply - add or remove the breakpoints according to @uc->filter > > still looks confusing, yes... OK, I got it. I mistakenly thought the comment was based on register_for_each_vma. It seems necessary to rename 'register_for_each_vma' to 'apply_for_each_vma', or some other more appropriate name. > > Oleg. > > > . > -- Regards, Zhen Lei
[PATCH v16 00/16] Add Cgroup support for SGX EPC memory
SGX Enclave Page Cache (EPC) memory allocations are separate from normal RAM allocations, and are managed solely by the SGX subsystem. The existing cgroup memory controller cannot be used to limit or account for SGX EPC memory, which is a desirable feature in some environments, e.g., support for pod level control in a Kubernates cluster on a VM or bare-metal host [1,2]. This patchset implements the support for sgx_epc memory within the misc cgroup controller. A user can use the misc cgroup controller to set and enforce a max limit on total EPC usage per cgroup. The implementation reports current usage and events of reaching the limit per cgroup as well as the total system capacity. Much like normal system memory, EPC memory can be overcommitted via virtual memory techniques and pages can be swapped out of the EPC to their backing store, which are normal system memory allocated via shmem and accounted by the memory controller. Similar to per-cgroup reclamation done by the memory controller, the EPC misc controller needs to implement a per-cgroup EPC reclaiming process: when the EPC usage of a cgroup reaches its hard limit ('sgx_epc' entry in the 'misc.max' file), the cgroup starts swapping out some EPC pages within the same cgroup to make room for new allocations. For that, this implementation tracks reclaimable EPC pages in a separate LRU list in each cgroup, and below are more details and justification of this design. Track EPC pages in per-cgroup LRUs (from Dave) -- tl;dr: A cgroup hitting its limit should be as similar as possible to the system running out of EPC memory. The only two choices to implement that are nasty changes the existing LRU scanning algorithm, or to add new LRUs. The result: Add a new LRU for each cgroup and scans those instead. Replace the existing global cgroup with the root cgroup's LRU (only when this new support is compiled in, obviously). The existing EPC memory management aims to be a miniature version of the core VM where EPC memory can be overcommitted and reclaimed. EPC allocations can wait for reclaim. The alternative to waiting would have been to send a signal and let the enclave die. This series attempts to implement that same logic for cgroups, for the same reasons: it's preferable to wait for memory to become available and let reclaim happen than to do things that are fatal to enclaves. There is currently a global reclaimable page SGX LRU list. That list (and the existing scanning algorithm) is essentially useless for doing reclaim when a cgroup hits its limit because the cgroup's pages are scattered around that LRU. It is unspeakably inefficient to scan a linked list with millions of entries for what could be dozens of pages from a cgroup that needs reclaim. Even if unspeakably slow reclaim was accepted, the existing scanning algorithm only picks a few pages off the head of the global LRU. It would either need to hold the list locks for unreasonable amounts of time, or be taught to scan the list in pieces, which has its own challenges. Unreclaimable Enclave Pages --- There are a variety of page types for enclaves, each serving different purposes [5]. Although the SGX architecture supports swapping for all types, some special pages, e.g., Version Array(VA) and Secure Enclave Control Structure (SECS)[5], holds meta data of reclaimed pages and enclaves. That makes reclamation of such pages more intricate to manage. The SGX driver global reclaimer currently does not swap out VA pages. It only swaps the SECS page of an enclave when all other associated pages have been swapped out. The cgroup reclaimer follows the same approach and does not track those in per-cgroup LRUs and considers them as unreclaimable pages. The allocation of these pages is counted towards the usage of a specific cgroup and is subject to the cgroup's set EPC limits. Earlier versions of this series implemented forced enclave-killing to reclaim VA and SECS pages. That was designed to enforce the 'max' limit, particularly in scenarios where a user or administrator reduces this limit post-launch of enclaves. However, subsequent discussions [3, 4] indicated that such preemptive enforcement is not necessary for the misc-controllers. Therefore, reclaiming SECS/VA pages by force-killing enclaves were removed, and the limit is only enforced at the time of new EPC allocation request. When a cgroup hits its limit but nothing left in the LRUs of the subtree, i.e., nothing to reclaim in the cgroup, any new attempt to allocate EPC within that cgroup will result in an 'ENOMEM'. Unreclaimable Guest VM EPC Pages The EPC pages allocated for guest VMs by the virtual EPC driver are not reclaimable by the host kernel [6]. Therefore an EPC cgroup also treats those as unreclaimable and returns ENOMEM when its limit is hit and nothing reclaimable left within the cgroup. The virtual EPC driver trans
[PATCH v16 01/16] x86/sgx: Replace boolean parameters with enums
Replace boolean parameters for 'reclaim' in the function sgx_alloc_epc_page() and its callers with an enum. Also opportunistically remove non-static declaration of __sgx_alloc_epc_page() and a typo Signed-off-by: Haitao Huang Suggested-by: Jarkko Sakkinen Suggested-by: Dave Hansen Reviewed-by: Jarkko Sakkinen Reviewed-by: Kai Huang Tested-by: Jarkko Sakkinen --- arch/x86/kernel/cpu/sgx/encl.c | 12 ++-- arch/x86/kernel/cpu/sgx/encl.h | 4 ++-- arch/x86/kernel/cpu/sgx/ioctl.c | 10 +- arch/x86/kernel/cpu/sgx/main.c | 14 +++--- arch/x86/kernel/cpu/sgx/sgx.h | 13 +++-- arch/x86/kernel/cpu/sgx/virt.c | 2 +- 6 files changed, 32 insertions(+), 23 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c index 279148e72459..f474179b6f77 100644 --- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -217,7 +217,7 @@ static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page, struct sgx_epc_page *epc_page; int ret; - epc_page = sgx_alloc_epc_page(encl_page, false); + epc_page = sgx_alloc_epc_page(encl_page, SGX_NO_RECLAIM); if (IS_ERR(epc_page)) return epc_page; @@ -359,14 +359,14 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma, goto err_out_unlock; } - epc_page = sgx_alloc_epc_page(encl_page, false); + epc_page = sgx_alloc_epc_page(encl_page, SGX_NO_RECLAIM); if (IS_ERR(epc_page)) { if (PTR_ERR(epc_page) == -EBUSY) vmret = VM_FAULT_NOPAGE; goto err_out_unlock; } - va_page = sgx_encl_grow(encl, false); + va_page = sgx_encl_grow(encl, SGX_NO_RECLAIM); if (IS_ERR(va_page)) { if (PTR_ERR(va_page) == -EBUSY) vmret = VM_FAULT_NOPAGE; @@ -1232,8 +1232,8 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr) /** * sgx_alloc_va_page() - Allocate a Version Array (VA) page - * @reclaim: Reclaim EPC pages directly if none available. Enclave - * mutex should not be held if this is set. + * @reclaim: Whether reclaim EPC pages directly if none available. Enclave + * mutex should not be held for SGX_DO_RECLAIM. * * Allocate a free EPC page and convert it to a Version Array (VA) page. * @@ -1241,7 +1241,7 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr) * a VA page, * -errno otherwise */ -struct sgx_epc_page *sgx_alloc_va_page(bool reclaim) +struct sgx_epc_page *sgx_alloc_va_page(enum sgx_reclaim reclaim) { struct sgx_epc_page *epc_page; int ret; diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h index f94ff14c9486..fe15ade02ca1 100644 --- a/arch/x86/kernel/cpu/sgx/encl.h +++ b/arch/x86/kernel/cpu/sgx/encl.h @@ -116,14 +116,14 @@ struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl *encl, unsigned long offset, u64 secinfo_flags); void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr); -struct sgx_epc_page *sgx_alloc_va_page(bool reclaim); +struct sgx_epc_page *sgx_alloc_va_page(enum sgx_reclaim reclaim); unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page); void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset); bool sgx_va_page_full(struct sgx_va_page *va_page); void sgx_encl_free_epc_page(struct sgx_epc_page *page); struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl, unsigned long addr); -struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim); +struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, enum sgx_reclaim reclaim); void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page); #endif /* _X86_ENCL_H */ diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c index b65ab214bdf5..793a0ba2cb16 100644 --- a/arch/x86/kernel/cpu/sgx/ioctl.c +++ b/arch/x86/kernel/cpu/sgx/ioctl.c @@ -17,7 +17,7 @@ #include "encl.h" #include "encls.h" -struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim) +struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, enum sgx_reclaim reclaim) { struct sgx_va_page *va_page = NULL; void *err; @@ -64,7 +64,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs) struct file *backing; long ret; - va_page = sgx_encl_grow(encl, true); + va_page = sgx_encl_grow(encl, SGX_DO_RECLAIM); if (IS_ERR(va_page)) return PTR_ERR(va_page); else if (va_page) @@ -83,7 +83,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs) encl->backing = backing; - secs_epc = sgx_alloc_epc_page(&encl->secs, true); + secs_epc = sgx_alloc_ep
[PATCH v16 02/16] cgroup/misc: Add per resource callbacks for CSS events
From: Kristen Carlson Accardi The misc cgroup controller (subsystem) currently does not perform resource type specific action for Cgroups Subsystem State (CSS) events: the 'css_alloc' event when a cgroup is created and the 'css_free' event when a cgroup is destroyed. Define callbacks for those events and allow resource providers to register the callbacks per resource type as needed. This will be utilized later by the EPC misc cgroup support implemented in the SGX driver. The SGX callbacks will need to access the 'struct misc_cg'. Pass 'struct misc_cg' to the callbacks but not the 'struct misc_res' because the latter doesn't have a pointer pointing back to 'struct misc_cg'. Link: https://lore.kernel.org/lkml/op.2kdw36otwjv...@hhuan26-mobl.amr.corp.intel.com/ Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen Reviewed-by: Tejun Heo Tested-by: Jarkko Sakkinen --- V16: - Don't call the ops if capacity is zero. (Kai) - Simplify the last paragraph of the commit message. (Kai) V15: - Style fixes (Jarkko, Kai) - _misc_cg_res_free() takes the last index. Only call ->free() for those res types with ->alloc() successfully called. (Ridong) V12: - Add comments in commit to clarify reason to pass in misc_cg, not misc_res. (Kai) - Remove unlikely (Kai) V8: - Abstract out _misc_cg_res_free() and _misc_cg_res_alloc() (Jarkko) V7: - Make ops one per resource type and store them in array (Michal) - Rename the ops struct to misc_res_ops, and enforce the constraints of required callback functions (Jarkko) - Moved addition of priv field to patch 4 where it was used first. (Jarkko) V6: - Create ops struct for per resource callbacks (Jarkko) - Drop max_write callback (Dave, Michal) - Style fixes (Kai) --- include/linux/misc_cgroup.h | 11 + kernel/cgroup/misc.c| 88 ++--- 2 files changed, 93 insertions(+), 6 deletions(-) diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h index 49eef10c8e59..e5159770a68e 100644 --- a/include/linux/misc_cgroup.h +++ b/include/linux/misc_cgroup.h @@ -28,6 +28,16 @@ struct misc_cg; #include +/** + * struct misc_res_ops: per resource type callback ops. + * @alloc: invoked for resource specific initialization when cgroup is allocated. + * @free: invoked for resource specific cleanup when cgroup is deallocated. + */ +struct misc_res_ops { + int (*alloc)(struct misc_cg *cg); + void (*free)(struct misc_cg *cg); +}; + /** * struct misc_res: Per cgroup per misc type resource * @max: Maximum limit on the resource. @@ -62,6 +72,7 @@ struct misc_cg { u64 misc_cg_res_total_usage(enum misc_res_type type); int misc_cg_set_capacity(enum misc_res_type type, u64 capacity); +int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops); int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount); void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount); diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c index 0e26068995a6..f3ce896d2c21 100644 --- a/kernel/cgroup/misc.c +++ b/kernel/cgroup/misc.c @@ -39,6 +39,9 @@ static struct misc_cg root_cg; */ static u64 misc_res_capacity[MISC_CG_RES_TYPES]; +/* Resource type specific operations */ +static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES]; + /** * parent_misc() - Get the parent of the passed misc cgroup. * @cgroup: cgroup whose parent needs to be fetched. @@ -105,6 +108,41 @@ int misc_cg_set_capacity(enum misc_res_type type, u64 capacity) } EXPORT_SYMBOL_GPL(misc_cg_set_capacity); +/** + * misc_cg_set_ops() - register resource specific operations. + * @type: Type of the misc res. + * @ops: Operations for the given type. + * + * The callbacks in @ops will not be invoked if the capacity of @type is 0. + * + * Context: Any context. + * Return: + * * %0 - Successfully registered the operations. + * * %-EINVAL - If @type is invalid, or the operations missing any required callbacks. + */ +int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops) +{ + if (!valid_type(type)) + return -EINVAL; + + if (!ops) + return -EINVAL; + + if (!ops->alloc) { + pr_err("%s: alloc missing\n", __func__); + return -EINVAL; + } + + if (!ops->free) { + pr_err("%s: free missing\n", __func__); + return -EINVAL; + } + + misc_res_ops[type] = ops; + return 0; +} +EXPORT_SYMBOL_GPL(misc_cg_set_ops); + /** * misc_cg_cancel_charge() - Cancel the charge from the misc cgroup. * @type: Misc res type in misc cg to cancel the charge from. @@ -439,6 +477,36 @@ static struct cftype misc_cg_files[] = { {} }; +static inline void _misc_cg_res_free(struct misc_cg *cg, enum misc_res_type last) +{ + enum misc_res_type i; + + for (i = 0; i <= last; i++) +
[PATCH v16 03/16] cgroup/misc: Export APIs for SGX driver
From: Kristen Carlson Accardi The SGX EPC cgroup will reclaim EPC pages when usage in a cgroup reaches its or ancestor's limit. This requires a walk from the current cgroup up to the root similar to misc_cg_try_charge(). Export misc_cg_parent() to enable this walk. The SGX driver also needs start a global level reclamation from the root. Export misc_cg_root() for the SGX driver to access. Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen Reviewed-by: Tejun Heo Reviewed-by: Kai Huang Tested-by: Jarkko Sakkinen --- V6: - Make commit messages more concise and split the original patch into two(Kai) --- include/linux/misc_cgroup.h | 24 kernel/cgroup/misc.c| 21 - 2 files changed, 32 insertions(+), 13 deletions(-) diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h index e5159770a68e..75c711333ad4 100644 --- a/include/linux/misc_cgroup.h +++ b/include/linux/misc_cgroup.h @@ -70,6 +70,7 @@ struct misc_cg { struct misc_res res[MISC_CG_RES_TYPES]; }; +struct misc_cg *misc_cg_root(void); u64 misc_cg_res_total_usage(enum misc_res_type type); int misc_cg_set_capacity(enum misc_res_type type, u64 capacity); int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops); @@ -90,6 +91,20 @@ static inline struct misc_cg *css_misc(struct cgroup_subsys_state *css) return css ? container_of(css, struct misc_cg, css) : NULL; } +/** + * misc_cg_parent() - Get the parent of the passed misc cgroup. + * @cgroup: cgroup whose parent needs to be fetched. + * + * Context: Any context. + * Return: + * * struct misc_cg* - Parent of the @cgroup. + * * %NULL - If @cgroup is null or the passed cgroup does not have a parent. + */ +static inline struct misc_cg *misc_cg_parent(struct misc_cg *cgroup) +{ + return cgroup ? css_misc(cgroup->css.parent) : NULL; +} + /* * get_current_misc_cg() - Find and get the misc cgroup of the current task. * @@ -114,6 +129,15 @@ static inline void put_misc_cg(struct misc_cg *cg) } #else /* !CONFIG_CGROUP_MISC */ +static inline struct misc_cg *misc_cg_root(void) +{ + return NULL; +} + +static inline struct misc_cg *misc_cg_parent(struct misc_cg *cg) +{ + return NULL; +} static inline u64 misc_cg_res_total_usage(enum misc_res_type type) { diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c index f3ce896d2c21..6cf1f0899f4e 100644 --- a/kernel/cgroup/misc.c +++ b/kernel/cgroup/misc.c @@ -43,18 +43,13 @@ static u64 misc_res_capacity[MISC_CG_RES_TYPES]; static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES]; /** - * parent_misc() - Get the parent of the passed misc cgroup. - * @cgroup: cgroup whose parent needs to be fetched. - * - * Context: Any context. - * Return: - * * struct misc_cg* - Parent of the @cgroup. - * * %NULL - If @cgroup is null or the passed cgroup does not have a parent. + * misc_cg_root() - Return the root misc cgroup. */ -static struct misc_cg *parent_misc(struct misc_cg *cgroup) +struct misc_cg *misc_cg_root(void) { - return cgroup ? css_misc(cgroup->css.parent) : NULL; + return &root_cg; } +EXPORT_SYMBOL_GPL(misc_cg_root); /** * valid_type() - Check if @type is valid or not. @@ -177,7 +172,7 @@ static void misc_cg_event(enum misc_res_type type, struct misc_cg *cg) atomic64_inc(&cg->res[type].events_local); cgroup_file_notify(&cg->events_local_file); - for (; parent_misc(cg); cg = parent_misc(cg)) { + for (; misc_cg_parent(cg); cg = misc_cg_parent(cg)) { atomic64_inc(&cg->res[type].events); cgroup_file_notify(&cg->events_file); } @@ -212,7 +207,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount) if (!amount) return 0; - for (i = cg; i; i = parent_misc(i)) { + for (i = cg; i; i = misc_cg_parent(i)) { res = &i->res[type]; new_usage = atomic64_add_return(amount, &res->usage); @@ -228,7 +223,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount) err_charge: misc_cg_event(type, i); - for (j = cg; j != i; j = parent_misc(j)) + for (j = cg; j != i; j = misc_cg_parent(j)) misc_cg_cancel_charge(type, j, amount); misc_cg_cancel_charge(type, i, amount); return ret; @@ -250,7 +245,7 @@ void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount) if (!(amount && valid_type(type) && cg)) return; - for (i = cg; i; i = parent_misc(i)) + for (i = cg; i; i = misc_cg_parent(i)) misc_cg_cancel_charge(type, i, amount); } EXPORT_SYMBOL_GPL(misc_cg_uncharge); -- 2.43.0
[PATCH v16 04/16] cgroup/misc: Add SGX EPC resource type
From: Kristen Carlson Accardi Add SGX EPC memory, MISC_CG_RES_SGX_EPC, to be a valid resource type for the misc controller. Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen Reviewed-by: Kai Huang Tested-by: Jarkko Sakkinen --- V12: - Remove CONFIG_CGROUP_SGX_EPC (Jarkko) V6: - Split the original patch into this and the preceding one (Kai) --- include/linux/misc_cgroup.h | 4 kernel/cgroup/misc.c| 4 2 files changed, 8 insertions(+) diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h index 75c711333ad4..b4119869b0d1 100644 --- a/include/linux/misc_cgroup.h +++ b/include/linux/misc_cgroup.h @@ -17,6 +17,10 @@ enum misc_res_type { MISC_CG_RES_SEV, /** @MISC_CG_RES_SEV_ES: AMD SEV-ES ASIDs resource */ MISC_CG_RES_SEV_ES, +#endif +#ifdef CONFIG_X86_SGX + /** @MISC_CG_RES_SGX_EPC: SGX EPC memory resource */ + MISC_CG_RES_SGX_EPC, #endif /** @MISC_CG_RES_TYPES: count of enum misc_res_type constants */ MISC_CG_RES_TYPES diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c index 6cf1f0899f4e..300af1ee20fc 100644 --- a/kernel/cgroup/misc.c +++ b/kernel/cgroup/misc.c @@ -24,6 +24,10 @@ static const char *const misc_res_name[] = { /* AMD SEV-ES ASIDs resource */ "sev_es", #endif +#ifdef CONFIG_X86_SGX + /* Intel SGX EPC memory bytes */ + "sgx_epc", +#endif }; /* Root misc cgroup */ -- 2.43.0
[PATCH v16 05/16] x86/sgx: Implement basic EPC misc cgroup functionality
From: Kristen Carlson Accardi SGX Enclave Page Cache (EPC) memory allocations are separate from normal RAM allocations, and are managed solely by the SGX subsystem. The existing cgroup memory controller cannot be used to limit or account for SGX EPC memory, which is a desirable feature in some environments. For instance, within a Kubernetes environment, while a user may specify a particular EPC quota for a pod, the orchestrator requires a mechanism to enforce that the pod's actual runtime EPC usage does not exceed the allocated quota. Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to limit and track EPC allocations per cgroup. Earlier patches have added the "sgx_epc" resource type in the misc cgroup subsystem. Add basic support in SGX driver as the "sgx_epc" resource provider: - Set "capacity" of EPC by calling misc_cg_set_capacity() - Update EPC usage counter, "current", by calling charge and uncharge APIs for EPC allocation and deallocation, respectively. - Setup sgx_epc resource type specific callbacks, which perform initialization and cleanup during cgroup allocation and deallocation, respectively. With these changes, the misc cgroup controller enables users to set a hard limit for EPC usage in the "misc.max" interface file. It reports current usage in "misc.current", the total EPC memory available in "misc.capacity", and the number of times EPC usage reached the max limit in "misc.events". For now, the EPC cgroup simply blocks additional EPC allocation in sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are still tracked in the global active list, only reclaimed by the global reclaimer when the total free page count is lower than a threshold. Later patches will reorganize the tracking and reclamation code in the global reclaimer and implement per-cgroup tracking and reclaiming. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen Reviewed-by: Tejun Heo Reviewed-by: Kai Huang Tested-by: Jarkko Sakkinen --- V16: - Proper handling for failures during init (Kai) - Register ops and capacity at the end when SGX is ready to handle callbacks. V15: - Declare __init for sgx_cgroup_init() (Jarkko) - Disable SGX when sgx_cgroup_init() fails (Jarkko) V13: - Remove unneeded includes. (Kai) V12: - Remove CONFIG_CGROUP_SGX_EPC and make sgx cgroup implementation conditionally compiled with CONFIG_CGROUP_MISC. (Jarkko) V11: - Update copyright and format better (Kai) - Create wrappers to remove #ifdefs in c file. (Kai) - Remove unneeded comments (Kai) V10: - Shorten function, variable, struct names, s/sgx_epc_cgroup/sgx_cgroup. (Jarkko) - Use enums instead of booleans for the parameters. (Dave, Jarkko) V8: - Remove null checks for epc_cg in try_charge()/uncharge(). (Jarkko) - Remove extra space, '_INTEL'. (Jarkko) V7: - Use a static for root cgroup (Kai) - Wrap epc_cg field in sgx_epc_page struct with #ifdef (Kai) - Correct check for charge API return (Kai) - Start initialization in SGX device driver init (Kai) - Remove unneeded BUG_ON (Kai) - Split sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai) V6: - Split the original large patch"Limit process EPC usage with misc cgroup controller" and restructure it (Kai) --- arch/x86/kernel/cpu/sgx/Makefile | 1 + arch/x86/kernel/cpu/sgx/epc_cgroup.c | 93 arch/x86/kernel/cpu/sgx/epc_cgroup.h | 78 +++ arch/x86/kernel/cpu/sgx/main.c | 44 +++-- arch/x86/kernel/cpu/sgx/sgx.h| 24 +++ include/linux/misc_cgroup.h | 2 + 6 files changed, 238 insertions(+), 4 deletions(-) create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile index 9c1656779b2a..081cb424575e 100644 --- a/arch/x86/kernel/cpu/sgx/Makefile +++ b/arch/x86/kernel/cpu/sgx/Makefile @@ -4,3 +4,4 @@ obj-y += \ ioctl.o \ main.o obj-$(CONFIG_X86_SGX_KVM) += virt.o +obj-$(CONFIG_CGROUP_MISC) += epc_cgroup.o diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c new file mode 100644 index ..0e422fef02bb --- /dev/null +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c @@ -0,0 +1,93 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright(c) 2022-2024 Intel Corporation. */ + +#include +#include "epc_cgroup.h" + +/* The root SGX EPC cgroup */ +static struct sgx_cgroup sgx_cg_root; + +/** + * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page + * + * @sgx_cg:The EPC cgroup to be charged for the page. + * Return: + * * %0 - If successfully charged. + * * -errno - for failures. + */ +int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg) +{ + return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, P
[PATCH v16 06/16] x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list
From: Sean Christopherson Introduce a data structure to wrap the existing reclaimable list and its spinlock. Each cgroup later will have one instance of this structure to track EPC pages allocated for processes associated with the same cgroup. Just like the global SGX reclaimer (ksgxd), an EPC cgroup reclaims pages from the reclaimable list in this structure when its usage reaches near its limit. Use this structure to encapsulate the LRU list and its lock used by the global reclaimer. Signed-off-by: Sean Christopherson Co-developed-by: Kristen Carlson Accardi Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Cc: Sean Christopherson Reviewed-by: Jarkko Sakkinen Reviewed-by: Kai Huang Tested-by: Jarkko Sakkinen --- V15: - Add comment for spinlock. (Jarkko, Kai) V6: - removed introduction to unreclaimables in commit message. V4: - Removed unneeded comments for the spinlock and the non-reclaimables. (Kai, Jarkko) - Revised the commit to add introduction comments for unreclaimables and multiple LRU lists.(Kai) - Reordered the patches: delay all changes for unreclaimables to later, and this one becomes the first change in the SGX subsystem. V3: - Removed the helper functions and revised commit messages. --- arch/x86/kernel/cpu/sgx/main.c | 39 +- arch/x86/kernel/cpu/sgx/sgx.h | 16 ++ 2 files changed, 36 insertions(+), 19 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index 0fda964c0a7c..45832c9a1e1b 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -28,10 +28,9 @@ static DEFINE_XARRAY(sgx_epc_address_space); /* * These variables are part of the state of the reclaimer, and must be accessed - * with sgx_reclaimer_lock acquired. + * with sgx_global_lru.lock acquired. */ -static LIST_HEAD(sgx_active_page_list); -static DEFINE_SPINLOCK(sgx_reclaimer_lock); +static struct sgx_epc_lru_list sgx_global_lru; static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0); @@ -306,13 +305,13 @@ static void sgx_reclaim_pages(void) int ret; int i; - spin_lock(&sgx_reclaimer_lock); + spin_lock(&sgx_global_lru.lock); for (i = 0; i < SGX_NR_TO_SCAN; i++) { - if (list_empty(&sgx_active_page_list)) + epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable, + struct sgx_epc_page, list); + if (!epc_page) break; - epc_page = list_first_entry(&sgx_active_page_list, - struct sgx_epc_page, list); list_del_init(&epc_page->list); encl_page = epc_page->owner; @@ -324,7 +323,7 @@ static void sgx_reclaim_pages(void) */ epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED; } - spin_unlock(&sgx_reclaimer_lock); + spin_unlock(&sgx_global_lru.lock); for (i = 0; i < cnt; i++) { epc_page = chunk[i]; @@ -347,9 +346,9 @@ static void sgx_reclaim_pages(void) continue; skip: - spin_lock(&sgx_reclaimer_lock); - list_add_tail(&epc_page->list, &sgx_active_page_list); - spin_unlock(&sgx_reclaimer_lock); + spin_lock(&sgx_global_lru.lock); + list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable); + spin_unlock(&sgx_global_lru.lock); kref_put(&encl_page->encl->refcount, sgx_encl_release); @@ -380,7 +379,7 @@ static void sgx_reclaim_pages(void) static bool sgx_should_reclaim(unsigned long watermark) { return atomic_long_read(&sgx_nr_free_pages) < watermark && - !list_empty(&sgx_active_page_list); + !list_empty(&sgx_global_lru.reclaimable); } /* @@ -432,6 +431,8 @@ static bool __init sgx_page_reclaimer_init(void) ksgxd_tsk = tsk; + sgx_lru_init(&sgx_global_lru); + return true; } @@ -507,10 +508,10 @@ static struct sgx_epc_page *__sgx_alloc_epc_page(void) */ void sgx_mark_page_reclaimable(struct sgx_epc_page *page) { - spin_lock(&sgx_reclaimer_lock); + spin_lock(&sgx_global_lru.lock); page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED; - list_add_tail(&page->list, &sgx_active_page_list); - spin_unlock(&sgx_reclaimer_lock); + list_add_tail(&page->list, &sgx_global_lru.reclaimable); + spin_unlock(&sgx_global_lru.lock); } /** @@ -525,18 +526,18 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page) */ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page) { - spin_lock(&sgx_reclaimer_lock); + spin_lock(&sgx_global_lru.lock); if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) { /* The page is being reclaimed. */ if (list_empty(&
[PATCH v16 08/16] x86/sgx: Encapsulate uses of the global LRU
To support the per-cgroup reclamation, each cgroup will have its own "per-cgroup LRU" and EPC pages will be in its owner cgroup's LRU instead of the global LRU. Abstract the code that is directly working with the global LRU into functions reusable with per-cgroup LRUs. Currently the basic reclamation procedure, sgx_reclaim_pages() directly reclaims pages from the global LRU. Change it to take in an LRU. Note the global EPC reclamation will still be needed when the total EPC usage reaches the system capacity while usages of some cgroups are below their respective limits. Create a separate wrapper for the global reclamation, sgx_reclaim_pages_global(), passing in the global LRU to the new sgx_reclaim_pages() now. Later it will be revised to reclaim from multiple LRUs from all EPC cgroups instead of a single global LRU. Wrap the existing emptiness check of the global LRU with a helper so that it can be changed later to work with multiple LRUs when per-cgroup LRU comes to play. Also the per-cgroup EPC reclaim and global EPC reclaim will have different check on whether they should be done. Rename the existing sgx_should_reclaim() to sgx_should_reclaim_global() to separate the two cases. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Co-developed-by: Kristen Carlson Accardi Signed-off-by: Kristen Carlson Accardi Signed-off-by: Haitao Huang --- v16: - Regroup all abstraction related to global LRU usage to this patch from different patches in previous version. Position this before adding per-cgroup reclaim. (Kai) V13: - Rename sgx_can_reclaim() to sgx_can_reclaim_global() and sgx_should_reclaim() to sgx_should_reclaim_global(). (Kai) V10: - Add comments for the new function. (Jarkko) V7: - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/main.c | 63 ++ 1 file changed, 41 insertions(+), 22 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index 5d5f6baac9c8..47dfba6f45af 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -37,6 +37,21 @@ static inline struct sgx_epc_lru_list *sgx_epc_page_lru(struct sgx_epc_page *epc return &sgx_global_lru; } +/* + * Check if there is any reclaimable page at global level. + */ +static inline bool sgx_can_reclaim_global(void) +{ + /* +* Now all EPC pages are still tracked in the @sgx_global_lru, so only +* check @sgx_global_lru. +* +* When EPC pages are tracked in the actual per-cgroup LRUs, +* replace with sgx_cgroup_lru_empty(misc_cg_root()). +*/ + return !list_empty(&sgx_global_lru.reclaimable); +} + static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0); /* Nodes with one or more EPC sections. */ @@ -287,10 +302,10 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page, } /* - * Take a fixed number of pages from the head of the active page pool and - * reclaim them to the enclave's private shmem files. Skip the pages, which have - * been accessed since the last scan. Move those pages to the tail of active - * page pool so that the pages get scanned in LRU like fashion. + * Take a fixed number of pages from the head of a given LRU and reclaim them to + * the enclave's private shmem files. Skip the pages, which have been accessed + * since the last scan. Move those pages to the tail of the list so that the + * pages get scanned in LRU like fashion. * * Batch process a chunk of pages (at the moment 16) in order to degrade amount * of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a bit @@ -299,7 +314,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page, * problematic as it would increase the lock contention too much, which would * halt forward progress. */ -static void sgx_reclaim_pages(void) +static void sgx_reclaim_pages(struct sgx_epc_lru_list *lru) { struct sgx_epc_page *chunk[SGX_NR_TO_SCAN]; struct sgx_backing backing[SGX_NR_TO_SCAN]; @@ -310,10 +325,9 @@ static void sgx_reclaim_pages(void) int ret; int i; - spin_lock(&sgx_global_lru.lock); + spin_lock(&lru->lock); for (i = 0; i < SGX_NR_TO_SCAN; i++) { - epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable, - struct sgx_epc_page, list); + epc_page = list_first_entry_or_null(&lru->reclaimable, struct sgx_epc_page, list); if (!epc_page) break; @@ -328,7 +342,7 @@ static void sgx_reclaim_pages(void) */ epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED; } - spin_unlock(&sgx_global_lru.lock); + spin_unlock(&lru->lock); for (i = 0; i < cnt; i++) { epc_page = chunk[i]; @@ -351,9 +365,9 @@ static void sgx_reclaim_pages(void)
[PATCH v16 07/16] x86/sgx: Abstract tracking reclaimable pages in LRU
From: Kristen Carlson Accardi The SGX driver tracks reclaimable EPC pages by adding a newly allocated page into the global LRU list in sgx_mark_page_reclaimable(), and doing the opposite in sgx_unmark_page_reclaimable(). To support SGX EPC cgroup, the SGX driver will need to maintain an LRU list for each cgroup, and each newly allocated EPC page will need to be added to the LRU of associated cgroup, not always the global LRU list. When sgx_mark_page_reclaimable() is called, the cgroup that the newly allocated EPC page belongs to is already known, i.e., it has been set to the 'struct sgx_epc_page'. Add a helper, sgx_epc_page_lru(), to return the LRU that the EPC page should be added to for the given EPC page. Currently it just returns the global LRU. Change sgx_{mark|unmark}_page_reclaimable() to use the helper function to get the LRU from the EPC page instead of referring to the global LRU directly. This allows EPC page being able to be tracked in "per-cgroup" LRU when that becomes ready. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen Reviewed-by: Kai Huang Tested-by: Jarkko Sakkinen --- V13: - Revise commit log (Kai) - Rename sgx_lru_list() to sgx_epc_page_lru() V7: - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/main.c | 30 ++ 1 file changed, 18 insertions(+), 12 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index 45832c9a1e1b..5d5f6baac9c8 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -32,6 +32,11 @@ static DEFINE_XARRAY(sgx_epc_address_space); */ static struct sgx_epc_lru_list sgx_global_lru; +static inline struct sgx_epc_lru_list *sgx_epc_page_lru(struct sgx_epc_page *epc_page) +{ + return &sgx_global_lru; +} + static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0); /* Nodes with one or more EPC sections. */ @@ -500,25 +505,24 @@ static struct sgx_epc_page *__sgx_alloc_epc_page(void) } /** - * sgx_mark_page_reclaimable() - Mark a page as reclaimable + * sgx_mark_page_reclaimable() - Mark a page as reclaimable and track it in a LRU. * @page: EPC page - * - * Mark a page as reclaimable and add it to the active page list. Pages - * are automatically removed from the active list when freed. */ void sgx_mark_page_reclaimable(struct sgx_epc_page *page) { - spin_lock(&sgx_global_lru.lock); + struct sgx_epc_lru_list *lru = sgx_epc_page_lru(page); + + spin_lock(&lru->lock); page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED; - list_add_tail(&page->list, &sgx_global_lru.reclaimable); - spin_unlock(&sgx_global_lru.lock); + list_add_tail(&page->list, &lru->reclaimable); + spin_unlock(&lru->lock); } /** - * sgx_unmark_page_reclaimable() - Remove a page from the reclaim list + * sgx_unmark_page_reclaimable() - Remove a page from its tracking LRU * @page: EPC page * - * Clear the reclaimable flag and remove the page from the active page list. + * Clear the reclaimable flag if set and remove the page from its LRU. * * Return: * 0 on success, @@ -526,18 +530,20 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page) */ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page) { - spin_lock(&sgx_global_lru.lock); + struct sgx_epc_lru_list *lru = sgx_epc_page_lru(page); + + spin_lock(&lru->lock); if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) { /* The page is being reclaimed. */ if (list_empty(&page->list)) { - spin_unlock(&sgx_global_lru.lock); + spin_unlock(&lru->lock); return -EBUSY; } list_del(&page->list); page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED; } - spin_unlock(&sgx_global_lru.lock); + spin_unlock(&lru->lock); return 0; } -- 2.43.0
[PATCH v16 09/16] x86/sgx: Add basic EPC reclamation flow for cgroup
Currently in the EPC page allocation, the kernel simply fails the allocation when the current EPC cgroup fails to charge due to its usage reaching limit. This is not ideal. When that happens, a better way is to reclaim EPC page(s) from the current EPC cgroup to reduce its usage so the new allocation can succeed. Currently, all EPC pages are tracked in a single global LRU, and the "global EPC reclamation" supports the following 3 cases: 1) On-demand asynchronous reclamation: For allocation requests that can not wait for reclamation but can be retried, an asynchronous reclamation is triggered, in which the global reclaimer, ksgxd, keeps reclaiming EPC pages until the free page count is above a minimal threshold. 2) On-demand synchronous reclamation: For allocations that can wait for reclamation, the EPC page allocator, sgx_alloc_epc_page() reclaims EPC page(s) immediately until at least one free page is available for allocation. 3) Preemptive reclamation: For some allocation requests, e.g., allocation for reloading a reclaimed page to change its permissions or page type, the kernel invokes sgx_reclaim_direct() to preemptively reclaim EPC page(s) as a best effort to minimize on-demand reclamation for subsequent allocations. Similarly, a "per-cgroup reclamation" is needed to support the above 3 cases as well: 1) For on-demand asynchronous reclamation, a per-cgroup reclamation needs to be invoked to maintain a minimal difference between the usage and the limit for each cgroup, analogous to the minimal free page threshold maintained by the global reclaimer. 2) For on-demand synchronous reclamation, sgx_cgroup_try_charge() needs to invoke the per-cgroup reclamation until the cgroup usage become at least one page lower than its limit. 3) For preemptive reclamation, sgx_reclaim_direct() needs to invoke the per-cgroup reclamation to minimize per-cgroup on-demand reclamation for subsequent allocations. To support the per-cgroup reclamation, introduce a "per-cgroup LRU" to track all EPC pages belong to the owner cgroup to utilize the existing sgx_reclaim_pages(). Currently, the global reclamation treats all EPC pages equally as it scans all EPC pages in FIFO order in the global LRU. The "per-cgroup reclamation" needs to somehow achieve the same fairness of all EPC pages that are tracked in the multiple LRUs of the given cgroup and all the descendants to reflect the nature of the cgroup. The idea is to achieve such fairness by scanning "all EPC cgroups" of the subtree (the given cgroup and all the descendants) equally in turns, and in the scan to each cgroup, apply the existing sgx_reclaim_pages() to its LRU. This basic flow is encapsulated in a new function, sgx_cgroup_reclaim_pages(). Export sgx_reclaim_pages() for use in sgx_cgroup_reclaim_pages(). And modify sgx_reclaim_pages() to return the number of pages scanned so sgx_cgroup_reclaim_pages() can track scanning progress and determine whether enough scanning is done or to continue the scanning for next descendant. Whenever reclaiming in a subtree of a given root is needed, start the scanning from the next descendant where scanning was stopped at last time. To keep track of the next descendant cgroup to scan, add a new field, next_cg, in the sgx_cgroup struct. Create an iterator function, sgx_cgroup_next_get(), atomically returns a valid reference of the descendant for next round of scanning and advances @next_cg to next valid descendant in a preorder walk. This iterator function is used in sgx_cgroup_reclaim_pages() to iterate descendants for scanning. Separately also advances @next_cg to next valid descendant when the cgroup referenced by @next_cg is to be freed. Add support for on-demand synchronous reclamation in sgx_cgroup_try_charge(), applying sgx_cgroup_reclaim_pages() iteratively until cgroup usage is lower than its limit. Later patches will reuse sgx_cgroup_reclaim_pages() to add support for asynchronous and preemptive reclamation. Note all reclaimable EPC pages are still tracked in the global LRU thus no per-cgroup reclamation is actually active at the moment: -ENOMEM is returned by __sgx_cgroup_try_charge() when LRUs are empty. Per-cgroup tracking and reclamation will be turned on in the end after all necessary infrastructure is in place. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Co-developed-by: Kristen Carlson Accardi Signed-off-by: Kristen Carlson Accardi Signed-off-by: Haitao Huang --- V16: - Revise commit log to define reclamation requirement and the design more clearly. (Kai) - Revise sgx_cgroup_reclaim_pages() to scan cgroups in subtree more fairly, track next_cg in each sgx_cgroup and add helpers that is used to iterate descendant in sgx_cgroup_reclaim_pages().(Kai) V14: - Allow sgx_cgroup_reclaim_pages() to continue from previous tree-walk. It takes in a 'start' node and returns the 'next' node for the caller to use as the new '
[PATCH v16 10/16] x86/sgx: Implement async reclamation for cgroup
From: Kristen Carlson Accardi In cases EPC pages need be allocated during a page fault and the cgroup usage is near its limit, an asynchronous reclamation needs to be triggered to avoid blocking the page fault handling. To keep implementation simple, use a workqueue instead of kthread to schedule the asynchronous reclamation work. Add corresponding work item and function definitions for EPC cgroup. In sgx_cgroup_try_charge(), if caller does not allow synchronous reclamation, queue an asynchronous work into the workqueue. The current global reclaimer, ksgxd, maintains a threshold for the minimal free EPC pages to avoid thrashing when allocating EPC pages. Similarly, only reclaiming EPC pages from the current cgroup when its usage actually reaches limit could also cause thrashing. To avoid that, define a similar "per-cgroup usage threshold" and actively trigger asynchronous per-cgroup EPC reclamation when the usage reaches the threshold after try_charge() is successful. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Tested-by: Jarkko Sakkinen --- V16: - Destroy workqueue in sgx_cgroup_deinit() - Reuse the new sgx_cgroup_reclaim_pages() to scan at least SGX_NR_TO_SCAN pages for each round async reclaim. - Revise commit message. (Kai) V15: - Disable SGX when sgx_cgroup_init() fails instead of BUG_ON() (Jarkko) - Reset capacity to zero when sgx_cgroup_init() fails. (Kai) V13: - Revert to BUG_ON() in case of workq allocation failure in init and only alloc if misc is enabled. V11: - Print error instead of WARN (Kai) - Add check for need to queue an async reclamation before returning from try_charge(), do so if needed. This is to be consistent with global reclaimer to minimize thrashing during allocation time. V10: - Split asynchronous flow in separate patch. (Kai) - Consider cgroup disabled when the workqueue allocation fail during init. (Kai) - Abstract out sgx_cgroup_should_reclaim(). V9: - Add comments for static variables. (Jarkko) V8: - Remove alignment for substructure variables. (Jarkko) V7: - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/epc_cgroup.c | 115 ++- arch/x86/kernel/cpu/sgx/epc_cgroup.h | 4 + arch/x86/kernel/cpu/sgx/main.c | 5 +- 3 files changed, 122 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c index ce28efd20a15..c40f601d06d9 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c @@ -4,9 +4,37 @@ #include #include "epc_cgroup.h" +/* + * The minimal free pages, or the minimal margin between limit and usage + * maintained by per-cgroup reclaimer. + * + * Set this to the low threshold used by the global reclaimer, ksgxd. + */ +#define SGX_CG_MIN_FREE_PAGE (SGX_NR_LOW_PAGES) + +/* + * If the cgroup limit is close to SGX_CG_MIN_FREE_PAGE, maintaining the minimal + * free pages would barely leave any page for use, causing excessive reclamation + * and thrashing. + * + * Define the following limit, below which cgroup does not maintain the minimal + * free page threshold. Set this to quadruple of the minimal so at least 75% + * pages used without being reclaimed. + */ +#define SGX_CG_LOW_LIMIT (SGX_CG_MIN_FREE_PAGE * 4) + /* The root SGX EPC cgroup */ static struct sgx_cgroup sgx_cg_root; +/* + * The work queue that reclaims EPC pages in the background for cgroups. + * + * A cgroup schedules a work item into this queue to reclaim pages within the + * same cgroup when its usage limit is reached and synchronous reclamation is not + * an option, i.e., in a page fault handler. + */ +static struct workqueue_struct *sgx_cg_wq; + /* * Return the next descendant in a preorder walk, given a root, @root and a * cgroup, @cg, to start the walk from. Return @root if no descendant left for @@ -100,6 +128,34 @@ static inline struct sgx_cgroup *sgx_cgroup_next_get(struct sgx_cgroup *root) return p; } +static inline u64 sgx_cgroup_page_counter_read(struct sgx_cgroup *sgx_cg) +{ + return atomic64_read(&sgx_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / PAGE_SIZE; +} + +static inline u64 sgx_cgroup_max_pages(struct sgx_cgroup *sgx_cg) +{ + return READ_ONCE(sgx_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE; +} + +/* + * Get the lower bound of limits of a cgroup and its ancestors. Used in + * sgx_cgroup_should_reclaim() to determine if EPC usage of a cgroup is + * close to its limit or its ancestors' hence reclamation is needed. + */ +static inline u64 sgx_cgroup_max_pages_to_root(struct sgx_cgroup *sgx_cg) +{ + struct misc_cg *i = sgx_cg->cg; + u64 m = U64_MAX; + + while (i) { + m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max)); + i = misc_cg_parent(i); + } + + return m / PAGE_SIZE; +
[PATCH v16 11/16] x86/sgx: Charge mem_cgroup for per-cgroup reclamation
Enclave Page Cache(EPC) memory can be swapped out to regular system memory, and the consumed memory should be charged to a proper mem_cgroup. Currently the selection of mem_cgroup to charge is done in sgx_encl_get_mem_cgroup(). But it considers all contexts other than the ksgxd thread are user processes. With the new EPC cgroup implementation, the swapping can also happen in EPC cgroup work-queue threads. In those cases, it improperly selects the root mem_cgroup to charge for the RAM usage. Remove current_is_ksgxd() and change sgx_encl_get_mem_cgroup() to take an additional argument to explicitly specify the mm struct to charge for allocations. Callers from background kthreads not associated with a charging mm struct would set it to NULL, while callers in user process contexts set it to current->mm. Internally, it handles the case when the charging mm given is NULL, by searching for an mm struct from enclave's mm_list. Signed-off-by: Haitao Huang Reported-by: Mikko Ylinen Reviewed-by: Kai Huang Reviewed-by: Jarkko Sakkinen Tested-by: Mikko Ylinen Tested-by: Jarkko Sakkinen --- V10: - Pass mm struct instead of a boolean 'indirect'. (Dave, Jarkko) V9: - Reduce number of if statements. (Tim) V8: - Limit text paragraphs to 80 characters wide. (Jarkko) --- arch/x86/kernel/cpu/sgx/encl.c | 29 ++-- arch/x86/kernel/cpu/sgx/encl.h | 3 +-- arch/x86/kernel/cpu/sgx/epc_cgroup.c | 9 ++--- arch/x86/kernel/cpu/sgx/main.c | 29 +--- arch/x86/kernel/cpu/sgx/sgx.h| 2 +- 5 files changed, 36 insertions(+), 36 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c index f474179b6f77..7b77dad41daf 100644 --- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -993,23 +993,23 @@ static int __sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_inde } /* - * When called from ksgxd, returns the mem_cgroup of a struct mm stored - * in the enclave's mm_list. When not called from ksgxd, just returns - * the mem_cgroup of the current task. + * Find the mem_cgroup to charge for memory allocated on behalf of an enclave. + * + * Used in sgx_encl_alloc_backing() for backing store allocation. + * + * Return the mem_cgroup of the given charge_mm. Otherwise return the mem_cgroup + * of a struct mm stored in the enclave's mm_list. */ -static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl) +static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl, + struct mm_struct *charge_mm) { struct mem_cgroup *memcg = NULL; struct sgx_encl_mm *encl_mm; int idx; - /* -* If called from normal task context, return the mem_cgroup -* of the current task's mm. The remainder of the handling is for -* ksgxd. -*/ - if (!current_is_ksgxd()) - return get_mem_cgroup_from_mm(current->mm); +/* Use the charge_mm if given. */ + if (charge_mm) + return get_mem_cgroup_from_mm(charge_mm); /* * Search the enclave's mm_list to find an mm associated with @@ -1047,8 +1047,9 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl) * @encl: an enclave pointer * @page_index:enclave page index * @backing: data for accessing backing storage for the page + * @charge_mm: the mm to charge for the allocation * - * When called from ksgxd, sets the active memcg from one of the + * When charge_mm is NULL, sets the active memcg from one of the * mms in the enclave's mm_list prior to any backing page allocation, * in order to ensure that shmem page allocations are charged to the * enclave. Create a backing page for loading data back into an EPC page with @@ -1060,9 +1061,9 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl) * -errno otherwise. */ int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long page_index, - struct sgx_backing *backing) + struct sgx_backing *backing, struct mm_struct *charge_mm) { - struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl); + struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl, charge_mm); struct mem_cgroup *memcg = set_active_memcg(encl_memcg); int ret; diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h index fe15ade02ca1..5ce9d108290f 100644 --- a/arch/x86/kernel/cpu/sgx/encl.h +++ b/arch/x86/kernel/cpu/sgx/encl.h @@ -103,12 +103,11 @@ static inline int sgx_encl_find(struct mm_struct *mm, unsigned long addr, int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start, unsigned long end, unsigned long vm_flags); -bool current_is_ksgxd(void); void sgx_encl_release(struct kref *ref); int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_
[PATCH v16 12/16] x86/sgx: Revise global reclamation for EPC cgroups
With EPC cgroups, the global reclamation function, sgx_reclaim_pages_global(), can no longer apply to the global LRU as pages are now in per-cgroup LRUs. Create a wrapper, sgx_cgroup_reclaim_global() to invoke sgx_cgroup_reclaim_pages() passing in the root cgroup. Call this wrapper from sgx_reclaim_pages_global() when cgroup is enabled. The wrapper will scan and attempt to reclaim SGX_NR_TO_SCAN pages just like the current global reclaim. Note this simple implementation doesn't _exactly_ mimic the current global EPC reclaim (which always tries to do the actual reclaim in batch of SGX_NR_TO_SCAN pages): in rare cases when LRUs have less than SGX_NR_TO_SCAN reclaimable pages, the actual reclaim of EPC pages will be split into smaller batches _across_ multiple LRUs with each being smaller than SGX_NR_TO_SCAN pages. A more precise way to mimic the current global EPC reclaim would be to have a new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages _across_ the given EPC cgroup _AND_ its descendants, and then do the actual reclaim in one batch. But this is unnecessarily complicated at this stage to address such rare cases. Signed-off-by: Haitao Huang --- arch/x86/kernel/cpu/sgx/epc_cgroup.c | 12 arch/x86/kernel/cpu/sgx/epc_cgroup.h | 3 +++ arch/x86/kernel/cpu/sgx/main.c | 7 +++ 3 files changed, 22 insertions(+) diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c index ae65cac858f8..23a61689e0d9 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c @@ -240,6 +240,18 @@ static bool sgx_cgroup_should_reclaim(struct sgx_cgroup *sgx_cg) return (cur >= max); } +/** + * sgx_cgroup_reclaim_pages_global() - Perform one round of global reclamation. + * + * @charge_mm: The mm to be charged for the backing store of reclaimed pages. + * + * Try to scan and attempt reclamation from root cgroup for %SGX_NR_TO_SCAN pages. + */ +void sgx_cgroup_reclaim_pages_global(struct mm_struct *charge_mm) +{ + sgx_cgroup_reclaim_pages(&sgx_cg_root, charge_mm, SGX_NR_TO_SCAN); +} + /* * Asynchronous work flow to reclaim pages from the cgroup when the cgroup is * at/near its maximum capacity. diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h index 37364bdb797f..c0390111e28c 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h @@ -36,6 +36,8 @@ static inline void __init sgx_cgroup_deinit(void) { } static inline void __init sgx_cgroup_register(void) { } +static inline void sgx_cgroup_reclaim_pages_global(struct mm_struct *charge_mm) { } + #else /* CONFIG_CGROUP_MISC */ struct sgx_cgroup { @@ -87,6 +89,7 @@ static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg) int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim); void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg); +void sgx_cgroup_reclaim_pages_global(struct mm_struct *charge_mm); int __init sgx_cgroup_init(void); void __init sgx_cgroup_register(void); void __init sgx_cgroup_deinit(void); diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index fa00ed5e884d..d00cb012838b 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -411,6 +411,13 @@ static bool sgx_should_reclaim_global(unsigned long watermark) static void sgx_reclaim_pages_global(struct mm_struct *charge_mm) { + /* +* Now all EPC pages are still tracked in the @sgx_global_lru. +* Still reclaim from it. +* +* When EPC pages are tracked in the actual per-cgroup LRUs, +* sgx_cgroup_reclaim_pages_global() will be called. +*/ sgx_reclaim_pages(&sgx_global_lru, charge_mm); } -- 2.43.0
[PATCH v16 13/16] x86/sgx: implement direct reclamation for cgroups
sgx_reclaim_direct() was introduced to preemptively reclaim some pages as the best effort to avoid on-demand reclamation that can stall forward progress in some situations, e.g., allocating pages to load previously reclaimed page to perform EDMM operations on [1]. Currently when the global usage is close to the capacity, sgx_reclaim_direct() makes one invocation to sgx_reclaim_pages_global() but does not guarantee there are free pages available for later allocations to succeed. In other words, the only goal here is to reduce the chance of on-demand reclamation at allocation time. In cases of allocation failure, the caller, the EDMM ioctl()'s, would return -EAGAIN to user space and let the user space to decide whether to retry or not. With EPC cgroups enabled, usage of a cgroup can also reach its limit (usually much lower than capacity) and trigger per-cgroup reclamation. Implement a similar strategy to reduce the chance of on-demand per-cgroup reclamation for this use case. Create a wrapper, sgx_cgroup_reclaim_direct(), to perform a preemptive reclamation at cgroup level, and have sgx_reclaim_direct() call it when EPC cgroup is enabled. [1] https://lore.kernel.org/all/a0d8f037c4a075d56bf79f432438412985f7ff7a.1652137848.git.reinette.cha...@intel.com/T/#u Signed-off-by: Haitao Huang --- arch/x86/kernel/cpu/sgx/epc_cgroup.c | 15 +++ arch/x86/kernel/cpu/sgx/epc_cgroup.h | 3 +++ arch/x86/kernel/cpu/sgx/main.c | 4 3 files changed, 22 insertions(+) diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c index 23a61689e0d9..b7d60b2d878d 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c @@ -252,6 +252,21 @@ void sgx_cgroup_reclaim_pages_global(struct mm_struct *charge_mm) sgx_cgroup_reclaim_pages(&sgx_cg_root, charge_mm, SGX_NR_TO_SCAN); } +/** + * sgx_cgroup_reclaim_direct() - Preemptive reclamation. + * + * Scan and attempt to reclaim %SGX_NR_TO_SCAN as best effort to allow caller + * make quick progress. + */ +void sgx_cgroup_reclaim_direct(void) +{ + struct sgx_cgroup *sgx_cg = sgx_get_current_cg(); + + if (sgx_cgroup_should_reclaim(sgx_cg)) + sgx_cgroup_reclaim_pages(sgx_cg, current->mm, SGX_NR_TO_SCAN); + sgx_put_cg(sgx_cg); +} + /* * Asynchronous work flow to reclaim pages from the cgroup when the cgroup is * at/near its maximum capacity. diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h index c0390111e28c..cf2b946d993e 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h @@ -38,6 +38,8 @@ static inline void __init sgx_cgroup_register(void) { } static inline void sgx_cgroup_reclaim_pages_global(struct mm_struct *charge_mm) { } +static inline void sgx_cgroup_reclaim_direct(void) { } + #else /* CONFIG_CGROUP_MISC */ struct sgx_cgroup { @@ -90,6 +92,7 @@ static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg) int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim); void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg); void sgx_cgroup_reclaim_pages_global(struct mm_struct *charge_mm); +void sgx_cgroup_reclaim_direct(void); int __init sgx_cgroup_init(void); void __init sgx_cgroup_register(void); void __init sgx_cgroup_deinit(void); diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index d00cb012838b..9a8f91ebd21b 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -428,6 +428,10 @@ static void sgx_reclaim_pages_global(struct mm_struct *charge_mm) */ void sgx_reclaim_direct(void) { + /* Reduce chance of per-cgroup reclamation for later allocation */ + sgx_cgroup_reclaim_direct(); + + /* Reduce chance of the global reclamation for later allocation */ if (sgx_should_reclaim_global(SGX_NR_LOW_PAGES)) sgx_reclaim_pages_global(current->mm); } -- 2.43.0
[PATCH v16 14/16] x86/sgx: Turn on per-cgroup EPC reclamation
From: Kristen Carlson Accardi Previous patches have implemented all infrastructure needed for per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC pages are still tracked in the global LRU as sgx_epc_page_lru() always returns reference to the global LRU. Change sgx_epc_page_lru() to return the LRU of the cgroup in which the given EPC page is allocated. Update sgx_can_reclaim_global(), to check emptiness of LRUs of all cgroups, and update sgx_reclaim_pages_global(), to utilize sgx_cgroup_reclaim_pages_global(), when EPC cgroup is enabled. With these changes, the global reclamation and per-cgroup reclamation both work properly with all pages tracked in per-cgroup LRUs. Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen --- V16: - Separated out the global and direct reclamation to earlier patch.(Kai) V14: - Update global reclamation to use the new sgx_cgroup_reclaim_pages() to iterate cgroups at lower level if the top cgroups are too busy. V13: - Use IS_ENABLED(CONFIG_CGROUP_MISC) in sgx_can_reclaim_global(). (Kai) V12: - Remove CONFIG_CGROUP_SGX_EPC, conditional compile SGX Cgroup for CONFIGCONFIG_CGROUPMISC. (Jarkko) V11: - Reword the comments for global reclamation for allocation failure after passing cgroup charging. (Kai) - Add stub functions to remove ifdefs in c file (Kai) - Add more detailed comments to clarify each page belongs to one cgroup, or the root. (Kai) V10: - Add comment to clarify each page belongs to one cgroup, or the root by default. (Kai) - Merge the changes that expose sgx_cgroup_* functions to this patch. - Add changes for sgx_reclaim_direct() that was missed previously. V7: - Split this out from the big patch, #10 in V6. (Dave, Kai) --- arch/x86/kernel/cpu/sgx/epc_cgroup.c | 2 +- arch/x86/kernel/cpu/sgx/epc_cgroup.h | 6 arch/x86/kernel/cpu/sgx/main.c | 45 ++-- 3 files changed, 36 insertions(+), 17 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c index b7d60b2d878d..c3f0c7bc13c6 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c @@ -162,7 +162,7 @@ static inline u64 sgx_cgroup_max_pages_to_root(struct sgx_cgroup *sgx_cg) * * Return: %true if all cgroups under the specified root have empty LRU lists. */ -static bool sgx_cgroup_lru_empty(struct misc_cg *root) +bool sgx_cgroup_lru_empty(struct misc_cg *root) { struct cgroup_subsys_state *css_root; struct cgroup_subsys_state *pos; diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h index cf2b946d993e..cd957cf38204 100644 --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h @@ -27,6 +27,11 @@ static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_recl static inline void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg) { } +static inline bool sgx_cgroup_lru_empty(struct misc_cg *root) +{ + return true; +} + static inline int __init sgx_cgroup_init(void) { return 0; @@ -91,6 +96,7 @@ static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg) int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim); void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg); +bool sgx_cgroup_lru_empty(struct misc_cg *root); void sgx_cgroup_reclaim_pages_global(struct mm_struct *charge_mm); void sgx_cgroup_reclaim_direct(void); int __init sgx_cgroup_init(void); diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c index 9a8f91ebd21b..2a23a10d882e 100644 --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@ -32,9 +32,30 @@ static DEFINE_XARRAY(sgx_epc_address_space); */ static struct sgx_epc_lru_list sgx_global_lru; +/* + * Get the per-cgroup or global LRU list that tracks the given reclaimable page. + */ static inline struct sgx_epc_lru_list *sgx_epc_page_lru(struct sgx_epc_page *epc_page) { +#ifdef CONFIG_CGROUP_MISC + /* +* epc_page->sgx_cg here is never NULL during a reclaimable epc_page's +* life between sgx_alloc_epc_page() and sgx_free_epc_page(): +* +* In sgx_alloc_epc_page(), epc_page->sgx_cg is set to the return from +* sgx_get_current_cg() which is the misc cgroup of the current task, or +* the root by default even if the misc cgroup is disabled by kernel +* command line. +* +* epc_page->sgx_cg is only unset by sgx_free_epc_page(). +* +* This function is never used before sgx_alloc_epc_page() or after +* sgx_free_epc_page(). +*/ + return &epc_page->sgx_cg->lru; +#else return &sgx_global_lru; +#endif } /* @@ -42,14 +63,10 @@ static inline struct sgx_epc_lru_list *sgx_epc_page_lru(struct sgx_epc_page *epc
[PATCH v16 15/16] Docs/x86/sgx: Add description for cgroup support
From: Sean Christopherson Add initial documentation of how to regulate the distribution of SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup controller. Signed-off-by: Sean Christopherson Co-developed-by: Kristen Carlson Accardi Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Cc: Sean Christopherson Reviewed-by: Bagas Sanjaya Reviewed-by: Jarkko Sakkinen Acked-by: Kai Huang Tested-by: Mikko Ylinen Tested-by: Jarkko Sakkinen --- V8: - Limit text width to 80 characters to be consistent. V6: - Remove mentioning of VMM specific behavior on handling SIGBUS - Remove statement of forced reclamation, add statement to specify ENOMEM returned when no reclamation possible. - Added statements on the non-preemptive nature for the max limit - Dropped Reviewed-by tag because of changes V4: - Fix indentation (Randy) - Change misc.events file to be read-only - Fix a typo for 'subsystem' - Add behavior when VMM overcommit EPC with a cgroup (Mikko) --- Documentation/arch/x86/sgx.rst | 83 ++ 1 file changed, 83 insertions(+) diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst index d90796adc2ec..c537e6a9aa65 100644 --- a/Documentation/arch/x86/sgx.rst +++ b/Documentation/arch/x86/sgx.rst @@ -300,3 +300,86 @@ to expected failures and handle them as follows: first call. It indicates a bug in the kernel or the userspace client if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has a return code other than 0. + + +Cgroup Support +== + +The "sgx_epc" resource within the Miscellaneous cgroup controller regulates +distribution of SGX EPC memory, which is a subset of system RAM that is used to +provide SGX-enabled applications with protected memory, and is otherwise +inaccessible, i.e. shows up as reserved in /proc/iomem and cannot be +read/written outside of an SGX enclave. + +Although current systems implement EPC by stealing memory from RAM, for all +intents and purposes the EPC is independent from normal system memory, e.g. must +be reserved at boot from RAM and cannot be converted between EPC and normal +memory while the system is running. The EPC is managed by the SGX subsystem and +is not accounted by the memory controller. Note that this is true only for EPC +memory itself, i.e. normal memory allocations related to SGX and EPC memory, +e.g. the backing memory for evicted EPC pages, are accounted, limited and +protected by the memory controller. + +Much like normal system memory, EPC memory can be overcommitted via virtual +memory techniques and pages can be swapped out of the EPC to their backing store +(normal system memory allocated via shmem). The SGX EPC subsystem is analogous +to the memory subsystem, and it implements limit and protection models for EPC +memory. + +SGX EPC Interface Files +--- + +For a generic description of the Miscellaneous controller interface files, +please see Documentation/admin-guide/cgroup-v2.rst + +All SGX EPC memory amounts are in bytes unless explicitly stated otherwise. If +a value which is not PAGE_SIZE aligned is written, the actual value used by the +controller will be rounded down to the closest PAGE_SIZE multiple. + + misc.capacity +A read-only flat-keyed file shown only in the root cgroup. The sgx_epc +resource will show the total amount of EPC memory available on the +platform. + + misc.current +A read-only flat-keyed file shown in the non-root cgroups. The sgx_epc +resource will show the current active EPC memory usage of the cgroup and +its descendants. EPC pages that are swapped out to backing RAM are not +included in the current count. + + misc.max +A read-write single value file which exists on non-root cgroups. The +sgx_epc resource will show the EPC usage hard limit. The default is +"max". + +If a cgroup's EPC usage reaches this limit, EPC allocations, e.g., for +page fault handling, will be blocked until EPC can be reclaimed from the +cgroup. If there are no pages left that are reclaimable within the same +group, the kernel returns ENOMEM. + +The EPC pages allocated for a guest VM by the virtual EPC driver are not +reclaimable by the host kernel. In case the guest cgroup's limit is +reached and no reclaimable pages left in the same cgroup, the virtual +EPC driver returns SIGBUS to the user space process to indicate failure +on new EPC allocation requests. + +The misc.max limit is non-preemptive. If a user writes a limit lower +than the current usage to this file, the cgroup will not preemptively +deallocate pages currently in use, and will only start blocking the next +allocation and reclaiming EPC at that time. + + misc.events +A read-only flat-keyed file which exists on non-root c
[PATCH v16 16/16] selftests/sgx: Add scripts for EPC cgroup testing
With different cgroups, the script starts one or multiple concurrent SGX selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed test case, which loads an enclave of EPC size equal to the EPC capacity available on the platform. The script checks results against the expectation set for each cgroup and reports success or failure. The script creates 3 different cgroups at the beginning with following expectations: 1) small - intentionally small enough to fail the test loading an enclave of size equal to the capacity. 2) large - large enough to run up to 4 concurrent tests but fail some if more than 4 concurrent tests are run. The script starts 4 expecting at least one test to pass, and then starts 5 expecting at least one test to fail. 3) larger - limit is the same as the capacity, large enough to run lots of concurrent tests. The script starts 8 of them and expects all pass. Then it reruns the same test with one process randomly killed and usage checked to be zero after all processes exit. The script also includes a test with low mem_cg limit and large sgx_epc limit to verify that the RAM used for per-cgroup reclamation is charged to a proper mem_cg. For this test, it turns off swapping before start, and turns swapping back on afterwards. Add README to document how to run the tests. Signed-off-by: Haitao Huang Reviewed-by: Jarkko Sakkinen Tested-by: Jarkko Sakkinen --- V13: - More improvement on handling error cases and style fixes. - Add settings file for custom timeout V12: - Integrate the scripts to the "run_tests" target. (Jarkko) V11: - Remove cgroups-tools dependency and make scripts ash compatible. (Jarkko) - Drop support for cgroup v1 and simplify. (Michal, Jarkko) - Add documentation for functions. (Jarkko) - Turn off swapping before memcontrol tests and back on after - Format and style fixes, name for hard coded values V7: - Added memcontrol test. V5: - Added script with automatic results checking, remove the interactive script. - The script can run independent from the series below. --- tools/testing/selftests/sgx/Makefile | 3 +- tools/testing/selftests/sgx/README| 109 +++ tools/testing/selftests/sgx/ash_cgexec.sh | 16 + tools/testing/selftests/sgx/config| 4 + .../selftests/sgx/run_epc_cg_selftests.sh | 294 ++ tools/testing/selftests/sgx/settings | 2 + .../selftests/sgx/watch_misc_for_tests.sh | 11 + 7 files changed, 438 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/sgx/README create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh create mode 100644 tools/testing/selftests/sgx/config create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh create mode 100644 tools/testing/selftests/sgx/settings create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh diff --git a/tools/testing/selftests/sgx/Makefile b/tools/testing/selftests/sgx/Makefile index 03b5e13b872b..3e673b8ace3f 100644 --- a/tools/testing/selftests/sgx/Makefile +++ b/tools/testing/selftests/sgx/Makefile @@ -20,7 +20,8 @@ ENCL_LDFLAGS := -Wl,-T,test_encl.lds,--build-id=none ifeq ($(CAN_BUILD_X86_64), 1) TEST_CUSTOM_PROGS := $(OUTPUT)/test_sgx -TEST_FILES := $(OUTPUT)/test_encl.elf +TEST_FILES := $(OUTPUT)/test_encl.elf ash_cgexec.sh +TEST_PROGS := run_epc_cg_selftests.sh all: $(TEST_CUSTOM_PROGS) $(OUTPUT)/test_encl.elf endif diff --git a/tools/testing/selftests/sgx/README b/tools/testing/selftests/sgx/README new file mode 100644 index ..f84406bf29a4 --- /dev/null +++ b/tools/testing/selftests/sgx/README @@ -0,0 +1,109 @@ +SGX selftests + +The SGX selftests includes a c program (test_sgx) that covers basic user space +facing APIs and a shell scripts (run_sgx_cg_selftests.sh) testing SGX misc +cgroup. The SGX cgroup test script requires root privileges and runs a +specific test case of the test_sgx in different cgroups configured by the +script. More details about the cgroup test can be found below. + +All SGX selftests can run with or without kselftest framework. + +WITH KSELFTEST FRAMEWORK +=== + +BUILD +- + +Build executable file "test_sgx" from top level directory of the kernel source: + $ make -C tools/testing/selftests TARGETS=sgx + +RUN +--- + +Run all sgx tests as sudo or root since the cgroup tests need to configure cgroup +limits in files under /sys/fs/cgroup. + + $ sudo make -C tools/testing/selftests/sgx run_tests + +Without sudo, SGX cgroup tests will be skipped. + +On platforms with large Enclave Page Cache (EPC) and/or less cpu cores, you may +need adjust the timeout in 'settings' to avoid timeouts. + +More details about kselftest framework can be found in +Documentation/dev-tools/kselftest.rst. + +WITHOUT KSELFTEST FRAMEWORK +=== + +BUILD +- + +Build executable file "test_sgx" from this +directory(tools/testing/selftests/sgx/): + + $ make + +RUN +--- + +Run all non-cgrou
Re: [PATCH] virtio_pmem: Check device status before requesting flush
Hi, On Tue, Aug 20, 2024 at 7:23 AM Ira Weiny wrote: > > Philip Chen wrote: > > On Mon, Aug 19, 2024 at 2:56 PM Ira Weiny wrote: > > > > > > Philip Chen wrote: > > > > If a pmem device is in a bad status, the driver side could wait for > > > > host ack forever in virtio_pmem_flush(), causing the system to hang. > > > > > > I assume this was supposed to be v2 and you resent this as a proper v2 > > > with a change list from v1? > > Ah...yes, I'll fix it and re-send it as a v2 patch. > > Wait didn't you already do that? Wasn't this v2? Yes, but somehow the patch didn't go to my inbox. (Maybe it's because there is no code change between v1 and v2?) So I resent another v2 (with some minor change to the comment): https://lore.kernel.org/all/20240820172256.903251-1-philipc...@chromium.org/ Please take a look. > > https://lore.kernel.org/all/20240815010337.2334245-1-philipc...@chromium.org/ > > Ira
Re: [PATCH v2] virtio_pmem: Check device status before requesting flush
Hi, On Tue, Aug 20, 2024 at 1:01 PM Dave Jiang wrote: > > > > On 8/20/24 10:22 AM, Philip Chen wrote: > > If a pmem device is in a bad status, the driver side could wait for > > host ack forever in virtio_pmem_flush(), causing the system to hang. > > > > So add a status check in the beginning of virtio_pmem_flush() to return > > early if the device is not activated. > > > > Signed-off-by: Philip Chen > > --- > > > > v2: > > - Remove change id from the patch description > > - Add more details to the patch description > > > > drivers/nvdimm/nd_virtio.c | 9 + > > 1 file changed, 9 insertions(+) > > > > diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c > > index 35c8fbbba10e..97addba06539 100644 > > --- a/drivers/nvdimm/nd_virtio.c > > +++ b/drivers/nvdimm/nd_virtio.c > > @@ -44,6 +44,15 @@ static int virtio_pmem_flush(struct nd_region *nd_region) > > unsigned long flags; > > int err, err1; > > > > + /* > > + * Don't bother to submit the request to the device if the device is > > + * not acticated. > > s/acticated/activated/ Thanks for the review. I'll fix this typo in v3. In addition to this typo, does anyone have any other concerns? > > > + */ > > + if (vdev->config->get_status(vdev) & VIRTIO_CONFIG_S_NEEDS_RESET) { > > + dev_info(&vdev->dev, "virtio pmem device needs a reset\n"); > > + return -EIO; > > + } > > + > > might_sleep(); > > req_data = kmalloc(sizeof(*req_data), GFP_KERNEL); > > if (!req_data)
[PATCH v7 4/4] module: add install target for modules.builtin.ranges
When CONFIG_BUILTIN_MODULE_RANGES is enabled, the modules.builtin.ranges file should be installed in the module install location. Signed-off-by: Kris Van Hees Reviewed-by: Nick Alcock --- Changes since v3: - Only install modules.builtin.ranges if CONFIG_BUILTIN_MODULE_RANGES=y --- scripts/Makefile.modinst | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/scripts/Makefile.modinst b/scripts/Makefile.modinst index 0afd75472679..c38bf63a33be 100644 --- a/scripts/Makefile.modinst +++ b/scripts/Makefile.modinst @@ -30,10 +30,12 @@ $(MODLIB)/modules.order: modules.order FORCE quiet_cmd_install_modorder = INSTALL $@ cmd_install_modorder = sed 's:^\(.*\)\.o$$:kernel/\1.ko:' $< > $@ -# Install modules.builtin(.modinfo) even when CONFIG_MODULES is disabled. +# Install modules.builtin(.modinfo,.ranges) even when CONFIG_MODULES is disabled. install-y += $(addprefix $(MODLIB)/, modules.builtin modules.builtin.modinfo) -$(addprefix $(MODLIB)/, modules.builtin modules.builtin.modinfo): $(MODLIB)/%: % FORCE +install-$(CONFIG_BUILTIN_MODULE_RANGES) += $(MODLIB)/modules.builtin.ranges + +$(addprefix $(MODLIB)/, modules.builtin modules.builtin.modinfo modules.builtin.ranges): $(MODLIB)/%: % FORCE $(call cmd,install) endif -- 2.45.2
[PATCH v7 1/4] kbuild: add mod(name,file)_flags to assembler flags for module objects
In order to create the file at build time, modules.builtin.ranges, that contains the range of addresses for all built-in modules, there needs to be a way to identify what code is compiled into modules. To identify what code is compiled into modules during a kernel build, one can look for the presence of the -DKBUILD_MODFILE and -DKBUILD_MODNAME options in the compile command lines. A simple grep in .*.cmd files for those options is sufficient for this. Unfortunately, these options are only passed when compiling C source files. Various modules also include objects built from assembler source, and these options are not passed in that case. Adding $(modfile_flags) to modkern_aflags (similar to modkern_cflahs), and adding $(modname_flags) to a_flags (similar to c_flags) makes it possible to identify which objects are compiled into modules for both C and assembler soure files. Signed-off-by: Kris Van Hees Reviewed-by: Steven Rostedt (Google) --- scripts/Makefile.lib | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib index fe3668dc4954..170f462537a8 100644 --- a/scripts/Makefile.lib +++ b/scripts/Makefile.lib @@ -238,7 +238,7 @@ modkern_rustflags = \ modkern_aflags = $(if $(part-of-module), \ $(KBUILD_AFLAGS_MODULE) $(AFLAGS_MODULE), \ - $(KBUILD_AFLAGS_KERNEL) $(AFLAGS_KERNEL)) + $(KBUILD_AFLAGS_KERNEL) $(AFLAGS_KERNEL) $(modfile_flags)) c_flags= -Wp,-MMD,$(depfile) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) \ -include $(srctree)/include/linux/compiler_types.h \ @@ -248,7 +248,7 @@ c_flags= -Wp,-MMD,$(depfile) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) \ rust_flags = $(_rust_flags) $(modkern_rustflags) @$(objtree)/include/generated/rustc_cfg a_flags= -Wp,-MMD,$(depfile) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) \ -$(_a_flags) $(modkern_aflags) +$(_a_flags) $(modkern_aflags) $(modname_flags) cpp_flags = -Wp,-MMD,$(depfile) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) \ $(_cpp_flags) -- 2.45.2
[PATCH v7 3/4] scripts: add verifier script for builtin module range data
The modules.builtin.ranges offset range data for builtin modules is generated at compile time based on the list of built-in modules and the vmlinux.map and vmlinux.o.map linker maps. This data can be used to determine whether a symbol at a particular address belongs to module code that was configured to be compiled into the kernel proper as a built-in module (rather than as a standalone module). This patch adds a script that uses the generated modules.builtin.ranges data to annotate the symbols in the System.map with module names if their address falls within a range that belongs to one or more built-in modules. It then processes the vmlinux.map (and if needed, vmlinux.o.map) to verify the annotation: - For each top-level section: - For each object in the section: - Determine whether the object is part of a built-in module (using modules.builtin and the .*.cmd file used to compile the object as suggested in [0]) - For each symbol in that object, verify that the built-in module association (or lack thereof) matches the annotation given to the symbol. Signed-off-by: Kris Van Hees Reviewed-by: Nick Alcock Reviewed-by: Alan Maguire --- Changes since v6: - Applied Masahiro Yamada's suggestions to the AWK script. Changes since v5: - Added optional 6th argument to specify kernel build directory. - Report error and exit if .*.o.cmd files cannot be read. Changes since v4: - New patch in the series --- scripts/verify_builtin_ranges.awk | 356 ++ 1 file changed, 356 insertions(+) create mode 100755 scripts/verify_builtin_ranges.awk diff --git a/scripts/verify_builtin_ranges.awk b/scripts/verify_builtin_ranges.awk new file mode 100755 index ..93f66e9a8802 --- /dev/null +++ b/scripts/verify_builtin_ranges.awk @@ -0,0 +1,356 @@ +#!/usr/bin/gawk -f +# SPDX-License-Identifier: GPL-2.0 +# verify_builtin_ranges.awk: Verify address range data for builtin modules +# Written by Kris Van Hees +# +# Usage: verify_builtin_ranges.awk modules.builtin.ranges System.map \ +# modules.builtin vmlinux.map vmlinux.o.map \ +# [ ] +# + +# Return the module name(s) (if any) associated with the given object. +# +# If we have seen this object before, return information from the cache. +# Otherwise, retrieve it from the corresponding .cmd file. +# +function get_module_info(fn, mod, obj, s) { + if (fn in omod) + return omod[fn]; + + if (match(fn, /\/[^/]+$/) == 0) + return ""; + + obj = fn; + mod = ""; + fn = kdir "/" substr(fn, 1, RSTART) "." substr(fn, RSTART + 1) ".cmd"; + if (getline s 0) { + mod = substr(s, RSTART + 16, RLENGTH - 16); + gsub(/['"]/, "", mod); + } + } else { + print "ERROR: Failed to read: " fn "\n\n" \ + " Invalid kernel build directory (" kdir ")\n" \ + " or its content does not match " ARGV[1] >"/dev/stderr"; + close(fn); + total = 0; + exit(1); + } + close(fn); + + # A single module (common case) also reflects objects that are not part + # of a module. Some of those objects have names that are also a module + # name (e.g. core). We check the associated module file name, and if + # they do not match, the object is not part of a module. + if (mod !~ / /) { + if (!(mod in mods)) + mod = ""; + } + + gsub(/([^/ ]*\/)+/, "", mod); + gsub(/-/, "_", mod); + + # At this point, mod is a single (valid) module name, or a list of + # module names (that do not need validation). + omod[obj] = mod; + close(fn); + + return mod; +} + +# Return a representative integer value for a given hexadecimal address. +# +# Since all kernel addresses fall within the same memory region, we can safely +# strip off the first 6 hex digits before performing the hex-to-dec conversion, +# thereby avoiding integer overflows. +# +function addr2val(val) { + sub(/^0x/, "", val); + if (length(val) == 16) + val = substr(val, 5); + return strtonum("0x" val); +} + +# Determine the kernel build directory to use (default is .). +# +BEGIN { + if (ARGC > 6) { + kdir = ARGV[ARGC - 1]; + ARGV[ARGC - 1] = ""; + } else + kdir = "."; +} + +# (1) Load the built-in module address range data. +# +ARGIND == 1 { + ranges[FNR] = $0; + rcnt++; + next; +} + +# (2) Annotate System.map symbols with module names. +# +ARGIND == 2 { + addr = addr2val($1); + name = $3; + + while (addr >= mod_eaddr) { + if (sect_symb) { + if (sect_symb != name) +
[PATCH v7 2/4] kbuild: generate offset range data for builtin modules
Create file module.builtin.ranges that can be used to find where built-in modules are located by their addresses. This will be useful for tracing tools to find what functions are for various built-in modules. The offset range data for builtin modules is generated using: - modules.builtin: associates object files with module names - vmlinux.map: provides load order of sections and offset of first member per section - vmlinux.o.map: provides offset of object file content per section - .*.cmd: build cmd file with KBUILD_MODFILE The generated data will look like: .text - = _text .text baf0-cb10 amd_uncore .text 0009bd10-0009c8e0 iosf_mbi ... .text 00b9f080-00ba011a intel_skl_int3472_discrete .text 00ba0120-00ba03c0 intel_skl_int3472_discrete intel_skl_int3472_tps68470 .text 00ba03c0-00ba08d6 intel_skl_int3472_tps68470 ... .data - = _sdata .data f020-f680 amd_uncore For each ELF section, it lists the offset of the first symbol. This can be used to determine the base address of the section at runtime. Next, it lists (in strict ascending order) offset ranges in that section that cover the symbols of one or more builtin modules. Multiple ranges can apply to a single module, and ranges can be shared between modules. The CONFIG_BUILTIN_MODULE_RANGES option controls whether offset range data is generated for kernel modules that are built into the kernel image. How it works: 1. The modules.builtin file is parsed to obtain a list of built-in module names and their associated object names (the .ko file that the module would be in if it were a loadable module, hereafter referred to as ). This object name can be used to identify objects in the kernel compile because any C or assembler code that ends up into a built-in module will have the option -DKBUILD_MODFILE= present in its build command, and those can be found in the ..cmd file in the kernel build tree. If an object is part of multiple modules, they will all be listed in the KBUILD_MODFILE option argument. This allows us to conclusively determine whether an object in the kernel build belong to any modules, and which. 2. The vmlinux.map is parsed next to determine the base address of each top level section so that all addresses into the section can be turned into offsets. This makes it possible to handle sections getting loaded at different addresses at system boot. We also determine an 'anchor' symbol at the beginning of each section to make it possible to calculate the true base address of a section at runtime (i.e. symbol address - symbol offset). We collect start addresses of sections that are included in the top level section. This is used when vmlinux is linked using vmlinux.o, because in that case, we need to look at the vmlinux.o linker map to know what object a symbol is found in. And finally, we process each symbol that is listed in vmlinux.map (or vmlinux.o.map) based on the following structure: vmlinux linked from vmlinux.a: vmlinux.map: -- might be same as top level section) -- built-in association known -- belongs to module(s) object belongs to ... vmlinux linked from vmlinux.o: vmlinux.map: -- might be same as top level section) vmlinux.o -- need to use vmlinux.o.map -- ignored ... vmlinux.o.map: -- built-in association known -- belongs to module(s) object belongs to ... 3. As sections, objects, and symbols are processed, offset ranges are constructed in a striaght-forward way: - If the symbol belongs to one or more built-in modules: - If we were working on the same module(s), extend the range to include this object - If we were working on another module(s), close that range, and start the new one - If the symbol does not belong to any built-in modules: - If we were working on a module(s) range, close that range Signed-off-by: Kris Van Hees Reviewed-by: Nick Alcock Reviewed-by: Alan Maguire Reviewed-by: Steven Rostedt (Google) --- Changes since v6: - Applied Masahiro Yamada's suggestions (Kconfig, makefile, script). Changes since v5: - Removed unnecessary compatibility info from option description. Changes since v4: - Improved commit description to explain the why and how. - Documented dependency on GNU AWK for CONFIG_BUILTIN_MODULE_RANGES. - Improved comments in generate_builtin_ranges.awk - Improved logic in generate_builtin_ranges.awk to handle incorrect object size information in linker maps Changes since v3: - Consolidated patches 2 through 5 into a single patch - Move
Re: [RFC 7/7] vhost: Add new UAPI to support changing Kthread mode
On Mon, Aug 19, 2024 at 05:27:33PM +0800, Cindy Lu wrote: > Add a new UAPI to support setting the vhost device to > use kthread mode. The user space application needs to use > VHOST_SET_USE_KTHREAD to set the mode. This setting must > be set before VHOST_SET_OWNER is set. Usage of an API is a complete kernel internal detail that has absolutely no business being exposed to applications. What is the application visible behavior that the API use is the proxy for?
Re: [syzbot] [perf?] KASAN: slab-use-after-free Read in uprobe_unregister
#syz dup: [syzbot] [perf?] KASAN: slab-use-after-free Read in __uprobe_unregister syzbot https://lore.kernel.org/all/382d39061f59f...@google.com/ Hopefully fixed by [PATCH tip/perf/core] bpf: fix use-after-free in bpf_uprobe_multi_link_attach() https://lore.kernel.org/all/20240813152524.ga7...@redhat.com/ On 08/20, syzbot wrote: > > Hello, > > syzbot found the following issue on: > > HEAD commit:367b5c3d53e5 Add linux-next specific files for 20240816 > git tree: linux-next > console+strace: https://syzkaller.appspot.com/x/log.txt?x=109d46a398 > kernel config: https://syzkaller.appspot.com/x/.config?x=61ba6f3b22ee5467 > dashboard link: https://syzkaller.appspot.com/bug?extid=b32fa80fe75273685f9c > compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) > 2.40 > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1099b7d398 > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=15c513e598 > > Downloadable assets: > disk image: > https://storage.googleapis.com/syzbot-assets/0b1b4e3cad3c/disk-367b5c3d.raw.xz > vmlinux: > https://storage.googleapis.com/syzbot-assets/5bb090f7813c/vmlinux-367b5c3d.xz > kernel image: > https://storage.googleapis.com/syzbot-assets/6674cb0709b1/bzImage-367b5c3d.xz > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > Reported-by: syzbot+b32fa80fe75273685...@syzkaller.appspotmail.com > > R10: 0001 R11: 0246 R12: > R13: R14: 0001 R15: 0001 > > == > BUG: KASAN: slab-use-after-free in consumer_del kernel/events/uprobes.c:772 > [inline] > BUG: KASAN: slab-use-after-free in uprobe_unregister+0x99/0x220 > kernel/events/uprobes.c:1097 > Read of size 8 at addr 888022b9eb40 by task syz-executor227/5226 > > CPU: 0 UID: 0 PID: 5226 Comm: syz-executor227 Not tainted > 6.11.0-rc3-next-20240816-syzkaller #0 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 08/06/2024 > Call Trace: > > __dump_stack lib/dump_stack.c:94 [inline] > dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 > print_address_description mm/kasan/report.c:377 [inline] > print_report+0x169/0x550 mm/kasan/report.c:488 > kasan_report+0x143/0x180 mm/kasan/report.c:601 > consumer_del kernel/events/uprobes.c:772 [inline] > uprobe_unregister+0x99/0x220 kernel/events/uprobes.c:1097 > bpf_uprobe_unregister kernel/trace/bpf_trace.c:3119 [inline] > bpf_uprobe_multi_link_release+0x8c/0x1b0 kernel/trace/bpf_trace.c:3127 > bpf_link_free+0xf5/0x250 kernel/bpf/syscall.c:2998 > bpf_link_put_direct kernel/bpf/syscall.c:3038 [inline] > bpf_link_release+0x7b/0x90 kernel/bpf/syscall.c:3045 > __fput+0x24b/0x890 fs/file_table.c:424 > task_work_run+0x24f/0x310 kernel/task_work.c:228 > exit_task_work include/linux/task_work.h:40 [inline] > do_exit+0xa2f/0x28e0 kernel/exit.c:939 > do_group_exit+0x207/0x2c0 kernel/exit.c:1088 > __do_sys_exit_group kernel/exit.c:1099 [inline] > __se_sys_exit_group kernel/exit.c:1097 [inline] > __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1097 > x64_sys_call+0x26a8/0x26b0 arch/x86/include/generated/asm/syscalls_64.h:232 > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > RIP: 0033:0x7f409a857c79 > Code: Unable to access opcode bytes at 0x7f409a857c4f. > RSP: 002b:7fff689d0b48 EFLAGS: 0246 ORIG_RAX: 00e7 > RAX: ffda RBX: RCX: 7f409a857c79 > RDX: 003c RSI: 00e7 RDI: > RBP: 7f409a8cb370 R08: ffb8 R09: 00a0 > R10: 0001 R11: 0246 R12: 7f409a8cb370 > R13: R14: 7f409a8cbdc0 R15: 7f409a821340 > > > Allocated by task 5226: > kasan_save_stack mm/kasan/common.c:47 [inline] > kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 > poison_kmalloc_redzone mm/kasan/common.c:377 [inline] > __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:394 > kasan_kmalloc include/linux/kasan.h:260 [inline] > __do_kmalloc_node mm/slub.c:4202 [inline] > __kmalloc_node_noprof+0x22a/0x440 mm/slub.c:4208 > __kvmalloc_node_noprof+0x72/0x1b0 mm/util.c:658 > kvmalloc_array_node_noprof include/linux/slab.h:835 [inline] > bpf_uprobe_multi_link_attach+0x43e/0xb90 kernel/trace/bpf_trace.c:3369 > link_create+0x6e9/0x870 kernel/bpf/syscall.c:5236 > __sys_bpf+0x4bc/0x810 kernel/bpf/syscall.c:5697 > __do_sys_bpf kernel/bpf/syscall.c:5734 [inline] > __se_sys_bpf kernel/bpf/syscall.c:5732 [inline] > __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5732 > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > Freed by task 5226: > kasan_save_stack mm/kasan/common.c:47 [inline] > kasan_save
Re: [syzbot] [perf?] KASAN: slab-use-after-free Read in uprobe_unregister
> #syz dup: [syzbot] [perf?] KASAN: slab-use-after-free Read in > __uprobe_unregister syzbot can't find the dup bug > > https://lore.kernel.org/all/382d39061f59f...@google.com/ > > Hopefully fixed by > [PATCH tip/perf/core] bpf: fix use-after-free in > bpf_uprobe_multi_link_attach() > https://lore.kernel.org/all/20240813152524.ga7...@redhat.com/ > > On 08/20, syzbot wrote: >> >> Hello, >> >> syzbot found the following issue on: >> >> HEAD commit:367b5c3d53e5 Add linux-next specific files for 20240816 >> git tree: linux-next >> console+strace: https://syzkaller.appspot.com/x/log.txt?x=109d46a398 >> kernel config: https://syzkaller.appspot.com/x/.config?x=61ba6f3b22ee5467 >> dashboard link: https://syzkaller.appspot.com/bug?extid=b32fa80fe75273685f9c >> compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for >> Debian) 2.40 >> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1099b7d398 >> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=15c513e598 >> >> Downloadable assets: >> disk image: >> https://storage.googleapis.com/syzbot-assets/0b1b4e3cad3c/disk-367b5c3d.raw.xz >> vmlinux: >> https://storage.googleapis.com/syzbot-assets/5bb090f7813c/vmlinux-367b5c3d.xz >> kernel image: >> https://storage.googleapis.com/syzbot-assets/6674cb0709b1/bzImage-367b5c3d.xz >> >> IMPORTANT: if you fix the issue, please add the following tag to the commit: >> Reported-by: syzbot+b32fa80fe75273685...@syzkaller.appspotmail.com >> >> R10: 0001 R11: 0246 R12: >> R13: R14: 0001 R15: 0001 >> >> == >> BUG: KASAN: slab-use-after-free in consumer_del kernel/events/uprobes.c:772 >> [inline] >> BUG: KASAN: slab-use-after-free in uprobe_unregister+0x99/0x220 >> kernel/events/uprobes.c:1097 >> Read of size 8 at addr 888022b9eb40 by task syz-executor227/5226 >> >> CPU: 0 UID: 0 PID: 5226 Comm: syz-executor227 Not tainted >> 6.11.0-rc3-next-20240816-syzkaller #0 >> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS >> Google 08/06/2024 >> Call Trace: >> >> __dump_stack lib/dump_stack.c:94 [inline] >> dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 >> print_address_description mm/kasan/report.c:377 [inline] >> print_report+0x169/0x550 mm/kasan/report.c:488 >> kasan_report+0x143/0x180 mm/kasan/report.c:601 >> consumer_del kernel/events/uprobes.c:772 [inline] >> uprobe_unregister+0x99/0x220 kernel/events/uprobes.c:1097 >> bpf_uprobe_unregister kernel/trace/bpf_trace.c:3119 [inline] >> bpf_uprobe_multi_link_release+0x8c/0x1b0 kernel/trace/bpf_trace.c:3127 >> bpf_link_free+0xf5/0x250 kernel/bpf/syscall.c:2998 >> bpf_link_put_direct kernel/bpf/syscall.c:3038 [inline] >> bpf_link_release+0x7b/0x90 kernel/bpf/syscall.c:3045 >> __fput+0x24b/0x890 fs/file_table.c:424 >> task_work_run+0x24f/0x310 kernel/task_work.c:228 >> exit_task_work include/linux/task_work.h:40 [inline] >> do_exit+0xa2f/0x28e0 kernel/exit.c:939 >> do_group_exit+0x207/0x2c0 kernel/exit.c:1088 >> __do_sys_exit_group kernel/exit.c:1099 [inline] >> __se_sys_exit_group kernel/exit.c:1097 [inline] >> __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1097 >> x64_sys_call+0x26a8/0x26b0 arch/x86/include/generated/asm/syscalls_64.h:232 >> do_syscall_x64 arch/x86/entry/common.c:52 [inline] >> do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 >> entry_SYSCALL_64_after_hwframe+0x77/0x7f >> RIP: 0033:0x7f409a857c79 >> Code: Unable to access opcode bytes at 0x7f409a857c4f. >> RSP: 002b:7fff689d0b48 EFLAGS: 0246 ORIG_RAX: 00e7 >> RAX: ffda RBX: RCX: 7f409a857c79 >> RDX: 003c RSI: 00e7 RDI: >> RBP: 7f409a8cb370 R08: ffb8 R09: 00a0 >> R10: 0001 R11: 0246 R12: 7f409a8cb370 >> R13: R14: 7f409a8cbdc0 R15: 7f409a821340 >> >> >> Allocated by task 5226: >> kasan_save_stack mm/kasan/common.c:47 [inline] >> kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 >> poison_kmalloc_redzone mm/kasan/common.c:377 [inline] >> __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:394 >> kasan_kmalloc include/linux/kasan.h:260 [inline] >> __do_kmalloc_node mm/slub.c:4202 [inline] >> __kmalloc_node_noprof+0x22a/0x440 mm/slub.c:4208 >> __kvmalloc_node_noprof+0x72/0x1b0 mm/util.c:658 >> kvmalloc_array_node_noprof include/linux/slab.h:835 [inline] >> bpf_uprobe_multi_link_attach+0x43e/0xb90 kernel/trace/bpf_trace.c:3369 >> link_create+0x6e9/0x870 kernel/bpf/syscall.c:5236 >> __sys_bpf+0x4bc/0x810 kernel/bpf/syscall.c:5697 >> __do_sys_bpf kernel/bpf/syscall.c:5734 [inline] >> __se_sys_bpf kernel/bpf/syscall.c:5732 [inline] >> __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5732 >> do_syscall_x64 arch/x86/entry/common.c:52 [inline] >> do_syscall_64+0xf3/0x230 arch/x86/entry/comm
Re: [PATCH] remoteproc: k3-r5: Fix error handling when power-up failed
On 19-08-2024 20:54, Jan Kiszka wrote: From: Jan Kiszka By simply bailing out, the driver was violating its rule and internal Using device lifecycle managed functions to register the rproc (devm_rproc_add()), bailing out with an error code will work. assumptions that either both or no rproc should be initialized. E.g., this could cause the first core to be available but not the second one, leading to crashes on its shutdown later on while trying to dereference that second instance. Fixes: 61f6f68447ab ("remoteproc: k3-r5: Wait for core0 power-up before powering up core1") Signed-off-by: Jan Kiszka --- drivers/remoteproc/ti_k3_r5_remoteproc.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c b/drivers/remoteproc/ti_k3_r5_remoteproc.c index 39a47540c590..eb09d2e9b32a 100644 --- a/drivers/remoteproc/ti_k3_r5_remoteproc.c +++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c @@ -1332,7 +1332,7 @@ static int k3_r5_cluster_rproc_init(struct platform_device *pdev) dev_err(dev, "Timed out waiting for %s core to power up!\n", rproc->name); - return ret; + goto err_powerup; } } @@ -1348,6 +1348,7 @@ static int k3_r5_cluster_rproc_init(struct platform_device *pdev) } } +err_powerup: rproc_del(rproc); Please use devm_rproc_add() to avoid having to do rproc_del() manually here. err_add: k3_r5_reserved_mem_exit(kproc);