[PATCH 0/4] Add new driver for WCSS secure PIL loading

2024-08-20 Thread Gokul Sriram Palanisamy
This series depends on q6 clock removal series [1].

- Secure PIL is signed, split firmware images which only TrustZone (TZ) can
  authenticate and load. Linux kernel will send a request to TZ to
  authenticate and load the PIL images.

- When secure PIL support was added to the existing wcss PIL driver
  earlier in [2], Bjorn suggested not to overload the existing WCSS
  rproc driver [2], instead post a new driver for secure PIL alone.
  This series adds a new secure PIL driver for the same. 

[1] 
https://patchwork.kernel.org/project/linux-arm-msm/cover/20240820055618.267554-1-quic_gokul...@quicinc.com/

[2] 
https://patchwork.kernel.org/project/linux-arm-msm/patch/1611984013-10201-3-git-send-email-gokul...@codeaurora.org/

Manikanta Mylavarapu (3):
  dt-bindings: remoteproc: qcom: document hexagon based WCSS secure PIL
  arm64: dts: qcom: ipq5332: add nodes to bringup q6
  arm64: dts: qcom: ipq9574: add nodes to bring up q6

Vignesh Viswanathan (1):
  remoteproc: qcom: add hexagon based WCSS secure PIL driver

 .../remoteproc/qcom,wcss-sec-pil.yaml | 125 ++
 arch/arm64/boot/dts/qcom/ipq5332.dtsi |  62 +++
 arch/arm64/boot/dts/qcom/ipq9574.dtsi |  58 +++
 drivers/remoteproc/Kconfig|  22 ++
 drivers/remoteproc/Makefile   |   1 +
 drivers/remoteproc/qcom_q6v5_wcss_sec.c   | 360 ++
 6 files changed, 628 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/remoteproc/qcom,wcss-sec-pil.yaml
 create mode 100644 drivers/remoteproc/qcom_q6v5_wcss_sec.c

-- 
2.34.1




[PATCH 1/2] dt-bindings: remoteproc: qcom: document hexagon based WCSS secure PIL

2024-08-20 Thread Gokul Sriram Palanisamy
From: Manikanta Mylavarapu 

Add new binding document for hexagon based WCSS secure PIL remoteproc.
IPQ5332, IPQ9574 follows secure PIL remoteproc.

Signed-off-by: Manikanta Mylavarapu 
Signed-off-by: Gokul Sriram Palanisamy 
---
 .../remoteproc/qcom,wcss-sec-pil.yaml | 125 ++
 1 file changed, 125 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/remoteproc/qcom,wcss-sec-pil.yaml

diff --git 
a/Documentation/devicetree/bindings/remoteproc/qcom,wcss-sec-pil.yaml 
b/Documentation/devicetree/bindings/remoteproc/qcom,wcss-sec-pil.yaml
new file mode 100644
index ..c69401b6cec1
--- /dev/null
+++ b/Documentation/devicetree/bindings/remoteproc/qcom,wcss-sec-pil.yaml
@@ -0,0 +1,125 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/remoteproc/qcom,wcss-sec-pil.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Qualcomm WCSS Secure Peripheral Image Loader
+
+maintainers:
+  - Manikanta Mylavarapu 
+
+description:
+  WCSS Secure Peripheral Image Loader loads firmware and power up QDSP6
+  remoteproc's on the Qualcomm IPQ9574, IPQ5332 SoC.
+
+properties:
+  compatible:
+enum:
+  - qcom,ipq5332-wcss-sec-pil
+  - qcom,ipq9574-wcss-sec-pil
+
+  reg:
+maxItems: 1
+
+  firmware-name:
+$ref: /schemas/types.yaml#/definitions/string
+description: Firmware name for the Hexagon core
+
+  interrupts:
+items:
+  - description: Watchdog interrupt
+  - description: Fatal interrupt
+  - description: Ready interrupt
+  - description: Handover interrupt
+  - description: Stop acknowledge interrupt
+
+  interrupt-names:
+items:
+  - const: wdog
+  - const: fatal
+  - const: ready
+  - const: handover
+  - const: stop-ack
+
+  clocks:
+items:
+  - description: IM SLEEP clock
+
+  clock-names:
+items:
+  - const: im_sleep
+
+  qcom,smem-states:
+$ref: /schemas/types.yaml#/definitions/phandle-array
+description: States used by the AP to signal the remote processor
+items:
+  - description: Shutdown Q6
+  - description: Stop Q6
+
+  qcom,smem-state-names:
+description:
+  Names of the states used by the AP to signal the remote processor
+items:
+  - const: shutdown
+  - const: stop
+
+  memory-region:
+items:
+  - description: Q6 reserved region
+
+  glink-edge:
+$ref: /schemas/remoteproc/qcom,glink-edge.yaml#
+description:
+  Qualcomm G-Link subnode which represents communication edge, channels
+  and devices related to the Modem.
+unevaluatedProperties: false
+
+required:
+  - compatible
+  - firmware-name
+  - reg
+  - interrupts
+  - interrupt-names
+  - qcom,smem-states
+  - qcom,smem-state-names
+  - memory-region
+
+additionalProperties: false
+
+examples:
+  - |
+#include 
+#include 
+q6v5_wcss: remoteproc@d10 {
+  compatible = "qcom,ipq5332-wcss-sec-pil";
+  reg = <0xd10 0x4040>;
+  firmware-name = "ath12k/IPQ5332/hw1.0/q6_fw0.mdt";
+  interrupts-extended = <&intc GIC_SPI 291 IRQ_TYPE_EDGE_RISING>,
+<&wcss_smp2p_in 0 IRQ_TYPE_NONE>,
+<&wcss_smp2p_in 1 IRQ_TYPE_NONE>,
+<&wcss_smp2p_in 2 IRQ_TYPE_NONE>,
+<&wcss_smp2p_in 3 IRQ_TYPE_NONE>;
+  interrupt-names = "wdog",
+"fatal",
+"ready",
+"handover",
+"stop-ack";
+
+  clocks = <&gcc GCC_IM_SLEEP_CLK>;
+  clock-names = "im_sleep";
+
+  qcom,smem-states = <&wcss_smp2p_out 0>,
+ <&wcss_smp2p_out 1>;
+  qcom,smem-state-names = "shutdown",
+  "stop";
+
+  memory-region = <&q6_region>;
+
+  glink-edge {
+interrupts = ;
+label = "rtr";
+qcom,remote-pid = <1>;
+mboxes = <&apcs_glb 8>;
+  };
+};
-- 
2.34.1




[PATCH 2/2] remoteproc: qcom: add hexagon based WCSS secure PIL driver

2024-08-20 Thread Gokul Sriram Palanisamy
From: Vignesh Viswanathan 

Add support to bring up hexagon based WCSS secure PIL remoteproc.
IPQ5332, IPQ9574 supports secure PIL remoteproc.

Signed-off-by: Vignesh Viswanathan 
Signed-off-by: Manikanta Mylavarapu 
Signed-off-by: Gokul Sriram Palanisamy 
---
 drivers/remoteproc/Kconfig  |  22 ++
 drivers/remoteproc/Makefile |   1 +
 drivers/remoteproc/qcom_q6v5_wcss_sec.c | 363 
 3 files changed, 386 insertions(+)
 create mode 100644 drivers/remoteproc/qcom_q6v5_wcss_sec.c

diff --git a/drivers/remoteproc/Kconfig b/drivers/remoteproc/Kconfig
index dda2ada215b7..368405825d83 100644
--- a/drivers/remoteproc/Kconfig
+++ b/drivers/remoteproc/Kconfig
@@ -255,6 +255,28 @@ config QCOM_Q6V5_WCSS
  Hexagon V5 based WCSS remote processors on e.g. IPQ8074.  This is
  a non-TrustZone wireless subsystem.
 
+config QCOM_Q6V5_WCSS_SEC
+   tristate "Qualcomm Hexagon based WCSS Secure Peripheral Image Loader"
+   depends on OF && ARCH_QCOM
+   depends on QCOM_SMEM
+   depends on RPMSG_QCOM_SMD || RPMSG_QCOM_SMD=n
+   depends on RPMSG_QCOM_GLINK_SMEM || RPMSG_QCOM_GLINK_SMEM=n
+   depends on QCOM_SYSMON || QCOM_SYSMON=n
+   depends on RPMSG_QCOM_GLINK || RPMSG_QCOM_GLINK=n
+   depends on QCOM_AOSS_QMP || QCOM_AOSS_QMP=n
+   select QCOM_MDT_LOADER
+   select QCOM_PIL_INFO
+   select QCOM_Q6V5_COMMON
+   select QCOM_RPROC_COMMON
+   select QCOM_SCM
+   help
+ Say y here to support the Qualcomm Secure Peripheral Image Loader
+ for the Hexagon based remote processors on e.g. IPQ5332.
+
+ This is TrustZone wireless subsystem. The firmware is
+ verified and booted with the help of the Peripheral Authentication
+ System (PAS) in TrustZone.
+
 config QCOM_SYSMON
tristate "Qualcomm sysmon driver"
depends on RPMSG
diff --git a/drivers/remoteproc/Makefile b/drivers/remoteproc/Makefile
index 91314a9b43ce..54530d7fa510 100644
--- a/drivers/remoteproc/Makefile
+++ b/drivers/remoteproc/Makefile
@@ -28,6 +28,7 @@ obj-$(CONFIG_QCOM_Q6V5_ADSP)  += qcom_q6v5_adsp.o
 obj-$(CONFIG_QCOM_Q6V5_MSS)+= qcom_q6v5_mss.o
 obj-$(CONFIG_QCOM_Q6V5_PAS)+= qcom_q6v5_pas.o
 obj-$(CONFIG_QCOM_Q6V5_WCSS)   += qcom_q6v5_wcss.o
+obj-$(CONFIG_QCOM_Q6V5_WCSS_SEC)   += qcom_q6v5_wcss_sec.o
 obj-$(CONFIG_QCOM_SYSMON)  += qcom_sysmon.o
 obj-$(CONFIG_QCOM_WCNSS_PIL)   += qcom_wcnss_pil.o
 qcom_wcnss_pil-y   += qcom_wcnss.o
diff --git a/drivers/remoteproc/qcom_q6v5_wcss_sec.c 
b/drivers/remoteproc/qcom_q6v5_wcss_sec.c
new file mode 100644
index ..8c47cf35a00b
--- /dev/null
+++ b/drivers/remoteproc/qcom_q6v5_wcss_sec.c
@@ -0,0 +1,363 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2016-2018 Linaro Ltd.
+ * Copyright (C) 2014 Sony Mobile Communications AB
+ * Copyright (c) 2012-2018, 2024 The Linux Foundation. All rights reserved.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "qcom_common.h"
+#include "qcom_q6v5.h"
+
+#include "remoteproc_internal.h"
+
+#define WCSS_CRASH_REASON  421
+
+#define WCSS_PAS_ID0x6
+#define MPD_WCSS_PAS_ID0xD
+
+struct wcss_sec {
+   struct device *dev;
+   struct qcom_rproc_glink glink_subdev;
+   struct qcom_rproc_ssr ssr_subdev;
+   struct qcom_q6v5 q6;
+   phys_addr_t mem_phys;
+   phys_addr_t mem_reloc;
+   void *mem_region;
+   size_t mem_size;
+   const struct wcss_data *desc;
+   const char *fw_name;
+
+   struct clk *im_sleep;
+};
+
+struct wcss_data {
+   u32 pasid;
+   const struct rproc_ops *ops;
+   bool need_auto_boot;
+   int (*init_clk)(struct wcss_sec *wcss);
+};
+
+static int wcss_sec_start(struct rproc *rproc)
+{
+   struct wcss_sec *wcss = rproc->priv;
+   struct device *dev = wcss->dev;
+   const struct wcss_data *desc = of_device_get_match_data(dev);
+   int ret;
+
+   qcom_q6v5_prepare(&wcss->q6);
+
+   ret = qcom_scm_pas_auth_and_reset(desc->pasid);
+   if (ret) {
+   dev_err(dev, "wcss_reset failed\n");
+   return ret;
+   }
+
+   ret = qcom_q6v5_wait_for_start(&wcss->q6, 5 * HZ);
+   if (ret == -ETIMEDOUT)
+   dev_err(dev, "start timed out\n");
+
+   return ret;
+}
+
+static int wcss_sec_stop(struct rproc *rproc)
+{
+   struct wcss_sec *wcss = rproc->priv;
+   struct device *dev = wcss->dev;
+   const struct wcss_data *desc = of_device_get_match_data(dev);
+   int ret;
+
+   ret = qcom_scm_pas_shutdown(desc->pasid);
+   if (ret) {
+   dev_err(dev, "not able to shutdown\n");
+   return ret;
+   }
+   qcom_q6v5_unprepare(&wcss->q6);
+
+   retur

[PATCH 3/4] arm64: dts: qcom: ipq5332: add nodes to bringup q6

2024-08-20 Thread Gokul Sriram Palanisamy
From: Manikanta Mylavarapu 

Enable nodes required for q6 remoteproc bring up.

Signed-off-by: Manikanta Mylavarapu 
Signed-off-by: Gokul Sriram Palanisamy 
---
 arch/arm64/boot/dts/qcom/ipq5332.dtsi | 62 +++
 1 file changed, 62 insertions(+)

diff --git a/arch/arm64/boot/dts/qcom/ipq5332.dtsi 
b/arch/arm64/boot/dts/qcom/ipq5332.dtsi
index 0a74ed4f72cc..ec93e7b64b9e 100644
--- a/arch/arm64/boot/dts/qcom/ipq5332.dtsi
+++ b/arch/arm64/boot/dts/qcom/ipq5332.dtsi
@@ -145,6 +145,11 @@ smem@4a80 {
 
hwlocks = <&tcsr_mutex 3>;
};
+
+   q6_region: wcnss@4a90 {
+   reg = <0x0 0x4a90 0x0 0x2b0>;
+   no-map;
+   };
};
 
soc@0 {
@@ -476,6 +481,39 @@ frame@b128000 {
status = "disabled";
};
};
+
+   q6v5_wcss: remoteproc@d10 {
+   compatible = "qcom,ipq5332-wcss-sec-pil";
+   reg = <0xd10 0x4040>;
+   firmware-name = "ath12k/IPQ5332/hw1.0/q6_fw0.mdt";
+   interrupts-extended = <&intc GIC_SPI 421 
IRQ_TYPE_EDGE_RISING>,
+ <&wcss_smp2p_in 0 IRQ_TYPE_NONE>,
+ <&wcss_smp2p_in 1 IRQ_TYPE_NONE>,
+ <&wcss_smp2p_in 2 IRQ_TYPE_NONE>,
+ <&wcss_smp2p_in 3 IRQ_TYPE_NONE>;
+   interrupt-names = "wdog",
+ "fatal",
+ "ready",
+ "handover",
+ "stop-ack";
+
+   clocks = <&gcc GCC_IM_SLEEP_CLK>;
+   clock-names = "im_sleep";
+
+   qcom,smem-states = <&wcss_smp2p_out 0>,
+  <&wcss_smp2p_out 1>;
+   qcom,smem-state-names = "shutdown",
+   "stop";
+
+   memory-region = <&q6_region>;
+
+   glink-edge {
+   interrupts = ;
+   label = "rtr";
+   qcom,remote-pid = <1>;
+   mboxes = <&apcs_glb 8>;
+   };
+   };
};
 
timer {
@@ -485,4 +523,28 @@ timer {
 ,
 ;
};
+
+   wcss: wcss-smp2p {
+   compatible = "qcom,smp2p";
+   qcom,smem = <435>, <428>;
+
+   interrupt-parent = <&intc>;
+   interrupts = ;
+
+   mboxes = <&apcs_glb 9>;
+
+   qcom,local-pid = <0>;
+   qcom,remote-pid = <1>;
+
+   wcss_smp2p_out: master-kernel {
+   qcom,entry-name = "master-kernel";
+   #qcom,smem-state-cells = <1>;
+   };
+
+   wcss_smp2p_in: slave-kernel {
+   qcom,entry-name = "slave-kernel";
+   interrupt-controller;
+   #interrupt-cells = <2>;
+   };
+   };
 };
-- 
2.34.1




[PATCH 4/4] arm64: dts: qcom: ipq9574: add nodes to bring up q6

2024-08-20 Thread Gokul Sriram Palanisamy
From: Manikanta Mylavarapu 

Enable nodes required for q6 remoteproc bring up.

Signed-off-by: Manikanta Mylavarapu 
Signed-off-by: Gokul Sriram Palanisamy 
---
 arch/arm64/boot/dts/qcom/ipq9574.dtsi | 58 +++
 1 file changed, 58 insertions(+)

diff --git a/arch/arm64/boot/dts/qcom/ipq9574.dtsi 
b/arch/arm64/boot/dts/qcom/ipq9574.dtsi
index 48dfafea46a7..bca3a50d21d7 100644
--- a/arch/arm64/boot/dts/qcom/ipq9574.dtsi
+++ b/arch/arm64/boot/dts/qcom/ipq9574.dtsi
@@ -213,6 +213,11 @@ smem@4aa0 {
hwlocks = <&tcsr_mutex 3>;
no-map;
};
+
+   q6_region: wcnss@4ab0 {
+   reg = <0x0 0x4ab0 0x0 0x2b0>;
+   no-map;
+   };
};
 
soc: soc@0 {
@@ -756,6 +761,35 @@ frame@b128000 {
status = "disabled";
};
};
+
+   q6v5_wcss: remoteproc@cd0 {
+   compatible = "qcom,ipq9574-wcss-sec-pil";
+   reg = <0x0cd0 0x4040>;
+   firmware-name = "ath11k/IPQ9574/hw1.0/q6_fw.mdt";
+   interrupts-extended = <&intc GIC_SPI 325 
IRQ_TYPE_EDGE_RISING>,
+ <&wcss_smp2p_in 0 IRQ_TYPE_NONE>,
+ <&wcss_smp2p_in 1 IRQ_TYPE_NONE>,
+ <&wcss_smp2p_in 2 IRQ_TYPE_NONE>,
+ <&wcss_smp2p_in 3 IRQ_TYPE_NONE>;
+   interrupt-names = "wdog",
+ "fatal",
+ "ready",
+ "handover",
+ "stop-ack";
+
+   qcom,smem-states = <&wcss_smp2p_out 0>,
+  <&wcss_smp2p_out 1>;
+   qcom,smem-state-names = "shutdown",
+   "stop";
+   memory-region = <&q6_region>;
+
+   glink-edge {
+   interrupts = ;
+   label = "rtr";
+   qcom,remote-pid = <1>;
+   mboxes = <&apcs_glb 8>;
+   };
+   };
};
 
thermal-zones {
@@ -987,4 +1021,28 @@ timer {
 ,
 ;
};
+
+   wcss: wcss-smp2p {
+   compatible = "qcom,smp2p";
+   qcom,smem = <435>, <428>;
+
+   interrupt-parent = <&intc>;
+   interrupts = ;
+
+   mboxes = <&apcs_glb 9>;
+
+   qcom,local-pid = <0>;
+   qcom,remote-pid = <1>;
+
+   wcss_smp2p_out: master-kernel {
+   qcom,entry-name = "master-kernel";
+   #qcom,smem-state-cells = <1>;
+   };
+
+   wcss_smp2p_in: slave-kernel {
+   qcom,entry-name = "slave-kernel";
+   interrupt-controller;
+   #interrupt-cells = <2>;
+   };
+   };
 };
-- 
2.34.1




[PATCH V5 1/1] remoteproc: qcom: q6v5: Get crash reason from specific SMEM partition

2024-08-20 Thread Gokul Sriram Palanisamy
From: Vignesh Viswanathan 

q6v5 fatal and watchdog IRQ handlers always retrieves the crash reason
information from SMEM global partition (QCOM_SMEM_HOST_ANY).

For some targets like IPQ9574 and IPQ5332, crash reason information is
present in target specific partition due to which the crash reason is
not printed in the current implementation.

Add support to pass crash_reason_partition along with crash_reason_item
number in qcom_q6v5_init call and use the same to get the crash information
from SMEM in fatal and watchdog IRQ handlers.

While at it, rename all instances of "crash_reason_smem" with
"crash_reason_item" as this reflects the proper meaning.

Signed-off-by: Vignesh Viswanathan 
---

This change depends on WCSS secure PIL driver [1].

Changes in V5: Added changes to WCSS secure PIL driver.

Changes in V4: Rename all instances of crash_reason_smem to
crash_reason_item

Changes in V3: Updated commit message.

Changes in V2: Addressed comments in V1.

[1] 
https://patchwork.kernel.org/project/linux-arm-msm/cover/20240820085517.435566-1-quic_gokul...@quicinc.com/

 drivers/remoteproc/qcom_q6v5.c  | 10 ++--
 drivers/remoteproc/qcom_q6v5.h  |  6 ++-
 drivers/remoteproc/qcom_q6v5_adsp.c | 17 +++---
 drivers/remoteproc/qcom_q6v5_mss.c  |  5 +-
 drivers/remoteproc/qcom_q6v5_pas.c  | 69 +
 drivers/remoteproc/qcom_q6v5_wcss.c | 12 +++--
 drivers/remoteproc/qcom_q6v5_wcss_sec.c |  3 +-
 7 files changed, 66 insertions(+), 56 deletions(-)

diff --git a/drivers/remoteproc/qcom_q6v5.c b/drivers/remoteproc/qcom_q6v5.c
index 4ee5e67a9f03..4d801ba3c791 100644
--- a/drivers/remoteproc/qcom_q6v5.c
+++ b/drivers/remoteproc/qcom_q6v5.c
@@ -100,7 +100,7 @@ static irqreturn_t q6v5_wdog_interrupt(int irq, void *data)
return IRQ_HANDLED;
}
 
-   msg = qcom_smem_get(QCOM_SMEM_HOST_ANY, q6v5->crash_reason, &len);
+   msg = qcom_smem_get(q6v5->crash_reason_partition, 
q6v5->crash_reason_item, &len);
if (!IS_ERR(msg) && len > 0 && msg[0])
dev_err(q6v5->dev, "watchdog received: %s\n", msg);
else
@@ -121,7 +121,7 @@ static irqreturn_t q6v5_fatal_interrupt(int irq, void *data)
if (!q6v5->running)
return IRQ_HANDLED;
 
-   msg = qcom_smem_get(QCOM_SMEM_HOST_ANY, q6v5->crash_reason, &len);
+   msg = qcom_smem_get(q6v5->crash_reason_partition, 
q6v5->crash_reason_item, &len);
if (!IS_ERR(msg) && len > 0 && msg[0])
dev_err(q6v5->dev, "fatal error received: %s\n", msg);
else
@@ -244,14 +244,16 @@ EXPORT_SYMBOL_GPL(qcom_q6v5_panic);
  * Return: 0 on success, negative errno on failure
  */
 int qcom_q6v5_init(struct qcom_q6v5 *q6v5, struct platform_device *pdev,
-  struct rproc *rproc, int crash_reason, const char 
*load_state,
+  struct rproc *rproc, int crash_reason_partition,
+  int crash_reason_item, const char *load_state,
   void (*handover)(struct qcom_q6v5 *q6v5))
 {
int ret;
 
q6v5->rproc = rproc;
q6v5->dev = &pdev->dev;
-   q6v5->crash_reason = crash_reason;
+   q6v5->crash_reason_partition = crash_reason_partition;
+   q6v5->crash_reason_item = crash_reason_item;
q6v5->handover = handover;
 
init_completion(&q6v5->start_done);
diff --git a/drivers/remoteproc/qcom_q6v5.h b/drivers/remoteproc/qcom_q6v5.h
index 5a859c41896e..7cdfa21fd9e4 100644
--- a/drivers/remoteproc/qcom_q6v5.h
+++ b/drivers/remoteproc/qcom_q6v5.h
@@ -34,7 +34,8 @@ struct qcom_q6v5 {
struct completion start_done;
struct completion stop_done;
 
-   int crash_reason;
+   int crash_reason_partition;
+   int crash_reason_item;
 
bool running;
 
@@ -43,7 +44,8 @@ struct qcom_q6v5 {
 };
 
 int qcom_q6v5_init(struct qcom_q6v5 *q6v5, struct platform_device *pdev,
-  struct rproc *rproc, int crash_reason, const char 
*load_state,
+  struct rproc *rproc, int crash_reason_partition,
+  int crash_reason_item, const char *load_state,
   void (*handover)(struct qcom_q6v5 *q6v5));
 void qcom_q6v5_deinit(struct qcom_q6v5 *q6v5);
 
diff --git a/drivers/remoteproc/qcom_q6v5_adsp.c 
b/drivers/remoteproc/qcom_q6v5_adsp.c
index 572dcb0f055b..174b0e213b12 100644
--- a/drivers/remoteproc/qcom_q6v5_adsp.c
+++ b/drivers/remoteproc/qcom_q6v5_adsp.c
@@ -60,7 +60,7 @@
 #define LPASS_EFUSE_Q6SS_EVB_SEL 0x0
 
 struct adsp_pil_data {
-   int crash_reason_smem;
+   int crash_reason_item;
const char *firmware_name;
 
const char *ssr_name;
@@ -97,7 +97,7 @@ struct qcom_adsp {
struct regmap *halt_map;
unsigned int halt_lpass;
 
-   int crash_reason_smem;
+   int crash_reason_item;
const char *info_name;
 
struct completion start_done;
@@ -721,8 +721,9 @@ static int adsp_probe(struct platform_device *pdev)
if (ret)
g

Re: [PATCH] remoteproc: k3-r5: Fix driver shutdown

2024-08-20 Thread Beleswar Prasad Padhi

Hi Jan,

On 19-08-2024 22:17, Jan Kiszka wrote:

From: Jan Kiszka 

When k3_r5_cluster_rproc_exit is run, core 1 is shutdown and removed
first. When core 0 should then be stopped before its removal, it will
find core1->rproc as NULL already and crashes. Happens on rmmod e.g.



Did you check this on top of -next-20240820 tag? There was a series[0] 
which was merged recently which fixed this condition. I don't see this 
issue when trying on top of -next-20240820 tag.

[0]: https://lore.kernel.org/all/20240808074127.2688131-1-b-pa...@ti.com/



Fixes: 3c8a9066d584 ("remoteproc: k3-r5: Do not allow core1 to power up before core0 
via sysfs")
CC: sta...@vger.kernel.org
Signed-off-by: Jan Kiszka 
---

There might be one more because I can still make this driver crash
after an operator error. Were error scenarios tested at all?



Can you point out what is this issue more specifically, and I can take 
this up then.




  drivers/remoteproc/ti_k3_r5_remoteproc.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c 
b/drivers/remoteproc/ti_k3_r5_remoteproc.c
index eb09d2e9b32a..9ebd7a34e638 100644
--- a/drivers/remoteproc/ti_k3_r5_remoteproc.c
+++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c
@@ -646,7 +646,8 @@ static int k3_r5_rproc_stop(struct rproc *rproc)
/* do not allow core 0 to stop before core 1 */
core1 = list_last_entry(&cluster->cores, struct k3_r5_core,
elem);
-   if (core != core1 && core1->rproc->state != RPROC_OFFLINE) {
+   if (core != core1 && core1->rproc &&
+   core1->rproc->state != RPROC_OFFLINE) {
dev_err(dev, "%s: can not stop core 0 before core 1\n",
__func__);
ret = -EPERM;




Re: [PATCH] remoteproc: k3-r5: Fix driver shutdown

2024-08-20 Thread Jan Kiszka
On 20.08.24 11:30, Beleswar Prasad Padhi wrote:
> Hi Jan,
> 
> On 19-08-2024 22:17, Jan Kiszka wrote:
>> From: Jan Kiszka 
>>
>> When k3_r5_cluster_rproc_exit is run, core 1 is shutdown and removed
>> first. When core 0 should then be stopped before its removal, it will
>> find core1->rproc as NULL already and crashes. Happens on rmmod e.g.
> 
> 
> Did you check this on top of -next-20240820 tag? There was a series[0]
> which was merged recently which fixed this condition. I don't see this
> issue when trying on top of -next-20240820 tag.
> [0]: https://lore.kernel.org/all/20240808074127.2688131-1-b-pa...@ti.com/
> 

I didn't try those yet, I was on 6.11-rcX. But from reading them
quickly, I'm not seeing the two issues I found directly addressed there.

>>
>> Fixes: 3c8a9066d584 ("remoteproc: k3-r5: Do not allow core1 to power
>> up before core0 via sysfs")
>> CC: sta...@vger.kernel.org
>> Signed-off-by: Jan Kiszka 
>> ---
>>
>> There might be one more because I can still make this driver crash
>> after an operator error. Were error scenarios tested at all?
> 
> 
> Can you point out what is this issue more specifically, and I can take
> this up then.

Try starting core1 before core0, and then again - system will hang or
crash while trying to wipe ATCM. I do not understand this problem yet -
same VA is used, and I cannot see where/how the region should have been
unmapped in between.

Jan

> 
>>
>>   drivers/remoteproc/ti_k3_r5_remoteproc.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c
>> b/drivers/remoteproc/ti_k3_r5_remoteproc.c
>> index eb09d2e9b32a..9ebd7a34e638 100644
>> --- a/drivers/remoteproc/ti_k3_r5_remoteproc.c
>> +++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c
>> @@ -646,7 +646,8 @@ static int k3_r5_rproc_stop(struct rproc *rproc)
>>   /* do not allow core 0 to stop before core 1 */
>>   core1 = list_last_entry(&cluster->cores, struct k3_r5_core,
>>   elem);
>> -    if (core != core1 && core1->rproc->state != RPROC_OFFLINE) {
>> +    if (core != core1 && core1->rproc &&
>> +    core1->rproc->state != RPROC_OFFLINE) {
>>   dev_err(dev, "%s: can not stop core 0 before core 1\n",
>>   __func__);
>>   ret = -EPERM;

-- 
Siemens AG, Technology
Linux Expert Center




Re: [PATCH] remoteproc: k3-r5: Fix driver shutdown

2024-08-20 Thread Beleswar Prasad Padhi



On 20-08-2024 15:09, Jan Kiszka wrote:

On 20.08.24 11:30, Beleswar Prasad Padhi wrote:

Hi Jan,

On 19-08-2024 22:17, Jan Kiszka wrote:

From: Jan Kiszka 

When k3_r5_cluster_rproc_exit is run, core 1 is shutdown and removed
first. When core 0 should then be stopped before its removal, it will
find core1->rproc as NULL already and crashes. Happens on rmmod e.g.


Did you check this on top of -next-20240820 tag? There was a series[0]
which was merged recently which fixed this condition. I don't see this
issue when trying on top of -next-20240820 tag.
[0]: https://lore.kernel.org/all/20240808074127.2688131-1-b-pa...@ti.com/


I didn't try those yet, I was on 6.11-rcX. But from reading them
quickly, I'm not seeing the two issues I found directly addressed there.


Check the comment by Andrew Davis[0], that addresses the above issue.

[0]: 
https://lore.kernel.org/all/0bba5293-a55d-4f13-887c-272a54d6e...@ti.com/





Fixes: 3c8a9066d584 ("remoteproc: k3-r5: Do not allow core1 to power
up before core0 via sysfs")
CC: sta...@vger.kernel.org
Signed-off-by: Jan Kiszka 
---

There might be one more because I can still make this driver crash
after an operator error. Were error scenarios tested at all?


Can you point out what is this issue more specifically, and I can take
this up then.

Try starting core1 before core0, and then again - system will hang or
If you are trying to stop and then start the cores from sysfs, that is 
not yet supported. The hang is thus expected.

crash while trying to wipe ATCM. I do not understand this problem yet -
same VA is used, and I cannot see where/how the region should have been
unmapped in between.

Jan


   drivers/remoteproc/ti_k3_r5_remoteproc.c | 3 ++-
   1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c
b/drivers/remoteproc/ti_k3_r5_remoteproc.c
index eb09d2e9b32a..9ebd7a34e638 100644
--- a/drivers/remoteproc/ti_k3_r5_remoteproc.c
+++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c
@@ -646,7 +646,8 @@ static int k3_r5_rproc_stop(struct rproc *rproc)
   /* do not allow core 0 to stop before core 1 */
   core1 = list_last_entry(&cluster->cores, struct k3_r5_core,
   elem);
-    if (core != core1 && core1->rproc->state != RPROC_OFFLINE) {
+    if (core != core1 && core1->rproc &&
+    core1->rproc->state != RPROC_OFFLINE) {
   dev_err(dev, "%s: can not stop core 0 before core 1\n",
   __func__);
   ret = -EPERM;




Re: [PATCH RFC v2] mmc/virtio: Add virtio MMC driver for QEMU emulation

2024-08-20 Thread Ulf Hansson
On Sun, 7 Jul 2024 at 18:06, Mikhail Krasheninnikov
 wrote:
>
> Introduce a new virtio MMC driver to enable virtio SD/MMC card
> emulation with QEMU. This driver allows emulating MMC cards in
> virtual environments, enhancing functionality and testing
> capabilities within QEMU.
>
> Link to the QEMU patch:
> https://lists.nongnu.org/archive/html/qemu-block/2024-07/msg00221.html
>
> No changes to existing dependencies or documentation.
>
> Signed-off-by: Mikhail Krasheninnikov 
> CC: Matwey Kornilov 
> CC: Ulf Hansson 
> CC: linux-...@vger.kernel.org
> CC: "Michael S. Tsirkin" 
> CC: Jason Wang 
> CC: Paolo Bonzini 
> CC: Stefan Hajnoczi 
> CC: Xuan Zhuo 
> CC: virtualizat...@lists.linux-foundation.org

Sorry for the delay. Can you please re-spin/post this to add Adrian
Hunter, the sdhci maintainer.

Moreover, it would be nice to get some from the virtio folkz to review
this too. My comments at this point are mostly nitpicks. Please see
below.

>
> ---
> Changes from v1:
>  - Add MAINTAINERS entry
>  - Refactor includes
>  - Change CPU endian format to little endian for device communication
>  - Move structs that belonged to uapi
>  - Validate multiple fields
>  - Introduce mutexes for safe request handling
>  - Call virtio_device_ready before adding host
>  - Fix removal of the device
>
>  MAINTAINERS   |  15 ++
>  drivers/mmc/host/Kconfig  |  14 ++
>  drivers/mmc/host/Makefile |   2 +
>  drivers/mmc/host/virtio-sdhci.c   | 258 ++
>  drivers/mmc/host/virtio-sdhci.h   |  40 +
>  include/uapi/linux/virtio-sdhci.h |  39 +

Please rename to sdhci-virtio.* and update the corresponding prefix of
the function names accordingly.

>  include/uapi/linux/virtio_ids.h   |   1 +
>  7 files changed, 369 insertions(+)
>  create mode 100644 drivers/mmc/host/virtio-sdhci.c
>  create mode 100644 drivers/mmc/host/virtio-sdhci.h
>  create mode 100644 include/uapi/linux/virtio-sdhci.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index dd5de540ec0b..be86156cd66c 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -22907,6 +22907,21 @@ S: Maintained
>  F: drivers/nvdimm/nd_virtio.c
>  F: drivers/nvdimm/virtio_pmem.c
>
> +VIRTIO SDHCI DRIVER
> +M: Mikhail Krasheninnikov 
> +M: "Michael S. Tsirkin" 
> +M: Jason Wang 
> +M: Paolo Bonzini 
> +M: Stefan Hajnoczi 
> +M: Xuan Zhuo 
> +L: virtualizat...@lists.linux-foundation.org
> +L: linux-...@vger.kernel.org
> +L: oasis-vir...@connectedcommunity.org
> +S: Maintained
> +F: drivers/mmc/host/virtio-sdhci.c
> +F: drivers/mmc/host/virtio-sdhci.h
> +F: include/uapi/linux/virtio-sdhci.h
> +
>  VIRTIO SOUND DRIVER
>  M: Anton Yakovlev 
>  M: "Michael S. Tsirkin" 
> diff --git a/drivers/mmc/host/Kconfig b/drivers/mmc/host/Kconfig
> index 554e67103c1a..d62d669ee117 100644
> --- a/drivers/mmc/host/Kconfig
> +++ b/drivers/mmc/host/Kconfig
> @@ -1069,3 +1069,17 @@ config MMC_LITEX
>   module will be called litex_mmc.
>
>   If unsure, say N.
> +
> +config SDHCI_VIRTIO
> +   tristate "VirtIO SDHCI Host Controller support"
> +   depends on VIRTIO
> +   help
> + This enables support for the Virtio SDHCI driver, which allows the
> + kernel to interact with SD/MMC devices over Virtio. Virtio is a
> + virtualization standard for network and disk device drivers,
> + providing a common API for virtualized environments.
> +
> + Enable this option if you are running the kernel in a virtualized
> + environment and need SD/MMC support via Virtio.
> +
> + If unsure, say N.
> \ No newline at end of file
> diff --git a/drivers/mmc/host/Makefile b/drivers/mmc/host/Makefile
> index a693fa3d3f1c..f9b05a07c6db 100644
> --- a/drivers/mmc/host/Makefile
> +++ b/drivers/mmc/host/Makefile
> @@ -108,3 +108,5 @@ endif
>
>  obj-$(CONFIG_MMC_SDHCI_XENON)  += sdhci-xenon-driver.o
>  sdhci-xenon-driver-y   += sdhci-xenon.o sdhci-xenon-phy.o
> +
> +obj-$(CONFIG_SDHCI_VIRTIO) += virtio-sdhci.o
> \ No newline at end of file
> diff --git a/drivers/mmc/host/virtio-sdhci.c b/drivers/mmc/host/virtio-sdhci.c
> new file mode 100644
> index ..1a637ab5e010
> --- /dev/null
> +++ b/drivers/mmc/host/virtio-sdhci.c
> @@ -0,0 +1,258 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + *  VirtIO SD/MMC driver
> + *
> + *  Author: Mikhail Krasheninnikov 
> + */
> +
> +#include "virtio-sdhci.h"
> +#include "linux/mmc/host.h"
> +#include "linux/virtio.h"
> +#include "linux/virtio_config.h"
> +#include "linux/completion.h"
> +#include "uapi/linux/virtio-sdhci.h"
> +
> +struct virtio_sdhci_host {
> +   struct virtio_device *vdev;
> +   struct mmc_host *mmc;
> +   struct virtqueue *vq;
> +   struct mmc_request *current_request;
> +
> +   struct virtio_mmc_request virtio_request;
> +   struct virtio_mmc_response virtio_response;
> +
> +   struct completion request_han

Re: [BUG] tracing: dynamic ftrace selftest detected failures

2024-08-20 Thread Mark Rutland
On Tue, Aug 20, 2024 at 10:03:30AM +0900, Masami Hiramatsu wrote:
> On Mon, 19 Aug 2024 12:02:44 -0400
> Steven Rostedt  wrote:
> 
> > On Tue, 20 Aug 2024 00:56:49 +0900
> > Masami Hiramatsu (Google)  wrote:
> > > 
> > > > 
> > > > We may need to add "noinline" or something to make sure those functions
> > > > don't get inlined for LTO.  
> > > 
> > > Yeah, we need such option at least for function call test.
> > 
> > Could you add the noinline, and if it fixes the issue send a patch?
> 
> I found the target function already has "noinline". I tried to add noinline
> to the testing function (callsite), but it also did not work.
> I think "noinline" is for the compiler, but LTO is done by the linker.

If LTO is breaking noinline, then that has much larger implications for
noinstr code and similar, and means that LTO is unsound...

> I found a discussion similar to this problem, but it seems very hacky.
> 
> https://groups.google.com/g/llvm-dev/c/6baI9LISmSU/m/uEeY_CRbBQAJ?pli=1

Thanks for the link!

Mark.



[PATCH v2] remoteproc: k3-r5: Delay notification of wakeup event

2024-08-20 Thread Beleswar Padhi
From: Udit Kumar 

Few times, core1 was scheduled to boot first before core0, which leads
to error:

'k3_r5_rproc_start: can not start core 1 before core 0'.

This was happening due to some scheduling between prepare and start
callback. The probe function waits for event, which is getting
triggered by prepare callback. To avoid above condition move event
trigger to start instead of prepare callback.

Fixes: 61f6f68447ab ("remoteproc: k3-r5: Wait for core0 power-up before 
powering up core1")
Signed-off-by: Udit Kumar 
[ Applied wakeup event trigger only for Split-Mode booted rprocs ]
Signed-off-by: Beleswar Padhi 
---
v2: Changelog:
* Mathieu
1) Rebased changes on top of -next-20240820 tag.

Link to v1:
https://lore.kernel.org/all/20240809060132.308642-1-b-pa...@ti.com/

 drivers/remoteproc/ti_k3_r5_remoteproc.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c 
b/drivers/remoteproc/ti_k3_r5_remoteproc.c
index 8a63a9360c0f..e61e53381abc 100644
--- a/drivers/remoteproc/ti_k3_r5_remoteproc.c
+++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c
@@ -469,8 +469,6 @@ static int k3_r5_rproc_prepare(struct rproc *rproc)
ret);
return ret;
}
-   core->released_from_reset = true;
-   wake_up_interruptible(&cluster->core_transition);
 
/*
 * Newer IP revisions like on J7200 SoCs support h/w auto-initialization
@@ -587,6 +585,9 @@ static int k3_r5_rproc_start(struct rproc *rproc)
ret = k3_r5_core_run(core);
if (ret)
return ret;
+
+   core->released_from_reset = true;
+   wake_up_interruptible(&cluster->core_transition);
}
 
return 0;
-- 
2.34.1




Re: [PATCH 0/4] Add new driver for WCSS secure PIL loading

2024-08-20 Thread Krzysztof Kozlowski
On 20/08/2024 10:55, Gokul Sriram Palanisamy wrote:
> This series depends on q6 clock removal series [1].

How? So this cannot be tested and merged?

That's second patchset to day with some totally bogus dependencies.
People, stop it.


Best regards,
Krzysztof




Re: [PATCH 1/2] dt-bindings: remoteproc: qcom: document hexagon based WCSS secure PIL

2024-08-20 Thread Krzysztof Kozlowski
On Tue, Aug 20, 2024 at 02:25:14PM +0530, Gokul Sriram Palanisamy wrote:
> From: Manikanta Mylavarapu 
> 
> Add new binding document for hexagon based WCSS secure PIL remoteproc.
> IPQ5332, IPQ9574 follows secure PIL remoteproc.
> 
> Signed-off-by: Manikanta Mylavarapu 
> Signed-off-by: Gokul Sriram Palanisamy 
> ---
>  .../remoteproc/qcom,wcss-sec-pil.yaml | 125 ++
>  1 file changed, 125 insertions(+)
>  create mode 100644 
> Documentation/devicetree/bindings/remoteproc/qcom,wcss-sec-pil.yaml
> 
> diff --git 
> a/Documentation/devicetree/bindings/remoteproc/qcom,wcss-sec-pil.yaml 
> b/Documentation/devicetree/bindings/remoteproc/qcom,wcss-sec-pil.yaml
> new file mode 100644
> index ..c69401b6cec1
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/remoteproc/qcom,wcss-sec-pil.yaml
> @@ -0,0 +1,125 @@
> +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/remoteproc/qcom,wcss-sec-pil.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: Qualcomm WCSS Secure Peripheral Image Loader

...

> +
> +maintainers:
> +  - Manikanta Mylavarapu 
> +
> +description:
> +  WCSS Secure Peripheral Image Loader loads firmware and power up QDSP6

What is WCSS? Don't add random acronyms without any explanation.

> +  remoteproc's on the Qualcomm IPQ9574, IPQ5332 SoC.
> +
> +properties:
> +  compatible:
> +enum:
> +  - qcom,ipq5332-wcss-sec-pil
> +  - qcom,ipq9574-wcss-sec-pil
> +
> +  reg:
> +maxItems: 1
> +
> +  firmware-name:
> +$ref: /schemas/types.yaml#/definitions/string
> +description: Firmware name for the Hexagon core

No, look how other bindings are doing it.

It looks like you develop patches on some old kernel, so please start
from scratch on newest kernel.

> +
> +  interrupts:
> +items:
> +  - description: Watchdog interrupt
> +  - description: Fatal interrupt
> +  - description: Ready interrupt
> +  - description: Handover interrupt
> +  - description: Stop acknowledge interrupt
> +
> +  interrupt-names:
> +items:
> +  - const: wdog
> +  - const: fatal
> +  - const: ready
> +  - const: handover
> +  - const: stop-ack
> +
> +  clocks:
> +items:
> +  - description: IM SLEEP clock

What is IM? Explain all acronyms.

What is SLEEP?

> +
> +  clock-names:
> +items:
> +  - const: im_sleep

sleep? Are there different sleep clocks here?

> +
> +  qcom,smem-states:
> +$ref: /schemas/types.yaml#/definitions/phandle-array
> +description: States used by the AP to signal the remote processor
> +items:
> +  - description: Shutdown Q6
> +  - description: Stop Q6
> +

Do not introduce other order. First stop, then shutdown.

> +  qcom,smem-state-names:
> +description:
> +  Names of the states used by the AP to signal the remote processor
> +items:
> +  - const: shutdown
> +  - const: stop

The same.

> +
> +  memory-region:
> +items:
> +  - description: Q6 reserved region
> +
> +  glink-edge:
> +$ref: /schemas/remoteproc/qcom,glink-edge.yaml#
> +description:
> +  Qualcomm G-Link subnode which represents communication edge, channels
> +  and devices related to the Modem.
> +unevaluatedProperties: false
> +
> +required:
> +  - compatible
> +  - firmware-name
> +  - reg
> +  - interrupts
> +  - interrupt-names
> +  - qcom,smem-states
> +  - qcom,smem-state-names
> +  - memory-region

Keep the same order as in properties.

> +
> +additionalProperties: false
> +
> +examples:
> +  - |
> +#include 
> +#include 
> +q6v5_wcss: remoteproc@d10 {

Drop unused label.

> +  compatible = "qcom,ipq5332-wcss-sec-pil";
> +  reg = <0xd10 0x4040>;
> +  firmware-name = "ath12k/IPQ5332/hw1.0/q6_fw0.mdt";
> +  interrupts-extended = <&intc GIC_SPI 291 IRQ_TYPE_EDGE_RISING>,
> +<&wcss_smp2p_in 0 IRQ_TYPE_NONE>,
> +<&wcss_smp2p_in 1 IRQ_TYPE_NONE>,

Best regards,
Krzysztof




Re: [PATCH 3/4] arm64: dts: qcom: ipq5332: add nodes to bringup q6

2024-08-20 Thread Krzysztof Kozlowski
On Tue, Aug 20, 2024 at 02:25:16PM +0530, Gokul Sriram Palanisamy wrote:
> From: Manikanta Mylavarapu 
> 
> Enable nodes required for q6 remoteproc bring up.
> 
> Signed-off-by: Manikanta Mylavarapu 
> Signed-off-by: Gokul Sriram Palanisamy 
> ---
>  arch/arm64/boot/dts/qcom/ipq5332.dtsi | 62 +++
>  1 file changed, 62 insertions(+)
> 
> diff --git a/arch/arm64/boot/dts/qcom/ipq5332.dtsi 
> b/arch/arm64/boot/dts/qcom/ipq5332.dtsi
> index 0a74ed4f72cc..ec93e7b64b9e 100644
> --- a/arch/arm64/boot/dts/qcom/ipq5332.dtsi
> +++ b/arch/arm64/boot/dts/qcom/ipq5332.dtsi
> @@ -145,6 +145,11 @@ smem@4a80 {
>  
>   hwlocks = <&tcsr_mutex 3>;
>   };
> +
> + q6_region: wcnss@4a90 {

Why here it is wcnss...

> + reg = <0x0 0x4a90 0x0 0x2b0>;
> + no-map;
> + };
>   };
>  
>   soc@0 {
> @@ -476,6 +481,39 @@ frame@b128000 {
>   status = "disabled";
>   };
>   };
> +
> + q6v5_wcss: remoteproc@d10 {

but everywhere else is wcss?

> + compatible = "qcom,ipq5332-wcss-sec-pil";
> + reg = <0xd10 0x4040>;
> + firmware-name = "ath12k/IPQ5332/hw1.0/q6_fw0.mdt";

It's one firmware independent of board?

Best regards,
Krzysztof




Re: [PATCH 2/2] remoteproc: qcom: add hexagon based WCSS secure PIL driver

2024-08-20 Thread Krzysztof Kozlowski
On Tue, Aug 20, 2024 at 02:25:15PM +0530, Gokul Sriram Palanisamy wrote:
> From: Vignesh Viswanathan 
> 
> Add support to bring up hexagon based WCSS secure PIL remoteproc.
> IPQ5332, IPQ9574 supports secure PIL remoteproc.
> 
> Signed-off-by: Vignesh Viswanathan 
> Signed-off-by: Manikanta Mylavarapu 
> Signed-off-by: Gokul Sriram Palanisamy 
> +static int wcss_sec_dump_segments(struct rproc *rproc,
> +   const struct firmware *fw)
> +{
> + struct device *dev = rproc->dev.parent;
> + struct reserved_mem *rmem = NULL;
> + struct device_node *node;
> + int num_segs, index = 0;
> + int ret;
> +
> + /* Parse through additional reserved memory regions for the rproc
> +  * and add them to the coredump segments
> +  */
> + num_segs = of_count_phandle_with_args(dev->of_node,
> +   "memory-region", NULL);
> + while (index < num_segs) {
> + node = of_parse_phandle(dev->of_node,
> + "memory-region", index);
> + if (!node)
> + return -EINVAL;
> +
> + rmem = of_reserved_mem_lookup(node);
> + if (!rmem) {
> + dev_err(dev, "unable to acquire memory-region index %d 
> num_segs %d\n",
> + index, num_segs);

Leaking refcnt.

> + return -EINVAL;
> + }
> +
> + of_node_put(node);
> +
> + dev_dbg(dev, "Adding segment 0x%pa size 0x%pa",
> + &rmem->base, &rmem->size);
> + ret = rproc_coredump_add_custom_segment(rproc,
> + rmem->base,
> + rmem->size,
> + wcss_sec_copy_segment,
> + NULL);
> + if (ret)
> + return ret;
> +
> + index++;
> + }
> +
> + return 0;
> +}
> +
> +static const struct rproc_ops wcss_sec_ops = {
> + .start = wcss_sec_start,
> + .stop = wcss_sec_stop,
> + .da_to_va = wcss_sec_da_to_va,
> + .load = wcss_sec_load,
> + .get_boot_addr = rproc_elf_get_boot_addr,
> + .panic = wcss_sec_panic,
> + .parse_fw = wcss_sec_dump_segments,
> +};
> +
> +static int wcss_sec_alloc_memory_region(struct wcss_sec *wcss)
> +{
> + struct reserved_mem *rmem = NULL;
> + struct device_node *node;
> + struct device *dev = wcss->dev;
> +
> + node = of_parse_phandle(dev->of_node, "memory-region", 0);
> + if (node) {
> + rmem = of_reserved_mem_lookup(node);
> + } else {

No, that's over complicated.

Just if (!node) { error handling }.

> + dev_err(dev, "can't find phandle memory-region\n");
> + return -EINVAL;
> + }
> +
> + of_node_put(node);
> +
> + if (!rmem) {
> + dev_err(dev, "unable to acquire memory-region\n");
> + return -EINVAL;
> + }
> +
> + wcss->mem_phys = rmem->base;
> + wcss->mem_reloc = rmem->base;
> + wcss->mem_size = rmem->size;
> + wcss->mem_region = devm_ioremap_wc(dev, wcss->mem_phys, wcss->mem_size);
> + if (!wcss->mem_region) {
> + dev_err(dev, "unable to map memory region: %pa+%pa\n",
> + &rmem->base, &rmem->size);
> + return -ENOMEM;
> + }
> +
> + return 0;
> +}
> +

...

> +static int wcss_sec_ipq5332_init_clk(struct wcss_sec *wcss)
> +{
> + int ret;
> + struct device *dev = wcss->dev;
> +
> + wcss->im_sleep = devm_clk_get(wcss->dev, "im_sleep");
> + if (IS_ERR(wcss->im_sleep)) {
> + ret = PTR_ERR(wcss->im_sleep);
> + if (ret != -EPROBE_DEFER)
> + dev_err(dev, "failed to get im_sleep clock");

Syntax is return dev_err_probe.

> + return ret;
> + }
> +
> + ret = clk_prepare_enable(wcss->im_sleep);
> + if (ret) {
> + dev_err(dev, "could not enable im_sleep clk\n");
> + return ret;

Just use devm_clk_get_enabled.

> + }
> +
> + return 0;

Best regards,
Krzysztof




[PATCH] tracing/timerlat: Check tlat_var for NULL in timerlat_fd_release

2024-08-20 Thread tglozar
From: Tomas Glozar 

When running timerlat with a userspace workload (NO_OSNOISE_WORKLOAD),
NULL pointer dereference can be triggered by sending consequent SIGINT
and SIGTERM signals to the workload process. That then causes
timerlat_fd_release to be called twice in a row, and the second time,
hrtimer_cancel is called on a zeroed hrtimer struct, causing the NULL
dereference.

This can be reproduced using rtla:
```
$ while true; do rtla timerlat top -u -q & PID=$!; sleep 5; \
 kill -INT $PID; sleep 0.001; kill -TERM $PID; wait $PID; done
[1] 1675
[1]+  Aborted (SIGTERM)  rtla timerlat top -u -q
[1] 1688
client_loop: send disconnect: Broken pipe
```
triggering the bug:
```
BUG: kernel NULL pointer dereference, address: 0010
Oops: Oops:  [#1] PREEMPT SMP NOPTI
CPU: 6 PID: 1679 Comm: timerlatu/6 Kdump: loaded Not tainted
6.10.0-rc2+ #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39
04/01/2014
RIP: 0010:hrtimer_active+0xd/0x50
RSP: 0018:a86641567cc0 EFLAGS: 00010286
RAX: 0002e2c0 RBX: 994c6bf2e2c8 RCX: 994b0911ac18
RDX:  RSI: 994b02f10700 RDI: 994c6bf2e2c8
RBP: 994c6bf2e340 R08: 994b158f7400 R09: 994b0911ac18
R10: 0010 R11: 994b00d40f00 R12: 994c6bf2e2c8
R13: 994b02049b20 R14: 994b011806c0 R15: 
FS:  () GS:994c6bf0()
knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 0010 CR3: 000139020006 CR4: 00770ef0
PKRU: 5554
Call Trace:
  
  ? __die+0x24/0x70
  ? page_fault_oops+0x75/0x170
  ? mt_destroy_walk.isra.0+0x2b3/0x320
  ? exc_page_fault+0x70/0x160
  ? asm_exc_page_fault+0x26/0x30
  ? hrtimer_active+0xd/0x50
  hrtimer_cancel+0x15/0x40
  timerlat_fd_release+0x48/0xe0
  __fput+0xed/0x2c0
  task_work_run+0x59/0x90
  do_exit+0x275/0x4b0
  do_group_exit+0x30/0x80
  get_signal+0x917/0x960
  ? vfs_read+0xb7/0x340
  arch_do_signal_or_restart+0x29/0xf0
  ? syscall_exit_to_user_mode+0x70/0x1f0
  ? syscall_exit_work+0xf3/0x120
  syscall_exit_to_user_mode+0x1a0/0x1f0
  do_syscall_64+0x89/0x160
  ? clear_bhb_loop+0x25/0x80
  ? clear_bhb_loop+0x25/0x80
  ? clear_bhb_loop+0x25/0x80
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7f75790fd9ec
...
  
```

Fix the NULL pointer dereference by checking tlat_var->kthread for zero
first in timerlat_fd_release, before calling hrtimer_cancel.
tlat_var->kthread is always non-zero unless the entire tlat_var is zero,
since it is set to the TID of the userspace workload in timerlat_fd_open
under a mutex.

Cc: sta...@vger.kernel.org
Fixes: e88ed227f639e ("tracing/timerlat: Add user-space interface")
Co-developed-by: Luis Claudio R. Goncalves 
Signed-off-by: Luis Claudio R. Goncalves 
Signed-off-by: Tomas Glozar 
---
 kernel/trace/trace_osnoise.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index 66a871553d4a..6d2b39da4dce 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -2572,6 +2572,7 @@ static int timerlat_fd_release(struct inode *inode, 
struct file *file)
struct osnoise_variables *osn_var;
struct timerlat_variables *tlat_var;
long cpu = (long) file->private_data;
+   int ret = 0;
 
migrate_disable();
mutex_lock(&interface_lock);
@@ -2579,6 +2580,12 @@ static int timerlat_fd_release(struct inode *inode, 
struct file *file)
osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu);
tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu);
 
+   if (!tlat_var->kthread) {
+   /* the fd has been closed already */
+   ret = -EBADF;
+   goto out;
+   }
+
hrtimer_cancel(&tlat_var->timer);
memset(tlat_var, 0, sizeof(*tlat_var));
 
@@ -2593,9 +2600,10 @@ static int timerlat_fd_release(struct inode *inode, 
struct file *file)
osn_var->kthread = NULL;
}
 
+out:
mutex_unlock(&interface_lock);
migrate_enable();
-   return 0;
+   return ret;
 }
 #endif
 
-- 
2.46.0




[PATCH 1/1] uprobe: fix comment of uprobe_apply()

2024-08-20 Thread Zhen Lei
Depending on the argument 'add', uprobe_apply() may be registering or
unregistering a probe. The current comment misses the description of the
registration.

Signed-off-by: Zhen Lei 
---
 kernel/events/uprobes.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 73cc47708679f0c..c9de255e56e777f 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1201,8 +1201,8 @@ int uprobe_register_refctr(struct inode *inode, loff_t 
offset,
 EXPORT_SYMBOL_GPL(uprobe_register_refctr);
 
 /*
- * uprobe_apply - unregister an already registered probe.
- * @inode: the file in which the probe has to be removed.
+ * uprobe_apply - register a probe or unregister an already registered probe.
+ * @inode: the file in which the probe has to be placed or removed.
  * @offset: offset from the start of the file.
  * @uc: consumer which wants to add more or remove some breakpoints
  * @add: add or remove the breakpoints
-- 
2.34.1




Re: [BUG] tracing: dynamic ftrace selftest detected failures

2024-08-20 Thread Steven Rostedt
On Tue, 20 Aug 2024 11:48:07 +0100
Mark Rutland  wrote:

> > I found the target function already has "noinline". I tried to add noinline
> > to the testing function (callsite), but it also did not work.
> > I think "noinline" is for the compiler, but LTO is done by the linker.  
> 
> If LTO is breaking noinline, then that has much larger implications for
> noinstr code and similar, and means that LTO is unsound...

Hmm, doesn't noinstr place the code in a separate section?

I wonder if we create a separate section for the test function that LTO is
inlining, if it will prevent it from being inlined. That is, noinline tells
the compiler not to inline, but LTO happens after the compiler is done and
may inline functions in the same section. But the linker does see separate
sections and I don't think it will try to inline that code into another
section. But I could be wrong. ;-)

-- Steve



Re: [PATCH] remoteproc: k3-r5: Fix driver shutdown

2024-08-20 Thread Jan Kiszka
On 20.08.24 11:48, Beleswar Prasad Padhi wrote:
> 
> On 20-08-2024 15:09, Jan Kiszka wrote:
>> On 20.08.24 11:30, Beleswar Prasad Padhi wrote:
>>> Hi Jan,
>>>
>>> On 19-08-2024 22:17, Jan Kiszka wrote:
>>>> From: Jan Kiszka 
>>>>
>>>> When k3_r5_cluster_rproc_exit is run, core 1 is shutdown and removed
>>>> first. When core 0 should then be stopped before its removal, it will
>>>> find core1->rproc as NULL already and crashes. Happens on rmmod e.g.
>>>
>>> Did you check this on top of -next-20240820 tag? There was a series[0]
>>> which was merged recently which fixed this condition. I don't see this
>>> issue when trying on top of -next-20240820 tag.
>>> [0]:
>>> https://lore.kernel.org/all/20240808074127.2688131-1-b-pa...@ti.com/
>>>
>> I didn't try those yet, I was on 6.11-rcX. But from reading them
>> quickly, I'm not seeing the two issues I found directly addressed there.
> 
> Check the comment by Andrew Davis[0], that addresses the above issue.
> 
> [0]:
> https://lore.kernel.org/all/0bba5293-a55d-4f13-887c-272a54d6e...@ti.com/
> 

OK, then someone still needs to update his patch accordingly.

>>
>>>> Fixes: 3c8a9066d584 ("remoteproc: k3-r5: Do not allow core1 to power
>>>> up before core0 via sysfs")
>>>> CC: sta...@vger.kernel.org
>>>> Signed-off-by: Jan Kiszka 
>>>> ---
>>>>
>>>> There might be one more because I can still make this driver crash
>>>> after an operator error. Were error scenarios tested at all?
>>>
>>> Can you point out what is this issue more specifically, and I can take
>>> this up then.
>> Try starting core1 before core0, and then again - system will hang or
> If you are trying to stop and then start the cores from sysfs, that is
> not yet supported. The hang is thus expected.

What? Then the driver is broken, even more. Why wasn't it fully implemented?

Jan

-- 
Siemens AG, Technology
Linux Expert Center




Re: [PATCH] virtio_pmem: Check device status before requesting flush

2024-08-20 Thread Ira Weiny
Philip Chen wrote:
> On Mon, Aug 19, 2024 at 2:56 PM Ira Weiny  wrote:
> >
> > Philip Chen wrote:
> > > If a pmem device is in a bad status, the driver side could wait for
> > > host ack forever in virtio_pmem_flush(), causing the system to hang.
> >
> > I assume this was supposed to be v2 and you resent this as a proper v2
> > with a change list from v1?
> Ah...yes, I'll fix it and re-send it as a v2 patch.

Wait didn't you already do that?  Wasn't this v2?

https://lore.kernel.org/all/20240815010337.2334245-1-philipc...@chromium.org/

Ira



Re: [PATCH 1/1] uprobe: fix comment of uprobe_apply()

2024-08-20 Thread Oleg Nesterov
On 08/20, Zhen Lei wrote:
>
> Depending on the argument 'add', uprobe_apply() may be registering or
> unregistering a probe.

...

>  /*
> - * uprobe_apply - unregister an already registered probe.
> - * @inode: the file in which the probe has to be removed.
> + * uprobe_apply - register a probe or unregister an already registered probe.

Not really.

See the commit 3c83a9ad0295eb63bd ("uprobes: make uprobe_register() return 
struct uprobe *")
in tip/perf/core which changed this description

* uprobe_apply - add or remove the breakpoints according to @uc->filter

still looks confusing, yes...

Oleg.




Re: [PATCH] remoteproc: k3-r5: Fix driver shutdown

2024-08-20 Thread Beleswar Prasad Padhi



On 20-08-2024 19:50, Jan Kiszka wrote:

On 20.08.24 11:48, Beleswar Prasad Padhi wrote:

On 20-08-2024 15:09, Jan Kiszka wrote:

On 20.08.24 11:30, Beleswar Prasad Padhi wrote:

Hi Jan,

On 19-08-2024 22:17, Jan Kiszka wrote:

From: Jan Kiszka 

When k3_r5_cluster_rproc_exit is run, core 1 is shutdown and removed
first. When core 0 should then be stopped before its removal, it will
find core1->rproc as NULL already and crashes. Happens on rmmod e.g.

Did you check this on top of -next-20240820 tag? There was a series[0]
which was merged recently which fixed this condition. I don't see this
issue when trying on top of -next-20240820 tag.
[0]:
https://lore.kernel.org/all/20240808074127.2688131-1-b-pa...@ti.com/


I didn't try those yet, I was on 6.11-rcX. But from reading them
quickly, I'm not seeing the two issues I found directly addressed there.

Check the comment by Andrew Davis[0], that addresses the above issue.

[0]:
https://lore.kernel.org/all/0bba5293-a55d-4f13-887c-272a54d6e...@ti.com/


OK, then someone still needs to update his patch accordingly.
That comment was addressed in the v4 series revision[1] and was merged 
to linux-next, available with tag -next-20240820. Request you to please 
check if the issue persists with -next-20240820 tag. I checked myself, 
and was not able to reproduce.

[1]: https://lore.kernel.org/all/Zr9nbWnADDB+ZOlg@p14s/



Fixes: 3c8a9066d584 ("remoteproc: k3-r5: Do not allow core1 to power
up before core0 via sysfs")
CC: sta...@vger.kernel.org
Signed-off-by: Jan Kiszka 
---

There might be one more because I can still make this driver crash
after an operator error. Were error scenarios tested at all?

Can you point out what is this issue more specifically, and I can take
this up then.

Try starting core1 before core0, and then again - system will hang or

If you are trying to stop and then start the cores from sysfs, that is
not yet supported. The hang is thus expected.

What? Then the driver is broken, even more. Why wasn't it fully implemented?
The driver is capable of starting a core and stopping it all well. The 
problem is, when we stop a core from sysfs (without resetting the SoC 
itself), the remotecore is powered off, but its resources are not 
relinquished. So when we start it back, there could be some memory 
corruptions. This feature of "graceful shutdown" of remotecores is 
almost implemented and will be posted to this driver soon. Request you 
to try out after that.


With graceful shutdown feature, this will be the flow:
1. We issue a core stop operation from sysfs.
2. The remoteproc driver sends a special "SHUTDOWN" mailbox message to 
the remotecore.
3. The remotecore relinquishes all of its acquired resources through 
Device Manager Firmware and sends an ACK back.

4. The remotecore enters WFI state and then is resetted through Host core.
5. Then, if we try to do the core start operation from sysfs, core 
should be up as expected.


Thanks,
Beleswar


Jan





Re: [PATCH RFC v3 09/13] uprobes: SRCU-protect uretprobe lifetime (with timeout)

2024-08-20 Thread Oleg Nesterov
On 08/19, Andrii Nakryiko wrote:
>
> On Mon, Aug 19, 2024 at 6:41 AM Oleg Nesterov  wrote:
> >
> > On 08/12, Andrii Nakryiko wrote:
> > >
> > > Avoid taking refcount on uprobe in prepare_uretprobe(), instead take
> > > uretprobe-specific SRCU lock and keep it active as kernel transfers
> > > control back to user space.
> > ...
> > >  include/linux/uprobes.h |  49 ++-
> > >  kernel/events/uprobes.c | 294 ++--
> > >  2 files changed, 301 insertions(+), 42 deletions(-)
> >
> > Oh. To be honest I don't like this patch.
> >
> > I would like to know what other reviewers think, but to me it adds too many
> > complications that I can't even fully understand...
>
> Which parts? The atomic xchg() and cmpxchg() parts? What exactly do
> you feel like you don't fully understand?

Heh, everything looks too complex for me ;)

Say, hprobe_expire(). It is also called by ri_timer() in softirq context,
right? And it does

/* We lost the race, undo our refcount bump. It can drop to zero. */
put_uprobe(uprobe);

How so? If the refcount goes to zero, put_uprobe() does mutex_lock(),
but it must not sleep in softirq context.


Or, prepare_uretprobe() which does

rcu_assign_pointer(utask->return_instances, ri);

if (!timer_pending(&utask->ri_timer))
mod_timer(&utask->ri_timer, ...);

Suppose that the timer was pending and it was fired right before
rcu_assign_pointer(). What guarantees that prepare_uretprobe() will see
timer_pending() == false?

rcu_assign_pointer()->smp_store_release() is a one-way barrier. This
timer_pending() check may appear to happen before rcu_assign_pointer()
completes.

In this (yes, theoretical) case ri_timer() can miss the new return_instance,
while prepare_uretprobe() can miss the necessary mod_timer(). I think this
needs another mb() in between.


And I can't convince myself hprobe_expire() is correct... OK, I don't
fully understand the logic, but why data_race(READ_ONCE(hprobe->leased)) ?
READ_ONCE() should be enough in this case?


> > As I have already mentioned in the previous discussions, we can simply kill
> > utask->active_uprobe. And utask->auprobe.
>
> I don't have anything against that, in principle, but let's benchmark
> and test that thoroughly. I'm a bit uneasy about the possibility that
> some arch-specific code will do container_of() on this arch_uprobe in
> order to get to uprobe,

Well, struct uprobe is not "exported", the arch-specific code can't do this.

Oleg.




Re: [BUG] tracing: dynamic ftrace selftest detected failures

2024-08-20 Thread Sami Tolvanen
On Tue, Aug 20, 2024 at 3:48 AM Mark Rutland  wrote:
>
> On Tue, Aug 20, 2024 at 10:03:30AM +0900, Masami Hiramatsu wrote:
> > On Mon, 19 Aug 2024 12:02:44 -0400
> > Steven Rostedt  wrote:
> >
> > > On Tue, 20 Aug 2024 00:56:49 +0900
> > > Masami Hiramatsu (Google)  wrote:
> > > >
> > > > >
> > > > > We may need to add "noinline" or something to make sure those 
> > > > > functions
> > > > > don't get inlined for LTO.
> > > >
> > > > Yeah, we need such option at least for function call test.
> > >
> > > Could you add the noinline, and if it fixes the issue send a patch?
> >
> > I found the target function already has "noinline". I tried to add noinline
> > to the testing function (callsite), but it also did not work.
> > I think "noinline" is for the compiler, but LTO is done by the linker.
>
> If LTO is breaking noinline, then that has much larger implications for
> noinstr code and similar, and means that LTO is unsound...

The noinline attribute is preserved in LLVM IR, so it should continue
to work with LTO. Which function are we talking about here? Are you
sure the function was inlined instead of being dropped completely?
Does marking the function __used help?

Sami



Re: [PATCH] remoteproc: k3-r5: Fix driver shutdown

2024-08-20 Thread Beleswar Prasad Padhi



On 20-08-2024 20:29, Beleswar Prasad Padhi wrote:


On 20-08-2024 19:50, Jan Kiszka wrote:

On 20.08.24 11:48, Beleswar Prasad Padhi wrote:

On 20-08-2024 15:09, Jan Kiszka wrote:

On 20.08.24 11:30, Beleswar Prasad Padhi wrote:

Hi Jan,

On 19-08-2024 22:17, Jan Kiszka wrote:

From: Jan Kiszka 

When k3_r5_cluster_rproc_exit is run, core 1 is shutdown and removed
first. When core 0 should then be stopped before its removal, it 
will

find core1->rproc as NULL already and crashes. Happens on rmmod e.g.
Did you check this on top of -next-20240820 tag? There was a 
series[0]
which was merged recently which fixed this condition. I don't see 
this

issue when trying on top of -next-20240820 tag.
[0]:
https://lore.kernel.org/all/20240808074127.2688131-1-b-pa...@ti.com/


I didn't try those yet, I was on 6.11-rcX. But from reading them
quickly, I'm not seeing the two issues I found directly addressed 
there.

Check the comment by Andrew Davis[0], that addresses the above issue.

[0]:
https://lore.kernel.org/all/0bba5293-a55d-4f13-887c-272a54d6e...@ti.com/ 




OK, then someone still needs to update his patch accordingly.
That comment was addressed in the v4 series revision[1] and was merged 
to linux-next, available with tag -next-20240820. Request you to 
please check if the issue persists with -next-20240820 tag. I checked 
myself, and was not able to reproduce.

[1]: https://lore.kernel.org/all/Zr9nbWnADDB+ZOlg@p14s/



Fixes: 3c8a9066d584 ("remoteproc: k3-r5: Do not allow core1 to power
up before core0 via sysfs")
CC: sta...@vger.kernel.org
Signed-off-by: Jan Kiszka 
---

There might be one more because I can still make this driver crash
after an operator error. Were error scenarios tested at all?
Can you point out what is this issue more specifically, and I can 
take

this up then.

Try starting core1 before core0, and then again - system will hang or

If you are trying to stop and then start the cores from sysfs, that is
not yet supported. The hang is thus expected.
What? Then the driver is broken, even more. Why wasn't it fully 
implemented?



Just wanted to point out that this "graceful shutdown" feature is 
majorly dependent on the Device Manager Firmware(point 3) and minimal 
changes to the remoteproc driver (point 2 and 4). Thus, as soon as 
Firmware is capable, we will send out the patches for this feature.


The driver is capable of starting a core and stopping it all well. The 
problem is, when we stop a core from sysfs (without resetting the SoC 
itself), the remotecore is powered off, but its resources are not 
relinquished. So when we start it back, there could be some memory 
corruptions. This feature of "graceful shutdown" of remotecores is 
almost implemented and will be posted to this driver soon. Request you 
to try out after that.


With graceful shutdown feature, this will be the flow:
1. We issue a core stop operation from sysfs.
2. The remoteproc driver sends a special "SHUTDOWN" mailbox message to 
the remotecore.
3. The remotecore relinquishes all of its acquired resources through 
Device Manager Firmware and sends an ACK back.
4. The remotecore enters WFI state and then is resetted through Host 
core.
5. Then, if we try to do the core start operation from sysfs, core 
should be up as expected.


Thanks,
Beleswar


Jan





Re: [BUG] tracing: dynamic ftrace selftest detected failures

2024-08-20 Thread Steven Rostedt
On Tue, 20 Aug 2024 08:10:42 -0700
Sami Tolvanen  wrote:

> On Tue, Aug 20, 2024 at 3:48 AM Mark Rutland  wrote:
> >
> > On Tue, Aug 20, 2024 at 10:03:30AM +0900, Masami Hiramatsu wrote:  
> > > On Mon, 19 Aug 2024 12:02:44 -0400
> > > Steven Rostedt  wrote:
> > >  
> > > > On Tue, 20 Aug 2024 00:56:49 +0900
> > > > Masami Hiramatsu (Google)  wrote:  
> > > > >  
> > > > > >
> > > > > > We may need to add "noinline" or something to make sure those 
> > > > > > functions
> > > > > > don't get inlined for LTO.  
> > > > >
> > > > > Yeah, we need such option at least for function call test.  
> > > >
> > > > Could you add the noinline, and if it fixes the issue send a patch?  
> > >
> > > I found the target function already has "noinline". I tried to add 
> > > noinline
> > > to the testing function (callsite), but it also did not work.
> > > I think "noinline" is for the compiler, but LTO is done by the linker.  
> >
> > If LTO is breaking noinline, then that has much larger implications for
> > noinstr code and similar, and means that LTO is unsound...  
> 
> The noinline attribute is preserved in LLVM IR, so it should continue
> to work with LTO. Which function are we talking about here? Are you
> sure the function was inlined instead of being dropped completely?
> Does marking the function __used help?

Hmm, maybe it was dropped, as the functions are basically just stubs.

Masami, can you add __used to the functions in trace_selftests_dynamic.c
and see if that fixes it?

-- Steve




Re: [PATCH v6 2/4] kbuild, kconfig: generate offset range data for builtin modules

2024-08-20 Thread Kris Van Hees
On Sun, Aug 18, 2024 at 03:19:36PM +0900, Masahiro Yamada wrote:
> On Fri, Aug 16, 2024 at 12:04???AM Kris Van Hees  
> wrote:
> 
> 
> The subject should be:
> "kbuild: generate offset range data for builtin modules"
> 
> 
> (Drop ", kconfig")

Thank you - applied.

> >
> > Create file module.builtin.ranges that can be used to find where
> > built-in modules are located by their addresses. This will be useful for
> > tracing tools to find what functions are for various built-in modules.
> >
> > The offset range data for builtin modules is generated using:
> >  - modules.builtin: associates object files with module names
> >  - vmlinux.map: provides load order of sections and offset of first member
> > per section
> >  - vmlinux.o.map: provides offset of object file content per section
> >  - .*.cmd: build cmd file with KBUILD_MODFILE and KBUILD_MODNAME
> 
> 
> I do not see "KBUILD_MODNAME" in the code.
> It only checks "KUILD_MODFILE".

Ah yes, that was a leftover from the earlier implementation.  Updated.

> >
> > The generated data will look like:
> >
> > .text - = _text
> > .text baf0-cb10 amd_uncore
> > .text 0009bd10-0009c8e0 iosf_mbi
> > ...
> > .text 008e6660-008e9630 snd_soc_wcd_mbhc
> > .text 008e9630-008ea610 snd_soc_wcd9335 snd_soc_wcd934x snd_soc_wcd938x
> 
> 
> 
> It is good to note that multiple module names appear
> in one line, but the instance (snd_soc_wcd933*) no longer
> occurs since 11b0b802f8e38d48ca74d520028add81263f003e.
> 
> 
> I recommend to replace the output snippet with:
> 
> 
> .text 00b9f080-00ba011a intel_skl_int3472_discrete
> .text 00ba0120-00ba03c0 intel_skl_int3472_discrete intel_skl_int3472_tps68470
> .text 00ba03c0-00ba08d6 intel_skl_int3472_tps68470
> 
> 
> This still happens when CONFIG_INTEL_SKL_INT3472=y.

Applied.  Thanks for pointing this out - I didn't noticed that the original
case was no longer present.

> > .text 008ea610-008ea780 snd_soc_wcd9335
> > ...
> > .data - = _sdata
> > .data f020-f680 amd_uncore
> >
> > For each ELF section, it lists the offset of the first symbol.  This can
> > be used to determine the base address of the section at runtime.
> >
> > Next, it lists (in strict ascending order) offset ranges in that section
> > that cover the symbols of one or more builtin modules.  Multiple ranges
> > can apply to a single module, and ranges can be shared between modules.
> >
> > The CONFIG_BUILTIN_MODULE_RANGES option controls whether offset range data
> > is generated for kernel modules that are built into the kernel image.
> >
> > How it works:
> >
> >   1. The modules.builtin file is parsed to obtain a list of built-in
> >  module names and their associated object names (the .ko file that
> >  the module would be in if it were a loadable module, hereafter
> >  referred to as ).  This object name can be used to
> >  identify objects in the kernel compile because any C or assembler
> >  code that ends up into a built-in module will have the option
> >  -DKBUILD_MODFILE= present in its build command, and those
> >  can be found in the ..cmd file in the kernel build tree.
> >
> >  If an object is part of multiple modules, they will all be listed
> >  in the KBUILD_MODFILE option argument.
> >
> >  This allows us to conclusively determine whether an object in the
> >  kernel build belong to any modules, and which.
> >
> >  2. The vmlinux.map is parsed next to determine the base address of each
> > top level section so that all addresses into the section can be
> > turned into offsets.  This makes it possible to handle sections
> > getting loaded at different addresses at system boot.
> >
> > We also determine an 'anchor' symbol at the beginning of each
> > section to make it possible to calculate the true base address of
> > a section at runtime (i.e. symbol address - symbol offset).
> >
> > We collect start addresses of sections that are included in the top
> > level section.  This is used when vmlinux is linked using vmlinux.o,
> > because in that case, we need to look at the vmlinux.o linker map to
> > know what object a symbol is found in.
> >
> > And finally, we process each symbol that is listed in vmlinux.map
> > (or vmlinux.o.map) based on the following structure:
> >
> > vmlinux linked from vmlinux.a:
> >
> >   vmlinux.map:
> > 
> > -- might be same as top level section)
> >   -- built-in association known
> >   -- belongs to module(s) object belongs to
> >   ...
> >
> > vmlinux linked from vmlinux.o:
> >
> >   vmlinux.map:
> > 
> > -- might be same as top level section)
> > vmlinux.o -- need to use vmlinux.o.map
> >   -- ignored
> >   ...
> >
> >   vmlinux.o.map:
> > 
> >   -- built-in association known
> >  

Re: [PATCH v6 3/4] scripts: add verifier script for builtin module range data

2024-08-20 Thread Kris Van Hees
On Sun, Aug 18, 2024 at 03:40:36PM +0900, Masahiro Yamada wrote:
> On Fri, Aug 16, 2024 at 12:04???AM Kris Van Hees  
> wrote:
> >
> > The modules.builtin.ranges offset range data for builtin modules is
> > generated at compile time based on the list of built-in modules and
> > the vmlinux.map and vmlinux.o.map linker maps.  This data can be used
> > to determine whether a symbol at a particular address belongs to
> > module code that was configured to be compiled into the kernel proper
> > as a built-in module (rather than as a standalone module).
> >
> > This patch adds a script that uses the generated modules.builtin.ranges
> > data to annotate the symbols in the System.map with module names if
> > their address falls within a range that belongs to one or more built-in
> > modules.
> >
> > It then processes the vmlinux.map (and if needed, vmlinux.o.map) to
> > verify the annotation:
> >
> >   - For each top-level section:
> >  - For each object in the section:
> > - Determine whether the object is part of a built-in module
> >   (using modules.builtin and the .*.cmd file used to compile
> >the object as suggested in [0])
> > - For each symbol in that object, verify that the built-in
> >   module association (or lack thereof) matches the annotation
> >   given to the symbol.
> >
> > Signed-off-by: Kris Van Hees 
> > Reviewed-by: Nick Alcock 
> > Reviewed-by: Alan Maguire 
> > ---
> > Changes since v5:
> >  - Added optional 6th argument to specify kernel build directory.
> >  - Report error and exit if .*.o.cmd files cannot be read.
> >
> > Changes since v4:
> >  - New patch in the series
> > ---
> >  scripts/verify_builtin_ranges.awk | 365 ++
> >  1 file changed, 365 insertions(+)
> >  create mode 100755 scripts/verify_builtin_ranges.awk
> >
> > diff --git a/scripts/verify_builtin_ranges.awk 
> > b/scripts/verify_builtin_ranges.awk
> > new file mode 100755
> > index ..b82cf0a0fbeb
> > --- /dev/null
> > +++ b/scripts/verify_builtin_ranges.awk
> > @@ -0,0 +1,365 @@
> > +#!/usr/bin/gawk -f
> > +# SPDX-License-Identifier: GPL-2.0
> > +# verify_builtin_ranges.awk: Verify address range data for builtin modules
> > +# Written by Kris Van Hees 
> > +#
> > +# Usage: verify_builtin_ranges.awk modules.builtin.ranges System.map \
> > +# modules.builtin vmlinux.map 
> > vmlinux.o.map \
> > +# [  ]
> > +#
> > +
> > +# Return the module name(s) (if any) associated with the given object.
> > +#
> > +# If we have seen this object before, return information from the cache.
> > +# Otherwise, retrieve it from the corresponding .cmd file.
> > +#
> > +function get_module_info(fn, mod, obj, mfn, s) {
> > +   if (fn in omod)
> > +   return omod[fn];
> > +
> > +   if (match(fn, /\/[^/]+$/) == 0)
> > +   return "";
> > +
> > +   obj = fn;
> > +   mod = "";
> > +   mfn = "";
> > +   fn = kdir "/" substr(fn, 1, RSTART) "." substr(fn, RSTART + 1) 
> > ".cmd";
> > +   if (getline s  > +   if (match(s, /DKBUILD_MODFILE=['"]+[^'"]+/) > 0) {
> > +   mfn = substr(s, RSTART + 16, RLENGTH - 16);
> > +   gsub(/['"]/, "", mfn);
> > +
> > +   mod = mfn;
> > +   gsub(/([^/ ]*\/)+/, "", mod);
> > +   gsub(/-/, "_", mod);
> > +   }
> > +   } else {
> > +   print "ERROR: Failed to read: " fn "\n\n" \
> > + "  Invalid kernel build directory (" kdir ")\n" \
> > + "  or its content does not match " ARGV[1] 
> > >"/dev/stderr";
> > +   close(fn);
> > +   total = 0;
> > +   exit(1);
> > +   }
> > +   close(fn);
> > +
> > +   # A single module (common case) also reflects objects that are not 
> > part
> > +   # of a module.  Some of those objects have names that are also a 
> > module
> > +   # name (e.g. core).  We check the associated module file name, and 
> > if
> > +   # they do not match, the object is not part of a module.
> > +   if (mod !~ / /) {
> > +   if (!(mod in mods))
> > +   return "";
> > +   if (mods[mod] != mfn)
> > +   return "";
> > +   }
> > +
> > +   # At this point, mod is a single (valid) module name, or a list of
> > +   # module names (that do not need validation).
> > +   omod[obj] = mod;
> > +   close(fn);
> > +
> > +   return mod;
> > +}
> >
> 
> 
> 
> This code is copy-paste from scripts/generate_builtin_ranges.awk
> So, my comments in 2/4 can apply to this patch, too.

Yes, and I will apply the same changes to the verifier script.

> Instead of adding a separate script,
> we could add a "verify mode" option.
> 
> 
>  scripts/generate_builtin_ranges.awk --verify ...
> 

[PATCH v2] virtio_pmem: Check device status before requesting flush

2024-08-20 Thread Philip Chen
If a pmem device is in a bad status, the driver side could wait for
host ack forever in virtio_pmem_flush(), causing the system to hang.

So add a status check in the beginning of virtio_pmem_flush() to return
early if the device is not activated.

Signed-off-by: Philip Chen 
---

v2:
- Remove change id from the patch description
- Add more details to the patch description

 drivers/nvdimm/nd_virtio.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
index 35c8fbbba10e..97addba06539 100644
--- a/drivers/nvdimm/nd_virtio.c
+++ b/drivers/nvdimm/nd_virtio.c
@@ -44,6 +44,15 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
unsigned long flags;
int err, err1;
 
+   /*
+* Don't bother to submit the request to the device if the device is
+* not acticated.
+*/
+   if (vdev->config->get_status(vdev) & VIRTIO_CONFIG_S_NEEDS_RESET) {
+   dev_info(&vdev->dev, "virtio pmem device needs a reset\n");
+   return -EIO;
+   }
+
might_sleep();
req_data = kmalloc(sizeof(*req_data), GFP_KERNEL);
if (!req_data)
-- 
2.46.0.184.g6999bdac58-goog




Re: [PATCH RFC v3 09/13] uprobes: SRCU-protect uretprobe lifetime (with timeout)

2024-08-20 Thread Andrii Nakryiko
On Tue, Aug 20, 2024 at 8:06 AM Oleg Nesterov  wrote:
>
> On 08/19, Andrii Nakryiko wrote:
> >
> > On Mon, Aug 19, 2024 at 6:41 AM Oleg Nesterov  wrote:
> > >
> > > On 08/12, Andrii Nakryiko wrote:
> > > >
> > > > Avoid taking refcount on uprobe in prepare_uretprobe(), instead take
> > > > uretprobe-specific SRCU lock and keep it active as kernel transfers
> > > > control back to user space.
> > > ...
> > > >  include/linux/uprobes.h |  49 ++-
> > > >  kernel/events/uprobes.c | 294 ++--
> > > >  2 files changed, 301 insertions(+), 42 deletions(-)
> > >
> > > Oh. To be honest I don't like this patch.
> > >
> > > I would like to know what other reviewers think, but to me it adds too 
> > > many
> > > complications that I can't even fully understand...
> >
> > Which parts? The atomic xchg() and cmpxchg() parts? What exactly do
> > you feel like you don't fully understand?
>
> Heh, everything looks too complex for me ;)

Well, the best code is no code. But I'm not doing this just for fun,
so I'm happy with the simplest solution *that works*.

>
> Say, hprobe_expire(). It is also called by ri_timer() in softirq context,
> right? And it does
>
> /* We lost the race, undo our refcount bump. It can drop to zero. */
> put_uprobe(uprobe);
>
> How so? If the refcount goes to zero, put_uprobe() does mutex_lock(),
> but it must not sleep in softirq context.
>

Now we are talking about specific issues, thank you! It's hard to
discuss "I don't like".

Yes, I think you are right and it is certainly a problem with
put_uprobe() that it can't be called from softirq (just having to
remember that is error-prone, as is evidenced by me forgetting about
this issue).

It's easy enough to solve. We can either schedule work from timer
thread (and that will solve this particular issue only), or we can
teach put_uprobe() to schedule work item if it drops refcount to zero
from softirq and other restricted contexts.

I vote for making put_uprobe() flexible in this regard, add
work_struct to uprobe, and schedule all this to be done in the work
queue callback. WDYT?

>
> Or, prepare_uretprobe() which does
>
> rcu_assign_pointer(utask->return_instances, ri);
>
> if (!timer_pending(&utask->ri_timer))
> mod_timer(&utask->ri_timer, ...);
>
> Suppose that the timer was pending and it was fired right before
> rcu_assign_pointer(). What guarantees that prepare_uretprobe() will see
> timer_pending() == false?
>
> rcu_assign_pointer()->smp_store_release() is a one-way barrier. This
> timer_pending() check may appear to happen before rcu_assign_pointer()
> completes.
>
> In this (yes, theoretical) case ri_timer() can miss the new return_instance,
> while prepare_uretprobe() can miss the necessary mod_timer(). I think this
> needs another mb() in between.
>

Ok, that's fair. I felt like this pattern might be a bit problematic,
but I also felt like it's good to have to ensure that we do
occasionally have a race between timer callback and uretprobe, even if
uretprobe returns very quickly. (and I did confirm we get those races
and they seem to be handled fine, i.e., I saw uprobes being "expired"
into refcounted ones from ri_timer)

But the really simple way to solve this is to do unconditional
mod_timer(), so I can do just that to keep this less tricky. Would you
be ok with that?

>
> And I can't convince myself hprobe_expire() is correct... OK, I don't
> fully understand the logic, but why data_race(READ_ONCE(hprobe->leased)) ?
> READ_ONCE() should be enough in this case?

you mean why data_race() annotation? To appease KCSAN, given that we
modify hprobe->leased with xchg/cmpxchg, but read here with
READ_ONCE(). Maybe I'm overthinking it, not sure. There is a reason
why this is an RFC ;)

>
>
> > > As I have already mentioned in the previous discussions, we can simply 
> > > kill
> > > utask->active_uprobe. And utask->auprobe.
> >
> > I don't have anything against that, in principle, but let's benchmark
> > and test that thoroughly. I'm a bit uneasy about the possibility that
> > some arch-specific code will do container_of() on this arch_uprobe in
> > order to get to uprobe,
>
> Well, struct uprobe is not "exported", the arch-specific code can't do this.
>

Ah, good point, that's great. But as I said, uretprobe is actually
*the common* use case, not single-stepped uprobe. Still not very happy
about that memcpy() and assumption that it's cheap, but that's minor.

But no matter what we do for single-stepped one, uretprobe needs some
solution. (and if that solution works for uretprobe, why wouldn't it
work for single-step?...)

> Oleg.
>



Re: [PATCH v2 16/19] gendwarfksyms: Add support for reserved structure fields

2024-08-20 Thread Sami Tolvanen
On Mon, Aug 19, 2024 at 10:17 PM Benno Lossin  wrote:
>
> On 19.08.24 21:38, Sami Tolvanen wrote:
> >
> > This definitely looks cleaner than unions in Rust, but how would this
> > scheme be visible in DWARF? You might also need to expand the annotation
> > to allow replacing one reserved field with multiple smaller ones without
> > using structs.
>
> Hmm that's a good question, I have no idea how DWARF works. The way you
> do it in this patch is just by the name of the field, right?

Correct, it just looks at the name of the union fields.

> If Rust's DWARF output contains exact types names (I just checked this,
> I *think* that this is the case, but I have never used/seen DWARF
> before), we might be able to just create a `KAbiReserved` type
> that you search for instead of the attribute. The usage would then be
> like this:
>
> #[repr(C)]
> pub struct Struct1 {
> a: u64,
> _reserved: KAbiReserved<(), u64>,
> }
>
> And then when adding a new field, you would do this:
>
> #[repr(C)]
> pub struct Struct1 {
> a: u64,
> b: KAbiReserved,
> }
>
> /* Struct2 as above */
>
> The way `KAbiReserved` is implemented is via a `union` (maybe a bit
> ironic, considering what I said in my other replies, but in this case,
> we would provide a safe abstraction over this `union`, thus avoiding
> exposing users of this type to `unsafe`):
>
> #[repr(C)]
> pub union KAbiReserved {
> value: T,
> _reserved: R,
> }

I like this approach even better, assuming any remaining issues with
ownership etc. can be sorted out. This would also look identical to
the C version in DWARF if you rename _reserved in the union to
__kabi_reserved. Of course, we can always change gendwarfksyms to
support a different scheme for Rust code if a better solution comes
along later.

Sami



Re: [PATCH v2] virtio_pmem: Check device status before requesting flush

2024-08-20 Thread Dave Jiang



On 8/20/24 10:22 AM, Philip Chen wrote:
> If a pmem device is in a bad status, the driver side could wait for
> host ack forever in virtio_pmem_flush(), causing the system to hang.
> 
> So add a status check in the beginning of virtio_pmem_flush() to return
> early if the device is not activated.
> 
> Signed-off-by: Philip Chen 
> ---
> 
> v2:
> - Remove change id from the patch description
> - Add more details to the patch description
> 
>  drivers/nvdimm/nd_virtio.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
> index 35c8fbbba10e..97addba06539 100644
> --- a/drivers/nvdimm/nd_virtio.c
> +++ b/drivers/nvdimm/nd_virtio.c
> @@ -44,6 +44,15 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
>   unsigned long flags;
>   int err, err1;
>  
> + /*
> +  * Don't bother to submit the request to the device if the device is
> +  * not acticated.

s/acticated/activated/

> +  */
> + if (vdev->config->get_status(vdev) & VIRTIO_CONFIG_S_NEEDS_RESET) {
> + dev_info(&vdev->dev, "virtio pmem device needs a reset\n");
> + return -EIO;
> + }
> +
>   might_sleep();
>   req_data = kmalloc(sizeof(*req_data), GFP_KERNEL);
>   if (!req_data)



Re: [PATCH v2 16/19] gendwarfksyms: Add support for reserved structure fields

2024-08-20 Thread Matthew Maurer
> > The way `KAbiReserved` is implemented is via a `union` (maybe a bit
> > ironic, considering what I said in my other replies, but in this case,
> > we would provide a safe abstraction over this `union`, thus avoiding
> > exposing users of this type to `unsafe`):
> >
> > #[repr(C)]
> > pub union KAbiReserved {
> > value: T,
> > _reserved: R,
> > }
>
> I like this approach even better, assuming any remaining issues with
> ownership etc. can be sorted out. This would also look identical to
> the C version in DWARF if you rename _reserved in the union to
> __kabi_reserved. Of course, we can always change gendwarfksyms to
> support a different scheme for Rust code if a better solution comes
> along later.
>
> Sami

Agreement here - this seems like a good approach to representing
reserved in Rust code. A few minor adjustments we discussed off-list
which aren't required for gendwarfksyms to know about:
1. Types being added to reserved fields have to be `Copy`, e.g. they
must be `!Drop`.
2. Types being added to reserved fields must be legal to be
represented by all zeroes.
3. Reserved fields need to be initialized to zero before having their
union set to the provided value when constructing them.
4. It may be helpful to have delegating trait implementations to avoid
the couple places where autoderef won't handle the conversion.

While I think this is the right solution, esp. since it can share a
representation with C, I wanted to call out one minor shortfall - a
reserved field can only be replaced by one type. We could still
indicate a replacement by two fields the same as in C, by using a
tuple which will look like an anonymous struct. The limitation will be
that if two or more new fields were introduced, we'd need to edit the
patches accessing them to do foo.x.y and foo.x.z for their accesses
instead of simply foo.y and foo.z - the autoref trick only works for a
single type.



Re: [BUG] tracing: dynamic ftrace selftest detected failures

2024-08-20 Thread Google
On Tue, 20 Aug 2024 08:10:42 -0700
Sami Tolvanen  wrote:

> On Tue, Aug 20, 2024 at 3:48 AM Mark Rutland  wrote:
> >
> > On Tue, Aug 20, 2024 at 10:03:30AM +0900, Masami Hiramatsu wrote:
> > > On Mon, 19 Aug 2024 12:02:44 -0400
> > > Steven Rostedt  wrote:
> > >
> > > > On Tue, 20 Aug 2024 00:56:49 +0900
> > > > Masami Hiramatsu (Google)  wrote:
> > > > >
> > > > > >
> > > > > > We may need to add "noinline" or something to make sure those 
> > > > > > functions
> > > > > > don't get inlined for LTO.
> > > > >
> > > > > Yeah, we need such option at least for function call test.
> > > >
> > > > Could you add the noinline, and if it fixes the issue send a patch?
> > >
> > > I found the target function already has "noinline". I tried to add 
> > > noinline
> > > to the testing function (callsite), but it also did not work.
> > > I think "noinline" is for the compiler, but LTO is done by the linker.
> >
> > If LTO is breaking noinline, then that has much larger implications for
> > noinstr code and similar, and means that LTO is unsound...
> 
> The noinline attribute is preserved in LLVM IR, so it should continue
> to work with LTO. Which function are we talking about here? Are you
> sure the function was inlined instead of being dropped completely?
> Does marking the function __used help?

We are talking about trace_selftest_startup_dynamic_tracing() in
kernel/trace/trace_selftest.c. The callee is func() which is actually
DYN_FTRACE_TEST_NAME() in kernel/trace/trace_selftest_dynamic.c.
That function passed as pointer (but the compiler can embed it by constant
propagation.)

Does the noinline attribute prevent embedding callsite too? I mean

extern callee()

noinline callee()
{
...
}

caller()
{
callee() // (*)
}

In this case, does noinline prevent LTO to embed the callee at the callsite(*)
or prevent LTO remove the callee() symbol?

Thank you,

> 
> Sami
> 


-- 
Masami Hiramatsu (Google) 



Re: [BUG] tracing: dynamic ftrace selftest detected failures

2024-08-20 Thread Steven Rostedt
On Wed, 21 Aug 2024 07:05:39 +0900
Masami Hiramatsu (Google)  wrote:


> Does the noinline attribute prevent embedding callsite too? I mean
> 
> extern callee()
> 
> noinline callee()
> {
> ...
> }
> 
> caller()
> {
>   callee() // (*)
> }
> 
> In this case, does noinline prevent LTO to embed the callee at the callsite(*)
> or prevent LTO remove the callee() symbol?
>

Even though we have it passed as a parameter, I think the compiler and
linker is smart enough to see that and notice its use, and that the
function passed in is a nop, which doesn't break the flow.

Can you add the __used and see if it fixes it?

-- Steve



Re: [BUG] tracing: dynamic ftrace selftest detected failures

2024-08-20 Thread Google
On Tue, 20 Aug 2024 18:11:09 -0400
Steven Rostedt  wrote:

> On Wed, 21 Aug 2024 07:05:39 +0900
> Masami Hiramatsu (Google)  wrote:
> 
> 
> > Does the noinline attribute prevent embedding callsite too? I mean
> > 
> > extern callee()
> > 
> > noinline callee()
> > {
> > ...
> > }
> > 
> > caller()
> > {
> > callee() // (*)
> > }
> > 
> > In this case, does noinline prevent LTO to embed the callee at the 
> > callsite(*)
> > or prevent LTO remove the callee() symbol?
> >
> 
> Even though we have it passed as a parameter, I think the compiler and
> linker is smart enough to see that and notice its use, and that the
> function passed in is a nop, which doesn't break the flow.
> 
> Can you add the __used and see if it fixes it?

Adding __used to DYN_FTRACE_TEST_NAME() and DYN_FTRACE_TEST_NAME2() does
not change, the test still fails. Hmm, what about makes the caller
(trace_selftest_startup_dynamic_tracing()) called via a function pointer?
In that case, wouldn't it be subject to constant propagetion?

Let me try.

Thanks,

> 
> -- Steve


-- 
Masami Hiramatsu (Google) 



Re: [BUG] tracing: dynamic ftrace selftest detected failures

2024-08-20 Thread Steven Rostedt
On Wed, 21 Aug 2024 08:43:51 +0900
Masami Hiramatsu (Google)  wrote:

> > Can you add the __used and see if it fixes it?  
> 
> Adding __used to DYN_FTRACE_TEST_NAME() and DYN_FTRACE_TEST_NAME2() does
> not change, the test still fails.

OK, now that sounds like a bug in LTO itself.

-- Steve



Re: [BUG] tracing: dynamic ftrace selftest detected failures

2024-08-20 Thread Google
On Tue, 20 Aug 2024 19:49:14 -0400
Steven Rostedt  wrote:

> On Wed, 21 Aug 2024 08:43:51 +0900
> Masami Hiramatsu (Google)  wrote:
> 
> > > Can you add the __used and see if it fixes it?  
> > 
> > Adding __used to DYN_FTRACE_TEST_NAME() and DYN_FTRACE_TEST_NAME2() does
> > not change, the test still fails.
> 
> OK, now that sounds like a bug in LTO itself.

Hmm, I think __used just preserve the function exists as a function. But the
callsite can be optimized. This mean the __used function code is duplicated,
and embedded one copy in the callsite, but keep another copy as a function
which can be used from outside. But "noinline" attribute seems to be expected
not embedded in the callsite (because that is what "inlined" mean), so it
looks like a bug.

Thank you,

> 
> -- Steve


-- 
Masami Hiramatsu (Google) 



Re: [BUG] tracing: dynamic ftrace selftest detected failures

2024-08-20 Thread Google
On Wed, 21 Aug 2024 08:43:51 +0900
Masami Hiramatsu (Google)  wrote:

> On Tue, 20 Aug 2024 18:11:09 -0400
> Steven Rostedt  wrote:
> 
> > On Wed, 21 Aug 2024 07:05:39 +0900
> > Masami Hiramatsu (Google)  wrote:
> > 
> > 
> > > Does the noinline attribute prevent embedding callsite too? I mean
> > > 
> > > extern callee()
> > > 
> > > noinline callee()
> > > {
> > > ...
> > > }
> > > 
> > > caller()
> > > {
> > >   callee() // (*)
> > > }
> > > 
> > > In this case, does noinline prevent LTO to embed the callee at the 
> > > callsite(*)
> > > or prevent LTO remove the callee() symbol?
> > >
> > 
> > Even though we have it passed as a parameter, I think the compiler and
> > linker is smart enough to see that and notice its use, and that the
> > function passed in is a nop, which doesn't break the flow.
> > 
> > Can you add the __used and see if it fixes it?
> 
> Adding __used to DYN_FTRACE_TEST_NAME() and DYN_FTRACE_TEST_NAME2() does
> not change, the test still fails. Hmm, what about makes the caller
> (trace_selftest_startup_dynamic_tracing()) called via a function pointer?
> In that case, wouldn't it be subject to constant propagetion?
> 
> Let me try.

OK, it is succeeded! Calling `caller` via global function pointer makes
it run as we expected. It passed the dynamic_ftrace test, but other tests
still fails. Those need to be called via function pointer too.
 
[1.851324] Testing dynamic ftrace: PASSED   
  
[2.083329] Testing dynamic ftrace ops #1:   
  
[2.173751] (0 0 0 0 0) FAILED!  
  
[2.182337] [ cut here ] 
  
[2.183323] WARNING: CPU: 0 PID: 1 at kernel/trace/trace.c:2143 
run_tracer_sel0
[2.184323] Modules linked in:  

Anyway, here is what I did.

diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index 97f1e4bc47dc..9663bc777888 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -353,9 +353,10 @@ static int trace_selftest_ops(struct trace_array *tr, int 
cnt)
 }
 
 /* Test dynamic code modification and ftrace filters */
-static int trace_selftest_startup_dynamic_tracing(struct tracer *trace,
- struct trace_array *tr,
- int (*func)(void))
+static int noinline
+trace_selftest_startup_dynamic_tracing(struct tracer *trace,
+   struct trace_array *tr,
+   int (*func)(void))
 {
int save_ftrace_enabled = ftrace_enabled;
unsigned long count;
@@ -569,10 +570,22 @@ trace_selftest_function_recursion(void)
return ret;
 }
 #else
-# define trace_selftest_startup_dynamic_tracing(trace, tr, func) ({ 0; })
+static int trace_selftest_startup_dynamic_tracing(struct tracer *trace,
+   struct trace_array *tr,
+   int (*func)(void))
+{
+   if (!trace || !tr || !func)
+   return -EINVAL;
+   return 0;
+}
 # define trace_selftest_function_recursion() ({ 0; })
 #endif /* CONFIG_DYNAMIC_FTRACE */
 
+int (*global_trace_selftest_startup_dynamic_tracing)(struct tracer *trace,
+   struct trace_array *tr,
+   int (*func)(void))
+   = trace_selftest_startup_dynamic_tracing;
+
 static enum {
TRACE_SELFTEST_REGS_START,
TRACE_SELFTEST_REGS_FOUND,
@@ -732,7 +745,7 @@ trace_selftest_startup_function(struct tracer *trace, 
struct trace_array *tr)
goto out;
}
 
-   ret = trace_selftest_startup_dynamic_tracing(trace, tr,
+   ret = global_trace_selftest_startup_dynamic_tracing(trace, tr,
 DYN_FTRACE_TEST_NAME);
if (ret)
goto out;

-- 
Masami Hiramatsu (Google) 



[syzbot] [perf?] KASAN: slab-use-after-free Read in uprobe_unregister

2024-08-20 Thread syzbot
Hello,

syzbot found the following issue on:

HEAD commit:367b5c3d53e5 Add linux-next specific files for 20240816
git tree:   linux-next
console+strace: https://syzkaller.appspot.com/x/log.txt?x=109d46a398
kernel config:  https://syzkaller.appspot.com/x/.config?x=61ba6f3b22ee5467
dashboard link: https://syzkaller.appspot.com/bug?extid=b32fa80fe75273685f9c
compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 
2.40
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1099b7d398
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=15c513e598

Downloadable assets:
disk image: 
https://storage.googleapis.com/syzbot-assets/0b1b4e3cad3c/disk-367b5c3d.raw.xz
vmlinux: 
https://storage.googleapis.com/syzbot-assets/5bb090f7813c/vmlinux-367b5c3d.xz
kernel image: 
https://storage.googleapis.com/syzbot-assets/6674cb0709b1/bzImage-367b5c3d.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+b32fa80fe75273685...@syzkaller.appspotmail.com

R10: 0001 R11: 0246 R12: 
R13:  R14: 0001 R15: 0001
 
==
BUG: KASAN: slab-use-after-free in consumer_del kernel/events/uprobes.c:772 
[inline]
BUG: KASAN: slab-use-after-free in uprobe_unregister+0x99/0x220 
kernel/events/uprobes.c:1097
Read of size 8 at addr 888022b9eb40 by task syz-executor227/5226

CPU: 0 UID: 0 PID: 5226 Comm: syz-executor227 Not tainted 
6.11.0-rc3-next-20240816-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
08/06/2024
Call Trace:
 
 __dump_stack lib/dump_stack.c:94 [inline]
 dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:377 [inline]
 print_report+0x169/0x550 mm/kasan/report.c:488
 kasan_report+0x143/0x180 mm/kasan/report.c:601
 consumer_del kernel/events/uprobes.c:772 [inline]
 uprobe_unregister+0x99/0x220 kernel/events/uprobes.c:1097
 bpf_uprobe_unregister kernel/trace/bpf_trace.c:3119 [inline]
 bpf_uprobe_multi_link_release+0x8c/0x1b0 kernel/trace/bpf_trace.c:3127
 bpf_link_free+0xf5/0x250 kernel/bpf/syscall.c:2998
 bpf_link_put_direct kernel/bpf/syscall.c:3038 [inline]
 bpf_link_release+0x7b/0x90 kernel/bpf/syscall.c:3045
 __fput+0x24b/0x890 fs/file_table.c:424
 task_work_run+0x24f/0x310 kernel/task_work.c:228
 exit_task_work include/linux/task_work.h:40 [inline]
 do_exit+0xa2f/0x28e0 kernel/exit.c:939
 do_group_exit+0x207/0x2c0 kernel/exit.c:1088
 __do_sys_exit_group kernel/exit.c:1099 [inline]
 __se_sys_exit_group kernel/exit.c:1097 [inline]
 __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1097
 x64_sys_call+0x26a8/0x26b0 arch/x86/include/generated/asm/syscalls_64.h:232
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f409a857c79
Code: Unable to access opcode bytes at 0x7f409a857c4f.
RSP: 002b:7fff689d0b48 EFLAGS: 0246 ORIG_RAX: 00e7
RAX: ffda RBX:  RCX: 7f409a857c79
RDX: 003c RSI: 00e7 RDI: 
RBP: 7f409a8cb370 R08: ffb8 R09: 00a0
R10: 0001 R11: 0246 R12: 7f409a8cb370
R13:  R14: 7f409a8cbdc0 R15: 7f409a821340
 

Allocated by task 5226:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
 poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
 __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:394
 kasan_kmalloc include/linux/kasan.h:260 [inline]
 __do_kmalloc_node mm/slub.c:4202 [inline]
 __kmalloc_node_noprof+0x22a/0x440 mm/slub.c:4208
 __kvmalloc_node_noprof+0x72/0x1b0 mm/util.c:658
 kvmalloc_array_node_noprof include/linux/slab.h:835 [inline]
 bpf_uprobe_multi_link_attach+0x43e/0xb90 kernel/trace/bpf_trace.c:3369
 link_create+0x6e9/0x870 kernel/bpf/syscall.c:5236
 __sys_bpf+0x4bc/0x810 kernel/bpf/syscall.c:5697
 __do_sys_bpf kernel/bpf/syscall.c:5734 [inline]
 __se_sys_bpf kernel/bpf/syscall.c:5732 [inline]
 __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5732
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Freed by task 5226:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
 kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:579
 poison_slab_object mm/kasan/common.c:247 [inline]
 __kasan_slab_free+0x59/0x70 mm/kasan/common.c:264
 kasan_slab_free include/linux/kasan.h:233 [inline]
 slab_free_hook mm/slub.c:2295 [inline]
 slab_free mm/slub.c:4517 [inline]
 kfree+0x196/0x3e0 mm/slub.c:4665
 bpf_uprobe_multi_link_attach+0xa7d/0xb90 kernel/trace/bpf_trace.c:3431
 link_create+0x6e9/0x870 kernel/bpf/syscall.c:5236
 __sys_bpf+0x4bc/0x810 kernel/bpf/syscall.c:5697
 __do_sys

Re: [PATCH 1/1] uprobe: fix comment of uprobe_apply()

2024-08-20 Thread Leizhen (ThunderTown)



On 2024/8/20 22:30, Oleg Nesterov wrote:
> On 08/20, Zhen Lei wrote:
>>
>> Depending on the argument 'add', uprobe_apply() may be registering or
>> unregistering a probe.
> 
> ...
> 
>>  /*
>> - * uprobe_apply - unregister an already registered probe.
>> - * @inode: the file in which the probe has to be removed.
>> + * uprobe_apply - register a probe or unregister an already registered 
>> probe.
> 
> Not really.
> 
> See the commit 3c83a9ad0295eb63bd ("uprobes: make uprobe_register() return 
> struct uprobe *")
> in tip/perf/core which changed this description
> 
>   * uprobe_apply - add or remove the breakpoints according to @uc->filter
> 
> still looks confusing, yes...

OK, I got it. I mistakenly thought the comment was based on 
register_for_each_vma.
It seems necessary to rename 'register_for_each_vma' to 'apply_for_each_vma',
or some other more appropriate name.

> 
> Oleg.
> 
> 
> .
> 

-- 
Regards,
  Zhen Lei



[PATCH v16 00/16] Add Cgroup support for SGX EPC memory

2024-08-20 Thread Haitao Huang
SGX Enclave Page Cache (EPC) memory allocations are separate from normal
RAM allocations, and are managed solely by the SGX subsystem. The existing
cgroup memory controller cannot be used to limit or account for SGX EPC
memory, which is a desirable feature in some environments, e.g., support
for pod level control in a Kubernates cluster on a VM or bare-metal host
[1,2].
 
This patchset implements the support for sgx_epc memory within the misc
cgroup controller. A user can use the misc cgroup controller to set and
enforce a max limit on total EPC usage per cgroup. The implementation
reports current usage and events of reaching the limit per cgroup as well
as the total system capacity.
 
Much like normal system memory, EPC memory can be overcommitted via virtual
memory techniques and pages can be swapped out of the EPC to their backing
store, which are normal system memory allocated via shmem and accounted by
the memory controller. Similar to per-cgroup reclamation done by the memory
controller, the EPC misc controller needs to implement a per-cgroup EPC
reclaiming process: when the EPC usage of a cgroup reaches its hard limit
('sgx_epc' entry in the 'misc.max' file), the cgroup starts swapping out
some EPC pages within the same cgroup to make room for new allocations.
 
For that, this implementation tracks reclaimable EPC pages in a separate
LRU list in each cgroup, and below are more details and justification of
this design. 
 
Track EPC pages in per-cgroup LRUs (from Dave)
--
 
tl;dr: A cgroup hitting its limit should be as similar as possible to the
system running out of EPC memory. The only two choices to implement that
are nasty changes the existing LRU scanning algorithm, or to add new LRUs.
The result: Add a new LRU for each cgroup and scans those instead. Replace
the existing global cgroup with the root cgroup's LRU (only when this new
support is compiled in, obviously).
 
The existing EPC memory management aims to be a miniature version of the
core VM where EPC memory can be overcommitted and reclaimed. EPC
allocations can wait for reclaim. The alternative to waiting would have
been to send a signal and let the enclave die.
 
This series attempts to implement that same logic for cgroups, for the same
reasons: it's preferable to wait for memory to become available and let
reclaim happen than to do things that are fatal to enclaves.
 
There is currently a global reclaimable page SGX LRU list. That list (and
the existing scanning algorithm) is essentially useless for doing reclaim
when a cgroup hits its limit because the cgroup's pages are scattered
around that LRU. It is unspeakably inefficient to scan a linked list with
millions of entries for what could be dozens of pages from a cgroup that
needs reclaim.
 
Even if unspeakably slow reclaim was accepted, the existing scanning
algorithm only picks a few pages off the head of the global LRU. It would
either need to hold the list locks for unreasonable amounts of time, or be
taught to scan the list in pieces, which has its own challenges.
 
Unreclaimable Enclave Pages
---
 
There are a variety of page types for enclaves, each serving different
purposes [5]. Although the SGX architecture supports swapping for all
types, some special pages, e.g., Version Array(VA) and Secure Enclave
Control Structure (SECS)[5], holds meta data of reclaimed pages and
enclaves. That makes reclamation of such pages more intricate to manage.
The SGX driver global reclaimer currently does not swap out VA pages. It
only swaps the SECS page of an enclave when all other associated pages have
been swapped out. The cgroup reclaimer follows the same approach and does
not track those in per-cgroup LRUs and considers them as unreclaimable
pages. The allocation of these pages is counted towards the usage of a
specific cgroup and is subject to the cgroup's set EPC limits.
 
Earlier versions of this series implemented forced enclave-killing to
reclaim VA and SECS pages. That was designed to enforce the 'max' limit,
particularly in scenarios where a user or administrator reduces this limit
post-launch of enclaves. However, subsequent discussions [3, 4] indicated
that such preemptive enforcement is not necessary for the misc-controllers.
Therefore, reclaiming SECS/VA pages by force-killing enclaves were removed,
and the limit is only enforced at the time of new EPC allocation request.
When a cgroup hits its limit but nothing left in the LRUs of the subtree,
i.e., nothing to reclaim in the cgroup, any new attempt to allocate EPC
within that cgroup will result in an 'ENOMEM'.
 
Unreclaimable Guest VM EPC Pages

 
The EPC pages allocated for guest VMs by the virtual EPC driver are not
reclaimable by the host kernel [6]. Therefore an EPC cgroup also treats
those as unreclaimable and returns ENOMEM when its limit is hit and nothing
reclaimable left within the cgroup. The virtual EPC driver trans

[PATCH v16 01/16] x86/sgx: Replace boolean parameters with enums

2024-08-20 Thread Haitao Huang
Replace boolean parameters for 'reclaim' in the function
sgx_alloc_epc_page() and its callers with an enum.

Also opportunistically remove non-static declaration of
__sgx_alloc_epc_page() and a typo

Signed-off-by: Haitao Huang 
Suggested-by: Jarkko Sakkinen 
Suggested-by: Dave Hansen 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Kai Huang 
Tested-by: Jarkko Sakkinen 
---
 arch/x86/kernel/cpu/sgx/encl.c  | 12 ++--
 arch/x86/kernel/cpu/sgx/encl.h  |  4 ++--
 arch/x86/kernel/cpu/sgx/ioctl.c | 10 +-
 arch/x86/kernel/cpu/sgx/main.c  | 14 +++---
 arch/x86/kernel/cpu/sgx/sgx.h   | 13 +++--
 arch/x86/kernel/cpu/sgx/virt.c  |  2 +-
 6 files changed, 32 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 279148e72459..f474179b6f77 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -217,7 +217,7 @@ static struct sgx_epc_page *sgx_encl_eldu(struct 
sgx_encl_page *encl_page,
struct sgx_epc_page *epc_page;
int ret;
 
-   epc_page = sgx_alloc_epc_page(encl_page, false);
+   epc_page = sgx_alloc_epc_page(encl_page, SGX_NO_RECLAIM);
if (IS_ERR(epc_page))
return epc_page;
 
@@ -359,14 +359,14 @@ static vm_fault_t sgx_encl_eaug_page(struct 
vm_area_struct *vma,
goto err_out_unlock;
}
 
-   epc_page = sgx_alloc_epc_page(encl_page, false);
+   epc_page = sgx_alloc_epc_page(encl_page, SGX_NO_RECLAIM);
if (IS_ERR(epc_page)) {
if (PTR_ERR(epc_page) == -EBUSY)
vmret =  VM_FAULT_NOPAGE;
goto err_out_unlock;
}
 
-   va_page = sgx_encl_grow(encl, false);
+   va_page = sgx_encl_grow(encl, SGX_NO_RECLAIM);
if (IS_ERR(va_page)) {
if (PTR_ERR(va_page) == -EBUSY)
vmret = VM_FAULT_NOPAGE;
@@ -1232,8 +1232,8 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned 
long addr)
 
 /**
  * sgx_alloc_va_page() - Allocate a Version Array (VA) page
- * @reclaim: Reclaim EPC pages directly if none available. Enclave
- *   mutex should not be held if this is set.
+ * @reclaim: Whether reclaim EPC pages directly if none available. Enclave
+ *   mutex should not be held for SGX_DO_RECLAIM.
  *
  * Allocate a free EPC page and convert it to a Version Array (VA) page.
  *
@@ -1241,7 +1241,7 @@ void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned 
long addr)
  *   a VA page,
  *   -errno otherwise
  */
-struct sgx_epc_page *sgx_alloc_va_page(bool reclaim)
+struct sgx_epc_page *sgx_alloc_va_page(enum sgx_reclaim reclaim)
 {
struct sgx_epc_page *epc_page;
int ret;
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index f94ff14c9486..fe15ade02ca1 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -116,14 +116,14 @@ struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl 
*encl,
  unsigned long offset,
  u64 secinfo_flags);
 void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr);
-struct sgx_epc_page *sgx_alloc_va_page(bool reclaim);
+struct sgx_epc_page *sgx_alloc_va_page(enum sgx_reclaim reclaim);
 unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page);
 void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset);
 bool sgx_va_page_full(struct sgx_va_page *va_page);
 void sgx_encl_free_epc_page(struct sgx_epc_page *page);
 struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
 unsigned long addr);
-struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim);
+struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, enum sgx_reclaim 
reclaim);
 void sgx_encl_shrink(struct sgx_encl *encl, struct sgx_va_page *va_page);
 
 #endif /* _X86_ENCL_H */
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index b65ab214bdf5..793a0ba2cb16 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -17,7 +17,7 @@
 #include "encl.h"
 #include "encls.h"
 
-struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, bool reclaim)
+struct sgx_va_page *sgx_encl_grow(struct sgx_encl *encl, enum sgx_reclaim 
reclaim)
 {
struct sgx_va_page *va_page = NULL;
void *err;
@@ -64,7 +64,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct 
sgx_secs *secs)
struct file *backing;
long ret;
 
-   va_page = sgx_encl_grow(encl, true);
+   va_page = sgx_encl_grow(encl, SGX_DO_RECLAIM);
if (IS_ERR(va_page))
return PTR_ERR(va_page);
else if (va_page)
@@ -83,7 +83,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct 
sgx_secs *secs)
 
encl->backing = backing;
 
-   secs_epc = sgx_alloc_epc_page(&encl->secs, true);
+   secs_epc = sgx_alloc_ep

[PATCH v16 02/16] cgroup/misc: Add per resource callbacks for CSS events

2024-08-20 Thread Haitao Huang
From: Kristen Carlson Accardi 

The misc cgroup controller (subsystem) currently does not perform
resource type specific action for Cgroups Subsystem State (CSS) events:
the 'css_alloc' event when a cgroup is created and the 'css_free' event
when a cgroup is destroyed.

Define callbacks for those events and allow resource providers to
register the callbacks per resource type as needed. This will be
utilized later by the EPC misc cgroup support implemented in the SGX
driver.

The SGX callbacks will need to access the 'struct misc_cg'.  Pass
'struct misc_cg' to the callbacks but not the 'struct misc_res' because
the latter doesn't have a pointer pointing back to 'struct misc_cg'.

Link: 
https://lore.kernel.org/lkml/op.2kdw36otwjv...@hhuan26-mobl.amr.corp.intel.com/

Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Tejun Heo 
Tested-by: Jarkko Sakkinen 
---
V16:
- Don't call the ops if capacity is zero. (Kai)
- Simplify the last paragraph of the commit message. (Kai)

V15:
- Style fixes (Jarkko, Kai)
- _misc_cg_res_free() takes the last index. Only call ->free() for those res 
types with ->alloc() successfully called. (Ridong)

V12:
- Add comments in commit to clarify reason to pass in misc_cg, not
misc_res. (Kai)
- Remove unlikely (Kai)

V8:
- Abstract out _misc_cg_res_free() and _misc_cg_res_alloc() (Jarkko)

V7:
- Make ops one per resource type and store them in array (Michal)
- Rename the ops struct to misc_res_ops, and enforce the constraints of 
required callback
functions (Jarkko)
- Moved addition of priv field to patch 4 where it was used first. (Jarkko)

V6:
- Create ops struct for per resource callbacks (Jarkko)
- Drop max_write callback (Dave, Michal)
- Style fixes (Kai)
---
 include/linux/misc_cgroup.h | 11 +
 kernel/cgroup/misc.c| 88 ++---
 2 files changed, 93 insertions(+), 6 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 49eef10c8e59..e5159770a68e 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -28,6 +28,16 @@ struct misc_cg;
 
 #include 
 
+/**
+ * struct misc_res_ops: per resource type callback ops.
+ * @alloc: invoked for resource specific initialization when cgroup is 
allocated.
+ * @free: invoked for resource specific cleanup when cgroup is deallocated.
+ */
+struct misc_res_ops {
+   int (*alloc)(struct misc_cg *cg);
+   void (*free)(struct misc_cg *cg);
+};
+
 /**
  * struct misc_res: Per cgroup per misc type resource
  * @max: Maximum limit on the resource.
@@ -62,6 +72,7 @@ struct misc_cg {
 
 u64 misc_cg_res_total_usage(enum misc_res_type type);
 int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
+int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops);
 int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 
amount);
 void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount);
 
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 0e26068995a6..f3ce896d2c21 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -39,6 +39,9 @@ static struct misc_cg root_cg;
  */
 static u64 misc_res_capacity[MISC_CG_RES_TYPES];
 
+/* Resource type specific operations */
+static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES];
+
 /**
  * parent_misc() - Get the parent of the passed misc cgroup.
  * @cgroup: cgroup whose parent needs to be fetched.
@@ -105,6 +108,41 @@ int misc_cg_set_capacity(enum misc_res_type type, u64 
capacity)
 }
 EXPORT_SYMBOL_GPL(misc_cg_set_capacity);
 
+/**
+ * misc_cg_set_ops() - register resource specific operations.
+ * @type: Type of the misc res.
+ * @ops: Operations for the given type.
+ *
+ * The callbacks in @ops will not be invoked if the capacity of @type is 0.
+ *
+ * Context: Any context.
+ * Return:
+ * * %0 - Successfully registered the operations.
+ * * %-EINVAL - If @type is invalid, or the operations missing any required 
callbacks.
+ */
+int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops)
+{
+   if (!valid_type(type))
+   return -EINVAL;
+
+   if (!ops)
+   return -EINVAL;
+
+   if (!ops->alloc) {
+   pr_err("%s: alloc missing\n", __func__);
+   return -EINVAL;
+   }
+
+   if (!ops->free) {
+   pr_err("%s: free missing\n", __func__);
+   return -EINVAL;
+   }
+
+   misc_res_ops[type] = ops;
+   return 0;
+}
+EXPORT_SYMBOL_GPL(misc_cg_set_ops);
+
 /**
  * misc_cg_cancel_charge() - Cancel the charge from the misc cgroup.
  * @type: Misc res type in misc cg to cancel the charge from.
@@ -439,6 +477,36 @@ static struct cftype misc_cg_files[] = {
{}
 };
 
+static inline void _misc_cg_res_free(struct misc_cg *cg, enum misc_res_type 
last)
+{
+   enum misc_res_type i;
+
+   for (i = 0; i <= last; i++)
+   

[PATCH v16 03/16] cgroup/misc: Export APIs for SGX driver

2024-08-20 Thread Haitao Huang
From: Kristen Carlson Accardi 

The SGX EPC cgroup will reclaim EPC pages when usage in a cgroup reaches
its or ancestor's limit. This requires a walk from the current cgroup up
to the root similar to misc_cg_try_charge(). Export misc_cg_parent() to
enable this walk.

The SGX driver also needs start a global level reclamation from the
root. Export misc_cg_root() for the SGX driver to access.

Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Tejun Heo 
Reviewed-by: Kai Huang 
Tested-by: Jarkko Sakkinen 
---
V6:
- Make commit messages more concise and split the original patch into two(Kai)
---
 include/linux/misc_cgroup.h | 24 
 kernel/cgroup/misc.c| 21 -
 2 files changed, 32 insertions(+), 13 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index e5159770a68e..75c711333ad4 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -70,6 +70,7 @@ struct misc_cg {
struct misc_res res[MISC_CG_RES_TYPES];
 };
 
+struct misc_cg *misc_cg_root(void);
 u64 misc_cg_res_total_usage(enum misc_res_type type);
 int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
 int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops);
@@ -90,6 +91,20 @@ static inline struct misc_cg *css_misc(struct 
cgroup_subsys_state *css)
return css ? container_of(css, struct misc_cg, css) : NULL;
 }
 
+/**
+ * misc_cg_parent() - Get the parent of the passed misc cgroup.
+ * @cgroup: cgroup whose parent needs to be fetched.
+ *
+ * Context: Any context.
+ * Return:
+ * * struct misc_cg* - Parent of the @cgroup.
+ * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ */
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cgroup)
+{
+   return cgroup ? css_misc(cgroup->css.parent) : NULL;
+}
+
 /*
  * get_current_misc_cg() - Find and get the misc cgroup of the current task.
  *
@@ -114,6 +129,15 @@ static inline void put_misc_cg(struct misc_cg *cg)
 }
 
 #else /* !CONFIG_CGROUP_MISC */
+static inline struct misc_cg *misc_cg_root(void)
+{
+   return NULL;
+}
+
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cg)
+{
+   return NULL;
+}
 
 static inline u64 misc_cg_res_total_usage(enum misc_res_type type)
 {
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index f3ce896d2c21..6cf1f0899f4e 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -43,18 +43,13 @@ static u64 misc_res_capacity[MISC_CG_RES_TYPES];
 static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES];
 
 /**
- * parent_misc() - Get the parent of the passed misc cgroup.
- * @cgroup: cgroup whose parent needs to be fetched.
- *
- * Context: Any context.
- * Return:
- * * struct misc_cg* - Parent of the @cgroup.
- * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ * misc_cg_root() - Return the root misc cgroup.
  */
-static struct misc_cg *parent_misc(struct misc_cg *cgroup)
+struct misc_cg *misc_cg_root(void)
 {
-   return cgroup ? css_misc(cgroup->css.parent) : NULL;
+   return &root_cg;
 }
+EXPORT_SYMBOL_GPL(misc_cg_root);
 
 /**
  * valid_type() - Check if @type is valid or not.
@@ -177,7 +172,7 @@ static void misc_cg_event(enum misc_res_type type, struct 
misc_cg *cg)
atomic64_inc(&cg->res[type].events_local);
cgroup_file_notify(&cg->events_local_file);
 
-   for (; parent_misc(cg); cg = parent_misc(cg)) {
+   for (; misc_cg_parent(cg); cg = misc_cg_parent(cg)) {
atomic64_inc(&cg->res[type].events);
cgroup_file_notify(&cg->events_file);
}
@@ -212,7 +207,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct 
misc_cg *cg, u64 amount)
if (!amount)
return 0;
 
-   for (i = cg; i; i = parent_misc(i)) {
+   for (i = cg; i; i = misc_cg_parent(i)) {
res = &i->res[type];
 
new_usage = atomic64_add_return(amount, &res->usage);
@@ -228,7 +223,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct 
misc_cg *cg, u64 amount)
 err_charge:
misc_cg_event(type, i);
 
-   for (j = cg; j != i; j = parent_misc(j))
+   for (j = cg; j != i; j = misc_cg_parent(j))
misc_cg_cancel_charge(type, j, amount);
misc_cg_cancel_charge(type, i, amount);
return ret;
@@ -250,7 +245,7 @@ void misc_cg_uncharge(enum misc_res_type type, struct 
misc_cg *cg, u64 amount)
if (!(amount && valid_type(type) && cg))
return;
 
-   for (i = cg; i; i = parent_misc(i))
+   for (i = cg; i; i = misc_cg_parent(i))
misc_cg_cancel_charge(type, i, amount);
 }
 EXPORT_SYMBOL_GPL(misc_cg_uncharge);
-- 
2.43.0




[PATCH v16 04/16] cgroup/misc: Add SGX EPC resource type

2024-08-20 Thread Haitao Huang
From: Kristen Carlson Accardi 

Add SGX EPC memory, MISC_CG_RES_SGX_EPC, to be a valid resource type
for the misc controller.

Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Kai Huang 
Tested-by: Jarkko Sakkinen 
---
V12:
- Remove CONFIG_CGROUP_SGX_EPC (Jarkko)

V6:
- Split the original patch into this and the preceding one (Kai)
---
 include/linux/misc_cgroup.h | 4 
 kernel/cgroup/misc.c| 4 
 2 files changed, 8 insertions(+)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 75c711333ad4..b4119869b0d1 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -17,6 +17,10 @@ enum misc_res_type {
MISC_CG_RES_SEV,
/** @MISC_CG_RES_SEV_ES: AMD SEV-ES ASIDs resource */
MISC_CG_RES_SEV_ES,
+#endif
+#ifdef CONFIG_X86_SGX
+   /** @MISC_CG_RES_SGX_EPC: SGX EPC memory resource */
+   MISC_CG_RES_SGX_EPC,
 #endif
/** @MISC_CG_RES_TYPES: count of enum misc_res_type constants */
MISC_CG_RES_TYPES
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 6cf1f0899f4e..300af1ee20fc 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -24,6 +24,10 @@ static const char *const misc_res_name[] = {
/* AMD SEV-ES ASIDs resource */
"sev_es",
 #endif
+#ifdef CONFIG_X86_SGX
+   /* Intel SGX EPC memory bytes */
+   "sgx_epc",
+#endif
 };
 
 /* Root misc cgroup */
-- 
2.43.0




[PATCH v16 05/16] x86/sgx: Implement basic EPC misc cgroup functionality

2024-08-20 Thread Haitao Huang
From: Kristen Carlson Accardi 

SGX Enclave Page Cache (EPC) memory allocations are separate from normal
RAM allocations, and are managed solely by the SGX subsystem. The
existing cgroup memory controller cannot be used to limit or account for
SGX EPC memory, which is a desirable feature in some environments. For
instance, within a Kubernetes environment, while a user may specify a
particular EPC quota for a pod, the orchestrator requires a mechanism to
enforce that the pod's actual runtime EPC usage does not exceed the
allocated quota.

Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
limit and track EPC allocations per cgroup. Earlier patches have added
the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
support in SGX driver as the "sgx_epc" resource provider:

- Set "capacity" of EPC by calling misc_cg_set_capacity()
- Update EPC usage counter, "current", by calling charge and uncharge
APIs for EPC allocation and deallocation, respectively.
- Setup sgx_epc resource type specific callbacks, which perform
initialization and cleanup during cgroup allocation and deallocation,
respectively.

With these changes, the misc cgroup controller enables users to set a hard
limit for EPC usage in the "misc.max" interface file. It reports current
usage in "misc.current", the total EPC memory available in
"misc.capacity", and the number of times EPC usage reached the max limit
in "misc.events".

For now, the EPC cgroup simply blocks additional EPC allocation in
sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
still tracked in the global active list, only reclaimed by the global
reclaimer when the total free page count is lower than a threshold.

Later patches will reorganize the tracking and reclamation code in the
global reclaimer and implement per-cgroup tracking and reclaiming.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Tejun Heo 
Reviewed-by: Kai Huang 
Tested-by: Jarkko Sakkinen 
---
V16:
- Proper handling for failures during init (Kai)
- Register ops and capacity at the end when SGX is ready to handle
  callbacks.

V15:
- Declare __init for sgx_cgroup_init() (Jarkko)
- Disable SGX when sgx_cgroup_init() fails (Jarkko)

V13:
- Remove unneeded includes. (Kai)

V12:
- Remove CONFIG_CGROUP_SGX_EPC and make sgx cgroup implementation
conditionally compiled with CONFIG_CGROUP_MISC. (Jarkko)

V11:
- Update copyright and format better (Kai)
- Create wrappers to remove #ifdefs in c file. (Kai)
- Remove unneeded comments (Kai)

V10:
- Shorten function, variable, struct names, s/sgx_epc_cgroup/sgx_cgroup. 
(Jarkko)
- Use enums instead of booleans for the parameters. (Dave, Jarkko)

V8:
- Remove null checks for epc_cg in try_charge()/uncharge(). (Jarkko)
- Remove extra space, '_INTEL'. (Jarkko)

V7:
- Use a static for root cgroup (Kai)
- Wrap epc_cg field in sgx_epc_page struct with #ifdef (Kai)
- Correct check for charge API return (Kai)
- Start initialization in SGX device driver init (Kai)
- Remove unneeded BUG_ON (Kai)
- Split  sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai)

V6:
- Split the original large patch"Limit process EPC usage with misc
cgroup controller"  and restructure it (Kai)
---
 arch/x86/kernel/cpu/sgx/Makefile |  1 +
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 93 
 arch/x86/kernel/cpu/sgx/epc_cgroup.h | 78 +++
 arch/x86/kernel/cpu/sgx/main.c   | 44 +++--
 arch/x86/kernel/cpu/sgx/sgx.h| 24 +++
 include/linux/misc_cgroup.h  |  2 +
 6 files changed, 238 insertions(+), 4 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
 create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h

diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 9c1656779b2a..081cb424575e 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -4,3 +4,4 @@ obj-y += \
ioctl.o \
main.o
 obj-$(CONFIG_X86_SGX_KVM)  += virt.o
+obj-$(CONFIG_CGROUP_MISC)  += epc_cgroup.o
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
new file mode 100644
index ..0e422fef02bb
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -0,0 +1,93 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2022-2024 Intel Corporation. */
+
+#include
+#include "epc_cgroup.h"
+
+/* The root SGX EPC cgroup */
+static struct sgx_cgroup sgx_cg_root;
+
+/**
+ * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page
+ *
+ * @sgx_cg:The EPC cgroup to be charged for the page.
+ * Return:
+ * * %0 - If successfully charged.
+ * * -errno - for failures.
+ */
+int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
+{
+   return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, P

[PATCH v16 06/16] x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list

2024-08-20 Thread Haitao Huang
From: Sean Christopherson 

Introduce a data structure to wrap the existing reclaimable list and its
spinlock. Each cgroup later will have one instance of this structure to
track EPC pages allocated for processes associated with the same cgroup.
Just like the global SGX reclaimer (ksgxd), an EPC cgroup reclaims pages
from the reclaimable list in this structure when its usage reaches near
its limit.

Use this structure to encapsulate the LRU list and its lock used by the
global reclaimer.

Signed-off-by: Sean Christopherson 
Co-developed-by: Kristen Carlson Accardi 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Cc: Sean Christopherson 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Kai Huang 
Tested-by: Jarkko Sakkinen 
---
V15:
- Add comment for spinlock. (Jarkko, Kai)

V6:
- removed introduction to unreclaimables in commit message.

V4:
- Removed unneeded comments for the spinlock and the non-reclaimables.
(Kai, Jarkko)
- Revised the commit to add introduction comments for unreclaimables and
multiple LRU lists.(Kai)
- Reordered the patches: delay all changes for unreclaimables to
later, and this one becomes the first change in the SGX subsystem.

V3:
- Removed the helper functions and revised commit messages.
---
 arch/x86/kernel/cpu/sgx/main.c | 39 +-
 arch/x86/kernel/cpu/sgx/sgx.h  | 16 ++
 2 files changed, 36 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 0fda964c0a7c..45832c9a1e1b 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -28,10 +28,9 @@ static DEFINE_XARRAY(sgx_epc_address_space);
 
 /*
  * These variables are part of the state of the reclaimer, and must be accessed
- * with sgx_reclaimer_lock acquired.
+ * with sgx_global_lru.lock acquired.
  */
-static LIST_HEAD(sgx_active_page_list);
-static DEFINE_SPINLOCK(sgx_reclaimer_lock);
+static struct sgx_epc_lru_list sgx_global_lru;
 
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
@@ -306,13 +305,13 @@ static void sgx_reclaim_pages(void)
int ret;
int i;
 
-   spin_lock(&sgx_reclaimer_lock);
+   spin_lock(&sgx_global_lru.lock);
for (i = 0; i < SGX_NR_TO_SCAN; i++) {
-   if (list_empty(&sgx_active_page_list))
+   epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
+   struct sgx_epc_page, list);
+   if (!epc_page)
break;
 
-   epc_page = list_first_entry(&sgx_active_page_list,
-   struct sgx_epc_page, list);
list_del_init(&epc_page->list);
encl_page = epc_page->owner;
 
@@ -324,7 +323,7 @@ static void sgx_reclaim_pages(void)
 */
epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
-   spin_unlock(&sgx_reclaimer_lock);
+   spin_unlock(&sgx_global_lru.lock);
 
for (i = 0; i < cnt; i++) {
epc_page = chunk[i];
@@ -347,9 +346,9 @@ static void sgx_reclaim_pages(void)
continue;
 
 skip:
-   spin_lock(&sgx_reclaimer_lock);
-   list_add_tail(&epc_page->list, &sgx_active_page_list);
-   spin_unlock(&sgx_reclaimer_lock);
+   spin_lock(&sgx_global_lru.lock);
+   list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
+   spin_unlock(&sgx_global_lru.lock);
 
kref_put(&encl_page->encl->refcount, sgx_encl_release);
 
@@ -380,7 +379,7 @@ static void sgx_reclaim_pages(void)
 static bool sgx_should_reclaim(unsigned long watermark)
 {
return atomic_long_read(&sgx_nr_free_pages) < watermark &&
-  !list_empty(&sgx_active_page_list);
+  !list_empty(&sgx_global_lru.reclaimable);
 }
 
 /*
@@ -432,6 +431,8 @@ static bool __init sgx_page_reclaimer_init(void)
 
ksgxd_tsk = tsk;
 
+   sgx_lru_init(&sgx_global_lru);
+
return true;
 }
 
@@ -507,10 +508,10 @@ static struct sgx_epc_page *__sgx_alloc_epc_page(void)
  */
 void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
 {
-   spin_lock(&sgx_reclaimer_lock);
+   spin_lock(&sgx_global_lru.lock);
page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
-   list_add_tail(&page->list, &sgx_active_page_list);
-   spin_unlock(&sgx_reclaimer_lock);
+   list_add_tail(&page->list, &sgx_global_lru.reclaimable);
+   spin_unlock(&sgx_global_lru.lock);
 }
 
 /**
@@ -525,18 +526,18 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
  */
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
 {
-   spin_lock(&sgx_reclaimer_lock);
+   spin_lock(&sgx_global_lru.lock);
if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
/* The page is being reclaimed. */
if (list_empty(&

[PATCH v16 08/16] x86/sgx: Encapsulate uses of the global LRU

2024-08-20 Thread Haitao Huang
To support the per-cgroup reclamation, each cgroup will have its own
"per-cgroup LRU" and EPC pages will be in its owner cgroup's LRU instead
of the global LRU. Abstract the code that is directly working with the
global LRU into functions reusable with per-cgroup LRUs.

Currently the basic reclamation procedure, sgx_reclaim_pages() directly
reclaims pages from the global LRU. Change it to take in an LRU.

Note the global EPC reclamation will still be needed when the total EPC
usage reaches the system capacity while usages of some cgroups are below
their respective limits. Create a separate wrapper for the global
reclamation, sgx_reclaim_pages_global(), passing in the global LRU to
the new sgx_reclaim_pages() now. Later it will be revised to reclaim
from multiple LRUs from all EPC cgroups instead of a single global LRU.

Wrap the existing emptiness check of the global LRU with a helper so
that it can be changed later to work with multiple LRUs when per-cgroup
LRU comes to play.

Also the per-cgroup EPC reclaim and global EPC reclaim will have
different check on whether they should be done.  Rename the existing
sgx_should_reclaim() to sgx_should_reclaim_global() to separate the two
cases.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Co-developed-by: Kristen Carlson Accardi 
Signed-off-by: Kristen Carlson Accardi 
Signed-off-by: Haitao Huang 
---
v16:
- Regroup all abstraction related to global LRU usage to this patch from
  different patches in previous version. Position this before adding
per-cgroup reclaim. (Kai)

V13:
- Rename sgx_can_reclaim() to sgx_can_reclaim_global() and
sgx_should_reclaim() to sgx_should_reclaim_global(). (Kai)

V10:
- Add comments for the new function. (Jarkko)

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/main.c | 63 ++
 1 file changed, 41 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 5d5f6baac9c8..47dfba6f45af 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -37,6 +37,21 @@ static inline struct sgx_epc_lru_list 
*sgx_epc_page_lru(struct sgx_epc_page *epc
return &sgx_global_lru;
 }
 
+/*
+ * Check if there is any reclaimable page at global level.
+ */
+static inline bool sgx_can_reclaim_global(void)
+{
+   /*
+* Now all EPC pages are still tracked in the @sgx_global_lru, so only
+* check @sgx_global_lru.
+*
+* When EPC pages are tracked in the actual per-cgroup LRUs,
+* replace with sgx_cgroup_lru_empty(misc_cg_root()).
+*/
+   return !list_empty(&sgx_global_lru.reclaimable);
+}
+
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
 /* Nodes with one or more EPC sections. */
@@ -287,10 +302,10 @@ static void sgx_reclaimer_write(struct sgx_epc_page 
*epc_page,
 }
 
 /*
- * Take a fixed number of pages from the head of the active page pool and
- * reclaim them to the enclave's private shmem files. Skip the pages, which 
have
- * been accessed since the last scan. Move those pages to the tail of active
- * page pool so that the pages get scanned in LRU like fashion.
+ * Take a fixed number of pages from the head of a given LRU and reclaim them 
to
+ * the enclave's private shmem files. Skip the pages, which have been accessed
+ * since the last scan. Move those pages to the tail of the list so that the
+ * pages get scanned in LRU like fashion.
  *
  * Batch process a chunk of pages (at the moment 16) in order to degrade amount
  * of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a 
bit
@@ -299,7 +314,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page 
*epc_page,
  * problematic as it would increase the lock contention too much, which would
  * halt forward progress.
  */
-static void sgx_reclaim_pages(void)
+static void sgx_reclaim_pages(struct sgx_epc_lru_list *lru)
 {
struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
struct sgx_backing backing[SGX_NR_TO_SCAN];
@@ -310,10 +325,9 @@ static void sgx_reclaim_pages(void)
int ret;
int i;
 
-   spin_lock(&sgx_global_lru.lock);
+   spin_lock(&lru->lock);
for (i = 0; i < SGX_NR_TO_SCAN; i++) {
-   epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
-   struct sgx_epc_page, list);
+   epc_page = list_first_entry_or_null(&lru->reclaimable, struct 
sgx_epc_page, list);
if (!epc_page)
break;
 
@@ -328,7 +342,7 @@ static void sgx_reclaim_pages(void)
 */
epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
-   spin_unlock(&sgx_global_lru.lock);
+   spin_unlock(&lru->lock);
 
for (i = 0; i < cnt; i++) {
epc_page = chunk[i];
@@ -351,9 +365,9 @@ static void sgx_reclaim_pages(void)
  

[PATCH v16 07/16] x86/sgx: Abstract tracking reclaimable pages in LRU

2024-08-20 Thread Haitao Huang
From: Kristen Carlson Accardi 

The SGX driver tracks reclaimable EPC pages by adding a newly allocated
page into the global LRU list in sgx_mark_page_reclaimable(), and doing
the opposite in sgx_unmark_page_reclaimable().

To support SGX EPC cgroup, the SGX driver will need to maintain an LRU
list for each cgroup, and each newly allocated EPC page will need to be
added to the LRU of associated cgroup, not always the global LRU list.

When sgx_mark_page_reclaimable() is called, the cgroup that the newly
allocated EPC page belongs to is already known, i.e., it has been set to
the 'struct sgx_epc_page'.

Add a helper, sgx_epc_page_lru(), to return the LRU that the EPC page
should be added to for the given EPC page.  Currently it just returns
the global LRU. Change sgx_{mark|unmark}_page_reclaimable() to use the
helper function to get the LRU from the EPC page instead of referring to
the global LRU directly.

This allows EPC page being able to be tracked in "per-cgroup" LRU when
that becomes ready.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Reviewed-by: Kai Huang 
Tested-by: Jarkko Sakkinen 
---
V13:
- Revise commit log (Kai)
- Rename sgx_lru_list() to sgx_epc_page_lru()

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/main.c | 30 ++
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 45832c9a1e1b..5d5f6baac9c8 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -32,6 +32,11 @@ static DEFINE_XARRAY(sgx_epc_address_space);
  */
 static struct sgx_epc_lru_list sgx_global_lru;
 
+static inline struct sgx_epc_lru_list *sgx_epc_page_lru(struct sgx_epc_page 
*epc_page)
+{
+   return &sgx_global_lru;
+}
+
 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
 
 /* Nodes with one or more EPC sections. */
@@ -500,25 +505,24 @@ static struct sgx_epc_page *__sgx_alloc_epc_page(void)
 }
 
 /**
- * sgx_mark_page_reclaimable() - Mark a page as reclaimable
+ * sgx_mark_page_reclaimable() - Mark a page as reclaimable and track it in a 
LRU.
  * @page:  EPC page
- *
- * Mark a page as reclaimable and add it to the active page list. Pages
- * are automatically removed from the active list when freed.
  */
 void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
 {
-   spin_lock(&sgx_global_lru.lock);
+   struct sgx_epc_lru_list *lru = sgx_epc_page_lru(page);
+
+   spin_lock(&lru->lock);
page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
-   list_add_tail(&page->list, &sgx_global_lru.reclaimable);
-   spin_unlock(&sgx_global_lru.lock);
+   list_add_tail(&page->list, &lru->reclaimable);
+   spin_unlock(&lru->lock);
 }
 
 /**
- * sgx_unmark_page_reclaimable() - Remove a page from the reclaim list
+ * sgx_unmark_page_reclaimable() - Remove a page from its tracking LRU
  * @page:  EPC page
  *
- * Clear the reclaimable flag and remove the page from the active page list.
+ * Clear the reclaimable flag if set and remove the page from its LRU.
  *
  * Return:
  *   0 on success,
@@ -526,18 +530,20 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
  */
 int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
 {
-   spin_lock(&sgx_global_lru.lock);
+   struct sgx_epc_lru_list *lru = sgx_epc_page_lru(page);
+
+   spin_lock(&lru->lock);
if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
/* The page is being reclaimed. */
if (list_empty(&page->list)) {
-   spin_unlock(&sgx_global_lru.lock);
+   spin_unlock(&lru->lock);
return -EBUSY;
}
 
list_del(&page->list);
page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
-   spin_unlock(&sgx_global_lru.lock);
+   spin_unlock(&lru->lock);
 
return 0;
 }
-- 
2.43.0




[PATCH v16 09/16] x86/sgx: Add basic EPC reclamation flow for cgroup

2024-08-20 Thread Haitao Huang
Currently in the EPC page allocation, the kernel simply fails the
allocation when the current EPC cgroup fails to charge due to its usage
reaching limit.  This is not ideal.  When that happens, a better way is
to reclaim EPC page(s) from the current EPC cgroup to reduce its usage
so the new allocation can succeed.

Currently, all EPC pages are tracked in a single global LRU, and the
"global EPC reclamation" supports the following 3 cases:

1) On-demand asynchronous reclamation: For allocation requests that can
   not wait for reclamation but can be retried, an asynchronous
   reclamation is triggered, in which the global reclaimer, ksgxd, keeps
   reclaiming EPC pages until the free page count is above a minimal
   threshold.

2) On-demand synchronous reclamation: For allocations that can wait for
   reclamation, the EPC page allocator, sgx_alloc_epc_page() reclaims
   EPC page(s) immediately until at least one free page is available for
   allocation.

3) Preemptive reclamation: For some allocation requests, e.g.,
   allocation for reloading a reclaimed page to change its permissions
   or page type, the kernel invokes sgx_reclaim_direct() to preemptively
   reclaim EPC page(s) as a best effort to minimize on-demand
   reclamation for subsequent allocations.

Similarly, a "per-cgroup reclamation" is needed to support the above 3
cases as well:

1) For on-demand asynchronous reclamation, a per-cgroup reclamation
   needs to be invoked to maintain a minimal difference between the
   usage and the limit for each cgroup, analogous to the minimal free
   page threshold  maintained by the global reclaimer.

2) For on-demand synchronous reclamation, sgx_cgroup_try_charge() needs
   to invoke the per-cgroup reclamation until the cgroup usage become
   at least one page lower than its limit.

3) For preemptive reclamation, sgx_reclaim_direct() needs to invoke the
   per-cgroup reclamation to minimize per-cgroup on-demand reclamation
   for subsequent allocations.

To support the per-cgroup reclamation, introduce a "per-cgroup LRU" to
track all EPC pages belong to the owner cgroup to utilize the existing
sgx_reclaim_pages().

Currently, the global reclamation treats all EPC pages equally as it
scans all EPC pages in FIFO order in the global LRU.  The "per-cgroup
reclamation" needs to somehow achieve the same fairness of all EPC pages
that are tracked in the multiple LRUs of the given cgroup and all the
descendants to reflect the nature of the cgroup.

The idea is to achieve such fairness by scanning "all EPC cgroups" of
the subtree (the given cgroup and all the descendants) equally in turns,
and in the scan to each cgroup, apply the existing sgx_reclaim_pages()
to its LRU. This basic flow is encapsulated in a new function,
sgx_cgroup_reclaim_pages().

Export sgx_reclaim_pages() for use in sgx_cgroup_reclaim_pages(). And
modify sgx_reclaim_pages() to return the number of pages scanned so
sgx_cgroup_reclaim_pages() can track scanning progress and determine
whether enough scanning is done or to continue the scanning for next
descendant.

Whenever reclaiming in a subtree of a given root is needed, start the
scanning from the next descendant where scanning was stopped at last
time.  To keep track of the next descendant cgroup to scan, add a new
field, next_cg, in the sgx_cgroup struct.  Create an iterator function,
sgx_cgroup_next_get(), atomically returns a valid reference of the
descendant for next round of scanning and advances @next_cg to next
valid descendant in a preorder walk. This iterator function is used in
sgx_cgroup_reclaim_pages() to iterate descendants for scanning.
Separately also advances @next_cg to next valid descendant when the
cgroup referenced by @next_cg is to be freed.

Add support for on-demand synchronous reclamation in
sgx_cgroup_try_charge(), applying sgx_cgroup_reclaim_pages() iteratively
until cgroup usage is lower than its limit.

Later patches will reuse sgx_cgroup_reclaim_pages() to add support for
asynchronous and preemptive reclamation.

Note all reclaimable EPC pages are still tracked in the global LRU thus
no per-cgroup reclamation is actually active at the moment: -ENOMEM is
returned by __sgx_cgroup_try_charge() when LRUs are empty. Per-cgroup
tracking and reclamation will be turned on in the end after all
necessary infrastructure is in place.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Co-developed-by: Kristen Carlson Accardi 
Signed-off-by: Kristen Carlson Accardi 
Signed-off-by: Haitao Huang 
---
V16:
- Revise commit log to define reclamation requirement and the design more 
clearly. (Kai)
- Revise sgx_cgroup_reclaim_pages() to scan cgroups in subtree more
  fairly, track next_cg in each sgx_cgroup and add helpers that is used
  to iterate descendant in sgx_cgroup_reclaim_pages().(Kai)

V14:
- Allow sgx_cgroup_reclaim_pages() to continue from previous tree-walk.
It takes in a 'start' node and returns the 'next' node for the caller to
use as the new '

[PATCH v16 10/16] x86/sgx: Implement async reclamation for cgroup

2024-08-20 Thread Haitao Huang
From: Kristen Carlson Accardi 

In cases EPC pages need be allocated during a page fault and the cgroup
usage is near its limit, an asynchronous reclamation needs to be
triggered to avoid blocking the page fault handling.

To keep implementation simple, use a workqueue instead of kthread to
schedule the asynchronous reclamation work. Add corresponding work item and
function definitions for EPC cgroup.

In sgx_cgroup_try_charge(), if caller does not allow synchronous
reclamation, queue an asynchronous work into the workqueue.

The current global reclaimer, ksgxd, maintains a threshold for the
minimal free EPC pages to avoid thrashing when allocating EPC pages.
Similarly, only reclaiming EPC pages from the current cgroup when its
usage actually reaches limit could also cause thrashing.  To avoid that,
define a similar "per-cgroup usage threshold" and actively trigger
asynchronous per-cgroup EPC reclamation when the usage reaches the
threshold after try_charge() is successful.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Tested-by: Jarkko Sakkinen 
---
V16:
- Destroy workqueue in sgx_cgroup_deinit()
- Reuse the new sgx_cgroup_reclaim_pages() to scan at least
  SGX_NR_TO_SCAN pages for each round async reclaim.
- Revise commit message. (Kai)

V15:
- Disable SGX when sgx_cgroup_init() fails instead of BUG_ON() (Jarkko)
- Reset capacity to zero when sgx_cgroup_init() fails. (Kai)
V13:
- Revert to BUG_ON() in case of workq allocation failure in init and
only alloc if misc is enabled.

V11:
- Print error instead of WARN (Kai)
- Add check for need to queue an async reclamation before returning from
try_charge(), do so if needed. This is to be consistent with global
reclaimer to minimize thrashing during allocation time.

V10:
- Split asynchronous flow in separate patch. (Kai)
- Consider cgroup disabled when the workqueue allocation fail during
init. (Kai)
- Abstract out sgx_cgroup_should_reclaim().

V9:
- Add comments for static variables. (Jarkko)

V8:
- Remove alignment for substructure variables. (Jarkko)

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 115 ++-
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |   4 +
 arch/x86/kernel/cpu/sgx/main.c   |   5 +-
 3 files changed, 122 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index ce28efd20a15..c40f601d06d9 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -4,9 +4,37 @@
 #include
 #include "epc_cgroup.h"
 
+/*
+ * The minimal free pages, or the minimal margin between limit and usage
+ * maintained by per-cgroup reclaimer.
+ *
+ * Set this to the low threshold used by the global reclaimer, ksgxd.
+ */
+#define SGX_CG_MIN_FREE_PAGE   (SGX_NR_LOW_PAGES)
+
+/*
+ * If the cgroup limit is close to SGX_CG_MIN_FREE_PAGE, maintaining the 
minimal
+ * free pages would barely leave any page for use, causing excessive 
reclamation
+ * and thrashing.
+ *
+ * Define the following limit, below which cgroup does not maintain the minimal
+ * free page threshold. Set this to quadruple of the minimal so at least 75%
+ * pages used without being reclaimed.
+ */
+#define SGX_CG_LOW_LIMIT   (SGX_CG_MIN_FREE_PAGE * 4)
+
 /* The root SGX EPC cgroup */
 static struct sgx_cgroup sgx_cg_root;
 
+/*
+ * The work queue that reclaims EPC pages in the background for cgroups.
+ *
+ * A cgroup schedules a work item into this queue to reclaim pages within the
+ * same cgroup when its usage limit is reached and synchronous reclamation is 
not
+ * an option, i.e., in a page fault handler.
+ */
+static struct workqueue_struct *sgx_cg_wq;
+
 /*
  * Return the next descendant in a preorder walk, given a root, @root and a
  * cgroup, @cg, to start the walk from. Return @root if no descendant left for
@@ -100,6 +128,34 @@ static inline struct sgx_cgroup 
*sgx_cgroup_next_get(struct sgx_cgroup *root)
return p;
 }
 
+static inline u64 sgx_cgroup_page_counter_read(struct sgx_cgroup *sgx_cg)
+{
+   return atomic64_read(&sgx_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / 
PAGE_SIZE;
+}
+
+static inline u64 sgx_cgroup_max_pages(struct sgx_cgroup *sgx_cg)
+{
+   return READ_ONCE(sgx_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE;
+}
+
+/*
+ * Get the lower bound of limits of a cgroup and its ancestors. Used in
+ * sgx_cgroup_should_reclaim() to determine if EPC usage of a cgroup is
+ * close to its limit or its ancestors' hence reclamation is needed.
+ */
+static inline u64 sgx_cgroup_max_pages_to_root(struct sgx_cgroup *sgx_cg)
+{
+   struct misc_cg *i = sgx_cg->cg;
+   u64 m = U64_MAX;
+
+   while (i) {
+   m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max));
+   i = misc_cg_parent(i);
+   }
+
+   return m / PAGE_SIZE;
+

[PATCH v16 11/16] x86/sgx: Charge mem_cgroup for per-cgroup reclamation

2024-08-20 Thread Haitao Huang
Enclave Page Cache(EPC) memory can be swapped out to regular system
memory, and the consumed memory should be charged to a proper
mem_cgroup. Currently the selection of mem_cgroup to charge is done in
sgx_encl_get_mem_cgroup(). But it considers all contexts other than the
ksgxd thread are user processes. With the new EPC cgroup implementation,
the swapping can also happen in EPC cgroup work-queue threads. In those
cases, it improperly selects the root mem_cgroup to charge for the RAM
usage.

Remove current_is_ksgxd() and change sgx_encl_get_mem_cgroup() to take
an additional argument to explicitly specify the mm struct to charge for
allocations. Callers from background kthreads not associated with a
charging mm struct would set it to NULL, while callers in user process
contexts set it to current->mm.

Internally, it handles the case when the charging mm given is NULL, by
searching for an mm struct from enclave's mm_list.

Signed-off-by: Haitao Huang 
Reported-by: Mikko Ylinen 
Reviewed-by: Kai Huang 
Reviewed-by: Jarkko Sakkinen 
Tested-by: Mikko Ylinen 
Tested-by: Jarkko Sakkinen 
---
V10:
- Pass mm struct instead of a boolean 'indirect'. (Dave, Jarkko)

V9:
- Reduce number of if statements. (Tim)

V8:
- Limit text paragraphs to 80 characters wide. (Jarkko)
---
 arch/x86/kernel/cpu/sgx/encl.c   | 29 ++--
 arch/x86/kernel/cpu/sgx/encl.h   |  3 +--
 arch/x86/kernel/cpu/sgx/epc_cgroup.c |  9 ++---
 arch/x86/kernel/cpu/sgx/main.c   | 29 +---
 arch/x86/kernel/cpu/sgx/sgx.h|  2 +-
 5 files changed, 36 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index f474179b6f77..7b77dad41daf 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -993,23 +993,23 @@ static int __sgx_encl_get_backing(struct sgx_encl *encl, 
unsigned long page_inde
 }
 
 /*
- * When called from ksgxd, returns the mem_cgroup of a struct mm stored
- * in the enclave's mm_list. When not called from ksgxd, just returns
- * the mem_cgroup of the current task.
+ * Find the mem_cgroup to charge for memory allocated on behalf of an enclave.
+ *
+ * Used in sgx_encl_alloc_backing() for backing store allocation.
+ *
+ * Return the mem_cgroup of the given charge_mm. Otherwise return the 
mem_cgroup
+ * of a struct mm stored in the enclave's mm_list.
  */
-static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
+static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl,
+ struct mm_struct *charge_mm)
 {
struct mem_cgroup *memcg = NULL;
struct sgx_encl_mm *encl_mm;
int idx;
 
-   /*
-* If called from normal task context, return the mem_cgroup
-* of the current task's mm. The remainder of the handling is for
-* ksgxd.
-*/
-   if (!current_is_ksgxd())
-   return get_mem_cgroup_from_mm(current->mm);
+/* Use the charge_mm if given. */
+   if (charge_mm)
+   return get_mem_cgroup_from_mm(charge_mm);
 
/*
 * Search the enclave's mm_list to find an mm associated with
@@ -1047,8 +1047,9 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct 
sgx_encl *encl)
  * @encl:  an enclave pointer
  * @page_index:enclave page index
  * @backing:   data for accessing backing storage for the page
+ * @charge_mm: the mm to charge for the allocation
  *
- * When called from ksgxd, sets the active memcg from one of the
+ * When charge_mm is NULL, sets the active memcg from one of the
  * mms in the enclave's mm_list prior to any backing page allocation,
  * in order to ensure that shmem page allocations are charged to the
  * enclave.  Create a backing page for loading data back into an EPC page with
@@ -1060,9 +1061,9 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct 
sgx_encl *encl)
  *   -errno otherwise.
  */
 int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long page_index,
-  struct sgx_backing *backing)
+  struct sgx_backing *backing, struct mm_struct 
*charge_mm)
 {
-   struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl);
+   struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl, 
charge_mm);
struct mem_cgroup *memcg = set_active_memcg(encl_memcg);
int ret;
 
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index fe15ade02ca1..5ce9d108290f 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -103,12 +103,11 @@ static inline int sgx_encl_find(struct mm_struct *mm, 
unsigned long addr,
 int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
 unsigned long end, unsigned long vm_flags);
 
-bool current_is_ksgxd(void);
 void sgx_encl_release(struct kref *ref);
 int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_

[PATCH v16 12/16] x86/sgx: Revise global reclamation for EPC cgroups

2024-08-20 Thread Haitao Huang
With EPC cgroups, the global reclamation function,
sgx_reclaim_pages_global(), can no longer apply to the global LRU as
pages are now in per-cgroup LRUs.

Create a wrapper, sgx_cgroup_reclaim_global() to invoke
sgx_cgroup_reclaim_pages() passing in the root cgroup. Call this wrapper
from sgx_reclaim_pages_global() when cgroup is enabled. The wrapper will
scan and attempt to reclaim SGX_NR_TO_SCAN pages just like the current
global reclaim.

Note this simple implementation doesn't _exactly_ mimic the current
global EPC reclaim (which always tries to do the actual reclaim in batch
of SGX_NR_TO_SCAN pages): in rare cases when LRUs have less than
SGX_NR_TO_SCAN reclaimable pages, the actual reclaim of EPC pages will
be split into smaller batches _across_ multiple LRUs with each being
smaller than SGX_NR_TO_SCAN pages.

A more precise way to mimic the current global EPC reclaim would be to
have a new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages
_across_ the given EPC cgroup _AND_ its descendants, and then do the
actual reclaim in one batch.  But this is unnecessarily complicated at
this stage to address such rare cases.

Signed-off-by: Haitao Huang 
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 12 
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |  3 +++
 arch/x86/kernel/cpu/sgx/main.c   |  7 +++
 3 files changed, 22 insertions(+)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index ae65cac858f8..23a61689e0d9 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -240,6 +240,18 @@ static bool sgx_cgroup_should_reclaim(struct sgx_cgroup 
*sgx_cg)
return (cur >= max);
 }
 
+/**
+ * sgx_cgroup_reclaim_pages_global() - Perform one round of global reclamation.
+ *
+ * @charge_mm: The mm to be charged for the backing store of reclaimed pages.
+ *
+ * Try to scan and attempt reclamation from root cgroup for %SGX_NR_TO_SCAN 
pages.
+ */
+void sgx_cgroup_reclaim_pages_global(struct mm_struct *charge_mm)
+{
+   sgx_cgroup_reclaim_pages(&sgx_cg_root, charge_mm, SGX_NR_TO_SCAN);
+}
+
 /*
  * Asynchronous work flow to reclaim pages from the cgroup when the cgroup is
  * at/near its maximum capacity.
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index 37364bdb797f..c0390111e28c 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -36,6 +36,8 @@ static inline void __init sgx_cgroup_deinit(void) { }
 
 static inline void __init sgx_cgroup_register(void) { }
 
+static inline void sgx_cgroup_reclaim_pages_global(struct mm_struct 
*charge_mm) { }
+
 #else /* CONFIG_CGROUP_MISC */
 
 struct sgx_cgroup {
@@ -87,6 +89,7 @@ static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg)
 
 int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim);
 void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg);
+void sgx_cgroup_reclaim_pages_global(struct mm_struct *charge_mm);
 int __init sgx_cgroup_init(void);
 void __init sgx_cgroup_register(void);
 void __init sgx_cgroup_deinit(void);
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index fa00ed5e884d..d00cb012838b 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -411,6 +411,13 @@ static bool sgx_should_reclaim_global(unsigned long 
watermark)
 
 static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
 {
+   /*
+* Now all EPC pages are still tracked in the @sgx_global_lru.
+* Still reclaim from it.
+*
+* When EPC pages are tracked in the actual per-cgroup LRUs,
+* sgx_cgroup_reclaim_pages_global() will be called.
+*/
sgx_reclaim_pages(&sgx_global_lru, charge_mm);
 }
 
-- 
2.43.0




[PATCH v16 13/16] x86/sgx: implement direct reclamation for cgroups

2024-08-20 Thread Haitao Huang
sgx_reclaim_direct() was introduced to preemptively reclaim some pages
as the best effort to avoid on-demand reclamation that can stall forward
progress in some situations, e.g., allocating pages to load previously
reclaimed page to perform EDMM operations on [1].

Currently when the global usage is close to the capacity,
sgx_reclaim_direct() makes one invocation to sgx_reclaim_pages_global()
but does not guarantee there are free pages available for later
allocations to succeed. In other words, the only goal here is to reduce
the chance of on-demand reclamation at allocation time. In cases of
allocation failure, the caller, the EDMM ioctl()'s, would return -EAGAIN
to user space and let the user space to decide whether to retry or not.

With EPC cgroups enabled, usage of a cgroup can also reach its limit
(usually much lower than capacity) and trigger per-cgroup reclamation.
Implement a similar strategy to reduce the chance of on-demand
per-cgroup reclamation for this use case.

Create a wrapper, sgx_cgroup_reclaim_direct(), to perform a preemptive
reclamation at cgroup level, and have sgx_reclaim_direct() call it when
EPC cgroup is enabled.

[1] 
https://lore.kernel.org/all/a0d8f037c4a075d56bf79f432438412985f7ff7a.1652137848.git.reinette.cha...@intel.com/T/#u

Signed-off-by: Haitao Huang 
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c | 15 +++
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |  3 +++
 arch/x86/kernel/cpu/sgx/main.c   |  4 
 3 files changed, 22 insertions(+)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index 23a61689e0d9..b7d60b2d878d 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -252,6 +252,21 @@ void sgx_cgroup_reclaim_pages_global(struct mm_struct 
*charge_mm)
sgx_cgroup_reclaim_pages(&sgx_cg_root, charge_mm, SGX_NR_TO_SCAN);
 }
 
+/**
+ * sgx_cgroup_reclaim_direct() - Preemptive reclamation.
+ *
+ * Scan and attempt to reclaim %SGX_NR_TO_SCAN as best effort to allow caller
+ * make quick progress.
+ */
+void sgx_cgroup_reclaim_direct(void)
+{
+   struct sgx_cgroup *sgx_cg = sgx_get_current_cg();
+
+   if (sgx_cgroup_should_reclaim(sgx_cg))
+   sgx_cgroup_reclaim_pages(sgx_cg, current->mm, SGX_NR_TO_SCAN);
+   sgx_put_cg(sgx_cg);
+}
+
 /*
  * Asynchronous work flow to reclaim pages from the cgroup when the cgroup is
  * at/near its maximum capacity.
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index c0390111e28c..cf2b946d993e 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -38,6 +38,8 @@ static inline void __init sgx_cgroup_register(void) { }
 
 static inline void sgx_cgroup_reclaim_pages_global(struct mm_struct 
*charge_mm) { }
 
+static inline void sgx_cgroup_reclaim_direct(void) { }
+
 #else /* CONFIG_CGROUP_MISC */
 
 struct sgx_cgroup {
@@ -90,6 +92,7 @@ static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg)
 int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim);
 void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg);
 void sgx_cgroup_reclaim_pages_global(struct mm_struct *charge_mm);
+void sgx_cgroup_reclaim_direct(void);
 int __init sgx_cgroup_init(void);
 void __init sgx_cgroup_register(void);
 void __init sgx_cgroup_deinit(void);
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index d00cb012838b..9a8f91ebd21b 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -428,6 +428,10 @@ static void sgx_reclaim_pages_global(struct mm_struct 
*charge_mm)
  */
 void sgx_reclaim_direct(void)
 {
+   /* Reduce chance of per-cgroup reclamation for later allocation */
+   sgx_cgroup_reclaim_direct();
+
+   /* Reduce chance of the global reclamation for later allocation */
if (sgx_should_reclaim_global(SGX_NR_LOW_PAGES))
sgx_reclaim_pages_global(current->mm);
 }
-- 
2.43.0




[PATCH v16 14/16] x86/sgx: Turn on per-cgroup EPC reclamation

2024-08-20 Thread Haitao Huang
From: Kristen Carlson Accardi 

Previous patches have implemented all infrastructure needed for
per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
pages are still tracked in the global LRU as sgx_epc_page_lru() always
returns reference to the global LRU.

Change sgx_epc_page_lru() to return the LRU of the cgroup in which the
given EPC page is allocated.

Update sgx_can_reclaim_global(), to check emptiness of LRUs of all
cgroups, and update sgx_reclaim_pages_global(), to utilize
sgx_cgroup_reclaim_pages_global(), when EPC cgroup is enabled.

With these changes, the global reclamation and per-cgroup reclamation
both work properly with all pages tracked in per-cgroup LRUs.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
---
V16:
- Separated out the global and direct reclamation to earlier patch.(Kai)

V14:
- Update global reclamation to use the new sgx_cgroup_reclaim_pages() to
iterate cgroups at lower level if the top cgroups are too busy.

V13:
- Use IS_ENABLED(CONFIG_CGROUP_MISC) in sgx_can_reclaim_global(). (Kai)

V12:
- Remove CONFIG_CGROUP_SGX_EPC, conditional compile SGX Cgroup for
CONFIGCONFIG_CGROUPMISC. (Jarkko)

V11:
- Reword the comments for global reclamation for allocation failure
after passing cgroup charging. (Kai)
- Add stub functions to remove ifdefs in c file (Kai)
- Add more detailed comments to clarify each page belongs to one cgroup, or the
root. (Kai)

V10:
- Add comment to clarify each page belongs to one cgroup, or the root by
default. (Kai)
- Merge the changes that expose sgx_cgroup_* functions to this patch.
- Add changes for sgx_reclaim_direct() that was missed previously.

V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
 arch/x86/kernel/cpu/sgx/epc_cgroup.c |  2 +-
 arch/x86/kernel/cpu/sgx/epc_cgroup.h |  6 
 arch/x86/kernel/cpu/sgx/main.c   | 45 ++--
 3 files changed, 36 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index b7d60b2d878d..c3f0c7bc13c6 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -162,7 +162,7 @@ static inline u64 sgx_cgroup_max_pages_to_root(struct 
sgx_cgroup *sgx_cg)
  *
  * Return: %true if all cgroups under the specified root have empty LRU lists.
  */
-static bool sgx_cgroup_lru_empty(struct misc_cg *root)
+bool sgx_cgroup_lru_empty(struct misc_cg *root)
 {
struct cgroup_subsys_state *css_root;
struct cgroup_subsys_state *pos;
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h 
b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index cf2b946d993e..cd957cf38204 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -27,6 +27,11 @@ static inline int sgx_cgroup_try_charge(struct sgx_cgroup 
*sgx_cg, enum sgx_recl
 
 static inline void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg) { }
 
+static inline bool sgx_cgroup_lru_empty(struct misc_cg *root)
+{
+   return true;
+}
+
 static inline int __init sgx_cgroup_init(void)
 {
return 0;
@@ -91,6 +96,7 @@ static inline void sgx_put_cg(struct sgx_cgroup *sgx_cg)
 
 int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim);
 void sgx_cgroup_uncharge(struct sgx_cgroup *sgx_cg);
+bool sgx_cgroup_lru_empty(struct misc_cg *root);
 void sgx_cgroup_reclaim_pages_global(struct mm_struct *charge_mm);
 void sgx_cgroup_reclaim_direct(void);
 int __init sgx_cgroup_init(void);
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 9a8f91ebd21b..2a23a10d882e 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -32,9 +32,30 @@ static DEFINE_XARRAY(sgx_epc_address_space);
  */
 static struct sgx_epc_lru_list sgx_global_lru;
 
+/*
+ * Get the per-cgroup or global LRU list that tracks the given reclaimable 
page.
+ */
 static inline struct sgx_epc_lru_list *sgx_epc_page_lru(struct sgx_epc_page 
*epc_page)
 {
+#ifdef CONFIG_CGROUP_MISC
+   /*
+* epc_page->sgx_cg here is never NULL during a reclaimable epc_page's
+* life between sgx_alloc_epc_page() and sgx_free_epc_page():
+*
+* In sgx_alloc_epc_page(), epc_page->sgx_cg is set to the return from
+* sgx_get_current_cg() which is the misc cgroup of the current task, or
+* the root by default even if the misc cgroup is disabled by kernel
+* command line.
+*
+* epc_page->sgx_cg is only unset by sgx_free_epc_page().
+*
+* This function is never used before sgx_alloc_epc_page() or after
+* sgx_free_epc_page().
+*/
+   return &epc_page->sgx_cg->lru;
+#else
return &sgx_global_lru;
+#endif
 }
 
 /*
@@ -42,14 +63,10 @@ static inline struct sgx_epc_lru_list 
*sgx_epc_page_lru(struct sgx_epc_page *epc

[PATCH v16 15/16] Docs/x86/sgx: Add description for cgroup support

2024-08-20 Thread Haitao Huang
From: Sean Christopherson 

Add initial documentation of how to regulate the distribution of
SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup
controller.

Signed-off-by: Sean Christopherson 
Co-developed-by: Kristen Carlson Accardi 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang
Signed-off-by: Haitao Huang
Cc: Sean Christopherson 
Reviewed-by: Bagas Sanjaya 
Reviewed-by: Jarkko Sakkinen 
Acked-by: Kai Huang 
Tested-by: Mikko Ylinen 
Tested-by: Jarkko Sakkinen 
---
V8:
- Limit text width to 80 characters to be consistent.

V6:
- Remove mentioning of VMM specific behavior on handling SIGBUS
- Remove statement of forced reclamation, add statement to specify
ENOMEM returned when no reclamation possible.
- Added statements on the non-preemptive nature for the max limit
- Dropped Reviewed-by tag because of changes

V4:
- Fix indentation (Randy)
- Change misc.events file to be read-only
- Fix a typo for 'subsystem'
- Add behavior when VMM overcommit EPC with a cgroup (Mikko)
---
 Documentation/arch/x86/sgx.rst | 83 ++
 1 file changed, 83 insertions(+)

diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst
index d90796adc2ec..c537e6a9aa65 100644
--- a/Documentation/arch/x86/sgx.rst
+++ b/Documentation/arch/x86/sgx.rst
@@ -300,3 +300,86 @@ to expected failures and handle them as follows:
first call.  It indicates a bug in the kernel or the userspace client
if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has
a return code other than 0.
+
+
+Cgroup Support
+==
+
+The "sgx_epc" resource within the Miscellaneous cgroup controller regulates
+distribution of SGX EPC memory, which is a subset of system RAM that is used to
+provide SGX-enabled applications with protected memory, and is otherwise
+inaccessible, i.e. shows up as reserved in /proc/iomem and cannot be
+read/written outside of an SGX enclave.
+
+Although current systems implement EPC by stealing memory from RAM, for all
+intents and purposes the EPC is independent from normal system memory, e.g. 
must
+be reserved at boot from RAM and cannot be converted between EPC and normal
+memory while the system is running.  The EPC is managed by the SGX subsystem 
and
+is not accounted by the memory controller.  Note that this is true only for EPC
+memory itself, i.e.  normal memory allocations related to SGX and EPC memory,
+e.g. the backing memory for evicted EPC pages, are accounted, limited and
+protected by the memory controller.
+
+Much like normal system memory, EPC memory can be overcommitted via virtual
+memory techniques and pages can be swapped out of the EPC to their backing 
store
+(normal system memory allocated via shmem).  The SGX EPC subsystem is analogous
+to the memory subsystem, and it implements limit and protection models for EPC
+memory.
+
+SGX EPC Interface Files
+---
+
+For a generic description of the Miscellaneous controller interface files,
+please see Documentation/admin-guide/cgroup-v2.rst
+
+All SGX EPC memory amounts are in bytes unless explicitly stated otherwise. If
+a value which is not PAGE_SIZE aligned is written, the actual value used by the
+controller will be rounded down to the closest PAGE_SIZE multiple.
+
+  misc.capacity
+A read-only flat-keyed file shown only in the root cgroup. The sgx_epc
+resource will show the total amount of EPC memory available on the
+platform.
+
+  misc.current
+A read-only flat-keyed file shown in the non-root cgroups. The sgx_epc
+resource will show the current active EPC memory usage of the cgroup 
and
+its descendants. EPC pages that are swapped out to backing RAM are not
+included in the current count.
+
+  misc.max
+A read-write single value file which exists on non-root cgroups. The
+sgx_epc resource will show the EPC usage hard limit. The default is
+"max".
+
+If a cgroup's EPC usage reaches this limit, EPC allocations, e.g., for
+page fault handling, will be blocked until EPC can be reclaimed from 
the
+cgroup. If there are no pages left that are reclaimable within the same
+group, the kernel returns ENOMEM.
+
+The EPC pages allocated for a guest VM by the virtual EPC driver are 
not
+reclaimable by the host kernel. In case the guest cgroup's limit is
+reached and no reclaimable pages left in the same cgroup, the virtual
+EPC driver returns SIGBUS to the user space process to indicate failure
+on new EPC allocation requests.
+
+The misc.max limit is non-preemptive. If a user writes a limit lower
+than the current usage to this file, the cgroup will not preemptively
+deallocate pages currently in use, and will only start blocking the 
next
+allocation and reclaiming EPC at that time.
+
+  misc.events
+A read-only flat-keyed file which exists on non-root c

[PATCH v16 16/16] selftests/sgx: Add scripts for EPC cgroup testing

2024-08-20 Thread Haitao Huang
With different cgroups, the script starts one or multiple concurrent SGX
selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed
test case, which loads an enclave of EPC size equal to the EPC capacity
available on the platform. The script checks results against the
expectation set for each cgroup and reports success or failure.

The script creates 3 different cgroups at the beginning with following
expectations:

1) small - intentionally small enough to fail the test loading an
enclave of size equal to the capacity.
2) large - large enough to run up to 4 concurrent tests but fail some if
more than 4 concurrent tests are run. The script starts 4 expecting at
least one test to pass, and then starts 5 expecting at least one test
to fail.
3) larger - limit is the same as the capacity, large enough to run lots of
concurrent tests. The script starts 8 of them and expects all pass.
Then it reruns the same test with one process randomly killed and
usage checked to be zero after all processes exit.

The script also includes a test with low mem_cg limit and large sgx_epc
limit to verify that the RAM used for per-cgroup reclamation is charged
to a proper mem_cg. For this test, it turns off swapping before start,
and turns swapping back on afterwards.

Add README to document how to run the tests.

Signed-off-by: Haitao Huang 
Reviewed-by: Jarkko Sakkinen 
Tested-by: Jarkko Sakkinen 
---
V13:
- More improvement on handling error cases and style fixes.
- Add settings file for custom timeout

V12:
- Integrate the scripts to the "run_tests" target. (Jarkko)

V11:
- Remove cgroups-tools dependency and make scripts ash compatible. (Jarkko)
- Drop support for cgroup v1 and simplify. (Michal, Jarkko)
- Add documentation for functions. (Jarkko)
- Turn off swapping before memcontrol tests and back on after
- Format and style fixes, name for hard coded values

V7:
- Added memcontrol test.

V5:
- Added script with automatic results checking, remove the interactive
script.
- The script can run independent from the series below.
---
 tools/testing/selftests/sgx/Makefile  |   3 +-
 tools/testing/selftests/sgx/README| 109 +++
 tools/testing/selftests/sgx/ash_cgexec.sh |  16 +
 tools/testing/selftests/sgx/config|   4 +
 .../selftests/sgx/run_epc_cg_selftests.sh | 294 ++
 tools/testing/selftests/sgx/settings  |   2 +
 .../selftests/sgx/watch_misc_for_tests.sh |  11 +
 7 files changed, 438 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/sgx/README
 create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh
 create mode 100644 tools/testing/selftests/sgx/config
 create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
 create mode 100644 tools/testing/selftests/sgx/settings
 create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh

diff --git a/tools/testing/selftests/sgx/Makefile 
b/tools/testing/selftests/sgx/Makefile
index 03b5e13b872b..3e673b8ace3f 100644
--- a/tools/testing/selftests/sgx/Makefile
+++ b/tools/testing/selftests/sgx/Makefile
@@ -20,7 +20,8 @@ ENCL_LDFLAGS := -Wl,-T,test_encl.lds,--build-id=none
 
 ifeq ($(CAN_BUILD_X86_64), 1)
 TEST_CUSTOM_PROGS := $(OUTPUT)/test_sgx
-TEST_FILES := $(OUTPUT)/test_encl.elf
+TEST_FILES := $(OUTPUT)/test_encl.elf ash_cgexec.sh
+TEST_PROGS := run_epc_cg_selftests.sh
 
 all: $(TEST_CUSTOM_PROGS) $(OUTPUT)/test_encl.elf
 endif
diff --git a/tools/testing/selftests/sgx/README 
b/tools/testing/selftests/sgx/README
new file mode 100644
index ..f84406bf29a4
--- /dev/null
+++ b/tools/testing/selftests/sgx/README
@@ -0,0 +1,109 @@
+SGX selftests
+
+The SGX selftests includes a c program (test_sgx) that covers basic user space
+facing APIs and a shell scripts (run_sgx_cg_selftests.sh) testing SGX misc
+cgroup. The SGX cgroup test script requires root privileges and runs a
+specific test case of  the test_sgx in different cgroups configured by the
+script. More details about the cgroup test can be found below.
+
+All SGX selftests can run with or without kselftest framework.
+
+WITH KSELFTEST FRAMEWORK
+===
+
+BUILD
+-
+
+Build executable file "test_sgx" from top level directory of the kernel source:
+ $ make -C tools/testing/selftests TARGETS=sgx
+
+RUN
+---
+
+Run all sgx tests as sudo or root since the cgroup tests need to configure 
cgroup
+limits in files under /sys/fs/cgroup.
+
+ $ sudo make -C tools/testing/selftests/sgx run_tests
+
+Without sudo, SGX cgroup tests will be skipped.
+
+On platforms with large Enclave Page Cache (EPC) and/or less cpu cores, you may
+need adjust the timeout in 'settings' to avoid timeouts.
+
+More details about kselftest framework can be found in
+Documentation/dev-tools/kselftest.rst.
+
+WITHOUT KSELFTEST FRAMEWORK
+===
+
+BUILD
+-
+
+Build executable file "test_sgx" from this
+directory(tools/testing/selftests/sgx/):
+
+  $ make
+
+RUN
+---
+
+Run all non-cgrou

Re: [PATCH] virtio_pmem: Check device status before requesting flush

2024-08-20 Thread Philip Chen
Hi,

On Tue, Aug 20, 2024 at 7:23 AM Ira Weiny  wrote:
>
> Philip Chen wrote:
> > On Mon, Aug 19, 2024 at 2:56 PM Ira Weiny  wrote:
> > >
> > > Philip Chen wrote:
> > > > If a pmem device is in a bad status, the driver side could wait for
> > > > host ack forever in virtio_pmem_flush(), causing the system to hang.
> > >
> > > I assume this was supposed to be v2 and you resent this as a proper v2
> > > with a change list from v1?
> > Ah...yes, I'll fix it and re-send it as a v2 patch.
>
> Wait didn't you already do that?  Wasn't this v2?

Yes, but somehow the patch didn't go to my inbox.
(Maybe it's because there is no code change between v1 and v2?)
So I resent another v2 (with some minor change to the comment):
https://lore.kernel.org/all/20240820172256.903251-1-philipc...@chromium.org/
Please take a look.

>
> https://lore.kernel.org/all/20240815010337.2334245-1-philipc...@chromium.org/
>
> Ira



Re: [PATCH v2] virtio_pmem: Check device status before requesting flush

2024-08-20 Thread Philip Chen
Hi,

On Tue, Aug 20, 2024 at 1:01 PM Dave Jiang  wrote:
>
>
>
> On 8/20/24 10:22 AM, Philip Chen wrote:
> > If a pmem device is in a bad status, the driver side could wait for
> > host ack forever in virtio_pmem_flush(), causing the system to hang.
> >
> > So add a status check in the beginning of virtio_pmem_flush() to return
> > early if the device is not activated.
> >
> > Signed-off-by: Philip Chen 
> > ---
> >
> > v2:
> > - Remove change id from the patch description
> > - Add more details to the patch description
> >
> >  drivers/nvdimm/nd_virtio.c | 9 +
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/drivers/nvdimm/nd_virtio.c b/drivers/nvdimm/nd_virtio.c
> > index 35c8fbbba10e..97addba06539 100644
> > --- a/drivers/nvdimm/nd_virtio.c
> > +++ b/drivers/nvdimm/nd_virtio.c
> > @@ -44,6 +44,15 @@ static int virtio_pmem_flush(struct nd_region *nd_region)
> >   unsigned long flags;
> >   int err, err1;
> >
> > + /*
> > +  * Don't bother to submit the request to the device if the device is
> > +  * not acticated.
>
> s/acticated/activated/

Thanks for the review.
I'll fix this typo in v3.

In addition to this typo, does anyone have any other concerns?

>
> > +  */
> > + if (vdev->config->get_status(vdev) & VIRTIO_CONFIG_S_NEEDS_RESET) {
> > + dev_info(&vdev->dev, "virtio pmem device needs a reset\n");
> > + return -EIO;
> > + }
> > +
> >   might_sleep();
> >   req_data = kmalloc(sizeof(*req_data), GFP_KERNEL);
> >   if (!req_data)



[PATCH v7 4/4] module: add install target for modules.builtin.ranges

2024-08-20 Thread Kris Van Hees
When CONFIG_BUILTIN_MODULE_RANGES is enabled, the modules.builtin.ranges
file should be installed in the module install location.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
---
Changes since v3:
 - Only install modules.builtin.ranges if CONFIG_BUILTIN_MODULE_RANGES=y
---
 scripts/Makefile.modinst | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/scripts/Makefile.modinst b/scripts/Makefile.modinst
index 0afd75472679..c38bf63a33be 100644
--- a/scripts/Makefile.modinst
+++ b/scripts/Makefile.modinst
@@ -30,10 +30,12 @@ $(MODLIB)/modules.order: modules.order FORCE
 quiet_cmd_install_modorder = INSTALL $@
   cmd_install_modorder = sed 's:^\(.*\)\.o$$:kernel/\1.ko:' $< > $@
 
-# Install modules.builtin(.modinfo) even when CONFIG_MODULES is disabled.
+# Install modules.builtin(.modinfo,.ranges) even when CONFIG_MODULES is 
disabled.
 install-y += $(addprefix $(MODLIB)/, modules.builtin modules.builtin.modinfo)
 
-$(addprefix $(MODLIB)/, modules.builtin modules.builtin.modinfo): $(MODLIB)/%: 
% FORCE
+install-$(CONFIG_BUILTIN_MODULE_RANGES) += $(MODLIB)/modules.builtin.ranges
+
+$(addprefix $(MODLIB)/, modules.builtin modules.builtin.modinfo 
modules.builtin.ranges): $(MODLIB)/%: % FORCE
$(call cmd,install)
 
 endif
-- 
2.45.2




[PATCH v7 1/4] kbuild: add mod(name,file)_flags to assembler flags for module objects

2024-08-20 Thread Kris Van Hees
In order to create the file at build time, modules.builtin.ranges, that
contains the range of addresses for all built-in modules, there needs to
be a way to identify what code is compiled into modules.

To identify what code is compiled into modules during a kernel build,
one can look for the presence of the -DKBUILD_MODFILE and -DKBUILD_MODNAME
options in the compile command lines.  A simple grep in .*.cmd files for
those options is sufficient for this.

Unfortunately, these options are only passed when compiling C source files.
Various modules also include objects built from assembler source, and these
options are not passed in that case.

Adding $(modfile_flags) to modkern_aflags (similar to modkern_cflahs), and
adding $(modname_flags) to a_flags (similar to c_flags) makes it possible
to identify which objects are compiled into modules for both C and
assembler soure files.

Signed-off-by: Kris Van Hees 
Reviewed-by: Steven Rostedt (Google) 
---
 scripts/Makefile.lib | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index fe3668dc4954..170f462537a8 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -238,7 +238,7 @@ modkern_rustflags = 
 \
 
 modkern_aflags = $(if $(part-of-module),   \
$(KBUILD_AFLAGS_MODULE) $(AFLAGS_MODULE),   \
-   $(KBUILD_AFLAGS_KERNEL) $(AFLAGS_KERNEL))
+   $(KBUILD_AFLAGS_KERNEL) $(AFLAGS_KERNEL) 
$(modfile_flags))
 
 c_flags= -Wp,-MMD,$(depfile) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) \
 -include $(srctree)/include/linux/compiler_types.h   \
@@ -248,7 +248,7 @@ c_flags= -Wp,-MMD,$(depfile) $(NOSTDINC_FLAGS) 
$(LINUXINCLUDE) \
 rust_flags = $(_rust_flags) $(modkern_rustflags) 
@$(objtree)/include/generated/rustc_cfg
 
 a_flags= -Wp,-MMD,$(depfile) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) \
-$(_a_flags) $(modkern_aflags)
+$(_a_flags) $(modkern_aflags) $(modname_flags)
 
 cpp_flags  = -Wp,-MMD,$(depfile) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) \
 $(_cpp_flags)
-- 
2.45.2




[PATCH v7 3/4] scripts: add verifier script for builtin module range data

2024-08-20 Thread Kris Van Hees
The modules.builtin.ranges offset range data for builtin modules is
generated at compile time based on the list of built-in modules and
the vmlinux.map and vmlinux.o.map linker maps.  This data can be used
to determine whether a symbol at a particular address belongs to
module code that was configured to be compiled into the kernel proper
as a built-in module (rather than as a standalone module).

This patch adds a script that uses the generated modules.builtin.ranges
data to annotate the symbols in the System.map with module names if
their address falls within a range that belongs to one or more built-in
modules.

It then processes the vmlinux.map (and if needed, vmlinux.o.map) to
verify the annotation:

  - For each top-level section:
 - For each object in the section:
- Determine whether the object is part of a built-in module
  (using modules.builtin and the .*.cmd file used to compile
   the object as suggested in [0])
- For each symbol in that object, verify that the built-in
  module association (or lack thereof) matches the annotation
  given to the symbol.

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
Reviewed-by: Alan Maguire 
---
Changes since v6:
 - Applied Masahiro Yamada's suggestions to the AWK script.

Changes since v5:
 - Added optional 6th argument to specify kernel build directory.
 - Report error and exit if .*.o.cmd files cannot be read.

Changes since v4:
 - New patch in the series
---
 scripts/verify_builtin_ranges.awk | 356 ++
 1 file changed, 356 insertions(+)
 create mode 100755 scripts/verify_builtin_ranges.awk

diff --git a/scripts/verify_builtin_ranges.awk 
b/scripts/verify_builtin_ranges.awk
new file mode 100755
index ..93f66e9a8802
--- /dev/null
+++ b/scripts/verify_builtin_ranges.awk
@@ -0,0 +1,356 @@
+#!/usr/bin/gawk -f
+# SPDX-License-Identifier: GPL-2.0
+# verify_builtin_ranges.awk: Verify address range data for builtin modules
+# Written by Kris Van Hees 
+#
+# Usage: verify_builtin_ranges.awk modules.builtin.ranges System.map \
+# modules.builtin vmlinux.map vmlinux.o.map \
+# [  ]
+#
+
+# Return the module name(s) (if any) associated with the given object.
+#
+# If we have seen this object before, return information from the cache.
+# Otherwise, retrieve it from the corresponding .cmd file.
+#
+function get_module_info(fn, mod, obj, s) {
+   if (fn in omod)
+   return omod[fn];
+
+   if (match(fn, /\/[^/]+$/) == 0)
+   return "";
+
+   obj = fn;
+   mod = "";
+   fn = kdir "/" substr(fn, 1, RSTART) "." substr(fn, RSTART + 1) ".cmd";
+   if (getline s  0) {
+   mod = substr(s, RSTART + 16, RLENGTH - 16);
+   gsub(/['"]/, "", mod);
+   }
+   } else {
+   print "ERROR: Failed to read: " fn "\n\n" \
+ "  Invalid kernel build directory (" kdir ")\n" \
+ "  or its content does not match " ARGV[1] >"/dev/stderr";
+   close(fn);
+   total = 0;
+   exit(1);
+   }
+   close(fn);
+
+   # A single module (common case) also reflects objects that are not part
+   # of a module.  Some of those objects have names that are also a module
+   # name (e.g. core).  We check the associated module file name, and if
+   # they do not match, the object is not part of a module.
+   if (mod !~ / /) {
+   if (!(mod in mods))
+   mod = "";
+   }
+
+   gsub(/([^/ ]*\/)+/, "", mod);
+   gsub(/-/, "_", mod);
+
+   # At this point, mod is a single (valid) module name, or a list of
+   # module names (that do not need validation).
+   omod[obj] = mod;
+   close(fn);
+
+   return mod;
+}
+
+# Return a representative integer value for a given hexadecimal address.
+#
+# Since all kernel addresses fall within the same memory region, we can safely
+# strip off the first 6 hex digits before performing the hex-to-dec conversion,
+# thereby avoiding integer overflows.
+#
+function addr2val(val) {
+   sub(/^0x/, "", val);
+   if (length(val) == 16)
+   val = substr(val, 5);
+   return strtonum("0x" val);
+}
+
+# Determine the kernel build directory to use (default is .).
+#
+BEGIN {
+   if (ARGC > 6) {
+   kdir = ARGV[ARGC - 1];
+   ARGV[ARGC - 1] = "";
+   } else
+   kdir = ".";
+}
+
+# (1) Load the built-in module address range data.
+#
+ARGIND == 1 {
+   ranges[FNR] = $0;
+   rcnt++;
+   next;
+}
+
+# (2) Annotate System.map symbols with module names.
+#
+ARGIND == 2 {
+   addr = addr2val($1);
+   name = $3;
+
+   while (addr >= mod_eaddr) {
+   if (sect_symb) {
+   if (sect_symb != name)
+   

[PATCH v7 2/4] kbuild: generate offset range data for builtin modules

2024-08-20 Thread Kris Van Hees
Create file module.builtin.ranges that can be used to find where
built-in modules are located by their addresses. This will be useful for
tracing tools to find what functions are for various built-in modules.

The offset range data for builtin modules is generated using:
 - modules.builtin: associates object files with module names
 - vmlinux.map: provides load order of sections and offset of first member
per section
 - vmlinux.o.map: provides offset of object file content per section
 - .*.cmd: build cmd file with KBUILD_MODFILE

The generated data will look like:

.text - = _text
.text baf0-cb10 amd_uncore
.text 0009bd10-0009c8e0 iosf_mbi
...
.text 00b9f080-00ba011a intel_skl_int3472_discrete
.text 00ba0120-00ba03c0 intel_skl_int3472_discrete intel_skl_int3472_tps68470
.text 00ba03c0-00ba08d6 intel_skl_int3472_tps68470
...
.data - = _sdata
.data f020-f680 amd_uncore

For each ELF section, it lists the offset of the first symbol.  This can
be used to determine the base address of the section at runtime.

Next, it lists (in strict ascending order) offset ranges in that section
that cover the symbols of one or more builtin modules.  Multiple ranges
can apply to a single module, and ranges can be shared between modules.

The CONFIG_BUILTIN_MODULE_RANGES option controls whether offset range data
is generated for kernel modules that are built into the kernel image.

How it works:

  1. The modules.builtin file is parsed to obtain a list of built-in
 module names and their associated object names (the .ko file that
 the module would be in if it were a loadable module, hereafter
 referred to as ).  This object name can be used to
 identify objects in the kernel compile because any C or assembler
 code that ends up into a built-in module will have the option
 -DKBUILD_MODFILE= present in its build command, and those
 can be found in the ..cmd file in the kernel build tree.

 If an object is part of multiple modules, they will all be listed
 in the KBUILD_MODFILE option argument.

 This allows us to conclusively determine whether an object in the
 kernel build belong to any modules, and which.

 2. The vmlinux.map is parsed next to determine the base address of each
top level section so that all addresses into the section can be
turned into offsets.  This makes it possible to handle sections
getting loaded at different addresses at system boot.

We also determine an 'anchor' symbol at the beginning of each
section to make it possible to calculate the true base address of
a section at runtime (i.e. symbol address - symbol offset).

We collect start addresses of sections that are included in the top
level section.  This is used when vmlinux is linked using vmlinux.o,
because in that case, we need to look at the vmlinux.o linker map to
know what object a symbol is found in.

And finally, we process each symbol that is listed in vmlinux.map
(or vmlinux.o.map) based on the following structure:

vmlinux linked from vmlinux.a:

  vmlinux.map:

-- might be same as top level section)
  -- built-in association known
  -- belongs to module(s) object belongs to
  ...

vmlinux linked from vmlinux.o:

  vmlinux.map:

-- might be same as top level section)
vmlinux.o -- need to use vmlinux.o.map
  -- ignored
  ...

  vmlinux.o.map:

  -- built-in association known
  -- belongs to module(s) object belongs to
  ...

 3. As sections, objects, and symbols are processed, offset ranges are
constructed in a striaght-forward way:

  - If the symbol belongs to one or more built-in modules:
  - If we were working on the same module(s), extend the range
to include this object
  - If we were working on another module(s), close that range,
and start the new one
  - If the symbol does not belong to any built-in modules:
  - If we were working on a module(s) range, close that range

Signed-off-by: Kris Van Hees 
Reviewed-by: Nick Alcock 
Reviewed-by: Alan Maguire 
Reviewed-by: Steven Rostedt (Google) 
---
Changes since v6:
 - Applied Masahiro Yamada's suggestions (Kconfig, makefile, script).

Changes since v5:
 - Removed unnecessary compatibility info from option description.

Changes since v4:
 - Improved commit description to explain the why and how.
 - Documented dependency on GNU AWK for CONFIG_BUILTIN_MODULE_RANGES.
 - Improved comments in generate_builtin_ranges.awk
 - Improved logic in generate_builtin_ranges.awk to handle incorrect
   object size information in linker maps

Changes since v3:
 - Consolidated patches 2 through 5 into a single patch
 - Move

Re: [RFC 7/7] vhost: Add new UAPI to support changing Kthread mode

2024-08-20 Thread Christoph Hellwig
On Mon, Aug 19, 2024 at 05:27:33PM +0800, Cindy Lu wrote:
> Add a new UAPI to support setting the vhost device to
> use kthread mode. The user space application needs to use
> VHOST_SET_USE_KTHREAD to set the mode. This setting must
> be set before VHOST_SET_OWNER is set.

Usage of an API is a complete kernel internal detail that has absolutely
no business being exposed to applications.

What is the application visible behavior that the API use is the proxy
for?




Re: [syzbot] [perf?] KASAN: slab-use-after-free Read in uprobe_unregister

2024-08-20 Thread Oleg Nesterov
#syz dup: [syzbot] [perf?] KASAN: slab-use-after-free Read in 
__uprobe_unregister syzbot

https://lore.kernel.org/all/382d39061f59f...@google.com/

Hopefully fixed by
[PATCH tip/perf/core] bpf: fix use-after-free in bpf_uprobe_multi_link_attach()
https://lore.kernel.org/all/20240813152524.ga7...@redhat.com/

On 08/20, syzbot wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit:367b5c3d53e5 Add linux-next specific files for 20240816
> git tree:   linux-next
> console+strace: https://syzkaller.appspot.com/x/log.txt?x=109d46a398
> kernel config:  https://syzkaller.appspot.com/x/.config?x=61ba6f3b22ee5467
> dashboard link: https://syzkaller.appspot.com/bug?extid=b32fa80fe75273685f9c
> compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 
> 2.40
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1099b7d398
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=15c513e598
>
> Downloadable assets:
> disk image: 
> https://storage.googleapis.com/syzbot-assets/0b1b4e3cad3c/disk-367b5c3d.raw.xz
> vmlinux: 
> https://storage.googleapis.com/syzbot-assets/5bb090f7813c/vmlinux-367b5c3d.xz
> kernel image: 
> https://storage.googleapis.com/syzbot-assets/6674cb0709b1/bzImage-367b5c3d.xz
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+b32fa80fe75273685...@syzkaller.appspotmail.com
>
> R10: 0001 R11: 0246 R12: 
> R13:  R14: 0001 R15: 0001
>  
> ==
> BUG: KASAN: slab-use-after-free in consumer_del kernel/events/uprobes.c:772 
> [inline]
> BUG: KASAN: slab-use-after-free in uprobe_unregister+0x99/0x220 
> kernel/events/uprobes.c:1097
> Read of size 8 at addr 888022b9eb40 by task syz-executor227/5226
>
> CPU: 0 UID: 0 PID: 5226 Comm: syz-executor227 Not tainted 
> 6.11.0-rc3-next-20240816-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 08/06/2024
> Call Trace:
>  
>  __dump_stack lib/dump_stack.c:94 [inline]
>  dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
>  print_address_description mm/kasan/report.c:377 [inline]
>  print_report+0x169/0x550 mm/kasan/report.c:488
>  kasan_report+0x143/0x180 mm/kasan/report.c:601
>  consumer_del kernel/events/uprobes.c:772 [inline]
>  uprobe_unregister+0x99/0x220 kernel/events/uprobes.c:1097
>  bpf_uprobe_unregister kernel/trace/bpf_trace.c:3119 [inline]
>  bpf_uprobe_multi_link_release+0x8c/0x1b0 kernel/trace/bpf_trace.c:3127
>  bpf_link_free+0xf5/0x250 kernel/bpf/syscall.c:2998
>  bpf_link_put_direct kernel/bpf/syscall.c:3038 [inline]
>  bpf_link_release+0x7b/0x90 kernel/bpf/syscall.c:3045
>  __fput+0x24b/0x890 fs/file_table.c:424
>  task_work_run+0x24f/0x310 kernel/task_work.c:228
>  exit_task_work include/linux/task_work.h:40 [inline]
>  do_exit+0xa2f/0x28e0 kernel/exit.c:939
>  do_group_exit+0x207/0x2c0 kernel/exit.c:1088
>  __do_sys_exit_group kernel/exit.c:1099 [inline]
>  __se_sys_exit_group kernel/exit.c:1097 [inline]
>  __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1097
>  x64_sys_call+0x26a8/0x26b0 arch/x86/include/generated/asm/syscalls_64.h:232
>  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> RIP: 0033:0x7f409a857c79
> Code: Unable to access opcode bytes at 0x7f409a857c4f.
> RSP: 002b:7fff689d0b48 EFLAGS: 0246 ORIG_RAX: 00e7
> RAX: ffda RBX:  RCX: 7f409a857c79
> RDX: 003c RSI: 00e7 RDI: 
> RBP: 7f409a8cb370 R08: ffb8 R09: 00a0
> R10: 0001 R11: 0246 R12: 7f409a8cb370
> R13:  R14: 7f409a8cbdc0 R15: 7f409a821340
>  
>
> Allocated by task 5226:
>  kasan_save_stack mm/kasan/common.c:47 [inline]
>  kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
>  poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
>  __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:394
>  kasan_kmalloc include/linux/kasan.h:260 [inline]
>  __do_kmalloc_node mm/slub.c:4202 [inline]
>  __kmalloc_node_noprof+0x22a/0x440 mm/slub.c:4208
>  __kvmalloc_node_noprof+0x72/0x1b0 mm/util.c:658
>  kvmalloc_array_node_noprof include/linux/slab.h:835 [inline]
>  bpf_uprobe_multi_link_attach+0x43e/0xb90 kernel/trace/bpf_trace.c:3369
>  link_create+0x6e9/0x870 kernel/bpf/syscall.c:5236
>  __sys_bpf+0x4bc/0x810 kernel/bpf/syscall.c:5697
>  __do_sys_bpf kernel/bpf/syscall.c:5734 [inline]
>  __se_sys_bpf kernel/bpf/syscall.c:5732 [inline]
>  __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5732
>  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
>
> Freed by task 5226:
>  kasan_save_stack mm/kasan/common.c:47 [inline]
>  kasan_save

Re: [syzbot] [perf?] KASAN: slab-use-after-free Read in uprobe_unregister

2024-08-20 Thread syzbot
> #syz dup: [syzbot] [perf?] KASAN: slab-use-after-free Read in 
> __uprobe_unregister syzbot

can't find the dup bug

>
> https://lore.kernel.org/all/382d39061f59f...@google.com/
>
> Hopefully fixed by
> [PATCH tip/perf/core] bpf: fix use-after-free in 
> bpf_uprobe_multi_link_attach()
> https://lore.kernel.org/all/20240813152524.ga7...@redhat.com/
>
> On 08/20, syzbot wrote:
>>
>> Hello,
>>
>> syzbot found the following issue on:
>>
>> HEAD commit:367b5c3d53e5 Add linux-next specific files for 20240816
>> git tree:   linux-next
>> console+strace: https://syzkaller.appspot.com/x/log.txt?x=109d46a398
>> kernel config:  https://syzkaller.appspot.com/x/.config?x=61ba6f3b22ee5467
>> dashboard link: https://syzkaller.appspot.com/bug?extid=b32fa80fe75273685f9c
>> compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for 
>> Debian) 2.40
>> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1099b7d398
>> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=15c513e598
>>
>> Downloadable assets:
>> disk image: 
>> https://storage.googleapis.com/syzbot-assets/0b1b4e3cad3c/disk-367b5c3d.raw.xz
>> vmlinux: 
>> https://storage.googleapis.com/syzbot-assets/5bb090f7813c/vmlinux-367b5c3d.xz
>> kernel image: 
>> https://storage.googleapis.com/syzbot-assets/6674cb0709b1/bzImage-367b5c3d.xz
>>
>> IMPORTANT: if you fix the issue, please add the following tag to the commit:
>> Reported-by: syzbot+b32fa80fe75273685...@syzkaller.appspotmail.com
>>
>> R10: 0001 R11: 0246 R12: 
>> R13:  R14: 0001 R15: 0001
>>  
>> ==
>> BUG: KASAN: slab-use-after-free in consumer_del kernel/events/uprobes.c:772 
>> [inline]
>> BUG: KASAN: slab-use-after-free in uprobe_unregister+0x99/0x220 
>> kernel/events/uprobes.c:1097
>> Read of size 8 at addr 888022b9eb40 by task syz-executor227/5226
>>
>> CPU: 0 UID: 0 PID: 5226 Comm: syz-executor227 Not tainted 
>> 6.11.0-rc3-next-20240816-syzkaller #0
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
>> Google 08/06/2024
>> Call Trace:
>>  
>>  __dump_stack lib/dump_stack.c:94 [inline]
>>  dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
>>  print_address_description mm/kasan/report.c:377 [inline]
>>  print_report+0x169/0x550 mm/kasan/report.c:488
>>  kasan_report+0x143/0x180 mm/kasan/report.c:601
>>  consumer_del kernel/events/uprobes.c:772 [inline]
>>  uprobe_unregister+0x99/0x220 kernel/events/uprobes.c:1097
>>  bpf_uprobe_unregister kernel/trace/bpf_trace.c:3119 [inline]
>>  bpf_uprobe_multi_link_release+0x8c/0x1b0 kernel/trace/bpf_trace.c:3127
>>  bpf_link_free+0xf5/0x250 kernel/bpf/syscall.c:2998
>>  bpf_link_put_direct kernel/bpf/syscall.c:3038 [inline]
>>  bpf_link_release+0x7b/0x90 kernel/bpf/syscall.c:3045
>>  __fput+0x24b/0x890 fs/file_table.c:424
>>  task_work_run+0x24f/0x310 kernel/task_work.c:228
>>  exit_task_work include/linux/task_work.h:40 [inline]
>>  do_exit+0xa2f/0x28e0 kernel/exit.c:939
>>  do_group_exit+0x207/0x2c0 kernel/exit.c:1088
>>  __do_sys_exit_group kernel/exit.c:1099 [inline]
>>  __se_sys_exit_group kernel/exit.c:1097 [inline]
>>  __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1097
>>  x64_sys_call+0x26a8/0x26b0 arch/x86/include/generated/asm/syscalls_64.h:232
>>  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>>  do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
>>  entry_SYSCALL_64_after_hwframe+0x77/0x7f
>> RIP: 0033:0x7f409a857c79
>> Code: Unable to access opcode bytes at 0x7f409a857c4f.
>> RSP: 002b:7fff689d0b48 EFLAGS: 0246 ORIG_RAX: 00e7
>> RAX: ffda RBX:  RCX: 7f409a857c79
>> RDX: 003c RSI: 00e7 RDI: 
>> RBP: 7f409a8cb370 R08: ffb8 R09: 00a0
>> R10: 0001 R11: 0246 R12: 7f409a8cb370
>> R13:  R14: 7f409a8cbdc0 R15: 7f409a821340
>>  
>>
>> Allocated by task 5226:
>>  kasan_save_stack mm/kasan/common.c:47 [inline]
>>  kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
>>  poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
>>  __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:394
>>  kasan_kmalloc include/linux/kasan.h:260 [inline]
>>  __do_kmalloc_node mm/slub.c:4202 [inline]
>>  __kmalloc_node_noprof+0x22a/0x440 mm/slub.c:4208
>>  __kvmalloc_node_noprof+0x72/0x1b0 mm/util.c:658
>>  kvmalloc_array_node_noprof include/linux/slab.h:835 [inline]
>>  bpf_uprobe_multi_link_attach+0x43e/0xb90 kernel/trace/bpf_trace.c:3369
>>  link_create+0x6e9/0x870 kernel/bpf/syscall.c:5236
>>  __sys_bpf+0x4bc/0x810 kernel/bpf/syscall.c:5697
>>  __do_sys_bpf kernel/bpf/syscall.c:5734 [inline]
>>  __se_sys_bpf kernel/bpf/syscall.c:5732 [inline]
>>  __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5732
>>  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>>  do_syscall_64+0xf3/0x230 arch/x86/entry/comm

Re: [PATCH] remoteproc: k3-r5: Fix error handling when power-up failed

2024-08-20 Thread Beleswar Prasad Padhi



On 19-08-2024 20:54, Jan Kiszka wrote:

From: Jan Kiszka 

By simply bailing out, the driver was violating its rule and internal



Using device lifecycle managed functions to register the rproc 
(devm_rproc_add()), bailing out with an error code will work.



assumptions that either both or no rproc should be initialized. E.g.,
this could cause the first core to be available but not the second one,
leading to crashes on its shutdown later on while trying to dereference
that second instance.

Fixes: 61f6f68447ab ("remoteproc: k3-r5: Wait for core0 power-up before powering up 
core1")
Signed-off-by: Jan Kiszka 
---
  drivers/remoteproc/ti_k3_r5_remoteproc.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c 
b/drivers/remoteproc/ti_k3_r5_remoteproc.c
index 39a47540c590..eb09d2e9b32a 100644
--- a/drivers/remoteproc/ti_k3_r5_remoteproc.c
+++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c
@@ -1332,7 +1332,7 @@ static int k3_r5_cluster_rproc_init(struct 
platform_device *pdev)
dev_err(dev,
"Timed out waiting for %s core to power up!\n",
rproc->name);
-   return ret;
+   goto err_powerup;
}
}
  
@@ -1348,6 +1348,7 @@ static int k3_r5_cluster_rproc_init(struct platform_device *pdev)

}
}
  
+err_powerup:

rproc_del(rproc);



Please use devm_rproc_add() to avoid having to do rproc_del() manually 
here.



  err_add:
k3_r5_reserved_mem_exit(kproc);