Re: [RFC PATCH v1 0/2] send credit update during setting SO_RCVLOWAT

2023-11-15 Thread Stefano Garzarella

On Wed, Nov 08, 2023 at 10:20:02AM +0300, Arseniy Krasnov wrote:

Hello,

  DESCRIPTION

This patchset fixes old problem with hungup of both rx/tx sides and adds
test for it. This happens due to non-default SO_RCVLOWAT value and
deferred credit update in virtio/vsock. Link to previous old patchset:
https://lore.kernel.org/netdev/39b2e9fd-601b-189d-39a9-914e55745...@sberdevices.ru/

Here is what happens step by step:

 TEST

   INITIAL CONDITIONS

1) Vsock buffer size is 128KB.
2) Maximum packet size is also 64KB as defined in header (yes it is
  hardcoded, just to remind about that value).
3) SO_RCVLOWAT is default, e.g. 1 byte.


STEPS

   SENDER  RECEIVER
1) sends 128KB + 1 byte in a
  single buffer. 128KB will
  be sent, but for 1 byte
  sender will wait for free
  space at peer. Sender goes
  to sleep.


2) reads 64KB, credit update not sent
3) sets SO_RCVLOWAT to 64KB + 1
4) poll() -> wait forever, there is
  only 64KB available to read.

So in step 4) receiver also goes to sleep, waiting for enough data or
connection shutdown message from the sender. Idea to fix it is that rx
kicks tx side to continue transmission (and may be close connection)
when rx changes number of bytes to be woken up (e.g. SO_RCVLOWAT) and
this value is bigger than number of available bytes to read.

I've added small test for this, but not sure as it uses hardcoded value


Thanks for adding the test!


for maximum packet length, this value is defined in kernel header and
used to control deferred credit update. And as this is not available to
userspace, I can't control test parameters correctly (if one day this
define will be changed - test may become useless).


I see, I'll leave a comment in the patch!

Thanks,
Stefano



Head for this patchset is:
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=ff269e2cd5adce4ae14f883fc9c8803bc43ee1e9

Arseniy Krasnov (2):
 virtio/vsock: send credit update during setting SO_RCVLOWAT
 vsock/test: SO_RCVLOWAT + deferred credit update test

drivers/vhost/vsock.c   |   2 +
include/linux/virtio_vsock.h|   1 +
net/vmw_vsock/virtio_transport.c|   2 +
net/vmw_vsock/virtio_transport_common.c |  31 ++
net/vmw_vsock/vsock_loopback.c  |   2 +
tools/testing/vsock/vsock_test.c| 131 
6 files changed, 169 insertions(+)

--
2.25.1






Re: [RFC PATCH v1 1/2] virtio/vsock: send credit update during setting SO_RCVLOWAT

2023-11-15 Thread Stefano Garzarella

On Wed, Nov 08, 2023 at 10:20:03AM +0300, Arseniy Krasnov wrote:

This adds sending credit update message when SO_RCVLOWAT is updated and
it is bigger than number of bytes in rx queue. It is needed, because
'poll()' will wait until number of bytes in rx queue will be not smaller
than SO_RCVLOWAT, so kick sender to send more data. Otherwise mutual
hungup for tx/rx is possible: sender waits for free space and receiver
is waiting data in 'poll()'.

Signed-off-by: Arseniy Krasnov 
---
drivers/vhost/vsock.c   |  2 ++
include/linux/virtio_vsock.h|  1 +
net/vmw_vsock/virtio_transport.c|  2 ++
net/vmw_vsock/virtio_transport_common.c | 31 +
net/vmw_vsock/vsock_loopback.c  |  2 ++
5 files changed, 38 insertions(+)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index f75731396b7e..ecfa5c11f5ee 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -451,6 +451,8 @@ static struct virtio_transport vhost_transport = {
.notify_buffer_size   = virtio_transport_notify_buffer_size,

.read_skb = virtio_transport_read_skb,
+
+   .set_rcvlowat = virtio_transport_set_rcvlowat
},

.send_pkt = vhost_transport_send_pkt,
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index ebb3ce63d64d..97dc1bebc69c 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -256,4 +256,5 @@ void virtio_transport_put_credit(struct virtio_vsock_sock 
*vvs, u32 credit);
void virtio_transport_deliver_tap_pkt(struct sk_buff *skb);
int virtio_transport_purge_skbs(void *vsk, struct sk_buff_head *list);
int virtio_transport_read_skb(struct vsock_sock *vsk, skb_read_actor_t 
read_actor);
+int virtio_transport_set_rcvlowat(struct vsock_sock *vsk, int val);
#endif /* _LINUX_VIRTIO_VSOCK_H */
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index af5bab1acee1..cf3431189d0c 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -539,6 +539,8 @@ static struct virtio_transport virtio_transport = {
.notify_buffer_size   = virtio_transport_notify_buffer_size,

.read_skb = virtio_transport_read_skb,
+
+   .set_rcvlowat = virtio_transport_set_rcvlowat
},

.send_pkt = virtio_transport_send_pkt,
diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index e22c81435ef7..88a58163046e 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1676,6 +1676,37 @@ int virtio_transport_read_skb(struct vsock_sock *vsk, 
skb_read_actor_t recv_acto
}
EXPORT_SYMBOL_GPL(virtio_transport_read_skb);

+int virtio_transport_set_rcvlowat(struct vsock_sock *vsk, int val)
+{
+   struct virtio_vsock_sock *vvs = vsk->trans;
+   bool send_update = false;


I'd declare this not initialized.


+
+   spin_lock_bh(&vvs->rx_lock);
+
+   /* If number of available bytes is less than new
+* SO_RCVLOWAT value, kick sender to send more
+* data, because sender may sleep in its 'send()'
+* syscall waiting for enough space at our side.
+*/
+   if (vvs->rx_bytes < val)
+   send_update = true;


Then here just:
send_update = vvs->rx_bytes < val;


+
+   spin_unlock_bh(&vvs->rx_lock);
+
+   if (send_update) {
+   int err;
+
+   err = virtio_transport_send_credit_update(vsk);
+   if (err < 0)
+   return err;
+   }
+
+   WRITE_ONCE(sk_vsock(vsk)->sk_rcvlowat, val ? : 1);


Not in this patch, but what about doing this in vsock_set_rcvlowat() in 
af_vsock.c?


I mean avoid to return if `transport->set_rcvlowat(vsk, val)` is
successfully, so set sk_rcvlowat in a single point.

The rest LGTM!

Stefano


+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_set_rcvlowat);
+
MODULE_LICENSE("GPL v2");
MODULE_AUTHOR("Asias He");
MODULE_DESCRIPTION("common code for virtio vsock");
diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
index 048640167411..388c157f6633 100644
--- a/net/vmw_vsock/vsock_loopback.c
+++ b/net/vmw_vsock/vsock_loopback.c
@@ -98,6 +98,8 @@ static struct virtio_transport loopback_transport = {
.notify_buffer_size   = virtio_transport_notify_buffer_size,

.read_skb = virtio_transport_read_skb,
+
+   .set_rcvlowat = virtio_transport_set_rcvlowat
},

.send_pkt = vsock_loopback_send_pkt,
-- 2.25.1






Re: [RFC PATCH v1 2/2] vsock/test: SO_RCVLOWAT + deferred credit update test

2023-11-15 Thread Stefano Garzarella

On Wed, Nov 08, 2023 at 10:20:04AM +0300, Arseniy Krasnov wrote:

This adds test which checks, that updating SO_RCVLOWAT value also sends


You can avoid "This adds", and write just "Add test ...".

See 
https://docs.kernel.org/process/submitting-patches.html#describe-your-changes

Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
to do frotz", as if you are giving orders to the codebase to change
its behaviour.

Also in the other patch.


credit update message. Otherwise mutual hungup may happen when receiver
didn't send credit update and then calls 'poll()' with non default
SO_RCVLOWAT value (e.g. waiting enough bytes to read), while sender
waits for free space at receiver's side.

Signed-off-by: Arseniy Krasnov 
---
tools/testing/vsock/vsock_test.c | 131 +++
1 file changed, 131 insertions(+)

diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
index c1f7bc9abd22..c71b3875fd16 100644
--- a/tools/testing/vsock/vsock_test.c
+++ b/tools/testing/vsock/vsock_test.c
@@ -1180,6 +1180,132 @@ static void test_stream_shutrd_server(const struct 
test_opts *opts)
close(fd);
}

+#define RCVLOWAT_CREDIT_UPD_BUF_SIZE   (1024 * 128)
+#define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE  (1024 * 64)


What about adding a comment like the one in the cover letter about
dependency with kernel values?

Please add it also in the commit description.

I'm thinking if we should move all the defines that depends on the
kernel in some special header.


+
+static void test_stream_rcvlowat_def_cred_upd_client(const struct test_opts 
*opts)
+{
+   size_t buf_size;
+   void *buf;
+   int fd;
+
+   fd = vsock_stream_connect(opts->peer_cid, 1234);
+   if (fd < 0) {
+   perror("connect");
+   exit(EXIT_FAILURE);
+   }
+
+   /* Send 1 byte more than peer's buffer size. */
+   buf_size = RCVLOWAT_CREDIT_UPD_BUF_SIZE + 1;
+
+   buf = malloc(buf_size);
+   if (!buf) {
+   perror("malloc");
+   exit(EXIT_FAILURE);
+   }
+
+   /* Wait until peer sets needed buffer size. */
+   control_expectln("SRVREADY");
+
+   if (send(fd, buf, buf_size, 0) != buf_size) {
+   perror("send failed");
+   exit(EXIT_FAILURE);
+   }
+
+   free(buf);
+   close(fd);
+}
+
+static void test_stream_rcvlowat_def_cred_upd_server(const struct test_opts 
*opts)
+{
+   size_t recv_buf_size;
+   struct pollfd fds;
+   size_t buf_size;
+   void *buf;
+   int fd;
+
+   fd = vsock_stream_accept(VMADDR_CID_ANY, 1234, NULL);
+   if (fd < 0) {
+   perror("accept");
+   exit(EXIT_FAILURE);
+   }
+
+   buf_size = RCVLOWAT_CREDIT_UPD_BUF_SIZE;
+
+   if (setsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_BUFFER_SIZE,
+  &buf_size, sizeof(buf_size))) {
+   perror("setsockopt(SO_VM_SOCKETS_BUFFER_SIZE)");
+   exit(EXIT_FAILURE);
+   }
+
+   buf = malloc(buf_size);
+   if (!buf) {
+   perror("malloc");
+   exit(EXIT_FAILURE);
+   }
+
+   control_writeln("SRVREADY");
+
+   /* Wait until there will be 128KB of data in rx queue. */
+   while (1) {
+   ssize_t res;
+
+   res = recv(fd, buf, buf_size, MSG_PEEK);
+   if (res == buf_size)
+   break;
+
+   if (res <= 0) {
+   fprintf(stderr, "unexpected 'recv()' return: %zi\n", 
res);
+   exit(EXIT_FAILURE);
+   }
+   }
+
+   /* There is 128KB of data in the socket's rx queue,
+* dequeue first 64KB, credit update is not sent.
+*/
+   recv_buf_size = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
+   recv_buf(fd, buf, recv_buf_size, 0, recv_buf_size);
+   recv_buf_size++;
+
+   /* Updating SO_RCVLOWAT will send credit update. */
+   if (setsockopt(fd, SOL_SOCKET, SO_RCVLOWAT,
+  &recv_buf_size, sizeof(recv_buf_size))) {
+   perror("setsockopt(SO_RCVLOWAT)");
+   exit(EXIT_FAILURE);
+   }
+
+   memset(&fds, 0, sizeof(fds));
+   fds.fd = fd;
+   fds.events = POLLIN | POLLRDNORM | POLLERR |
+POLLRDHUP | POLLHUP;
+
+   /* This 'poll()' will return once we receive last byte
+* sent by client.
+*/
+   if (poll(&fds, 1, -1) < 0) {
+   perror("poll");
+   exit(EXIT_FAILURE);
+   }
+
+   if (fds.revents & POLLERR) {
+   fprintf(stderr, "'poll()' error\n");
+   exit(EXIT_FAILURE);
+   }
+
+   if (fds.revents & (POLLIN | POLLRDNORM)) {
+   recv_buf(fd, buf, recv_buf_size, 0, recv_buf_size);
+   } else {
+   /* These flags must be set, as there is at
+* least 64KB of da

[PATCH] MAINTAINERS: TRACING: Add Mathieu Desnoyers as Reviewer

2023-11-15 Thread Mathieu Desnoyers
In order to make sure I get CC'd on tracing changes for which my input
would be relevant, add my name as reviewer of the TRACING subsystem.

Signed-off-by: Mathieu Desnoyers 
Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: linux-trace-ker...@vger.kernel.org
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 4cc6bf79fdd8..a7c2092d0063 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -21622,6 +21622,7 @@ F:  drivers/hwmon/pmbus/tps546d24.c
 TRACING
 M: Steven Rostedt 
 M: Masami Hiramatsu 
+R: Mathieu Desnoyers 
 L: linux-kernel@vger.kernel.org
 L: linux-trace-ker...@vger.kernel.org
 S: Maintained
-- 
2.39.2




[PATCH 2/3] modpost: Extended modversion support

2023-11-15 Thread Matthew Maurer
Adds a new format for modversions which stores each field in a separate
elf section. This initially adds support for variable length names, but
could later be used to add additional fields to modversions in a
backwards compatible way if needed.

Adding support for variable length names makes it possible to enable
MODVERSIONS and RUST at the same time.

Signed-off-by: Matthew Maurer 
---
 arch/powerpc/kernel/module_64.c | 24 +-
 init/Kconfig|  1 -
 kernel/module/internal.h| 16 ++-
 kernel/module/main.c|  9 +++-
 kernel/module/version.c | 77 +
 scripts/mod/modpost.c   | 33 --
 6 files changed, 151 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kernel/module_64.c b/arch/powerpc/kernel/module_64.c
index 7112adc597a8..2582353a2048 100644
--- a/arch/powerpc/kernel/module_64.c
+++ b/arch/powerpc/kernel/module_64.c
@@ -355,6 +355,24 @@ static void dedotify_versions(struct modversion_info *vers,
}
 }
 
+static void dedotify_ext_version_names(char *str_seq, unsigned long size)
+{
+   unsigned long out = 0;
+   unsigned long in;
+   char last = '\0';
+
+   for (in = 0; in < size; in++) {
+   if (last == '\0')
+   /* Skip all leading dots */
+   if (str_seq[in] == '.')
+   continue;
+   last = str_seq[in];
+   str_seq[out++] = last;
+   }
+   /* Zero the trailing portion of the names table for robustness */
+   bzero(&str_seq[out], size - out);
+}
+
 /*
  * Undefined symbols which refer to .funcname, hack to funcname. Make .TOC.
  * seem to be defined (value set later).
@@ -424,10 +442,12 @@ int module_frob_arch_sections(Elf64_Ehdr *hdr,
me->arch.toc_section = i;
if (sechdrs[i].sh_addralign < 8)
sechdrs[i].sh_addralign = 8;
-   }
-   else if (strcmp(secstrings+sechdrs[i].sh_name,"__versions")==0)
+   } else if (strcmp(secstrings + sechdrs[i].sh_name, 
"__versions") == 0)
dedotify_versions((void *)hdr + sechdrs[i].sh_offset,
  sechdrs[i].sh_size);
+   else if (strcmp(secstrings + sechdrs[i].sh_name, 
"__version_ext_names") == 0)
+   dedotify_ext_version_names((void *)hdr + 
sechdrs[i].sh_offset,
+  sechdrs[i].sh_size);
 
if (sechdrs[i].sh_type == SHT_SYMTAB)
dedotify((void *)hdr + sechdrs[i].sh_offset,
diff --git a/init/Kconfig b/init/Kconfig
index 9ffb103fc927..6cac5b4db8f6 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1885,7 +1885,6 @@ config RUST
bool "Rust support"
depends on HAVE_RUST
depends on RUST_IS_AVAILABLE
-   depends on !MODVERSIONS
depends on !GCC_PLUGINS
depends on !RANDSTRUCT
depends on !DEBUG_INFO_BTF || PAHOLE_HAS_LANG_EXCLUDE
diff --git a/kernel/module/internal.h b/kernel/module/internal.h
index c8b7b4dcf782..0c188c96a045 100644
--- a/kernel/module/internal.h
+++ b/kernel/module/internal.h
@@ -80,7 +80,7 @@ struct load_info {
unsigned int used_pages;
 #endif
struct {
-   unsigned int sym, str, mod, vers, info, pcpu;
+   unsigned int sym, str, mod, vers, info, pcpu, vers_ext_crc, 
vers_ext_name;
} index;
 };
 
@@ -384,6 +384,20 @@ void module_layout(struct module *mod, struct 
modversion_info *ver, struct kerne
   struct kernel_symbol *ks, struct tracepoint * const *tp);
 int check_modstruct_version(const struct load_info *info, struct module *mod);
 int same_magic(const char *amagic, const char *bmagic, bool has_crcs);
+struct modversion_info_ext_s32 {
+   const s32 *value;
+   const s32 *end;
+};
+struct modversion_info_ext_string {
+   const char *value;
+   const char *end;
+};
+struct modversion_info_ext {
+   struct modversion_info_ext_s32 crc;
+   struct modversion_info_ext_string name;
+};
+ssize_t modversion_ext_start(const struct load_info *info, struct 
modversion_info_ext *ver);
+int modversion_ext_advance(struct modversion_info_ext *ver);
 #else /* !CONFIG_MODVERSIONS */
 static inline int check_version(const struct load_info *info,
const char *symname,
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 98fedfdb8db5..e69b2ae46161 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1886,10 +1886,15 @@ static int elf_validity_cache_copy(struct load_info 
*info, int flags)
if (!info->name)
info->name = info->mod->name;
 
-   if (flags & MODULE_INIT_IGNORE_MODVERSIONS)
+   if (flags & MODULE_INIT_IGNORE_MODVERSIONS) {
info->index.vers = 0; /* Pretend no __versions section! */
-   else
+

Re: [PATCH v6 01/12] cgroup/misc: Add per resource callbacks for CSS events

2023-11-15 Thread Jarkko Sakkinen
On Mon Oct 30, 2023 at 8:20 PM EET, Haitao Huang wrote:
> From: Kristen Carlson Accardi 
>
> The misc cgroup controller (subsystem) currently does not perform
> resource type specific action for Cgroups Subsystem State (CSS) events:
> the 'css_alloc' event when a cgroup is created and the 'css_free' event
> when a cgroup is destroyed.
>
> Define callbacks for those events and allow resource providers to
> register the callbacks per resource type as needed. This will be
> utilized later by the EPC misc cgroup support implemented in the SGX
> driver.
>
> Also add per resource type private data for those callbacks to store and
> access resource specific data.
>
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 
> ---
> V6:
> - Create ops struct for per resource callbacks (Jarkko)
> - Drop max_write callback (Dave, Michal)
> - Style fixes (Kai)
> ---
>  include/linux/misc_cgroup.h | 14 ++
>  kernel/cgroup/misc.c| 27 ---
>  2 files changed, 38 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
> index e799b1f8d05b..5dc509c27c3d 100644
> --- a/include/linux/misc_cgroup.h
> +++ b/include/linux/misc_cgroup.h
> @@ -27,16 +27,30 @@ struct misc_cg;
>  
>  #include 
>  
> +/**
> + * struct misc_operations_struct: per resource callback ops.
> + * @alloc: invoked for resource specific initialization when cgroup is 
> allocated.
> + * @free: invoked for resource specific cleanup when cgroup is deallocated.
> + */
> +struct misc_operations_struct {
> + int (*alloc)(struct misc_cg *cg);
> + void (*free)(struct misc_cg *cg);
> +};

Maybe just misc_operations, or even misc_ops?

> +
>  /**
>   * struct misc_res: Per cgroup per misc type resource
>   * @max: Maximum limit on the resource.
>   * @usage: Current usage of the resource.
>   * @events: Number of times, the resource limit exceeded.
> + * @priv: resource specific data.
> + * @misc_ops: resource specific operations.
>   */
>  struct misc_res {
>   u64 max;
>   atomic64_t usage;
>   atomic64_t events;
> + void *priv;

priv is the wrong patch, it just confuses the overall picture heere.
please move it to 04/12. Let's deal with the callbacks here.

> + const struct misc_operations_struct *misc_ops;
>  };

misc_ops would be at least consistent with this, as misc_res also has an
acronym.

>  
>  /**
> diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
> index 79a3717a5803..d971ede44ebf 100644
> --- a/kernel/cgroup/misc.c
> +++ b/kernel/cgroup/misc.c
> @@ -383,23 +383,37 @@ static struct cftype misc_cg_files[] = {
>  static struct cgroup_subsys_state *
>  misc_cg_alloc(struct cgroup_subsys_state *parent_css)
>  {
> + struct misc_cg *parent_cg, *cg;
>   enum misc_res_type i;
> - struct misc_cg *cg;
> + int ret;
>  
>   if (!parent_css) {
> - cg = &root_cg;
> + parent_cg = cg = &root_cg;
>   } else {
>   cg = kzalloc(sizeof(*cg), GFP_KERNEL);
>   if (!cg)
>   return ERR_PTR(-ENOMEM);
> + parent_cg = css_misc(parent_css);
>   }
>  
>   for (i = 0; i < MISC_CG_RES_TYPES; i++) {
>   WRITE_ONCE(cg->res[i].max, MAX_NUM);
>   atomic64_set(&cg->res[i].usage, 0);
> + if (parent_cg->res[i].misc_ops && 
> parent_cg->res[i].misc_ops->alloc) {
> + ret = parent_cg->res[i].misc_ops->alloc(cg);
> + if (ret)
> + goto alloc_err;

The patch set only has a use case for both operations defined - any
partial combinations should never be allowed.

To enforce this invariant you could create a set of operations (written
out of top of my head):

static int misc_res_init(struct misc_res *res, struct misc_ops *ops)
{
if (!misc_ops->alloc) {
pr_err("%s: alloc missing\n", __func__);
return -EINVAL;
}

if (!misc_ops->free) {
pr_err("%s: free missing\n", __func__);
return -EINVAL;
}

res->misc_ops = misc_ops;
return 0;
}

static inline int misc_res_alloc(struct misc_cg *cg, struct misc_res *res)
{
int ret;

if (!res->misc_ops)
return 0;

return res->misc_ops->alloc(cg);
}

static inline void misc_res_free(struct misc_cg *cg, struct misc_res *res)
{
int ret;

if (!res->misc_ops)
return 0;

return res->misc_ops->alloc(cg);
}

Now if anything has misc_ops, it will also always have *both* callback,
and nothing half-baked gets in.

The above loops would be then:

for (i = 0; i < MISC_CG_RES_TYPES; i++) {
WRITE_ONCE(cg->res[i].max, MAX_NUM);
atomic64_set(&cg->res[i].usage, 0);
ret = misc_res_alloc(&parent_cg->res[i]);
if (ret)
  

[PATCH 01/32] ftrace: Unpoison ftrace_regs in ftrace_ops_list_func()

2023-11-15 Thread Ilya Leoshkevich
Architectures use assembly code to initialize ftrace_regs and call
ftrace_ops_list_func(). Therefore, from the KMSAN's point of view,
ftrace_regs is poisoned on ftrace_ops_list_func entry(). This causes
KMSAN warnings when running the ftrace testsuite.

Fix by trusting the architecture-specific assembly code and always
unpoisoning ftrace_regs in ftrace_ops_list_func.

Signed-off-by: Ilya Leoshkevich 
---
 kernel/trace/ftrace.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 8de8bec5f366..dfb8b26966aa 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -7399,6 +7399,7 @@ __ftrace_ops_list_func(unsigned long ip, unsigned long 
parent_ip,
 void arch_ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
   struct ftrace_ops *op, struct ftrace_regs *fregs)
 {
+   kmsan_unpoison_memory(fregs, sizeof(*fregs));
__ftrace_ops_list_func(ip, parent_ip, NULL, fregs);
 }
 #else
-- 
2.41.0




[PATCH 02/32] kmsan: Make the tests compatible with kmsan.panic=1

2023-11-15 Thread Ilya Leoshkevich
It's useful to have both tests and kmsan.panic=1 during development,
but right now the warnings, that the tests cause, lead to kernel
panics.

Temporarily set kmsan.panic=0 for the duration of the KMSAN testing.

Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/kmsan_test.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/mm/kmsan/kmsan_test.c b/mm/kmsan/kmsan_test.c
index 07d3a3a5a9c5..9bfd11674fe3 100644
--- a/mm/kmsan/kmsan_test.c
+++ b/mm/kmsan/kmsan_test.c
@@ -659,9 +659,13 @@ static void test_exit(struct kunit *test)
 {
 }
 
+static int orig_panic_on_kmsan;
+
 static int kmsan_suite_init(struct kunit_suite *suite)
 {
register_trace_console(probe_console, NULL);
+   orig_panic_on_kmsan = panic_on_kmsan;
+   panic_on_kmsan = 0;
return 0;
 }
 
@@ -669,6 +673,7 @@ static void kmsan_suite_exit(struct kunit_suite *suite)
 {
unregister_trace_console(probe_console, NULL);
tracepoint_synchronize_unregister();
+   panic_on_kmsan = orig_panic_on_kmsan;
 }
 
 static struct kunit_suite kmsan_test_suite = {
-- 
2.41.0




[PATCH 00/32] kmsan: Enable on s390

2023-11-15 Thread Ilya Leoshkevich
Hi,

This series provides the minimal support for Kernel Memory Sanitizer on
s390. Kernel Memory Sanitizer is clang-only instrumentation for finding
accesses to uninitialized memory. The clang support for s390 has already
been merged [1].

With this series, I can successfully boot s390 defconfig and
debug_defconfig with kmsan.panic=1. The tool found one real
s390-specific bug (fixed in master).

Best regards,
Ilya

[1] https://reviews.llvm.org/D148596

Ilya Leoshkevich (32):
  ftrace: Unpoison ftrace_regs in ftrace_ops_list_func()
  kmsan: Make the tests compatible with kmsan.panic=1
  kmsan: Disable KMSAN when DEFERRED_STRUCT_PAGE_INIT is enabled
  kmsan: Increase the maximum store size to 4096
  kmsan: Fix is_bad_asm_addr() on arches with overlapping address spaces
  kmsan: Fix kmsan_copy_to_user() on arches with overlapping address
spaces
  kmsan: Remove a useless assignment from
kmsan_vmap_pages_range_noflush()
  kmsan: Remove an x86-specific #include from kmsan.h
  kmsan: Introduce kmsan_memmove_metadata()
  kmsan: Expose kmsan_get_metadata()
  kmsan: Export panic_on_kmsan
  kmsan: Allow disabling KMSAN checks for the current task
  kmsan: Support SLAB_POISON
  kmsan: Use ALIGN_DOWN() in kmsan_get_metadata()
  mm: slub: Let KMSAN access metadata
  mm: kfence: Disable KMSAN when checking the canary
  lib/string: Add KMSAN support to strlcpy() and strlcat()
  lib/zlib: Unpoison DFLTCC output buffers
  kmsan: Accept ranges starting with 0 on s390
  s390: Turn off KMSAN for boot, vdso and purgatory
  s390: Use a larger stack for KMSAN
  s390/boot: Add the KMSAN runtime stub
  s390/checksum: Add a KMSAN check
  s390/cpacf: Unpoison the results of cpacf_trng()
  s390/ftrace: Unpoison ftrace_regs in kprobe_ftrace_handler()
  s390/mm: Define KMSAN metadata for vmalloc and modules
  s390/string: Add KMSAN support
  s390/traps: Unpoison the kernel_stack_overflow()'s pt_regs
  s390/uaccess: Add KMSAN support to put_user() and get_user()
  s390/unwind: Disable KMSAN checks
  s390: Implement the architecture-specific kmsan functions
  kmsan: Enable on s390

 Documentation/dev-tools/kmsan.rst   |   4 +-
 arch/s390/Kconfig   |   1 +
 arch/s390/Makefile  |   2 +-
 arch/s390/boot/Makefile |   2 +
 arch/s390/boot/kmsan.c  |   6 ++
 arch/s390/boot/startup.c|   8 ++
 arch/s390/boot/string.c |  15 
 arch/s390/include/asm/checksum.h|   2 +
 arch/s390/include/asm/cpacf.h   |   2 +
 arch/s390/include/asm/kmsan.h   |  36 +
 arch/s390/include/asm/pgtable.h |  10 +++
 arch/s390/include/asm/string.h  |  49 -
 arch/s390/include/asm/thread_info.h |   2 +-
 arch/s390/include/asm/uaccess.h | 110 
 arch/s390/kernel/ftrace.c   |   1 +
 arch/s390/kernel/traps.c|   2 +
 arch/s390/kernel/unwind_bc.c|   2 +
 arch/s390/kernel/vdso32/Makefile|   1 +
 arch/s390/kernel/vdso64/Makefile|   1 +
 arch/s390/purgatory/Makefile|   1 +
 include/linux/kmsan-checks.h|  26 +++
 include/linux/kmsan.h   |  14 
 include/linux/kmsan_types.h |   2 +-
 kernel/trace/ftrace.c   |   1 +
 lib/string.c|   6 ++
 lib/zlib_dfltcc/dfltcc.h|   1 +
 lib/zlib_dfltcc/dfltcc_util.h   |  23 ++
 mm/Kconfig  |   1 +
 mm/kfence/core.c|   5 +-
 mm/kmsan/core.c |   2 +-
 mm/kmsan/hooks.c|  30 +++-
 mm/kmsan/init.c |   4 +-
 mm/kmsan/instrumentation.c  |  11 +--
 mm/kmsan/kmsan.h|   3 +-
 mm/kmsan/kmsan_test.c   |   5 ++
 mm/kmsan/report.c   |   7 +-
 mm/kmsan/shadow.c   |   9 +--
 mm/slub.c   |   5 +-
 38 files changed, 331 insertions(+), 81 deletions(-)
 create mode 100644 arch/s390/boot/kmsan.c
 create mode 100644 arch/s390/include/asm/kmsan.h

-- 
2.41.0




[PATCH 05/32] kmsan: Fix is_bad_asm_addr() on arches with overlapping address spaces

2023-11-15 Thread Ilya Leoshkevich
Comparing pointers with TASK_SIZE does not make sense when kernel and
userspace overlap. Skip the comparison when this is the case.

Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/instrumentation.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/kmsan/instrumentation.c b/mm/kmsan/instrumentation.c
index 470b0b4afcc4..8a1bbbc723ab 100644
--- a/mm/kmsan/instrumentation.c
+++ b/mm/kmsan/instrumentation.c
@@ -20,7 +20,8 @@
 
 static inline bool is_bad_asm_addr(void *addr, uintptr_t size, bool is_store)
 {
-   if ((u64)addr < TASK_SIZE)
+   if (IS_ENABLED(CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE) &&
+   (u64)addr < TASK_SIZE)
return true;
if (!kmsan_get_metadata(addr, KMSAN_META_SHADOW))
return true;
-- 
2.41.0




[PATCH 08/32] kmsan: Remove an x86-specific #include from kmsan.h

2023-11-15 Thread Ilya Leoshkevich
Replace the x86-specific asm/pgtable_64_types.h #include with the
linux/pgtable.h one, which all architectures have.

Fixes: f80be4571b19 ("kmsan: add KMSAN runtime core")
Suggested-by: Heiko Carstens 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/kmsan.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/kmsan/kmsan.h b/mm/kmsan/kmsan.h
index a14744205435..3c0476d8b765 100644
--- a/mm/kmsan/kmsan.h
+++ b/mm/kmsan/kmsan.h
@@ -10,7 +10,7 @@
 #ifndef __MM_KMSAN_KMSAN_H
 #define __MM_KMSAN_KMSAN_H
 
-#include 
+#include 
 #include 
 #include 
 #include 
-- 
2.41.0




[PATCH 04/32] kmsan: Increase the maximum store size to 4096

2023-11-15 Thread Ilya Leoshkevich
The inline assembly block in s390's chsc() stores that much.

Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/instrumentation.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/kmsan/instrumentation.c b/mm/kmsan/instrumentation.c
index cc3907a9c33a..470b0b4afcc4 100644
--- a/mm/kmsan/instrumentation.c
+++ b/mm/kmsan/instrumentation.c
@@ -110,11 +110,10 @@ void __msan_instrument_asm_store(void *addr, uintptr_t 
size)
 
ua_flags = user_access_save();
/*
-* Most of the accesses are below 32 bytes. The two exceptions so far
-* are clwb() (64 bytes) and FPU state (512 bytes).
-* It's unlikely that the assembly will touch more than 512 bytes.
+* Most of the accesses are below 32 bytes. The exceptions so far are
+* clwb() (64 bytes), FPU state (512 bytes) and chsc() (4096 bytes).
 */
-   if (size > 512) {
+   if (size > 4096) {
WARN_ONCE(1, "assembly store size too big: %ld\n", size);
size = 8;
}
-- 
2.41.0




[PATCH 06/32] kmsan: Fix kmsan_copy_to_user() on arches with overlapping address spaces

2023-11-15 Thread Ilya Leoshkevich
Comparing pointers with TASK_SIZE does not make sense when kernel and
userspace overlap. Assume that we are handling user memory access in
this case.

Reported-by: Alexander Gordeev 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/hooks.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c
index 5d6e2dee5692..eafc45f937eb 100644
--- a/mm/kmsan/hooks.c
+++ b/mm/kmsan/hooks.c
@@ -267,7 +267,8 @@ void kmsan_copy_to_user(void __user *to, const void *from, 
size_t to_copy,
return;
 
ua_flags = user_access_save();
-   if ((u64)to < TASK_SIZE) {
+   if (!IS_ENABLED(CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE) ||
+   (u64)to < TASK_SIZE) {
/* This is a user memory access, check it. */
kmsan_internal_check_memory((void *)from, to_copy - left, to,
REASON_COPY_TO_USER);
-- 
2.41.0




[PATCH 03/32] kmsan: Disable KMSAN when DEFERRED_STRUCT_PAGE_INIT is enabled

2023-11-15 Thread Ilya Leoshkevich
KMSAN relies on memblock returning all available pages to it
(see kmsan_memblock_free_pages()). It partitions these pages into 3
categories: pages available to the buddy allocator, shadow pages and
origin pages. This partitioning is static.

If new pages appear after kmsan_init_runtime(), it is considered
an error. DEFERRED_STRUCT_PAGE_INIT causes this, so mark it as
incompatible with KMSAN.

Signed-off-by: Ilya Leoshkevich 
---
 mm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 89971a894b60..4f2f99339fc7 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -985,6 +985,7 @@ config DEFERRED_STRUCT_PAGE_INIT
depends on SPARSEMEM
depends on !NEED_PER_CPU_KM
depends on 64BIT
+   depends on !KMSAN
select PADATA
help
  Ordinarily all struct pages are initialised during early boot in a
-- 
2.41.0




[PATCH 09/32] kmsan: Introduce kmsan_memmove_metadata()

2023-11-15 Thread Ilya Leoshkevich
It is useful to manually copy metadata in order to describe the effects
of memmove()-like logic in uninstrumented code or inline asm. Introduce
kmsan_memmove_metadata() for this purpose.

Signed-off-by: Ilya Leoshkevich 
---
 include/linux/kmsan-checks.h | 14 ++
 mm/kmsan/hooks.c | 11 +++
 2 files changed, 25 insertions(+)

diff --git a/include/linux/kmsan-checks.h b/include/linux/kmsan-checks.h
index c4cae333deec..5218973f0ad0 100644
--- a/include/linux/kmsan-checks.h
+++ b/include/linux/kmsan-checks.h
@@ -61,6 +61,17 @@ void kmsan_check_memory(const void *address, size_t size);
 void kmsan_copy_to_user(void __user *to, const void *from, size_t to_copy,
size_t left);
 
+/**
+ * kmsan_memmove_metadata() - Copy kernel memory range metadata.
+ * @dst: start of the destination kernel memory range.
+ * @src: start of the source kernel memory range.
+ * @n:   size of the memory ranges.
+ *
+ * KMSAN will treat the destination range as if its contents were memmove()d
+ * from the source range.
+ */
+void kmsan_memmove_metadata(void *dst, const void *src, size_t n);
+
 #else
 
 static inline void kmsan_poison_memory(const void *address, size_t size,
@@ -77,6 +88,9 @@ static inline void kmsan_copy_to_user(void __user *to, const 
void *from,
  size_t to_copy, size_t left)
 {
 }
+static inline void kmsan_memmove_metadata(void *dst, const void *src, size_t n)
+{
+}
 
 #endif
 
diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c
index eafc45f937eb..4d477a0a356c 100644
--- a/mm/kmsan/hooks.c
+++ b/mm/kmsan/hooks.c
@@ -286,6 +286,17 @@ void kmsan_copy_to_user(void __user *to, const void *from, 
size_t to_copy,
 }
 EXPORT_SYMBOL(kmsan_copy_to_user);
 
+void kmsan_memmove_metadata(void *dst, const void *src, size_t n)
+{
+   if (!kmsan_enabled || kmsan_in_runtime())
+   return;
+
+   kmsan_enter_runtime();
+   kmsan_internal_memmove_metadata(dst, (void *)src, n);
+   kmsan_leave_runtime();
+}
+EXPORT_SYMBOL(kmsan_memmove_metadata);
+
 /* Helper function to check an URB. */
 void kmsan_handle_urb(const struct urb *urb, bool is_out)
 {
-- 
2.41.0




[PATCH 11/32] kmsan: Export panic_on_kmsan

2023-11-15 Thread Ilya Leoshkevich
When building the kmsan test as a module, modpost fails with the
following error message:

ERROR: modpost: "panic_on_kmsan" [mm/kmsan/kmsan_test.ko] undefined!

Export panic_on_kmsan in order to improve the KMSAN usability for
modules.

Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/report.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/kmsan/report.c b/mm/kmsan/report.c
index 02736ec757f2..c79d3b0d2d0d 100644
--- a/mm/kmsan/report.c
+++ b/mm/kmsan/report.c
@@ -20,6 +20,7 @@ static DEFINE_RAW_SPINLOCK(kmsan_report_lock);
 /* Protected by kmsan_report_lock */
 static char report_local_descr[DESCR_SIZE];
 int panic_on_kmsan __read_mostly;
+EXPORT_SYMBOL_GPL(panic_on_kmsan);
 
 #ifdef MODULE_PARAM_PREFIX
 #undef MODULE_PARAM_PREFIX
-- 
2.41.0




[PATCH 13/32] kmsan: Support SLAB_POISON

2023-11-15 Thread Ilya Leoshkevich
Avoid false KMSAN negatives with SLUB_DEBUG by allowing
kmsan_slab_free() to poison the freed memory, and by preventing
init_object() from unpoisoning new allocations.

Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/hooks.c | 2 +-
 mm/slub.c| 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c
index 7b5814412e9f..7a30274b893c 100644
--- a/mm/kmsan/hooks.c
+++ b/mm/kmsan/hooks.c
@@ -76,7 +76,7 @@ void kmsan_slab_free(struct kmem_cache *s, void *object)
return;
 
/* RCU slabs could be legally used after free within the RCU period */
-   if (unlikely(s->flags & (SLAB_TYPESAFE_BY_RCU | SLAB_POISON)))
+   if (unlikely(s->flags & SLAB_TYPESAFE_BY_RCU))
return;
/*
 * If there's a constructor, freed memory must remain in the same state
diff --git a/mm/slub.c b/mm/slub.c
index 63d281dfacdb..8d9aa4d7cb7e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1024,7 +1024,8 @@ static __printf(3, 4) void slab_err(struct kmem_cache *s, 
struct slab *slab,
add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
 }
 
-static void init_object(struct kmem_cache *s, void *object, u8 val)
+__no_sanitize_memory static void
+init_object(struct kmem_cache *s, void *object, u8 val)
 {
u8 *p = kasan_reset_tag(object);
unsigned int poison_size = s->object_size;
-- 
2.41.0




[PATCH 12/32] kmsan: Allow disabling KMSAN checks for the current task

2023-11-15 Thread Ilya Leoshkevich
Like for KASAN, it's useful to temporarily disable KMSAN checks around,
e.g., redzone accesses. Introduce kmsan_disable_current() and
kmsan_enable_current(), which are similar to their KASAN counterparts.

Even though it's not strictly necessary, make them reentrant, in order
to match the KASAN behavior. Repurpose the allow_reporting field for
this.

Signed-off-by: Ilya Leoshkevich 
---
 Documentation/dev-tools/kmsan.rst |  4 ++--
 include/linux/kmsan-checks.h  | 12 
 include/linux/kmsan_types.h   |  2 +-
 mm/kmsan/core.c   |  2 +-
 mm/kmsan/hooks.c  | 14 +-
 mm/kmsan/report.c |  6 +++---
 6 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/Documentation/dev-tools/kmsan.rst 
b/Documentation/dev-tools/kmsan.rst
index 323eedad53cd..022a823f5f1b 100644
--- a/Documentation/dev-tools/kmsan.rst
+++ b/Documentation/dev-tools/kmsan.rst
@@ -338,11 +338,11 @@ Per-task KMSAN state
 
 
 Every task_struct has an associated KMSAN task state that holds the KMSAN
-context (see above) and a per-task flag disallowing KMSAN reports::
+context (see above) and a per-task counter disallowing KMSAN reports::
 
   struct kmsan_context {
 ...
-bool allow_reporting;
+unsigned int depth;
 struct kmsan_context_state cstate;
 ...
   }
diff --git a/include/linux/kmsan-checks.h b/include/linux/kmsan-checks.h
index 5218973f0ad0..bab2603685f7 100644
--- a/include/linux/kmsan-checks.h
+++ b/include/linux/kmsan-checks.h
@@ -72,6 +72,10 @@ void kmsan_copy_to_user(void __user *to, const void *from, 
size_t to_copy,
  */
 void kmsan_memmove_metadata(void *dst, const void *src, size_t n);
 
+void kmsan_enable_current(void);
+
+void kmsan_disable_current(void);
+
 #else
 
 static inline void kmsan_poison_memory(const void *address, size_t size,
@@ -92,6 +96,14 @@ static inline void kmsan_memmove_metadata(void *dst, const 
void *src, size_t n)
 {
 }
 
+static inline void kmsan_enable_current(void)
+{
+}
+
+static inline void kmsan_disable_current(void)
+{
+}
+
 #endif
 
 #endif /* _LINUX_KMSAN_CHECKS_H */
diff --git a/include/linux/kmsan_types.h b/include/linux/kmsan_types.h
index 8bfa6c98176d..27bb146ece95 100644
--- a/include/linux/kmsan_types.h
+++ b/include/linux/kmsan_types.h
@@ -29,7 +29,7 @@ struct kmsan_context_state {
 struct kmsan_ctx {
struct kmsan_context_state cstate;
int kmsan_in_runtime;
-   bool allow_reporting;
+   unsigned int depth;
 };
 
 #endif /* _LINUX_KMSAN_TYPES_H */
diff --git a/mm/kmsan/core.c b/mm/kmsan/core.c
index c19f47af0424..b8767378cf8a 100644
--- a/mm/kmsan/core.c
+++ b/mm/kmsan/core.c
@@ -43,7 +43,7 @@ void kmsan_internal_task_create(struct task_struct *task)
struct thread_info *info = current_thread_info();
 
__memset(ctx, 0, sizeof(*ctx));
-   ctx->allow_reporting = true;
+   ctx->depth = 0;
kmsan_internal_unpoison_memory(info, sizeof(*info), false);
 }
 
diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c
index 4d477a0a356c..7b5814412e9f 100644
--- a/mm/kmsan/hooks.c
+++ b/mm/kmsan/hooks.c
@@ -44,7 +44,7 @@ void kmsan_task_exit(struct task_struct *task)
if (!kmsan_enabled || kmsan_in_runtime())
return;
 
-   ctx->allow_reporting = false;
+   ctx->depth++;
 }
 
 void kmsan_slab_alloc(struct kmem_cache *s, void *object, gfp_t flags)
@@ -434,3 +434,15 @@ void kmsan_check_memory(const void *addr, size_t size)
   REASON_ANY);
 }
 EXPORT_SYMBOL(kmsan_check_memory);
+
+void kmsan_enable_current(void)
+{
+   current->kmsan_ctx.depth--;
+}
+EXPORT_SYMBOL(kmsan_enable_current);
+
+void kmsan_disable_current(void)
+{
+   current->kmsan_ctx.depth++;
+}
+EXPORT_SYMBOL(kmsan_disable_current);
diff --git a/mm/kmsan/report.c b/mm/kmsan/report.c
index c79d3b0d2d0d..edcf53ca428e 100644
--- a/mm/kmsan/report.c
+++ b/mm/kmsan/report.c
@@ -158,12 +158,12 @@ void kmsan_report(depot_stack_handle_t origin, void 
*address, int size,
 
if (!kmsan_enabled)
return;
-   if (!current->kmsan_ctx.allow_reporting)
+   if (current->kmsan_ctx.depth)
return;
if (!origin)
return;
 
-   current->kmsan_ctx.allow_reporting = false;
+   current->kmsan_ctx.depth++;
ua_flags = user_access_save();
raw_spin_lock(&kmsan_report_lock);
pr_err("=\n");
@@ -216,5 +216,5 @@ void kmsan_report(depot_stack_handle_t origin, void 
*address, int size,
if (panic_on_kmsan)
panic("kmsan.panic set ...\n");
user_access_restore(ua_flags);
-   current->kmsan_ctx.allow_reporting = true;
+   current->kmsan_ctx.depth--;
 }
-- 
2.41.0




[PATCH 07/32] kmsan: Remove a useless assignment from kmsan_vmap_pages_range_noflush()

2023-11-15 Thread Ilya Leoshkevich
The value assigned to prot is immediately overwritten on the next line
with PAGE_KERNEL. The right hand side of the assignment has no
side-effects.

Fixes: b073d7f8aee4 ("mm: kmsan: maintain KMSAN metadata for page operations")
Suggested-by: Alexander Gordeev 
Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/shadow.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/kmsan/shadow.c b/mm/kmsan/shadow.c
index b9d05aff313e..2d57408c78ae 100644
--- a/mm/kmsan/shadow.c
+++ b/mm/kmsan/shadow.c
@@ -243,7 +243,6 @@ int kmsan_vmap_pages_range_noflush(unsigned long start, 
unsigned long end,
s_pages[i] = shadow_page_for(pages[i]);
o_pages[i] = origin_page_for(pages[i]);
}
-   prot = __pgprot(pgprot_val(prot) | _PAGE_NX);
prot = PAGE_KERNEL;
 
origin_start = vmalloc_meta((void *)start, KMSAN_META_ORIGIN);
-- 
2.41.0




[PATCH 15/32] mm: slub: Let KMSAN access metadata

2023-11-15 Thread Ilya Leoshkevich
Building the kernel with CONFIG_SLUB_DEBUG and CONFIG_KMSAN causes
KMSAN to complain about touching redzones in kfree().

Fix by extending the existing KASAN-related metadata_access_enable()
and metadata_access_disable() functions to KMSAN.

Signed-off-by: Ilya Leoshkevich 
---
 mm/slub.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index 8d9aa4d7cb7e..0b52bff99326 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -700,10 +700,12 @@ static int disable_higher_order_debug;
 static inline void metadata_access_enable(void)
 {
kasan_disable_current();
+   kmsan_disable_current();
 }
 
 static inline void metadata_access_disable(void)
 {
+   kmsan_enable_current();
kasan_enable_current();
 }
 
-- 
2.41.0




[PATCH 14/32] kmsan: Use ALIGN_DOWN() in kmsan_get_metadata()

2023-11-15 Thread Ilya Leoshkevich
Improve the readability by replacing the custom aligning logic with
ALIGN_DOWN(). Unlike other places where a similar sequence is used,
there is no size parameter that needs to be adjusted, so the standard
macro fits.

Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/shadow.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/kmsan/shadow.c b/mm/kmsan/shadow.c
index 2d57408c78ae..9c58f081d84f 100644
--- a/mm/kmsan/shadow.c
+++ b/mm/kmsan/shadow.c
@@ -123,14 +123,12 @@ struct shadow_origin_ptr kmsan_get_shadow_origin_ptr(void 
*address, u64 size,
  */
 void *kmsan_get_metadata(void *address, bool is_origin)
 {
-   u64 addr = (u64)address, pad, off;
+   u64 addr = (u64)address, off;
struct page *page;
void *ret;
 
-   if (is_origin && !IS_ALIGNED(addr, KMSAN_ORIGIN_SIZE)) {
-   pad = addr % KMSAN_ORIGIN_SIZE;
-   addr -= pad;
-   }
+   if (is_origin)
+   addr = ALIGN_DOWN(addr, KMSAN_ORIGIN_SIZE);
address = (void *)addr;
if (kmsan_internal_is_vmalloc_addr(address) ||
kmsan_internal_is_module_addr(address))
-- 
2.41.0




[PATCH 10/32] kmsan: Expose kmsan_get_metadata()

2023-11-15 Thread Ilya Leoshkevich
Each s390 CPU has lowcore pages associated with it. Each CPU sees its
own lowcore at virtual address 0 through a hardware mechanism called
prefixing. Additionally, all lowcores are mapped to non-0 virtual
addresses stored in the lowcore_ptr[] array.

When lowcore is accessed through virtual address 0, one needs to
resolve metadata for lowcore_ptr[raw_smp_processor_id()].

Expose kmsan_get_metadata() to make it possible to do this from the
arch code.

Signed-off-by: Ilya Leoshkevich 
---
 include/linux/kmsan.h  | 14 ++
 mm/kmsan/instrumentation.c |  1 +
 mm/kmsan/kmsan.h   |  1 -
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/kmsan.h b/include/linux/kmsan.h
index e0c23a32cdf0..ff8fd95733fa 100644
--- a/include/linux/kmsan.h
+++ b/include/linux/kmsan.h
@@ -230,6 +230,15 @@ void kmsan_handle_urb(const struct urb *urb, bool is_out);
  */
 void kmsan_unpoison_entry_regs(const struct pt_regs *regs);
 
+/**
+ * kmsan_get_metadata() - Return a pointer to KMSAN shadow or origins.
+ * @addr:  kernel address.
+ * @is_origin: whether to return origins or shadow.
+ *
+ * Return NULL if metadata cannot be found.
+ */
+void *kmsan_get_metadata(void *addr, bool is_origin);
+
 #else
 
 static inline void kmsan_init_shadow(void)
@@ -329,6 +338,11 @@ static inline void kmsan_unpoison_entry_regs(const struct 
pt_regs *regs)
 {
 }
 
+static inline void *kmsan_get_metadata(void *addr, bool is_origin)
+{
+   return NULL;
+}
+
 #endif
 
 #endif /* _LINUX_KMSAN_H */
diff --git a/mm/kmsan/instrumentation.c b/mm/kmsan/instrumentation.c
index 8a1bbbc723ab..94b49fac9d8b 100644
--- a/mm/kmsan/instrumentation.c
+++ b/mm/kmsan/instrumentation.c
@@ -14,6 +14,7 @@
 
 #include "kmsan.h"
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/mm/kmsan/kmsan.h b/mm/kmsan/kmsan.h
index 3c0476d8b765..2c743911a8c4 100644
--- a/mm/kmsan/kmsan.h
+++ b/mm/kmsan/kmsan.h
@@ -66,7 +66,6 @@ struct shadow_origin_ptr {
 
 struct shadow_origin_ptr kmsan_get_shadow_origin_ptr(void *addr, u64 size,
 bool store);
-void *kmsan_get_metadata(void *addr, bool is_origin);
 void __init kmsan_init_alloc_meta_for_range(void *start, void *end);
 
 enum kmsan_bug_reason {
-- 
2.41.0




[PATCH 16/32] mm: kfence: Disable KMSAN when checking the canary

2023-11-15 Thread Ilya Leoshkevich
KMSAN warns about check_canary() accessing the canary.

The reason is that, even though set_canary() is properly instrumented
and sets shadow, slub explicitly poisons the canary's address range
afterwards.

Unpoisoning the canary is not the right thing to do: only
check_canary() is supposed to ever touch it. Instead, disable KMSAN
checks around canary read accesses.

Signed-off-by: Ilya Leoshkevich 
---
 mm/kfence/core.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/kfence/core.c b/mm/kfence/core.c
index 3872528d0963..a2ea8e5a1ad9 100644
--- a/mm/kfence/core.c
+++ b/mm/kfence/core.c
@@ -306,7 +306,7 @@ metadata_update_state(struct kfence_metadata *meta, enum 
kfence_object_state nex
 }
 
 /* Check canary byte at @addr. */
-static inline bool check_canary_byte(u8 *addr)
+__no_kmsan_checks static inline bool check_canary_byte(u8 *addr)
 {
struct kfence_metadata *meta;
unsigned long flags;
@@ -341,7 +341,8 @@ static inline void set_canary(const struct kfence_metadata 
*meta)
*((u64 *)addr) = KFENCE_CANARY_PATTERN_U64;
 }
 
-static inline void check_canary(const struct kfence_metadata *meta)
+__no_kmsan_checks static inline void
+check_canary(const struct kfence_metadata *meta)
 {
const unsigned long pageaddr = ALIGN_DOWN(meta->addr, PAGE_SIZE);
unsigned long addr = pageaddr;
-- 
2.41.0




[PATCH 17/32] lib/string: Add KMSAN support to strlcpy() and strlcat()

2023-11-15 Thread Ilya Leoshkevich
Currently KMSAN does not fully propagate metadata in strlcpy() and
strlcat(), because they are built with -ffreestanding and call
memcpy(). In this combination memcpy() calls are not instrumented.

Fix by copying the metadata manually. Add the __STDC_HOSTED__ #ifdef in
case the code is compiled with different flags in the future.

Signed-off-by: Ilya Leoshkevich 
---
 lib/string.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/lib/string.c b/lib/string.c
index be26623953d2..e83c6dd77ec6 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -111,6 +111,9 @@ size_t strlcpy(char *dest, const char *src, size_t size)
if (size) {
size_t len = (ret >= size) ? size - 1 : ret;
__builtin_memcpy(dest, src, len);
+#if __STDC_HOSTED__ == 0
+   kmsan_memmove_metadata(dest, src, len);
+#endif
dest[len] = '\0';
}
return ret;
@@ -261,6 +264,9 @@ size_t strlcat(char *dest, const char *src, size_t count)
if (len >= count)
len = count-1;
__builtin_memcpy(dest, src, len);
+#if __STDC_HOSTED__ == 0
+   kmsan_memmove_metadata(dest, src, len);
+#endif
dest[len] = 0;
return res;
 }
-- 
2.41.0




[PATCH 19/32] kmsan: Accept ranges starting with 0 on s390

2023-11-15 Thread Ilya Leoshkevich
On s390 the virtual address 0 is valid (current CPU's lowcore is mapped
there), therefore KMSAN should not complain about it.

Disable the respective check on s390. There doesn't seem to be a
Kconfig option to describe this situation, so explicitly check for
s390.

Signed-off-by: Ilya Leoshkevich 
---
 mm/kmsan/init.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/kmsan/init.c b/mm/kmsan/init.c
index ffedf4dbc49d..14f4a432fddd 100644
--- a/mm/kmsan/init.c
+++ b/mm/kmsan/init.c
@@ -33,7 +33,9 @@ static void __init kmsan_record_future_shadow_range(void 
*start, void *end)
bool merged = false;
 
KMSAN_WARN_ON(future_index == NUM_FUTURE_RANGES);
-   KMSAN_WARN_ON((nstart >= nend) || !nstart || !nend);
+   KMSAN_WARN_ON((nstart >= nend) ||
+ (!IS_ENABLED(CONFIG_S390) && !nstart) ||
+ !nend);
nstart = ALIGN_DOWN(nstart, PAGE_SIZE);
nend = ALIGN(nend, PAGE_SIZE);
 
-- 
2.41.0




[PATCH 18/32] lib/zlib: Unpoison DFLTCC output buffers

2023-11-15 Thread Ilya Leoshkevich
The constraints of the DFLTCC inline assembly are not precise: they
do not communicate the size of the output buffers to the compiler, so
it cannot automatically instrument it.

Add the manual kmsan_unpoison_memory() calls for the output buffers.
The logic is the same as in [1].

[1] 
https://github.com/zlib-ng/zlib-ng/commit/1f5ddcc009ac3511e99fc88736a9e1a6381168c5

Reported-by: Alexander Gordeev 
Signed-off-by: Ilya Leoshkevich 
---
 lib/zlib_dfltcc/dfltcc.h  |  1 +
 lib/zlib_dfltcc/dfltcc_util.h | 23 +++
 2 files changed, 24 insertions(+)

diff --git a/lib/zlib_dfltcc/dfltcc.h b/lib/zlib_dfltcc/dfltcc.h
index b96232bdd44d..0f2a16d7a48a 100644
--- a/lib/zlib_dfltcc/dfltcc.h
+++ b/lib/zlib_dfltcc/dfltcc.h
@@ -80,6 +80,7 @@ struct dfltcc_param_v0 {
 uint8_t csb[1152];
 };
 
+static_assert(offsetof(struct dfltcc_param_v0, csb) == 384);
 static_assert(sizeof(struct dfltcc_param_v0) == 1536);
 
 #define CVT_CRC32 0
diff --git a/lib/zlib_dfltcc/dfltcc_util.h b/lib/zlib_dfltcc/dfltcc_util.h
index 4a46b5009f0d..ce2e039a55b5 100644
--- a/lib/zlib_dfltcc/dfltcc_util.h
+++ b/lib/zlib_dfltcc/dfltcc_util.h
@@ -2,6 +2,7 @@
 #ifndef DFLTCC_UTIL_H
 #define DFLTCC_UTIL_H
 
+#include "dfltcc.h"
 #include 
 
 /*
@@ -20,6 +21,7 @@ typedef enum {
 #define DFLTCC_CMPR 2
 #define DFLTCC_XPND 4
 #define HBT_CIRCULAR (1 << 7)
+#define DFLTCC_FN_MASK ((1 << 7) - 1)
 #define HB_BITS 15
 #define HB_SIZE (1 << HB_BITS)
 
@@ -34,6 +36,7 @@ static inline dfltcc_cc dfltcc(
 )
 {
 Byte *t2 = op1 ? *op1 : NULL;
+unsigned char *orig_t2 = t2;
 size_t t3 = len1 ? *len1 : 0;
 const Byte *t4 = op2 ? *op2 : NULL;
 size_t t5 = len2 ? *len2 : 0;
@@ -59,6 +62,26 @@ static inline dfltcc_cc dfltcc(
  : "cc", "memory");
 t2 = r2; t3 = r3; t4 = r4; t5 = r5;
 
+switch (fn & DFLTCC_FN_MASK) {
+case DFLTCC_QAF:
+kmsan_unpoison_memory(param, sizeof(struct dfltcc_qaf_param));
+break;
+case DFLTCC_GDHT:
+kmsan_unpoison_memory(param, offsetof(struct dfltcc_param_v0, csb));
+break;
+case DFLTCC_CMPR:
+kmsan_unpoison_memory(param, sizeof(struct dfltcc_param_v0));
+kmsan_unpoison_memory(
+orig_t2,
+t2 - orig_t2 +
+(((struct dfltcc_param_v0 *)param)->sbb == 0 ? 0 : 1));
+break;
+case DFLTCC_XPND:
+kmsan_unpoison_memory(param, sizeof(struct dfltcc_param_v0));
+kmsan_unpoison_memory(orig_t2, t2 - orig_t2);
+break;
+}
+
 if (op1)
 *op1 = t2;
 if (len1)
-- 
2.41.0




[PATCH 22/32] s390/boot: Add the KMSAN runtime stub

2023-11-15 Thread Ilya Leoshkevich
It should be possible to have inline functions in the s390 header
files, which call kmsan_unpoison_memory(). The problem is that these
header files might be included by the decompressor, which does not
contain KMSAN runtime, causing linker errors.

Not compiling these calls if __SANITIZE_MEMORY__ is not defined -
either by changing kmsan-checks.h or at the call sites - may cause
unintended side effects, since calling these functions from an
uninstrumented code that is linked into the kernel is valid use case.

One might want to explicitly distinguish between the kernel and the
decompressor. Checking for a decompressor-specific #define is quite
heavy-handed, and will have to be done at all call sites.

A more generic approach is to provide a dummy kmsan_unpoison_memory()
definition. This produces some runtime overhead, but only when building
with CONFIG_KMSAN. The benefit is that it does not disturb the existing
KMSAN build logic and call sites don't need to be changed.

Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/boot/Makefile | 1 +
 arch/s390/boot/kmsan.c  | 6 ++
 2 files changed, 7 insertions(+)
 create mode 100644 arch/s390/boot/kmsan.c

diff --git a/arch/s390/boot/Makefile b/arch/s390/boot/Makefile
index 5a05c927f703..826005e2e3aa 100644
--- a/arch/s390/boot/Makefile
+++ b/arch/s390/boot/Makefile
@@ -43,6 +43,7 @@ obj-$(findstring y, $(CONFIG_PROTECTED_VIRTUALIZATION_GUEST) 
$(CONFIG_PGSTE)) +=
 obj-$(CONFIG_RANDOMIZE_BASE)   += kaslr.o
 obj-y  += $(if $(CONFIG_KERNEL_UNCOMPRESSED),,decompressor.o) info.o
 obj-$(CONFIG_KERNEL_ZSTD) += clz_ctz.o
+obj-$(CONFIG_KMSAN) += kmsan.o
 obj-all := $(obj-y) piggy.o syms.o
 
 targets:= bzImage section_cmp.boot.data 
section_cmp.boot.preserved.data $(obj-y)
diff --git a/arch/s390/boot/kmsan.c b/arch/s390/boot/kmsan.c
new file mode 100644
index ..e7b3ac48143e
--- /dev/null
+++ b/arch/s390/boot/kmsan.c
@@ -0,0 +1,6 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+
+void kmsan_unpoison_memory(const void *address, size_t size)
+{
+}
-- 
2.41.0




[PATCH 21/32] s390: Use a larger stack for KMSAN

2023-11-15 Thread Ilya Leoshkevich
Adjust the stack size for the KMSAN-enabled kernel like it was done
for the KASAN-enabled one in commit 7fef92ccadd7 ("s390/kasan: double
the stack size"). Both tools have similar requirements.

Reviewed-by: Alexander Gordeev 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/Makefile  | 2 +-
 arch/s390/include/asm/thread_info.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/s390/Makefile b/arch/s390/Makefile
index 73873e451686..a7f5386d25ad 100644
--- a/arch/s390/Makefile
+++ b/arch/s390/Makefile
@@ -34,7 +34,7 @@ KBUILD_CFLAGS_DECOMPRESSOR += $(if 
$(CONFIG_DEBUG_INFO_DWARF4), $(call cc-option
 KBUILD_CFLAGS_DECOMPRESSOR += $(if 
$(CONFIG_CC_NO_ARRAY_BOUNDS),-Wno-array-bounds)
 
 UTS_MACHINE:= s390x
-STACK_SIZE := $(if $(CONFIG_KASAN),65536,16384)
+STACK_SIZE := $(if $(CONFIG_KASAN),65536,$(if $(CONFIG_KMSAN),65536,16384))
 CHECKFLAGS += -D__s390__ -D__s390x__
 
 export LD_BFD
diff --git a/arch/s390/include/asm/thread_info.h 
b/arch/s390/include/asm/thread_info.h
index a674c7d25da5..d02a709717b8 100644
--- a/arch/s390/include/asm/thread_info.h
+++ b/arch/s390/include/asm/thread_info.h
@@ -16,7 +16,7 @@
 /*
  * General size of kernel stacks
  */
-#ifdef CONFIG_KASAN
+#if defined(CONFIG_KASAN) || defined(CONFIG_KMSAN)
 #define THREAD_SIZE_ORDER 4
 #else
 #define THREAD_SIZE_ORDER 2
-- 
2.41.0




[PATCH 23/32] s390/checksum: Add a KMSAN check

2023-11-15 Thread Ilya Leoshkevich
Add a KMSAN check to the CKSM inline assembly, similar to how it was
done for ASAN in commit e42ac7789df6 ("s390/checksum: always use cksm
instruction").

Acked-by: Alexander Gordeev 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/include/asm/checksum.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/s390/include/asm/checksum.h b/arch/s390/include/asm/checksum.h
index 69837eec2ff5..55ba0ddd8eab 100644
--- a/arch/s390/include/asm/checksum.h
+++ b/arch/s390/include/asm/checksum.h
@@ -13,6 +13,7 @@
 #define _S390_CHECKSUM_H
 
 #include 
+#include 
 #include 
 
 /*
@@ -35,6 +36,7 @@ static inline __wsum csum_partial(const void *buff, int len, 
__wsum sum)
};
 
kasan_check_read(buff, len);
+   kmsan_check_memory(buff, len);
asm volatile(
"0: cksm%[sum],%[rp]\n"
"   jo  0b\n"
-- 
2.41.0




[PATCH 26/32] s390/mm: Define KMSAN metadata for vmalloc and modules

2023-11-15 Thread Ilya Leoshkevich
The pages for the KMSAN metadata associated with most kernel mappings
are taken from memblock by the common code. However, vmalloc and module
metadata needs to be defined by the architectures.

Be a little bit more careful than x86: allocate exactly MODULES_LEN
for the module shadow and origins, and then take 2/3 of vmalloc for
the vmalloc shadow and origins. This ensures that users passing small
vmalloc= values on the command line do not cause module metadata
collisions.

Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/boot/startup.c|  8 
 arch/s390/include/asm/pgtable.h | 10 ++
 2 files changed, 18 insertions(+)

diff --git a/arch/s390/boot/startup.c b/arch/s390/boot/startup.c
index 8104e0e3d188..297c1062372a 100644
--- a/arch/s390/boot/startup.c
+++ b/arch/s390/boot/startup.c
@@ -253,9 +253,17 @@ static unsigned long setup_kernel_memory_layout(void)
MODULES_END = round_down(__abs_lowcore, _SEGMENT_SIZE);
MODULES_VADDR = MODULES_END - MODULES_LEN;
VMALLOC_END = MODULES_VADDR;
+#ifdef CONFIG_KMSAN
+   VMALLOC_END -= MODULES_LEN * 2;
+#endif
 
/* allow vmalloc area to occupy up to about 1/2 of the rest virtual 
space left */
vmalloc_size = min(vmalloc_size, round_down(VMALLOC_END / 2, 
_REGION3_SIZE));
+#ifdef CONFIG_KMSAN
+   /* take 2/3 of vmalloc area for KMSAN shadow and origins */
+   vmalloc_size = round_down(vmalloc_size / 3, PAGE_SIZE);
+   VMALLOC_END -= vmalloc_size * 2;
+#endif
VMALLOC_START = VMALLOC_END - vmalloc_size;
 
/* split remaining virtual space between 1:1 mapping & vmemmap array */
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 601e87fa8a9a..d764abeb9e6d 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -107,6 +107,16 @@ static inline int is_module_addr(void *addr)
return 1;
 }
 
+#ifdef CONFIG_KMSAN
+#define KMSAN_VMALLOC_SIZE (VMALLOC_END - VMALLOC_START)
+#define KMSAN_VMALLOC_SHADOW_START VMALLOC_END
+#define KMSAN_VMALLOC_ORIGIN_START (KMSAN_VMALLOC_SHADOW_START + \
+   KMSAN_VMALLOC_SIZE)
+#define KMSAN_MODULES_SHADOW_START (KMSAN_VMALLOC_ORIGIN_START + \
+   KMSAN_VMALLOC_SIZE)
+#define KMSAN_MODULES_ORIGIN_START (KMSAN_MODULES_SHADOW_START + MODULES_LEN)
+#endif
+
 /*
  * A 64 bit pagetable entry of S390 has following format:
  * |PFRA |0IPC|  OS  |
-- 
2.41.0




[PATCH 30/32] s390/unwind: Disable KMSAN checks

2023-11-15 Thread Ilya Leoshkevich
The unwind code can read uninitialized frames. Furthermore, even in
the good case, KMSAN does not emit shadow for backchains. Therefore
disable it for the unwinding functions.

Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/kernel/unwind_bc.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/s390/kernel/unwind_bc.c b/arch/s390/kernel/unwind_bc.c
index 0ece156fdd7c..7ecaab24783f 100644
--- a/arch/s390/kernel/unwind_bc.c
+++ b/arch/s390/kernel/unwind_bc.c
@@ -49,6 +49,7 @@ static inline bool is_final_pt_regs(struct unwind_state 
*state,
   READ_ONCE_NOCHECK(regs->psw.mask) & PSW_MASK_PSTATE;
 }
 
+__no_kmsan_checks
 bool unwind_next_frame(struct unwind_state *state)
 {
struct stack_info *info = &state->stack_info;
@@ -118,6 +119,7 @@ bool unwind_next_frame(struct unwind_state *state)
 }
 EXPORT_SYMBOL_GPL(unwind_next_frame);
 
+__no_kmsan_checks
 void __unwind_start(struct unwind_state *state, struct task_struct *task,
struct pt_regs *regs, unsigned long first_frame)
 {
-- 
2.41.0




[PATCH 25/32] s390/ftrace: Unpoison ftrace_regs in kprobe_ftrace_handler()

2023-11-15 Thread Ilya Leoshkevich
s390 uses assembly code to initialize ftrace_regs and call
kprobe_ftrace_handler(). Therefore, from the KMSAN's point of view,
ftrace_regs is poisoned on kprobe_ftrace_handler() entry. This causes
KMSAN warnings when running the ftrace testsuite.

Fix by trusting the assembly code and always unpoisoning ftrace_regs in
kprobe_ftrace_handler().

Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/kernel/ftrace.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
index c46381ea04ec..3bad34eaa51e 100644
--- a/arch/s390/kernel/ftrace.c
+++ b/arch/s390/kernel/ftrace.c
@@ -300,6 +300,7 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long 
parent_ip,
if (bit < 0)
return;
 
+   kmsan_unpoison_memory(fregs, sizeof(*fregs));
regs = ftrace_get_regs(fregs);
p = get_kprobe((kprobe_opcode_t *)ip);
if (!regs || unlikely(!p) || kprobe_disabled(p))
-- 
2.41.0




[PATCH 32/32] kmsan: Enable on s390

2023-11-15 Thread Ilya Leoshkevich
Now that everything else is in place, enable KMSAN in Kconfig.

Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 3bec98d20283..160ad2220c53 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -153,6 +153,7 @@ config S390
select HAVE_ARCH_KASAN
select HAVE_ARCH_KASAN_VMALLOC
select HAVE_ARCH_KCSAN
+   select HAVE_ARCH_KMSAN
select HAVE_ARCH_KFENCE
select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
select HAVE_ARCH_SECCOMP_FILTER
-- 
2.41.0




[PATCH 27/32] s390/string: Add KMSAN support

2023-11-15 Thread Ilya Leoshkevich
Add KMSAN support for the s390 implementations of the string functions.
Do this similar to how it's already done for KASAN, except that the
optimized memset{16,32,64}() functions need to be disabled: it's
important for KMSAN to know that they initialized something.

Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/boot/string.c| 15 +++
 arch/s390/include/asm/string.h | 49 --
 2 files changed, 44 insertions(+), 20 deletions(-)

diff --git a/arch/s390/boot/string.c b/arch/s390/boot/string.c
index faccb33b462c..6d886c84075b 100644
--- a/arch/s390/boot/string.c
+++ b/arch/s390/boot/string.c
@@ -4,8 +4,14 @@
 #include 
 #undef CONFIG_KASAN
 #undef CONFIG_KASAN_GENERIC
+#undef CONFIG_KMSAN
 #include "../lib/string.c"
 
+/*
+ * Duplicate some functions from the common lib/string.c
+ * instead of fully including it.
+ */
+
 int strncmp(const char *cs, const char *ct, size_t count)
 {
unsigned char c1, c2;
@@ -22,6 +28,15 @@ int strncmp(const char *cs, const char *ct, size_t count)
return 0;
 }
 
+void *memset64(uint64_t *s, uint64_t v, size_t count)
+{
+   uint64_t *xs = s;
+
+   while (count--)
+   *xs++ = v;
+   return s;
+}
+
 char *skip_spaces(const char *str)
 {
while (isspace(*str))
diff --git a/arch/s390/include/asm/string.h b/arch/s390/include/asm/string.h
index 351685de53d2..94925024cb26 100644
--- a/arch/s390/include/asm/string.h
+++ b/arch/s390/include/asm/string.h
@@ -15,15 +15,12 @@
 #define __HAVE_ARCH_MEMCPY /* gcc builtin & arch function */
 #define __HAVE_ARCH_MEMMOVE/* gcc builtin & arch function */
 #define __HAVE_ARCH_MEMSET /* gcc builtin & arch function */
-#define __HAVE_ARCH_MEMSET16   /* arch function */
-#define __HAVE_ARCH_MEMSET32   /* arch function */
-#define __HAVE_ARCH_MEMSET64   /* arch function */
 
 void *memcpy(void *dest, const void *src, size_t n);
 void *memset(void *s, int c, size_t n);
 void *memmove(void *dest, const void *src, size_t n);
 
-#ifndef CONFIG_KASAN
+#if !defined(CONFIG_KASAN) && !defined(CONFIG_KMSAN)
 #define __HAVE_ARCH_MEMCHR /* inline & arch function */
 #define __HAVE_ARCH_MEMCMP /* arch function */
 #define __HAVE_ARCH_MEMSCAN/* inline & arch function */
@@ -36,6 +33,9 @@ void *memmove(void *dest, const void *src, size_t n);
 #define __HAVE_ARCH_STRNCPY/* arch function */
 #define __HAVE_ARCH_STRNLEN/* inline & arch function */
 #define __HAVE_ARCH_STRSTR /* arch function */
+#define __HAVE_ARCH_MEMSET16   /* arch function */
+#define __HAVE_ARCH_MEMSET32   /* arch function */
+#define __HAVE_ARCH_MEMSET64   /* arch function */
 
 /* Prototypes for non-inlined arch strings functions. */
 int memcmp(const void *s1, const void *s2, size_t n);
@@ -44,7 +44,7 @@ size_t strlcat(char *dest, const char *src, size_t n);
 char *strncat(char *dest, const char *src, size_t n);
 char *strncpy(char *dest, const char *src, size_t n);
 char *strstr(const char *s1, const char *s2);
-#endif /* !CONFIG_KASAN */
+#endif /* !defined(CONFIG_KASAN) && !defined(CONFIG_KMSAN) */
 
 #undef __HAVE_ARCH_STRCHR
 #undef __HAVE_ARCH_STRNCHR
@@ -74,21 +74,6 @@ void *__memset16(uint16_t *s, uint16_t v, size_t count);
 void *__memset32(uint32_t *s, uint32_t v, size_t count);
 void *__memset64(uint64_t *s, uint64_t v, size_t count);
 
-static inline void *memset16(uint16_t *s, uint16_t v, size_t count)
-{
-   return __memset16(s, v, count * sizeof(v));
-}
-
-static inline void *memset32(uint32_t *s, uint32_t v, size_t count)
-{
-   return __memset32(s, v, count * sizeof(v));
-}
-
-static inline void *memset64(uint64_t *s, uint64_t v, size_t count)
-{
-   return __memset64(s, v, count * sizeof(v));
-}
-
 #if !defined(IN_ARCH_STRING_C) && (!defined(CONFIG_FORTIFY_SOURCE) || 
defined(__NO_FORTIFY))
 
 #ifdef __HAVE_ARCH_MEMCHR
@@ -194,6 +179,27 @@ static inline size_t strnlen(const char * s, size_t n)
return end - s;
 }
 #endif
+
+#ifdef __HAVE_ARCH_MEMSET16
+static inline void *memset16(uint16_t *s, uint16_t v, size_t count)
+{
+   return __memset16(s, v, count * sizeof(v));
+}
+#endif
+
+#ifdef __HAVE_ARCH_MEMSET32
+static inline void *memset32(uint32_t *s, uint32_t v, size_t count)
+{
+   return __memset32(s, v, count * sizeof(v));
+}
+#endif
+
+#ifdef __HAVE_ARCH_MEMSET64
+static inline void *memset64(uint64_t *s, uint64_t v, size_t count)
+{
+   return __memset64(s, v, count * sizeof(v));
+}
+#endif
 #else /* IN_ARCH_STRING_C */
 void *memchr(const void * s, int c, size_t n);
 void *memscan(void *s, int c, size_t n);
@@ -201,6 +207,9 @@ char *strcat(char *dst, const char *src);
 char *strcpy(char *dst, const char *src);
 size_t strlen(const char *s);
 size_t strnlen(const char * s, size_t n);
+void *memset16(uint16_t *s, uint16_t v, size_t count);
+void *memset32(uint32_t *s, uint32_t v, size_t count);
+void *memset64(uint64_t *s, uint64_t v, size_t count);
 #endif /* !IN_ARCH_STRING_C */
 
 #endif /* __S390_STRING_H_ */
-- 

[PATCH 29/32] s390/uaccess: Add KMSAN support to put_user() and get_user()

2023-11-15 Thread Ilya Leoshkevich
put_user() uses inline assembly with precise constraints, so Clang is
in principle capable of instrumenting it automatically. Unfortunately,
one of the constraints contains a dereferenced user pointer, and Clang
does not currently distinguish user and kernel pointers. Therefore
KMSAN attempts to access shadow for user pointers, which is not a right
thing to do.

An obvious fix to add __no_sanitize_memory to __put_user_fn() does not
work, since it's __always_inline. And __always_inline cannot be removed
due to the __put_user_bad() trick.

A different obvious fix of using the "a" instead of the "+Q" constraint
degrades the code quality, which is very important here, since it's a
hot path.

Instead, repurpose the __put_user_asm() macro to define
__put_user_{char,short,int,long}_noinstr() functions and mark them with
__no_sanitize_memory. For the non-KMSAN builds make them
__always_inline in order to keep the generated code quality. Also
define __put_user_{char,short,int,long}() functions, which call the
aforementioned ones and which *are* instrumented, because they call
KMSAN hooks, which may be implemented as macros.

The same applies to get_user() as well.

Acked-by: Heiko Carstens 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/include/asm/uaccess.h | 110 ++--
 1 file changed, 78 insertions(+), 32 deletions(-)

diff --git a/arch/s390/include/asm/uaccess.h b/arch/s390/include/asm/uaccess.h
index 81ae8a98e7ec..b0715b88b55a 100644
--- a/arch/s390/include/asm/uaccess.h
+++ b/arch/s390/include/asm/uaccess.h
@@ -78,13 +78,23 @@ union oac {
 
 int __noreturn __put_user_bad(void);
 
-#define __put_user_asm(to, from, size) \
-({ \
+#ifdef CONFIG_KMSAN
+#define GET_PUT_USER_NOINSTR_ATTRIBUTES inline __no_sanitize_memory
+#else
+#define GET_PUT_USER_NOINSTR_ATTRIBUTES __always_inline
+#endif
+
+#define DEFINE_PUT_USER(type)  \
+static GET_PUT_USER_NOINSTR_ATTRIBUTES int \
+__put_user_##type##_noinstr(unsigned type __user *to,  \
+   unsigned type *from,\
+   unsigned long size) \
+{  \
union oac __oac_spec = {\
.oac1.as = PSW_BITS_AS_SECONDARY,   \
.oac1.a = 1,\
};  \
-   int __rc;   \
+   int rc; \
\
asm volatile(   \
"   lr  0,%[spec]\n"\
@@ -93,12 +103,28 @@ int __noreturn __put_user_bad(void);
"2:\n"  \
EX_TABLE_UA_STORE(0b, 2b, %[rc])\
EX_TABLE_UA_STORE(1b, 2b, %[rc])\
-   : [rc] "=&d" (__rc), [_to] "+Q" (*(to)) \
+   : [rc] "=&d" (rc), [_to] "+Q" (*(to))   \
: [_size] "d" (size), [_from] "Q" (*(from)),\
  [spec] "d" (__oac_spec.val)   \
: "cc", "0");   \
-   __rc;   \
-})
+   return rc;  \
+}  \
+   \
+static __always_inline int \
+__put_user_##type(unsigned type __user *to, unsigned type *from,   \
+ unsigned long size)   \
+{  \
+   int rc; \
+   \
+   rc = __put_user_##type##_noinstr(to, from, size);   \
+   instrument_put_user(*from, to, size);   \
+   return rc;  \
+}
+
+DEFINE_PUT_USER(char);
+DEFINE_PUT_USER(short);
+DEFINE_PUT_USER(int);
+DEFINE_PUT_USER(long);
 
 static __always_inline int __put_user_fn(void *x, void __user *ptr, unsigned 
long size)
 {
@@ -106,24 +132,24 @@ static __always_inline int __put_user_fn(void *x, void 
__user *ptr, unsigned lon
 
switch (size) {

[PATCH 24/32] s390/cpacf: Unpoison the results of cpacf_trng()

2023-11-15 Thread Ilya Leoshkevich
Prevent KMSAN from complaining about buffers filled by cpacf_trng()
being uninitialized.

Tested-by: Alexander Gordeev 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/include/asm/cpacf.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/s390/include/asm/cpacf.h b/arch/s390/include/asm/cpacf.h
index b378e2b57ad8..a72b92770c4b 100644
--- a/arch/s390/include/asm/cpacf.h
+++ b/arch/s390/include/asm/cpacf.h
@@ -473,6 +473,8 @@ static inline void cpacf_trng(u8 *ucbuf, unsigned long 
ucbuf_len,
: [ucbuf] "+&d" (u.pair), [cbuf] "+&d" (c.pair)
: [fc] "K" (CPACF_PRNO_TRNG), [opc] "i" (CPACF_PRNO)
: "cc", "memory", "0");
+   kmsan_unpoison_memory(ucbuf, ucbuf_len);
+   kmsan_unpoison_memory(cbuf, cbuf_len);
 }
 
 /**
-- 
2.41.0




[PATCH 20/32] s390: Turn off KMSAN for boot, vdso and purgatory

2023-11-15 Thread Ilya Leoshkevich
All other sanitizers are disabled for these components as well.

Reviewed-by: Alexander Gordeev 
Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/boot/Makefile  | 1 +
 arch/s390/kernel/vdso32/Makefile | 1 +
 arch/s390/kernel/vdso64/Makefile | 1 +
 arch/s390/purgatory/Makefile | 1 +
 4 files changed, 4 insertions(+)

diff --git a/arch/s390/boot/Makefile b/arch/s390/boot/Makefile
index c7c81e5f9218..5a05c927f703 100644
--- a/arch/s390/boot/Makefile
+++ b/arch/s390/boot/Makefile
@@ -8,6 +8,7 @@ GCOV_PROFILE := n
 UBSAN_SANITIZE := n
 KASAN_SANITIZE := n
 KCSAN_SANITIZE := n
+KMSAN_SANITIZE := n
 
 KBUILD_AFLAGS := $(KBUILD_AFLAGS_DECOMPRESSOR)
 KBUILD_CFLAGS := $(KBUILD_CFLAGS_DECOMPRESSOR)
diff --git a/arch/s390/kernel/vdso32/Makefile b/arch/s390/kernel/vdso32/Makefile
index caec7db6f966..8911c55a7f07 100644
--- a/arch/s390/kernel/vdso32/Makefile
+++ b/arch/s390/kernel/vdso32/Makefile
@@ -37,6 +37,7 @@ GCOV_PROFILE := n
 UBSAN_SANITIZE := n
 KASAN_SANITIZE := n
 KCSAN_SANITIZE := n
+KMSAN_SANITIZE := n
 
 # Force dependency (incbin is bad)
 $(obj)/vdso32_wrapper.o : $(obj)/vdso32.so
diff --git a/arch/s390/kernel/vdso64/Makefile b/arch/s390/kernel/vdso64/Makefile
index e3c9085f8fa7..f4f75c334d59 100644
--- a/arch/s390/kernel/vdso64/Makefile
+++ b/arch/s390/kernel/vdso64/Makefile
@@ -41,6 +41,7 @@ GCOV_PROFILE := n
 UBSAN_SANITIZE := n
 KASAN_SANITIZE := n
 KCSAN_SANITIZE := n
+KMSAN_SANITIZE := n
 
 # Force dependency (incbin is bad)
 $(obj)/vdso64_wrapper.o : $(obj)/vdso64.so
diff --git a/arch/s390/purgatory/Makefile b/arch/s390/purgatory/Makefile
index 4e930f566878..e8402287b0cd 100644
--- a/arch/s390/purgatory/Makefile
+++ b/arch/s390/purgatory/Makefile
@@ -20,6 +20,7 @@ GCOV_PROFILE := n
 UBSAN_SANITIZE := n
 KASAN_SANITIZE := n
 KCSAN_SANITIZE := n
+KMSAN_SANITIZE := n
 
 KBUILD_CFLAGS := -fno-strict-aliasing -Wall -Wstrict-prototypes
 KBUILD_CFLAGS += -Wno-pointer-sign -Wno-sign-compare
-- 
2.41.0




[PATCH 28/32] s390/traps: Unpoison the kernel_stack_overflow()'s pt_regs

2023-11-15 Thread Ilya Leoshkevich
This is normally done by the generic entry code, but the
kernel_stack_overflow() flow bypasses it.

Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/kernel/traps.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/s390/kernel/traps.c b/arch/s390/kernel/traps.c
index 1d2aa448d103..dd7362806dbb 100644
--- a/arch/s390/kernel/traps.c
+++ b/arch/s390/kernel/traps.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -260,6 +261,7 @@ static void monitor_event_exception(struct pt_regs *regs)
 
 void kernel_stack_overflow(struct pt_regs *regs)
 {
+   kmsan_unpoison_entry_regs(regs);
bust_spinlocks(1);
printk("Kernel stack overflow.\n");
show_regs(regs);
-- 
2.41.0




[PATCH 31/32] s390: Implement the architecture-specific kmsan functions

2023-11-15 Thread Ilya Leoshkevich
arch_kmsan_get_meta_or_null() finds the lowcore shadow by querying the
prefix and calling kmsan_get_metadata() again.

kmsan_virt_addr_valid() delegates to virt_addr_valid().

Signed-off-by: Ilya Leoshkevich 
---
 arch/s390/include/asm/kmsan.h | 36 +++
 1 file changed, 36 insertions(+)
 create mode 100644 arch/s390/include/asm/kmsan.h

diff --git a/arch/s390/include/asm/kmsan.h b/arch/s390/include/asm/kmsan.h
new file mode 100644
index ..afec71e9e9ac
--- /dev/null
+++ b/arch/s390/include/asm/kmsan.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_S390_KMSAN_H
+#define _ASM_S390_KMSAN_H
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifndef MODULE
+
+static inline void *arch_kmsan_get_meta_or_null(void *addr, bool is_origin)
+{
+   if (addr >= (void *)&S390_lowcore &&
+   addr < (void *)(&S390_lowcore + 1)) {
+   /*
+* Different lowcores accessed via S390_lowcore are described
+* by the same struct page. Resolve the prefix manually in
+* order to get a distinct struct page.
+*/
+   addr += (void *)lowcore_ptr[raw_smp_processor_id()] -
+   (void *)&S390_lowcore;
+   return kmsan_get_metadata(addr, is_origin);
+   }
+   return NULL;
+}
+
+static inline bool kmsan_virt_addr_valid(void *addr)
+{
+   return virt_addr_valid(addr);
+}
+
+#endif /* !MODULE */
+
+#endif /* _ASM_S390_KMSAN_H */
-- 
2.41.0




Re: [PATCH v6 07/12] x86/sgx: Introduce EPC page states

2023-11-15 Thread Jarkko Sakkinen
On Mon Oct 30, 2023 at 8:20 PM EET, Haitao Huang wrote:
> Use the lower 2 bits in the flags field of sgx_epc_page struct to track
> EPC states and define an enum for possible states for EPC pages tracked
> for reclamation.
>
> Add the RECLAIM_IN_PROGRESS state to explicitly indicate a page that is
> identified as a candidate for reclaiming, but has not yet been
> reclaimed, instead of relying on list_empty(&epc_page->list). A later
> patch will replace the array on stack with a temporary list to store the
> candidate pages, so list_empty() should no longer be used for this
> purpose.
>
> Co-developed-by: Sean Christopherson 
> Signed-off-by: Sean Christopherson 
> Co-developed-by: Kristen Carlson Accardi 
> Signed-off-by: Kristen Carlson Accardi 
> Signed-off-by: Haitao Huang 
> Cc: Sean Christopherson 
> ---
> V6:
> - Drop UNRECLAIMABLE and use only 2 bits for states (Kai)
> - Combine the patch for RECLAIM_IN_PROGRESS
> - Style fixes (Jarkko and Kai)
> ---
>  arch/x86/kernel/cpu/sgx/encl.c |  2 +-
>  arch/x86/kernel/cpu/sgx/main.c | 33 +-
>  arch/x86/kernel/cpu/sgx/sgx.h  | 62 +++---
>  3 files changed, 76 insertions(+), 21 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 279148e72459..17dc108d3ff7 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -1315,7 +1315,7 @@ void sgx_encl_free_epc_page(struct sgx_epc_page *page)
>  {
>   int ret;
>  
> - WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
> + WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_STATE_MASK);
>  
>   ret = __eremove(sgx_get_epc_virt_addr(page));
>   if (WARN_ONCE(ret, EREMOVE_ERROR_MESSAGE, ret, ret))
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index d347acd717fd..e27ac73d8843 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -315,13 +315,14 @@ static void sgx_reclaim_pages(void)
>   list_del_init(&epc_page->list);
>   encl_page = epc_page->owner;
>  
> - if (kref_get_unless_zero(&encl_page->encl->refcount) != 0)
> + if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
> + sgx_epc_page_set_state(epc_page, 
> SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
>   chunk[cnt++] = epc_page;
> - else
> + } else
>   /* The owner is freeing the page. No need to add the
>* page back to the list of reclaimable pages.
>*/
> - epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> + sgx_epc_page_reset_state(epc_page);
>   }
>   spin_unlock(&sgx_global_lru.lock);
>  
> @@ -347,6 +348,7 @@ static void sgx_reclaim_pages(void)
>  
>  skip:
>   spin_lock(&sgx_global_lru.lock);
> + sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
>   list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
>   spin_unlock(&sgx_global_lru.lock);
>  
> @@ -370,7 +372,7 @@ static void sgx_reclaim_pages(void)
>   sgx_reclaimer_write(epc_page, &backing[i]);
>  
>   kref_put(&encl_page->encl->refcount, sgx_encl_release);
> - epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> + sgx_epc_page_reset_state(epc_page);
>  
>   sgx_free_epc_page(epc_page);
>   }
> @@ -509,7 +511,8 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
>  void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
>  {
>   spin_lock(&sgx_global_lru.lock);
> - page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
> + WARN_ON_ONCE(sgx_epc_page_reclaimable(page->flags));
> + page->flags |= SGX_EPC_PAGE_RECLAIMABLE;
>   list_add_tail(&page->list, &sgx_global_lru.reclaimable);
>   spin_unlock(&sgx_global_lru.lock);
>  }
> @@ -527,16 +530,13 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page 
> *page)
>  int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
>  {
>   spin_lock(&sgx_global_lru.lock);
> - if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
> - /* The page is being reclaimed. */
> - if (list_empty(&page->list)) {
> - spin_unlock(&sgx_global_lru.lock);
> - return -EBUSY;
> - }
> -
> - list_del(&page->list);
> - page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> + if (sgx_epc_page_reclaim_in_progress(page->flags)) {
> + spin_unlock(&sgx_global_lru.lock);
> + return -EBUSY;
>   }
> +
> + list_del(&page->list);
> + sgx_epc_page_reset_state(page);
>   spin_unlock(&sgx_global_lru.lock);
>  
>   return 0;
> @@ -623,6 +623,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>   struct sgx_epc_section *section = &sgx_epc_sections[page->section];
>   s

Re: [PATCH v6 04/12] x86/sgx: Implement basic EPC misc cgroup functionality

2023-11-15 Thread Jarkko Sakkinen
On Mon Oct 30, 2023 at 8:20 PM EET, Haitao Huang wrote:
> From: Kristen Carlson Accardi 
>
> Implement support for cgroup control of SGX Enclave Page Cache (EPC)
> memory using the misc cgroup controller. EPC memory is independent
> from normal system memory, e.g. must be reserved at boot from RAM and
> cannot be converted between EPC and normal memory while the system is
> running. EPC is managed by the SGX subsystem and is not accounted by
> the memory controller.
>
> Much like normal system memory, EPC memory can be overcommitted via
> virtual memory techniques and pages can be swapped out of the EPC to
> their backing store (normal system memory, e.g. shmem).  The SGX EPC
> subsystem is analogous to the memory subsystem and the SGX EPC controller
> is in turn analogous to the memory controller; it implements limit and
> protection models for EPC memory.
>
> The misc controller provides a mechanism to set a hard limit of EPC
> usage via the "sgx_epc" resource in "misc.max". The total EPC memory
> available on the system is reported via the "sgx_epc" resource in
> "misc.capacity".
>
> This patch was modified from the previous version to only add basic EPC
> cgroup structure, accounting allocations for cgroup usage
> (charge/uncharge), setup misc cgroup callbacks, set total EPC capacity.
>
> For now, the EPC cgroup simply blocks additional EPC allocation in
> sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
> still tracked in the global active list, only reclaimed by the global
> reclaimer when the total free page count is lower than a threshold.
>
> Later patches will reorganize the tracking and reclamation code in the
> globale reclaimer and implement per-cgroup tracking and reclaiming.
>
> Co-developed-by: Sean Christopherson 
> Signed-off-by: Sean Christopherson 
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang 
> Signed-off-by: Haitao Huang 
> ---
> V6:
> - Split the original large patch"Limit process EPC usage with misc
> cgroup controller"  and restructure it (Kai)
> ---
>  arch/x86/Kconfig |  13 
>  arch/x86/kernel/cpu/sgx/Makefile |   1 +
>  arch/x86/kernel/cpu/sgx/epc_cgroup.c | 103 +++
>  arch/x86/kernel/cpu/sgx/epc_cgroup.h |  36 ++
>  arch/x86/kernel/cpu/sgx/main.c   |  28 
>  arch/x86/kernel/cpu/sgx/sgx.h|   3 +
>  6 files changed, 184 insertions(+)
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
>  create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 66bfabae8814..e17c5dc3aea4 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1921,6 +1921,19 @@ config X86_SGX
>  
> If unsure, say N.
>  
> +config CGROUP_SGX_EPC
> + bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for 
> Intel SGX"
> + depends on X86_SGX && CGROUP_MISC
> + help
> +   Provides control over the EPC footprint of tasks in a cgroup via
> +   the Miscellaneous cgroup controller.
> +
> +   EPC is a subset of regular memory that is usable only by SGX
> +   enclaves and is very limited in quantity, e.g. less than 1%
> +   of total DRAM.
> +
> +   Say N if unsure.
> +
>  config X86_USER_SHADOW_STACK
>   bool "X86 userspace shadow stack"
>   depends on AS_WRUSS
> diff --git a/arch/x86/kernel/cpu/sgx/Makefile 
> b/arch/x86/kernel/cpu/sgx/Makefile
> index 9c1656779b2a..12901a488da7 100644
> --- a/arch/x86/kernel/cpu/sgx/Makefile
> +++ b/arch/x86/kernel/cpu/sgx/Makefile
> @@ -4,3 +4,4 @@ obj-y += \
>   ioctl.o \
>   main.o
>  obj-$(CONFIG_X86_SGX_KVM)+= virt.o
> +obj-$(CONFIG_CGROUP_SGX_EPC)+= epc_cgroup.o
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c 
> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> new file mode 100644
> index ..500627d0563f
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -0,0 +1,103 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright(c) 2022 Intel Corporation.
> +
> +#include 
> +#include 
> +#include "epc_cgroup.h"
> +
> +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct 
> misc_cg *cg)
> +{
> + return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
> +}
> +
> +static inline bool sgx_epc_cgroup_disabled(void)
> +{
> + return !cgroup_subsys_enabled(misc_cgrp_subsys);
> +}
> +
> +/**
> + * sgx_epc_cgroup_try_charge() - hierarchically try to charge a single EPC 
> page
> + *
> + * Returns EPC cgroup or NULL on success, -errno on failure.

Should have a description explaining what "charging hierarchically" is
all about. This is too cryptic like this.

E.g. consider wahat non-hierarchically charging means. There must be
opposite end in order to have a meaning (for anything expressed with
a language).

> + */
> +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void)
> +{
> + struct sgx_epc_cgroup *epc_cg;
> + int ret;
> +
> + if (sgx_e

Re: [PATCH v6 08/12] x86/sgx: Use a list to track to-be-reclaimed pages

2023-11-15 Thread Jarkko Sakkinen
On Mon Oct 30, 2023 at 8:20 PM EET, Haitao Huang wrote:
> From: Sean Christopherson 
>
> Change sgx_reclaim_pages() to use a list rather than an array for
> storing the epc_pages which will be reclaimed. This change is needed
> to transition to the LRU implementation for EPC cgroup support.
>
> When the EPC cgroup is implemented, the reclaiming process will do a
> pre-order tree walk for the subtree starting from the limit-violating
> cgroup.  When each node is visited, candidate pages are selected from
> its "reclaimable" LRU list and moved into this temporary list. Passing a
> list from node to node for temporary storage in this walk is more
> straightforward than using an array.
>
> Signed-off-by: Sean Christopherson 
> Co-developed-by: Kristen Carlson Accardi 
> Signed-off-by: Kristen Carlson Accardi 
> Co-developed-by: Haitao Huang
> Signed-off-by: Haitao Huang
> Cc: Sean Christopherson 
> ---
> V6:
> - Remove extra list_del_init and style fix (Kai)
>
> V4:
> - Changes needed for patch reordering
> - Revised commit message
>
> V3:
> - Removed list wrappers
> ---
>  arch/x86/kernel/cpu/sgx/main.c | 35 +++---
>  1 file changed, 15 insertions(+), 20 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index e27ac73d8843..33bcba313d40 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -296,12 +296,11 @@ static void sgx_reclaimer_write(struct sgx_epc_page 
> *epc_page,
>   */
>  static void sgx_reclaim_pages(void)
>  {
> - struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
>   struct sgx_backing backing[SGX_NR_TO_SCAN];
> + struct sgx_epc_page *epc_page, *tmp;
>   struct sgx_encl_page *encl_page;
> - struct sgx_epc_page *epc_page;
>   pgoff_t page_index;
> - int cnt = 0;
> + LIST_HEAD(iso);
>   int ret;
>   int i;
>  
> @@ -317,7 +316,7 @@ static void sgx_reclaim_pages(void)
>  
>   if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
>   sgx_epc_page_set_state(epc_page, 
> SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
> - chunk[cnt++] = epc_page;
> + list_move_tail(&epc_page->list, &iso);
>   } else
>   /* The owner is freeing the page. No need to add the
>* page back to the list of reclaimable pages.
> @@ -326,8 +325,11 @@ static void sgx_reclaim_pages(void)
>   }
>   spin_unlock(&sgx_global_lru.lock);
>  
> - for (i = 0; i < cnt; i++) {
> - epc_page = chunk[i];
> + if (list_empty(&iso))
> + return;
> +
> + i = 0;
> + list_for_each_entry_safe(epc_page, tmp, &iso, list) {
>   encl_page = epc_page->owner;
>  
>   if (!sgx_reclaimer_age(epc_page))
> @@ -342,6 +344,7 @@ static void sgx_reclaim_pages(void)
>   goto skip;
>   }
>  
> + i++;
>   encl_page->desc |= SGX_ENCL_PAGE_BEING_RECLAIMED;
>   mutex_unlock(&encl_page->encl->lock);
>   continue;
> @@ -349,27 +352,19 @@ static void sgx_reclaim_pages(void)
>  skip:
>   spin_lock(&sgx_global_lru.lock);
>   sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
> - list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
> + list_move_tail(&epc_page->list, &sgx_global_lru.reclaimable);
>   spin_unlock(&sgx_global_lru.lock);
>  
>   kref_put(&encl_page->encl->refcount, sgx_encl_release);
> -
> - chunk[i] = NULL;
> - }
> -
> - for (i = 0; i < cnt; i++) {
> - epc_page = chunk[i];
> - if (epc_page)
> - sgx_reclaimer_block(epc_page);
>   }
>  
> - for (i = 0; i < cnt; i++) {
> - epc_page = chunk[i];
> - if (!epc_page)
> - continue;
> + list_for_each_entry(epc_page, &iso, list)
> + sgx_reclaimer_block(epc_page);
>  
> + i = 0;
> + list_for_each_entry_safe(epc_page, tmp, &iso, list) {
>   encl_page = epc_page->owner;
> - sgx_reclaimer_write(epc_page, &backing[i]);
> + sgx_reclaimer_write(epc_page, &backing[i++]);

Couldn't you alternatively "&backing[--i]" and not reset i to zero
before the loop?

>  
>   kref_put(&encl_page->encl->refcount, sgx_encl_release);
>   sgx_epc_page_reset_state(epc_page);

BR, Jarkko


Re: [PATCH v6 12/12] selftests/sgx: Add scripts for EPC cgroup testing

2023-11-15 Thread Jarkko Sakkinen
On Mon Oct 30, 2023 at 8:20 PM EET, Haitao Huang wrote:
> The scripts rely on cgroup-tools package from libcgroup [1].
>
> To run selftests for epc cgroup:
>
> sudo ./run_epc_cg_selftests.sh
>
> With different cgroups, the script starts one or multiple concurrent SGX
> selftests, each to run one unclobbered_vdso_oversubscribed test.  Each
> of such test tries to load an enclave of EPC size equal to the EPC
> capacity available on the platform. The script checks results against
> the expectation set for each cgroup and reports success or failure.
>
> The script creates 3 different cgroups at the beginning with following
> expectations:
>
> 1) SMALL - intentionally small enough to fail the test loading an
> enclave of size equal to the capacity.
> 2) LARGE - large enough to run up to 4 concurrent tests but fail some if
> more than 4 concurrent tests are run. The script starts 4 expecting at
> least one test to pass, and then starts 5 expecting at least one test
> to fail.
> 3) LARGER - limit is the same as the capacity, large enough to run lots of
> concurrent tests. The script starts 10 of them and expects all pass.
> Then it reruns the same test with one process randomly killed and
> usage checked to be zero after all process exit.
>
> To watch misc cgroup 'current' changes during testing, run this in a
> separate terminal:
>
> ./watch_misc_for_tests.sh current
>
> [1] https://github.com/libcgroup/libcgroup/blob/main/README
>
> Signed-off-by: Haitao Huang 
> ---
> V5:
>
> - Added script with automatic results checking, remove the interactive
> script.
> - The script can run independent from the series below.
> ---
>  .../selftests/sgx/run_epc_cg_selftests.sh | 196 ++
>  .../selftests/sgx/watch_misc_for_tests.sh |  13 ++
>  2 files changed, 209 insertions(+)
>  create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
>  create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh
>
> diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh 
> b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> new file mode 100755
> index ..72b93f694753
> --- /dev/null
> +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> @@ -0,0 +1,196 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright(c) 2023 Intel Corporation.
> +
> +TEST_ROOT_CG=selftest
> +cgcreate -g misc:$TEST_ROOT_CG
> +if [ $? -ne 0 ]; then
> +echo "# Please make sure cgroup-tools is installed, and misc cgroup is 
> mounted."
> +exit 1
> +fi
> +TEST_CG_SUB1=$TEST_ROOT_CG/test1
> +TEST_CG_SUB2=$TEST_ROOT_CG/test2
> +TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3
> +TEST_CG_SUB4=$TEST_ROOT_CG/test4
> +
> +cgcreate -g misc:$TEST_CG_SUB1
> +cgcreate -g misc:$TEST_CG_SUB2
> +cgcreate -g misc:$TEST_CG_SUB3
> +cgcreate -g misc:$TEST_CG_SUB4
> +
> +# Default to V2
> +CG_ROOT=/sys/fs/cgroup
> +if [ ! -d "/sys/fs/cgroup/misc" ]; then
> +echo "# cgroup V2 is in use."
> +else
> +echo "# cgroup V1 is in use."
> +CG_ROOT=/sys/fs/cgroup/misc
> +fi

Does the test need to support v1 cgroups?

> +
> +CAPACITY=$(grep "sgx_epc" "$CG_ROOT/misc.capacity" | awk '{print $2}')
> +# This is below number of VA pages needed for enclave of capacity size. So
> +# should fail oversubscribed cases
> +SMALL=$(( CAPACITY / 512 ))
> +
> +# At least load one enclave of capacity size successfully, maybe up to 4.
> +# But some may fail if we run more than 4 concurrent enclaves of capacity 
> size.
> +LARGE=$(( SMALL * 4 ))
> +
> +# Load lots of enclaves
> +LARGER=$CAPACITY
> +echo "# Setting up limits."
> +echo "sgx_epc $SMALL" | tee $CG_ROOT/$TEST_CG_SUB1/misc.max
> +echo "sgx_epc $LARGE" | tee $CG_ROOT/$TEST_CG_SUB2/misc.max
> +echo "sgx_epc $LARGER" | tee $CG_ROOT/$TEST_CG_SUB4/misc.max
> +
> +timestamp=$(date +%Y%m%d_%H%M%S)
> +
> +test_cmd="./test_sgx -t unclobbered_vdso_oversubscribed"
> +
> +echo "# Start unclobbered_vdso_oversubscribed with SMALL limit, expecting 
> failure..."
> +# Always use leaf node of misc cgroups so it works for both v1 and v2
> +# these may fail on OOM
> +cgexec -g misc:$TEST_CG_SUB3 $test_cmd >cgtest_small_$timestamp.log 2>&1
> +if [[ $? -eq 0 ]]; then
> +echo "# Fail on SMALL limit, not expecting any test passes."
> +cgdelete -r -g misc:$TEST_ROOT_CG
> +exit 1
> +else
> +echo "# Test failed as expected."
> +fi
> +
> +echo "# PASSED SMALL limit."
> +
> +echo "# Start 4 concurrent unclobbered_vdso_oversubscribed tests with LARGE 
> limit,
> +expecting at least one success"
> +pids=()
> +for i in {1..4}; do
> +(
> +cgexec -g misc:$TEST_CG_SUB2 $test_cmd 
> >cgtest_large_positive_$timestamp.$i.log 2>&1
> +) &
> +pids+=($!)
> +done
> +
> +any_success=0
> +for pid in "${pids[@]}"; do
> +wait "$pid"
> +status=$?
> +if [[ $status -eq 0 ]]; then
> +any_success=1
> + echo "# Process $pid returned successfully."
> +fi
> +done
> +
> +if [[ $any_success -eq 0 ]]; then
> +echo "# Failed on LARG

Re: [PATCH v6 12/12] selftests/sgx: Add scripts for EPC cgroup testing

2023-11-15 Thread Haitao Huang




+CG_ROOT=/sys/fs/cgroup
+if [ ! -d "/sys/fs/cgroup/misc" ]; then
+echo "# cgroup V2 is in use."
+else
+echo "# cgroup V1 is in use."
+CG_ROOT=/sys/fs/cgroup/misc
+fi


Does the test need to support v1 cgroups?

I thought some distro may still only support V1. I do my most work on  
Ubuntu22.04 which by default is v1 so it's convenient for me to test. But  
not strong opinions.


Thanks
Haitao


Re: [PATCH v7 00/13] selftests/sgx: Fix compilation errors

2023-11-15 Thread Jarkko Sakkinen
On Wed Nov 8, 2023 at 10:31 PM EET, Jo Van Bulck wrote:
> On 23.10.23 23:32, Jarkko Sakkinen wrote:
> > On Fri Oct 13, 2023 at 2:45 PM EEST, Jo Van Bulck wrote:
> >> On 10.10.23 11:44, Jarkko Sakkinen wrote:
> >>> Folks (sorry for top posting): I've now taken my old NUC7 out of the
> >>> dust and tested the series :-)
> >>>
> >>> Tested-by: Jarkko Sakkinen 
> >>
> >> Thanks for testing this Jarkko! Not sure on next steps, do you want me
> >> to re-post the series with the Tested-by tag for all commits or will you
> >> add that? Let me know if something from my side is needed.
> > 
> > Dave, can you pick these patches to the x86 tree with my tested-by
> > added? Sorry for latency. It is flu season in Finland and I've been
> > functional varying last week because of that.
>
> Just a kind follow-up: from what I can see, this series has not been 
> merged into the x86/sgx branch of tip yet (assuming that's where it 
> should go next)?
>
> Apologies if I've overlooked anything, and please let me know if there's 
> something on my end that can help!

I'm cool merging them so it is now up to Dave to pick them into the
tip tree.

> Best,
> Jo

BR, Jarkko


Re: [PATCH v7 00/13] selftests/sgx: Fix compilation errors

2023-11-15 Thread Jarkko Sakkinen
On Wed Nov 8, 2023 at 10:46 PM EET, Dave Hansen wrote:
> On 11/8/23 12:31, Jo Van Bulck wrote:
> > Just a kind follow-up: from what I can see, this series has not been
> > merged into the x86/sgx branch of tip yet (assuming that's where it
> > should go next)?
> > 
> > Apologies if I've overlooked anything, and please let me know if there's
> > something on my end that can help!
>
> Yes, you've missed something.  For your reading pleasure:
>
> https://www.kernel.org/doc/html/latest/process/2.Process.html?highlight=merge%20window
> https://www.kernel.org/doc/html/latest/process/maintainer-tip.html
>
> I honestly didn't even think about applying this until Jarkko said
> something on 23rd.  By that point, it was far too late for it to get
> sorted out for 6.7.  So, that puts it in the next merge window.
> Specifically:
>
>   The release candidate -rc1 is the starting point for new
>   patches to be applied which are targeted for the next merge
>   window.
>
> So wait for the next -rc1, and you'll hopefully see your series get merged.

OK, great, thank you Dave.

BR, Jarkko