date:20240814

[PATCH v1 0/2] mremap refactor: check src address for vma boundaries first.

2024-08-14 Thread jeffxu

From: Jeff Xu 

mremap doesn't allow relocate, expand, shrink across VMA boundaries,
refactor the code to check src address range before doing anything on
the destination, i.e. destination won't be unmapped, if src address
failed the boundaries check.

This also allows us to remove can_modify_mm from mremap.c, since
the src address must be single VMA, can_modify_vma is used.

It is likely this will improve the performance on mremap, previously 
the code does sealing check using can_modify_mm for the src address range,
and the new code removed the loop (used by can_modify_mm).

In order to verify this patch doesn't regress on mremap, I added tests in
mseal_test, the test patch can be applied before mremap refactor patch or
checkin independently.

Also this patch doesn't change mseal's existing schematic: if sealing fail,
user can expect the src/dst address isn't updated. So this patch can be
applied regardless if we decided to go with current out-of-loop approach 
or in-loop approach currently in discussion.

Regarding the perf test report by stress-ng [1] title:
8be7258aad: stress-ng.pagemove.page_remaps_per_sec -4.4% regression

The test is using below for testing:
stress-ng --timeout 60 --times --verify --metrics --no-rand-seed --pagemove 64

I can't repro this using ChromeOS, the pagemove test shows large value
of stddev and stderr, and can't reasonably refect the performance impact.

For example: I write a c program [2] to run the above pagemove test 10 times
and calculate the stddev, stderr, for 3 commits:

1> before mseal feature is added:
Ops/sec:
  Mean : 3564.40
  Std Dev  : 2737.35 (76.80% of Mean)
  Std Err  : 865.63 (24.29% of Mean)

2> after mseal feature is added:
Ops/sec:
  Mean : 2703.84
  Std Dev  : 2085.13 (77.12% of Mean)
  Std Err  : 659.38 (24.39% of Mean)

3> after current patch (mremap refactor)
Ops/sec:
  Mean : 3603.67
  Std Dev  : 2422.22 (67.22% of Mean)
  Std Err  : 765.97 (21.26% of Mean)

The result shows 21%-24% stderr, this means whatever perf improvment/impact
there might be won't be measured correctly by this test.

This test machine has 32G memory,  Intel(R) Celeron(R) 7305, 5 CPU.
And I reboot the machine before each test, and take the first 10 runs with
run_stress_ng 10 

(I will run longer duration to see if test still shows large stdDev,StdErr)

[1] https://lore.kernel.org/lkml/202408041602.caa0372-oliver.s...@intel.com/
[2] https://github.com/peaktocreek/mmperf/blob/main/run_stress_ng.c


Jeff Xu (2):
  mseal:selftest mremap across VMA boundaries.
  mseal: refactor mremap to remove can_modify_mm

 mm/internal.h   |  24 ++
 mm/mremap.c |  77 +++
 mm/mseal.c  |  17 --
 tools/testing/selftests/mm/mseal_test.c | 293 +++-
 4 files changed, 353 insertions(+), 58 deletions(-)

-- 
2.46.0.76.ge559c4bf1a-goog

[PATCH v1 1/2] mseal:selftest mremap across VMA boundaries.

2024-08-14 Thread jeffxu

From: Jeff Xu 

Add selftest to mremap across VMA boundaries,
i.e. mremap will fail.

Signed-off-by: Jeff Xu 
---
 tools/testing/selftests/mm/mseal_test.c | 293 +++-
 1 file changed, 292 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/mm/mseal_test.c 
b/tools/testing/selftests/mm/mseal_test.c
index 5bce2fe102ab..422cf90fb56c 100644
--- a/tools/testing/selftests/mm/mseal_test.c
+++ b/tools/testing/selftests/mm/mseal_test.c
@@ -1482,6 +1482,47 @@ static void test_seal_mremap_move_dontunmap_anyaddr(bool 
seal)
REPORT_TEST_PASS();
 }
 
+static void test_seal_mremap_move_dontunmap_allocated(bool seal)
+{
+   void *ptr, *ptr2;
+   unsigned long page_size = getpagesize();
+   unsigned long size = 4 * page_size;
+   int ret;
+   void *ret2;
+
+   setup_single_address(size, &ptr);
+   FAIL_TEST_IF_FALSE(ptr != (void *)-1);
+
+   if (seal) {
+   ret = sys_mseal(ptr, size);
+   FAIL_TEST_IF_FALSE(!ret);
+   }
+
+   /*
+* The new address is allocated.
+*/
+   setup_single_address(size, &ptr2);
+   FAIL_TEST_IF_FALSE(ptr2 != (void *)-1);
+
+   /*
+* remap to allocated address.
+*/
+   ret2 = sys_mremap(ptr, size, size, MREMAP_MAYMOVE | MREMAP_DONTUNMAP,
+   (void *) ptr2);
+   if (seal) {
+   FAIL_TEST_IF_FALSE(ret2 == MAP_FAILED);
+   FAIL_TEST_IF_FALSE(errno == EPERM);
+   } else {
+   /* remap success and but it won't be ptr2 */
+   FAIL_TEST_IF_FALSE(!(ret2 == MAP_FAILED));
+   FAIL_TEST_IF_FALSE(ret2 !=  ptr2);
+   }
+
+   REPORT_TEST_PASS();
+}
+
+
+
 static void test_seal_merge_and_split(void)
 {
void *ptr;
@@ -1746,6 +1787,239 @@ static void test_seal_discard_ro_anon(bool seal)
REPORT_TEST_PASS();
 }
 
+static void test_seal_mremap_shrink_multiple_vmas(bool seal)
+{
+   void *ptr;
+   unsigned long page_size = getpagesize();
+   unsigned long size = 12 * page_size;
+   int ret;
+   void *ret2;
+   int prot;
+
+   setup_single_address(size, &ptr);
+   FAIL_TEST_IF_FALSE(ptr != (void *)-1);
+
+   ret = sys_mprotect(ptr + 4 * page_size, 4 * page_size, PROT_NONE);
+   FAIL_TEST_IF_FALSE(!ret);
+
+   size = get_vma_size(ptr, &prot);
+   FAIL_TEST_IF_FALSE(size == 4 * page_size);
+
+   size = get_vma_size(ptr + 4 * page_size, &prot);
+   FAIL_TEST_IF_FALSE(size == 4 * page_size);
+
+   if (seal) {
+   ret = sys_mseal(ptr + 4 * page_size, 4 * page_size);
+   FAIL_TEST_IF_FALSE(!ret);
+   }
+
+   ret2 = sys_mremap(ptr, 12 * page_size, 6 * page_size, 0, 0);
+   if (seal) {
+   FAIL_TEST_IF_FALSE(ret2 == (void *) MAP_FAILED);
+   FAIL_TEST_IF_FALSE(errno == EPERM);
+
+   size = get_vma_size(ptr, &prot);
+   FAIL_TEST_IF_FALSE(size == 4 * page_size);
+   FAIL_TEST_IF_FALSE(prot == 0x4);
+
+   size = get_vma_size(ptr + 4 * page_size, &prot);
+   FAIL_TEST_IF_FALSE(size == 4 * page_size);
+   FAIL_TEST_IF_FALSE(prot == 0x0);
+   } else {
+   FAIL_TEST_IF_FALSE(ret2 == ptr);
+
+   size = get_vma_size(ptr, &prot);
+   FAIL_TEST_IF_FALSE(size == 4 * page_size);
+
+   size = get_vma_size(ptr + 4 * page_size, &prot);
+   FAIL_TEST_IF_FALSE(size == 2 * page_size);
+   }
+
+   REPORT_TEST_PASS();
+}
+
+static void test_seal_mremap_expand_multiple_vmas(bool seal)
+{
+   void *ptr;
+   unsigned long page_size = getpagesize();
+   unsigned long size = 12 * page_size;
+   int ret;
+   void *ret2;
+   int prot;
+
+   setup_single_address(size, &ptr);
+   FAIL_TEST_IF_FALSE(ptr != (void *)-1);
+
+   ret = sys_mprotect(ptr + 4 * page_size, 4 * page_size, PROT_NONE);
+   FAIL_TEST_IF_FALSE(!ret);
+
+   /* ummap last 4 pages. */
+   ret = sys_munmap(ptr + 8 * page_size, 4 * page_size);
+   FAIL_TEST_IF_FALSE(!ret);
+
+   size = get_vma_size(ptr, &prot);
+   FAIL_TEST_IF_FALSE(size == 4 * page_size);
+
+   size = get_vma_size(ptr + 4 * page_size, &prot);
+   FAIL_TEST_IF_FALSE(size == 4 * page_size);
+
+   if (seal) {
+   ret = sys_mseal(ptr + 4 * page_size, 4 * page_size);
+   FAIL_TEST_IF_FALSE(!ret);
+   }
+
+   ret2 = sys_mremap(ptr, 8 * page_size, 12 * page_size, 0, 0);
+   if (seal) {
+   FAIL_TEST_IF_FALSE(ret2 == (void *) MAP_FAILED);
+
+   size = get_vma_size(ptr, &prot);
+   FAIL_TEST_IF_FALSE(size == 4 * page_size);
+   FAIL_TEST_IF_FALSE(prot == 0x4);
+
+   size = get_vma_size(ptr + 4 * page_size, &prot);
+   FAIL_TEST_IF_FALSE(size == 4 * page_size);
+   FAIL_TEST_IF_FALSE(prot == 0x0);
+   } else {
+

[PATCH v1 2/2] mseal: refactor mremap to remove can_modify_mm

2024-08-14 Thread jeffxu

From: Jeff Xu 

mremap doesn't allow relocate, expand, shrink across VMA boundaries,
refactor the code to check src address range before doing anything on
the destination.

This also allow we remove can_modify_mm from mremap, since
the src address must be single VMA, use can_modify_vma instead.

Signed-off-by: Jeff Xu 
---
 mm/internal.h | 24 
 mm/mremap.c   | 77 +--
 mm/mseal.c| 17 
 3 files changed, 61 insertions(+), 57 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index b4d86436565b..53f0bbbc6449 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1501,6 +1501,24 @@ bool can_modify_mm(struct mm_struct *mm, unsigned long 
start,
unsigned long end);
 bool can_modify_mm_madv(struct mm_struct *mm, unsigned long start,
unsigned long end, int behavior);
+
+static inline bool vma_is_sealed(struct vm_area_struct *vma)
+{
+   return (vma->vm_flags & VM_SEALED);
+}
+
+/*
+ * check if a vma is sealed for modification.
+ * return true, if modification is allowed.
+ */
+static inline bool can_modify_vma(struct vm_area_struct *vma)
+{
+   if (unlikely(vma_is_sealed(vma)))
+   return false;
+
+   return true;
+}
+
 #else
 static inline int can_do_mseal(unsigned long flags)
 {
@@ -1518,6 +1536,12 @@ static inline bool can_modify_mm_madv(struct mm_struct 
*mm, unsigned long start,
 {
return true;
 }
+
+static inline bool can_modify_vma(struct vm_area_struct *vma)
+{
+   return true;
+}
+
 #endif
 
 #ifdef CONFIG_SHRINKER_DEBUG
diff --git a/mm/mremap.c b/mm/mremap.c
index e7ae140fc640..3c5bb671a280 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -904,28 +904,7 @@ static unsigned long mremap_to(unsigned long addr, 
unsigned long old_len,
 
/*
 * In mremap_to().
-* Move a VMA to another location, check if src addr is sealed.
-*
-* Place can_modify_mm here because mremap_to()
-* does its own checking for address range, and we only
-* check the sealing after passing those checks.
-*
-* can_modify_mm assumes we have acquired the lock on MM.
 */
-   if (unlikely(!can_modify_mm(mm, addr, addr + old_len)))
-   return -EPERM;
-
-   if (flags & MREMAP_FIXED) {
-   /*
-* In mremap_to().
-* VMA is moved to dst address, and munmap dst first.
-* do_munmap will check if dst is sealed.
-*/
-   ret = do_munmap(mm, new_addr, new_len, uf_unmap_early);
-   if (ret)
-   goto out;
-   }
-
if (old_len > new_len) {
ret = do_munmap(mm, addr+new_len, old_len - new_len, uf_unmap);
if (ret)
@@ -939,6 +918,26 @@ static unsigned long mremap_to(unsigned long addr, 
unsigned long old_len,
goto out;
}
 
+   /*
+* Since we can't remap across vma boundaries,
+* check single vma instead of src address range.
+*/
+   if (unlikely(!can_modify_vma(vma))) {
+   ret = -EPERM;
+   goto out;
+   }
+
+   if (flags & MREMAP_FIXED) {
+   /*
+* In mremap_to().
+* VMA is moved to dst address, and munmap dst first.
+* do_munmap will check if dst is sealed.
+*/
+   ret = do_munmap(mm, new_addr, new_len, uf_unmap_early);
+   if (ret)
+   goto out;
+   }
+
/* MREMAP_DONTUNMAP expands by old_len since old_len == new_len */
if (flags & MREMAP_DONTUNMAP &&
!may_expand_vm(mm, vma->vm_flags, old_len >> PAGE_SHIFT)) {
@@ -1079,19 +1078,6 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned 
long, old_len,
goto out;
}
 
-   /*
-* Below is shrink/expand case (not mremap_to())
-* Check if src address is sealed, if so, reject.
-* In other words, prevent shrinking or expanding a sealed VMA.
-*
-* Place can_modify_mm here so we can keep the logic related to
-* shrink/expand together.
-*/
-   if (unlikely(!can_modify_mm(mm, addr, addr + old_len))) {
-   ret = -EPERM;
-   goto out;
-   }
-
/*
 * Always allow a shrinking remap: that just unmaps
 * the unnecessary pages..
@@ -1107,7 +1093,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned 
long, old_len,
}
 
ret = do_vmi_munmap(&vmi, mm, addr + new_len, old_len - new_len,
-   &uf_unmap, true);
+   &uf_unmap, true);
if (ret)
goto out;
 
@@ -1124,6 +1110,15 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned 
long, old_len,
goto out;
}
 
+   /*
+* Since we can't remap a

[PATCH net 0/2] selftests: Fix udpgro failures

2024-08-14 Thread Hangbin Liu

There are 2 issues for the current udpgro test. The first one is the testing
doesn't record all the failures, which may report pass but the test actually
failed. e.g.
https://netdev-3.bots.linux.dev/vmksft-net/results/725661/45-udpgro-sh/stdout

The other one is after commit d7db7775ea2e ("net: veth: do not manipulate
GRO when using XDP"), there is no need to load xdp program to enable GRO
on veth device.

Hangbin Liu (2):
  selftests: udpgro: report error when receive failed
  selftests: udpgro: no need to load xdp for gro

 tools/testing/selftests/net/udpgro.sh | 50 +--
 1 file changed, 25 insertions(+), 25 deletions(-)

-- 
2.45.0

[PATCH net 1/2] selftests: udpgro: report error when receive failed

2024-08-14 Thread Hangbin Liu

Currently, we only check the latest senders's exit code. If the receiver
report failed, it is not recoreded. Fix it by checking the exit code
of all the involved processes.

Before:
  bad GRO lookup  ok
  multiple GRO socks  ./udpgso_bench_rx: recv: bad packet 
len, got 1452, expected 14520

 ./udpgso_bench_rx: recv: bad packet len, got 1452, expected 14520

 failed
 $ echo $?
 0

After:
  bad GRO lookup  ok
  multiple GRO socks  ./udpgso_bench_rx: recv: bad packet 
len, got 1452, expected 14520

 ./udpgso_bench_rx: recv: bad packet len, got 1452, expected 14520

 failed
 $ echo $?
 1

Fixes: 3327a9c46352 ("selftests: add functionals test for UDP GRO")
Suggested-by: Paolo Abeni 
Signed-off-by: Hangbin Liu 
---
 tools/testing/selftests/net/udpgro.sh | 41 ---
 1 file changed, 24 insertions(+), 17 deletions(-)

diff --git a/tools/testing/selftests/net/udpgro.sh 
b/tools/testing/selftests/net/udpgro.sh
index 11a1ebda564f..7e0164247b83 100755
--- a/tools/testing/selftests/net/udpgro.sh
+++ b/tools/testing/selftests/net/udpgro.sh
@@ -49,14 +49,15 @@ run_one() {
 
cfg_veth
 
-   ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 1000 -R 10 ${rx_args} 
&& \
-   echo "ok" || \
-   echo "failed" &
+   ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 1000 -R 10 ${rx_args} &
+   local PID1=$!
 
wait_local_port_listen ${PEER_NS} 8000 udp
./udpgso_bench_tx ${tx_args}
-   ret=$?
-   wait $(jobs -p)
+   check_err $?
+   wait ${PID1}
+   check_err $?
+   [ "$ret" -eq 0 ] && echo "ok" || echo "failed"
return $ret
 }
 
@@ -93,16 +94,17 @@ run_one_nat() {
# ... so that GRO will match the UDP_GRO enabled socket, but packets
# will land on the 'plain' one
ip netns exec "${PEER_NS}" ./udpgso_bench_rx -G ${family} -b ${addr1} 
-n 0 &
-   pid=$!
-   ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 1000 -R 10 ${family} -b 
${addr2%/*} ${rx_args} && \
-   echo "ok" || \
-   echo "failed"&
+   local PID1=$!
+   ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 1000 -R 10 ${family} -b 
${addr2%/*} ${rx_args} &
+   local PID2=$!
 
wait_local_port_listen "${PEER_NS}" 8000 udp
./udpgso_bench_tx ${tx_args}
-   ret=$?
-   kill -INT $pid
-   wait $(jobs -p)
+   check_err $?
+   kill -INT ${PID1}
+   wait ${PID2}
+   check_err $?
+   [ "$ret" -eq 0 ] && echo "ok" || echo "failed"
return $ret
 }
 
@@ -115,16 +117,21 @@ run_one_2sock() {
cfg_veth
 
ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 1000 -R 10 ${rx_args} 
-p 12345 &
-   ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 2000 -R 10 ${rx_args} 
&& \
-   echo "ok" || \
-   echo "failed" &
+   local PID1=$!
+   ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 2000 -R 10 ${rx_args} &
+   local PID2=$!
 
wait_local_port_listen "${PEER_NS}" 12345 udp
./udpgso_bench_tx ${tx_args} -p 12345
+   check_err $?
wait_local_port_listen "${PEER_NS}" 8000 udp
./udpgso_bench_tx ${tx_args}
-   ret=$?
-   wait $(jobs -p)
+   check_err $?
+   wait ${PID1}
+   check_err $?
+   wait ${PID2}
+   check_err $?
+   [ "$ret" -eq 0 ] && echo "ok" || echo "failed"
return $ret
 }
 
-- 
2.45.0

[PATCH net 2/2] selftests: udpgro: no need to load xdp for gro

2024-08-14 Thread Hangbin Liu

After commit d7db7775ea2e ("net: veth: do not manipulate GRO when using
XDP"), there is no need to load XDP program to enable GRO. On the other
hand, the current test is failed due to loading the XDP program. e.g.

 # selftests: net: udpgro.sh
 # ipv4
 #  no GRO  ok
 #  no GRO chk cmsg ok
 #  GRO ./udpgso_bench_rx: recv: bad packet 
len, got 1472, expected 14720
 #
 # failed

 [...]

 #  bad GRO lookup  ok
 #  multiple GRO socks  ./udpgso_bench_rx: recv: bad packet 
len, got 1452, expected 14520
 #
 # ./udpgso_bench_rx: recv: bad packet len, got 1452, expected 14520
 #
 # failed
 ok 1 selftests: net: udpgro.sh

After fix, all the test passed.

 # ./udpgro.sh
 ipv4
  no GRO  ok
  [...]
  multiple GRO socks  ok

Fixes: d7db7775ea2e ("net: veth: do not manipulate GRO when using XDP")
Reported-by: Yi Chen 
Closes: https://issues.redhat.com/browse/RHEL-53858
Signed-off-by: Hangbin Liu 
---
 tools/testing/selftests/net/udpgro.sh | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/tools/testing/selftests/net/udpgro.sh 
b/tools/testing/selftests/net/udpgro.sh
index 7e0164247b83..180b337c8345 100755
--- a/tools/testing/selftests/net/udpgro.sh
+++ b/tools/testing/selftests/net/udpgro.sh
@@ -7,8 +7,6 @@ source net_helper.sh
 
 readonly PEER_NS="ns-peer-$(mktemp -u XX)"
 
-BPF_FILE="xdp_dummy.bpf.o"
-
 # set global exit status, but never reset nonzero one.
 check_err()
 {
@@ -38,7 +36,7 @@ cfg_veth() {
ip -netns "${PEER_NS}" addr add dev veth1 192.168.1.1/24
ip -netns "${PEER_NS}" addr add dev veth1 2001:db8::1/64 nodad
ip -netns "${PEER_NS}" link set dev veth1 up
-   ip -n "${PEER_NS}" link set veth1 xdp object ${BPF_FILE} section xdp
+   ip netns exec "${PEER_NS}" ethtool -K veth1 gro on
 }
 
 run_one() {
@@ -203,11 +201,6 @@ run_all() {
return $ret
 }
 
-if [ ! -f ${BPF_FILE} ]; then
-   echo "Missing ${BPF_FILE}. Run 'make' first"
-   exit -1
-fi
-
 if [[ $# -eq 0 ]]; then
run_all
 elif [[ $1 == "__subprocess" ]]; then
-- 
2.45.0

[PATCH net v3] selftest: af_unix: Fix kselftest compilation warnings

2024-08-14 Thread Abhinav Jain

Change expected_buf from (const void *) to (const char *)
in function __recvpair().
This change fixes the below warnings during test compilation:

```
In file included from msg_oob.c:14:
msg_oob.c: In function ‘__recvpair’:

../../kselftest_harness.h:106:40: warning: format ‘%s’ expects argument
of type ‘char *’,but argument 6 has type ‘const void *’ [-Wformat=]

../../kselftest_harness.h:101:17: note: in expansion of macro ‘__TH_LOG’
msg_oob.c:235:17: note: in expansion of macro ‘TH_LOG’

../../kselftest_harness.h:106:40: warning: format ‘%s’ expects argument
of type ‘char *’,but argument 6 has type ‘const void *’ [-Wformat=]

../../kselftest_harness.h:101:17: note: in expansion of macro ‘__TH_LOG’
msg_oob.c:259:25: note: in expansion of macro ‘TH_LOG’
```

Fixes: d098d77232c3 ("selftest: af_unix: Add msg_oob.c.")
Signed-off-by: Abhinav Jain 
Reviewed-by: Kuniyuki Iwashima 
---

v2:
https://lore.kernel.org/all/20240812191122.1092806-1-jain.abhinav...@gmail.com/

- Change the parameter expected_buf from (const void *) to (const char *)
  in the function __recvpair() as per the feedback in v1.
- Add Fixes tag as per feedback in v1.

v1:
https://lore.kernel.org/netdev/20240810134037.669765-1-jain.abhinav...@gmail.com
---
 tools/testing/selftests/net/af_unix/msg_oob.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/af_unix/msg_oob.c 
b/tools/testing/selftests/net/af_unix/msg_oob.c
index 16d0c172eaeb..535eb2c3d7d1 100644
--- a/tools/testing/selftests/net/af_unix/msg_oob.c
+++ b/tools/testing/selftests/net/af_unix/msg_oob.c
@@ -209,7 +209,7 @@ static void __sendpair(struct __test_metadata *_metadata,
 
 static void __recvpair(struct __test_metadata *_metadata,
   FIXTURE_DATA(msg_oob) *self,
-  const void *expected_buf, int expected_len,
+  const char *expected_buf, int expected_len,
   int buf_len, int flags)
 {
int i, ret[2], recv_errno[2], expected_errno = 0;
-- 
2.34.1

Re: [PATCH net v2] selftest: af_unix: Fix kselftest compilation warnings

2024-08-14 Thread Abhinav Jain

On Tue, 13 Aug 2024 18:21:06 -0700, Jakub Kicinski wrote:
> Some patchwork malfunction, the patch didn't get registered :(
> Could you resend?
>
> Please keep Kuniyuki's review tag and address his feedback.

Sure. I have submitted v3 keeping the above in mind, please review:
https://lore.kernel.org/all/20240814080743.1156166-1-jain.abhinav...@gmail.com/

Also, @Jakub, please kindly check this and revert (another patch on which you
have helped a lot already, need one small input and I can send the next 
version):
https://lore.kernel.org/all/20240810175509.404094-1-jain.abhinav...@gmail.com/

Re: [PATCH net-next v2] net: netconsole: selftests: Create a new netconsole selftest

2024-08-14 Thread Breno Leitao

Hello Jakub,

On Tue, Aug 13, 2024 at 03:37:16PM -0700, Jakub Kicinski wrote:
> On Tue, 13 Aug 2024 11:38:16 -0700 Breno Leitao wrote:
> > Adds a selftest that creates two virtual interfaces, assigns one to a
> > new namespace, and assigns IP addresses to both.
> > 
> > It listens on the destination interface using socat and configures a
> > dynamic target on netconsole, pointing to the destination IP address.
> > 
> > The test then checks if the message was received properly on the
> > destination interface.
> 
> We're getting a:
> 
> SKIP: directory /sys/kernel/config/netconsole does not exist. Check if 
> NETCONSOLE_DYNAMIC is enabled
> 
> https://netdev-3.bots.linux.dev/vmksft-net-drv-dbg/results/726381/4-netcons-basic-sh/stdout
> 
> Gotta extend tools/testing/selftests/drivers/net/config

Thanks, I've added the following changes in
tools/testing/selftests/drivers/net/config:

CONFIG_NETCONSOLE=m
CONFIG_NETCONSOLE_DYNAMIC=y

Then I was able to get the test executing running the following
commands:

# vng --build  --config tools/testing/selftests/net/config
# vng -v --run . --user root --cpus 4 -- \
make -C tools/testing/selftests TARGETS=drivers/net  \
TEST_PROGS="netcons_basic.sh" TEST_GEN_PROGS="" run_tests
.
# timeout set to 45
# selftests: drivers/net: netcons_basic.sh
[   11.172987] netdevsim netdevsim407 eni407np1: renamed from eth1
[   12.441965] printk: legacy console [netcon_ext0] enabled
[   12.443049] netpoll: netconsole: local port 6665
[   12.444104] netpoll: netconsole: local IPv4 address 192.168.1.1
[   12.445281] netpoll: netconsole: interface 'eni407np1'
[   12.446299] netpoll: netconsole: remote port 
[   12.447246] netpoll: netconsole: remote IPv4 address 192.168.1.2
[   12.448419] netpoll: netconsole: remote ethernet address 
aa:c8:83:a5:05:b3
[   12.450646] netconsole: network logging started
[   13.547405] netconsole selftest: netcons_GjLI0
ok 1 selftests: drivers/net: netcons_basic.sh

I will wait a bit more, and then integrate this change into v2.

Thanks for the review
--breno

Re: [PATCH RFT v8 4/9] fork: Add shadow stack support to clone3()

2024-08-14 Thread Catalin Marinas

On Tue, Aug 13, 2024 at 07:58:26PM +0100, Mark Brown wrote:
> On Tue, Aug 13, 2024 at 05:25:47PM +0100, Catalin Marinas wrote:
> > However, the x86 would be slightly inconsistent here between clone() and
> > clone3(). I guess it depends how you look at it. The classic clone()
> > syscall, if one doesn't pass CLONE_VM but does set new stack, there's no
> > new shadow stack allocated which I'd expect since it's a new stack.
> > Well, I doubt anyone cares about this scenario. Are there real cases of
> > !CLONE_VM but with a new stack?
> 
> ISTR the concerns were around someone being clever with vfork() but I
> don't remember anything super concrete.  In terms of the inconsistency
> here that was actually another thing that came up - if userspace
> specifies a stack for clone3() it'll just get used even with CLONE_VFORK
> so it seemed to make sense to do the same thing for the shadow stack.
> This was part of the thinking when we were looking at it, if you can
> specify a regular stack you should be able to specify a shadow stack.

Yes, I agree. But by this logic, I was wondering why the current clone()
behaviour does not allocate a shadow stack when a new stack is
requested with CLONE_VFORK. That's rather theoretical though and we may
not want to change the ABI.

> > > > I'm confused that we need to consume the token here. I could not find
> > > > the default shadow stack allocation doing this, only setting it via
> > > > create_rstor_token() (or I did not search enough). In the default case,
> 
> > > As discussed for a couple of previous versions if we don't have the
> > > token and userspace can specify any old shadow stack page as the shadow
> > > stack this allows clone3() to be used to overwrite the shadow stack of
> > > another thread, you can point to a shadow stack page which is currently
> 
> > IIUC, the kernel-allocated shadow stack will have the token always set
> > while the user-allocated one will be cleared. I was looking to
> 
> No, when the kernel allocates we don't bother with tokens at all.  We
> only look for and clear a token with the user specified shadow stack.

Ah, you are right, I misread the alloc_shstk() function. It takes a
set_res_tok parameter which is false for the normal allocation.

> > I guess I was rather questioning the current choices than the new
> > clone3() ABI. But even for the new clone3() ABI, does it make sense to
> > set up a shadow stack if the current stack isn't changed? We'll end up
> > with a lot of possible combinations that will never get tested but
> > potentially become obscure ABI. Limiting the options to the sane choices
> > only helps with validation and unsurprising changes later on.
> 
> OTOH if we add the restrictions it's more code (and more test code) to
> check, and thinking about if we've missed some important use case.  Not
> that it's a *huge* amount of code, like I say I'd not be too unhappy
> with adding a restriction on having a regular stack specified in order
> to specify a shadow stack.

I guess we just follow the normal stack behaviour for clone3(), at least
we'd be consistent with that.

Anyway, I understood this patch now and the ABI decisions. FWIW:

Reviewed-by: Catalin Marinas

Re: [PATCH bpf-next v4 2/2] selftests/bpf: Add mptcp subflow subtest

2024-08-14 Thread Matthieu Baerts

Hi Martin,

Thank you for your reply!

On 14/08/2024 03:12, Martin KaFai Lau wrote:
> On 8/5/24 2:52 AM, Matthieu Baerts (NGI0) wrote:
>> +static int endpoint_init(char *flags)
>> +{
>> +    SYS(fail, "ip -net %s link add veth1 type veth peer name veth2",
>> NS_TEST);
>> +    SYS(fail, "ip -net %s addr add %s/24 dev veth1", NS_TEST, ADDR_1);
>> +    SYS(fail, "ip -net %s link set dev veth1 up", NS_TEST);
>> +    SYS(fail, "ip -net %s addr add %s/24 dev veth2", NS_TEST, ADDR_2);
>> +    SYS(fail, "ip -net %s link set dev veth2 up", NS_TEST);
>> +    if (SYS_NOFAIL("ip -net %s mptcp endpoint add %s %s", NS_TEST,
>> ADDR_2, flags)) {
>> +    printf("'ip mptcp' not supported, skip this test.\n");
>> +    test__skip();
> 
> It is always a skip now in bpf CI:
> 
> #171/3   mptcp/subflow:SKIP
> 
> This test is a useful addition for the bpf CI selftest.
> 
> It can't catch regression if it is always a skip in bpf CI though.

Indeed, for the moment, this test is skipped in bpf CI.

The MPTCP CI checks the MPTCP BPF selftests that are on top of net and
net-next at least once a day. It is always running with the last stable
version of iproute2, so this test is not skipped:

   #169/3   mptcp/subflow:OK

https://github.com/multipath-tcp/mptcp_net-next/actions/runs/10384566794/job/28751869426#step:7:11080

> iproute2 needs to be updated (cc: Daniel Xu and Manu, the outdated
> iproute2 is something that came up multiple times).
> 
> Not sure when the iproute2 can be updated. In the mean time, your v3 is
> pretty close to getting pm_nl_ctl compiled. Is there other blocker on this?

I will try to find some time to check the modifications I suggested in
the v3, but I don't know how long it will take to have them ready, as
they might require some adaptations of the CI side as well, I need to
check. On the other hand, I understood adding a duplicated version of
the mptcp.h UAPI header is not an option either.

So not to block this (already old) series, I thought it would help to
first focus on this version using 'ip mptcp', while I'm looking at the
selftests modifications. If these modifications are successful, I can
always resend the patch 2/3 from the v3 later, and using 'pm_nl_ctl'
instead of 'ip mptcp', to be able to work with IPRoute2 5.5.

Do you think that could work like that?

>> +    goto fail;
>> +    }
>> +
>> +    return 0;
>> +fail:
>> +    return -1;
>> +}
>> +
>> +static int _ss_search(char *src, char *dst, char *port, char *keyword)
>> +{
>> +    return SYS_NOFAIL("ip netns exec %s ss -enita src %s dst %s %s %d
>> | grep -q '%s'",
>> +  NS_TEST, src, dst, port, PORT_1, keyword);
>> +}
>> +
>> +static int ss_search(char *src, char *keyword)
>> +{
>> +    return _ss_search(src, ADDR_1, "dport", keyword);
>> +}
>> +
>> +static void run_subflow(char *new)
>> +{
>> +    int server_fd, client_fd, err;
>> +    char cc[TCP_CA_NAME_MAX];
>> +    socklen_t len = sizeof(cc);
>> +
>> +    server_fd = start_mptcp_server(AF_INET, ADDR_1, PORT_1, 0);
>> +    if (!ASSERT_GE(server_fd, 0, "start_mptcp_server"))
>> +    return;
>> +
>> +    client_fd = connect_to_fd(server_fd, 0);
>> +    if (!ASSERT_GE(client_fd, 0, "connect to fd"))
>> +    goto fail;
>> +
>> +    err = getsockopt(server_fd, SOL_TCP, TCP_CONGESTION, cc, &len);
>> +    if (!ASSERT_OK(err, "getsockopt(srv_fd, TCP_CONGESTION)"))
>> +    goto fail;
>> +
>> +    send_byte(client_fd);
>> +
>> +    ASSERT_OK(ss_search(ADDR_1, "fwmark:0x1"), "ss_search fwmark:0x1");
>> +    ASSERT_OK(ss_search(ADDR_2, "fwmark:0x2"), "ss_search fwmark:0x2");
>> +    ASSERT_OK(ss_search(ADDR_1, new), "ss_search new cc");
>> +    ASSERT_OK(ss_search(ADDR_2, cc), "ss_search default cc");
> 
> Is there a getsockopt way instead of ss + grep?

No there isn't: from the userspace, the app communicates with the MPTCP
socket, which can have multiple paths (subflows, a TCP socket). To keep
the compatibility with TCP, [gs]etsockopt() will look at/modify the
whole MPTCP connection. For example, in some cases, a setsockopt() will
propagate the option to all the subflows. Depending on the option, the
modification might only apply to the first subflow, or to the
user-facing socket.

For advanced users who want to have different options set to the
different subflows of an MPTCP connection, they can use BPF: that's what
is being validated here. In other words, doing a 'getsockopt()' from the
userspace program here will not show all the different marks and TCP CC
that can be set per subflow with BPF. We can see that in the test: a
getsockopt() is done on the MPTCP socket to retrieve the default TCP CC
('cc' which is certainly 'cubic'), but we expect to find another one
('new' which is 'reno'), set by the BPF program from patch 1/2. I guess
we could use bpf to do a getsockopt() per subflow, but that's seems a
bit cheated to have the BPF test program setting something and checking
if it is set. Here, it is an external way. Because it is done from a
dedicated netns, it

Re: [PATCH net 1/2] selftests: udpgro: report error when receive failed

2024-08-14 Thread Paolo Abeni


On 8/14/24 09:57, Hangbin Liu wrote:

Currently, we only check the latest senders's exit code. If the receiver
report failed, it is not recoreded. Fix it by checking the exit code
of all the involved processes.

Before:
   bad GRO lookup  ok
   multiple GRO socks  ./udpgso_bench_rx: recv: bad packet 
len, got 1452, expected 14520

  ./udpgso_bench_rx: recv: bad packet len, got 1452, expected 14520

  failed
  $ echo $?
  0

After:
   bad GRO lookup  ok
   multiple GRO socks  ./udpgso_bench_rx: recv: bad packet 
len, got 1452, expected 14520

  ./udpgso_bench_rx: recv: bad packet len, got 1452, expected 14520

  failed
  $ echo $?
  1

Fixes: 3327a9c46352 ("selftests: add functionals test for UDP GRO")
Suggested-by: Paolo Abeni 
Signed-off-by: Hangbin Liu 
---
  tools/testing/selftests/net/udpgro.sh | 41 ---
  1 file changed, 24 insertions(+), 17 deletions(-)

diff --git a/tools/testing/selftests/net/udpgro.sh 
b/tools/testing/selftests/net/udpgro.sh
index 11a1ebda564f..7e0164247b83 100755
--- a/tools/testing/selftests/net/udpgro.sh
+++ b/tools/testing/selftests/net/udpgro.sh
@@ -49,14 +49,15 @@ run_one() {
  
  	cfg_veth
  
-	ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 1000 -R 10 ${rx_args} && \

-   echo "ok" || \
-   echo "failed" &
+   ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 1000 -R 10 ${rx_args} &
+   local PID1=$!
  
  	wait_local_port_listen ${PEER_NS} 8000 udp

./udpgso_bench_tx ${tx_args}
-   ret=$?
-   wait $(jobs -p)
+   check_err $?
+   wait ${PID1}
+   check_err $?
+   [ "$ret" -eq 0 ] && echo "ok" || echo "failed"


I think that with the above, in case of a failure, every test after the 
failing one will should fail, regardless of the actual results, am I 
correct?


Thanks,

Paolo

Re: [PATCH net 2/2] selftests: udpgro: no need to load xdp for gro

2024-08-14 Thread Toke Høiland-Jørgensen

Hangbin Liu  writes:

> After commit d7db7775ea2e ("net: veth: do not manipulate GRO when using
> XDP"), there is no need to load XDP program to enable GRO. On the other
> hand, the current test is failed due to loading the XDP program. e.g.
>
>  # selftests: net: udpgro.sh
>  # ipv4
>  #  no GRO  ok
>  #  no GRO chk cmsg ok
>  #  GRO ./udpgso_bench_rx: recv: bad 
> packet len, got 1472, expected 14720
>  #
>  # failed
>
>  [...]
>
>  #  bad GRO lookup  ok
>  #  multiple GRO socks  ./udpgso_bench_rx: recv: bad 
> packet len, got 1452, expected 14520
>  #
>  # ./udpgso_bench_rx: recv: bad packet len, got 1452, expected 14520
>  #
>  # failed
>  ok 1 selftests: net: udpgro.sh
>
> After fix, all the test passed.
>
>  # ./udpgro.sh
>  ipv4
>   no GRO  ok
>   [...]
>   multiple GRO socks  ok
>
> Fixes: d7db7775ea2e ("net: veth: do not manipulate GRO when using XDP")
> Reported-by: Yi Chen 
> Closes: https://issues.redhat.com/browse/RHEL-53858
> Signed-off-by: Hangbin Liu 

Reviewed-by: Toke Høiland-Jørgensen

Re: [PATCH net 2/2] selftests: udpgro: no need to load xdp for gro

2024-08-14 Thread Paolo Abeni


On 8/14/24 09:57, Hangbin Liu wrote:

After commit d7db7775ea2e ("net: veth: do not manipulate GRO when using
XDP"), there is no need to load XDP program to enable GRO. On the other
hand, the current test is failed due to loading the XDP program. e.g.

  # selftests: net: udpgro.sh
  # ipv4
  #  no GRO  ok
  #  no GRO chk cmsg ok
  #  GRO ./udpgso_bench_rx: recv: bad 
packet len, got 1472, expected 14720
  #
  # failed

  [...]

  #  bad GRO lookup  ok
  #  multiple GRO socks  ./udpgso_bench_rx: recv: bad 
packet len, got 1452, expected 14520
  #
  # ./udpgso_bench_rx: recv: bad packet len, got 1452, expected 14520
  #
  # failed
  ok 1 selftests: net: udpgro.sh

After fix, all the test passed.

  # ./udpgro.sh
  ipv4
   no GRO  ok
   [...]
   multiple GRO socks  ok

Fixes: d7db7775ea2e ("net: veth: do not manipulate GRO when using XDP")
Reported-by: Yi Chen 
Closes: https://issues.redhat.com/browse/RHEL-53858
Signed-off-by: Hangbin Liu 


LGTM,

Acked-by: Paolo Abeni

Re: [PATCH RFT v8 1/9] Documentation: userspace-api: Add shadow stack API documentation

2024-08-14 Thread Catalin Marinas

On Thu, Aug 08, 2024 at 09:15:22AM +0100, Mark Brown wrote:
> There are a number of architectures with shadow stack features which we are
> presenting to userspace with as consistent an API as we can (though there
> are some architecture specifics). Especially given that there are some
> important considerations for userspace code interacting directly with the
> feature let's provide some documentation covering the common aspects.
> 
> Signed-off-by: Mark Brown 

Reviewed-by: Catalin Marinas

Re: [PATCH RFT v8 3/9] mm: Introduce ARCH_HAS_USER_SHADOW_STACK

2024-08-14 Thread Catalin Marinas

On Thu, Aug 08, 2024 at 09:15:24AM +0100, Mark Brown wrote:
> Since multiple architectures have support for shadow stacks and we need to
> select support for this feature in several places in the generic code
> provide a generic config option that the architectures can select.
> 
> Suggested-by: David Hildenbrand 
> Acked-by: David Hildenbrand 
> Reviewed-by: Deepak Gupta 
> Reviewed-by: Rick Edgecombe 
> Signed-off-by: Mark Brown 

Reviewed-by: Catalin Marinas

Re: [PATCH net-next v2] net: netconsole: selftests: Create a new netconsole selftest

2024-08-14 Thread Petr Machata



Breno Leitao  writes:

> Adds a selftest that creates two virtual interfaces, assigns one to a
> new namespace, and assigns IP addresses to both.
>
> It listens on the destination interface using socat and configures a
> dynamic target on netconsole, pointing to the destination IP address.
>
> The test then checks if the message was received properly on the
> destination interface.
>
> Signed-off-by: Breno Leitao 
> ---
> Changelog:
>
> v2:
>  * Change the location of the path (Jakub)
>  * Move from veth to netdevsim
>  * Other small changes in dependency checks and cleanup
>
> v1:
>  * https://lore.kernel.org/all/zqyuhn770pjso...@gmail.com/
>
>  MAINTAINERS   |   1 +
>  tools/testing/selftests/drivers/net/Makefile  |   1 +
>  .../selftests/drivers/net/netcons_basic.sh| 223 ++
>  3 files changed, 225 insertions(+)
>  create mode 100755 tools/testing/selftests/drivers/net/netcons_basic.sh
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index a9dace908305..ded45f1dff7e 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -15770,6 +15770,7 @@ M:Breno Leitao 
>  S:   Maintained
>  F:   Documentation/networking/netconsole.rst
>  F:   drivers/net/netconsole.c
> +F:   tools/testing/selftests/drivers/net/netcons_basic.sh
>  
>  NETDEVSIM
>  M:   Jakub Kicinski 
> diff --git a/tools/testing/selftests/drivers/net/Makefile 
> b/tools/testing/selftests/drivers/net/Makefile
> index e54f382bcb02..928530b26abc 100644
> --- a/tools/testing/selftests/drivers/net/Makefile
> +++ b/tools/testing/selftests/drivers/net/Makefile
> @@ -3,6 +3,7 @@
>  TEST_INCLUDES := $(wildcard lib/py/*.py)
>  
>  TEST_PROGS := \
> + netcons_basic.sh \
>   ping.py \
>   queues.py \
>   stats.py \
> diff --git a/tools/testing/selftests/drivers/net/netcons_basic.sh 
> b/tools/testing/selftests/drivers/net/netcons_basic.sh
> new file mode 100755
> index ..e0e58fc7e89f
> --- /dev/null
> +++ b/tools/testing/selftests/drivers/net/netcons_basic.sh
> @@ -0,0 +1,223 @@
> +#!/usr/bin/env bash
> +# SPDX-License-Identifier: GPL-2.0
> +
> +# This test creates two netdevsim virtual interfaces, assigns one of them 
> (the
> +# "destination interface") to a new namespace, and assigns IP addresses to 
> both
> +# interfaces.
> +#
> +# It listens on the destination interface using socat and configures a 
> dynamic
> +# target on netconsole, pointing to the destination IP address.
> +#
> +# Finally, it checks whether the message was received properly on the
> +# destination interface.  Note that this test may pollute the kernel log 
> buffer
> +# (dmesg) and relies on dynamic configuration and namespaces being 
> configured.
> +#
> +# Author: Breno Leitao 
> +
> +set -euo pipefail
> +
> +SCRIPTDIR=$(dirname "$(readlink -e "${BASH_SOURCE[0]}")")
> +
> +# Simple script to test dynamic targets in netconsole
> +SRCIF="" # to be populated later
> +SRCIP=192.168.1.1
> +DSTIF="" # to be populated later
> +DSTIP=192.168.1.2
> +
> +PORT=""
> +MSG="netconsole selftest"
> +TARGET=$(mktemp -u netcons_X)
> +NETCONS_CONFIGFS="/sys/kernel/config/netconsole"
> +NETCONS_PATH="${NETCONS_CONFIGFS}"/"${TARGET}"
> +# This will have some tmp values appended to it in set_network()
> +NAMESPACE="netconsns_dst"
> +
> +# IDs for netdevsim
> +NSIM_DEV_1_ID=$((256 + RANDOM % 256))
> +NSIM_DEV_2_ID=$((512 + RANDOM % 256))
> +
> +# Used to create and delete namespaces
> +source "${SCRIPTDIR}"/../../net/lib.sh
> +
> +# Create netdevsim interfaces
> +create_ifaces() {
> + local NSIM_DEV_SYS_NEW=/sys/bus/netdevsim/new_device
> +
> + echo "$NSIM_DEV_2_ID" > "$NSIM_DEV_SYS_NEW"
> + echo "$NSIM_DEV_1_ID" > "$NSIM_DEV_SYS_NEW"
> + udevadm settle || true
> +
> + local 
> NSIM_DEV_1_SYS=/sys/bus/netdevsim/devices/netdevsim"$NSIM_DEV_1_ID"
> + local 
> NSIM_DEV_2_SYS=/sys/bus/netdevsim/devices/netdevsim"$NSIM_DEV_2_ID"
> +
> + # These are global variables
> + SRCIF=$(find "$NSIM_DEV_1_SYS"/net -maxdepth 1 -type d ! \
> + -path "$NSIM_DEV_1_SYS"/net -exec basename {} \;)
> + DSTIF=$(find "$NSIM_DEV_2_SYS"/net -maxdepth 1 -type d ! \
> + -path "$NSIM_DEV_2_SYS"/net -exec basename {} \;)
> +}
> +
> +link_ifaces() {
> + local NSIM_DEV_SYS_LINK="/sys/bus/netdevsim/link_device"
> + local SRCIF_IFIDX=$(cat /sys/class/net/"$SRCIF"/ifindex)
> + local DSTIF_IFIDX=$(cat /sys/class/net/"$DSTIF"/ifindex)
> +
> + exec {NAMESPACE_FD} + exec {INITNS_FD} +
> + # Bind the dst interface to namespace
> + ip link set "${DSTIF}" netns "${NAMESPACE}"
> +
> + # Linking one device to the other one (on the other namespace}
> + echo "${INITNS_FD}:$SRCIF_IFIDX $NAMESPACE_FD:$DSTIF_IFIDX" > 
> $NSIM_DEV_SYS_LINK
> + if [ $? -ne 0 ]; then
> + echo "linking netdevsim1 with netdevsim2 should succeed"
> + cleanup
> + exit ${ksft_skip}
> + fi
> +}
> +
> +function configure_ip() {
> + # Configure the IPs for

Re: [PATCH v2] selftests: fix relative rpath usage

2024-08-14 Thread Shuah Khan


On 8/13/24 10:33, Eugene Syromiatnikov wrote:

On Mon, Aug 12, 2024 at 05:03:45PM -0600, Shuah Khan wrote:

On 8/12/24 10:56, Eugene Syromiatnikov wrote:

The relative RPATH ("./") supplied to linker options in CFLAGS is resolved
relative to current working directory and not the executable directory,
which will lead in incorrect resolution when the test executables are run

>from elsewhere.  Changing it to $ORIGIN makes it resolve relative

to the directory in which the executables reside, which is supposedly
the desired behaviour.  This patch also moves these CFLAGS to lib.mk,
so the RPATH is provided for all selftest binaries, which is arguably
a useful default.


Can you elaborate on the erros you would see if this isn't fixed? I understand
that check-rpaths tool - howebver I would like to know how it manifests and


One would be unable to execute the test binaries that require additional
locally built dynamic libraries outside the directories in which they reside:

 [build@builder selftests]$ alsa/mixer-test
 alsa/mixer-test: error while loading shared libraries: libatest.so: cannot 
open shared object file: No such file or directory


how would you reproduce this problem while running selftests?


This usually doesn't come up in a regular selftests usage so far, as they
are usually run via make, and make descends into specific test directories
to execute make the respective make targets there, triggering the execution
of the specific test bineries.



Right. selftests are run usually via make and when they are installed run 
through
a script which descends into specific test directories where the tests are 
installed.

Unless we see the problem using kselftest use-case, there is no reason the make 
changes.
Sorry I am not going be taking these patches.

thanks,
-- Shuah

Re: [PATCH net-next v2] net: netconsole: selftests: Create a new netconsole selftest

2024-08-14 Thread Matthieu Baerts

Hi Petr, Breno,

On 14/08/2024 12:24, Petr Machata wrote:
> 
> Breno Leitao  writes:
> 
>> Adds a selftest that creates two virtual interfaces, assigns one to a
>> new namespace, and assigns IP addresses to both.
>>
>> It listens on the destination interface using socat and configures a
>> dynamic target on netconsole, pointing to the destination IP address.
>>
>> The test then checks if the message was received properly on the
>> destination interface.

(...)

>> diff --git a/tools/testing/selftests/drivers/net/netcons_basic.sh 
>> b/tools/testing/selftests/drivers/net/netcons_basic.sh
>> new file mode 100755
>> index ..e0e58fc7e89f
>> --- /dev/null
>> +++ b/tools/testing/selftests/drivers/net/netcons_basic.sh
>> @@ -0,0 +1,223 @@

(...)

> +NAMESPACE="netconsns_dst"

(...)

>> +function set_network() {
>> +# This is coming from lib.sh. And it does unbound variable access
>> +set +u
>> +setup_ns "${NAMESPACE}"
>> +set -u
> 
> It would make sense to fix lib.sh. I think this is what is needed?
> 
> modified   tools/testing/selftests/net/lib.sh
> @@ -178,7 +178,7 @@ setup_ns()
>   fi
>  
>   # Some test may setup/remove same netns multi times
> - if [ -z "${!ns_name}" ]; then
> + if ! declare -p "$ns_name" &> /dev/null; then
>   eval "${ns_name}=${ns_name,,}-$(mktemp -u XX)"
>   else
>   cleanup_ns "${!ns_name}"
> 
> CC'd Geliang Tang , Hangbin Liu ,
> Matthieu Baerts (NGI0)  who were in the vicinity
> in the past.
Thank you for having CCed me.

I don't know if lib.sh needs to be modified: setup_ns() is supposed to
be called with the name of an existing variable. Can you not define this
variable before?

I mean: the modification from Petr looks good to me to support 'set -u',
but it sounds safer to define the variable before in the script, just in
case it is defined by in the environment, before starting the test, and
not taking the expected path.

Note that in all the other selftests, setup_ns() is called with the name
of the variable, not a variable like you did, e.g.

  NAMESPACE=
  setup_ns NAMESPACE

instead of:

  NAMESPACE="netconsns_dst"
  setup_ns "${NAMESPACE}"
  NAMESPACE=${NS_LIST[0]}

Maybe better to do like the others?

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.

Re: [PATCH v3 3/3] selftests/tracing: Add hist poll() support test

2024-08-14 Thread Shuah Khan


On 8/13/24 18:11, Masami Hiramatsu (Google) wrote:

From: Masami Hiramatsu (Google) 

Add a testcase for poll() on hist file. This introduces a helper binary
to the ftracetest, because there is no good way to reliably execute
poll() on hist file.

Signed-off-by: Masami Hiramatsu (Google) 
---
  Changes in v2:
   - Update poll command to support both of POLLIN and POLLPRI, and timeout.
   - Identify unsupported stable kernel if poll-in returns soon.
   - Test both of POLLIN and POLLPRI.
---
  tools/testing/selftests/ftrace/Makefile|2 +
  tools/testing/selftests/ftrace/poll.c  |   62 +
  .../ftrace/test.d/trigger/trigger-hist-poll.tc |   74 
  3 files changed, 138 insertions(+)
  create mode 100644 tools/testing/selftests/ftrace/poll.c
  create mode 100644 
tools/testing/selftests/ftrace/test.d/trigger/trigger-hist-poll.tc

diff --git a/tools/testing/selftests/ftrace/Makefile 
b/tools/testing/selftests/ftrace/Makefile
index a1e955d2de4c..49d96bb16355 100644
--- a/tools/testing/selftests/ftrace/Makefile
+++ b/tools/testing/selftests/ftrace/Makefile
@@ -6,4 +6,6 @@ TEST_PROGS := ftracetest-ktap
  TEST_FILES := test.d settings
  EXTRA_CLEAN := $(OUTPUT)/logs/*
  
+TEST_GEN_PROGS = poll

+
  include ../lib.mk
diff --git a/tools/testing/selftests/ftrace/poll.c 
b/tools/testing/selftests/ftrace/poll.c
new file mode 100644
index ..8003a59fe042
--- /dev/null
+++ b/tools/testing/selftests/ftrace/poll.c
@@ -0,0 +1,62 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Simple poll on a file.
+ *
+ * Copyright (c) 2024 Google LLC.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define BUFSIZE 4096
+
+/*
+ * Usage:
+ *  poll   [timeout]
+ */
+int main(int argc, char *argv[])
+{
+   struct pollfd pfd;
+   char buf[BUFSIZE];
+   int timeout = -1;
+   int ret;
+
+   if (argc < 3)
+   return -1;
+
+   if (!strcmp(argv[1], "in"))
+   pfd.events = POLLIN;
+   else if (!strcmp(argv[1], "pri"))
+   pfd.events = POLLPRI;
+
+   pfd.fd = open(argv[2], O_RDONLY);
+   if (pfd.fd < 0) {
+   perror("open");
+   return -1;
+   }
+
+   if (argc == 4)
+   timeout = atoi(argv[3]);


This code can be simpler and more maintainable using getopt.
Any reason why you didn't use it?


+
+   /* Reset poll by read if POLLIN is specified. */
+   if (pfd.events & POLLIN)
+   do {} while (read(pfd.fd, buf, BUFSIZE) == BUFSIZE);
+
+   ret = poll(&pfd, 1, timeout);
+   if (ret < 0 && errno != EINTR) {
+   perror("poll")> +  return -1;
+   }
+   close(pfd.fd);
+
+   /* If timeout happned, return code is 0 */


Spelling - happened


+   if (ret == 0)
+   return 1;
+
+   return 0;
+}
diff --git a/tools/testing/selftests/ftrace/test.d/trigger/trigger-hist-poll.tc 
b/tools/testing/selftests/ftrace/test.d/trigger/trigger-hist-poll.tc
new file mode 100644
index ..53bea74e2234
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/trigger/trigger-hist-poll.tc
@@ -0,0 +1,74 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: event trigger - test poll wait on histogram
+# requires: set_event events/sched/sched_process_free/trigger 
events/sched/sched_process_free/hist
+# flags: instance
+
+POLL=${FTRACETEST_ROOT}/poll
+
+if [ ! -x ${POLL} ]; then
+  echo "poll program is not compiled!"
+  exit_unresolved
+fi
+
+EVENT=events/sched/sched_process_free/
+
+# Check poll ops is supported. Before implementing poll on hist file, it
+# returns soon with POLLIN | POLLOUT, but not POLLPRI.
+
+# This must wait >1 sec and return 1 (timeout).
+set +e
+${POLL} in ${EVENT}/hist 1000
+ret=$?
+set -e
+if [ ${ret} != 1 ]; then
+  echo "poll on hist file is not supported"
+  exit_unsupported
+fi
+
+# Test POLLIN
+echo > trace
+echo "hist:key=comm" > ${EVENT}/trigger
+echo 1 > ${EVENT}/enable
+
+# This sleep command will exit after 2 seconds.
+sleep 2 &
+BGPID=$!
+# if timeout happens, poll returns 1.
+${POLL} in ${EVENT}/hist 4000
+echo 0 > tracing_on
+
+if [ -d /proc/${BGPID} ]; then
+  echo "poll exits too soon"
+  kill -KILL ${BGPID} ||:
+  exit_fail
+fi
+
+if ! grep -qw "sleep" trace; then
+  echo "poll exits before event happens"
+  exit_fail
+fi
+
+# Test POLLPRI
+echo > trace
+echo 1 > tracing_on
+
+# This sleep command will exit after 2 seconds.
+sleep 2 &
+BGPID=$!
+# if timeout happens, poll returns 1.
+${POLL} pri ${EVENT}/hist 4000
+echo 0 > tracing_on
+
+if [ -d /proc/${BGPID} ]; then
+  echo "poll exits too soon"
+  kill -KILL ${BGPID} ||:
+  exit_fail
+fi
+
+if ! grep -qw "sleep" trace; then
+  echo "poll exits before event happens"
+  exit_fail
+fi
+
+exit_pass



thanks,
-- Shuah

Re: [PATCH] selftests/ftrace: Add required dependency for kprobe tests

2024-08-14 Thread Shuah Khan


On 8/13/24 18:53, Masami Hiramatsu (Google) wrote:

Hi Shuah,

Can you pick this? I confirmed this can be applied on v6.11-rc3.



On Thu, 13 Jun 2024 07:12:10 +0900
"Masami Hiramatsu (Google)"  wrote:


From: Masami Hiramatsu (Google) 

kprobe_args_{char,string}.tc are using available_filter_functions file
which is provided by function tracer. Thus if function tracer is disabled,
these tests are failed on recent kernels because tracefs_create_dir is
not raised events by adding a dynamic event.
Add available_filter_functions to requires line.

Fixes: 7c1130ea5cae ("test: ftrace: Fix kprobe test for eventfs")
Signed-off-by: Masami Hiramatsu (Google) 


Applied to linux-kselftest next for Linux 6.12-rc1

thanks,
-- Shuah

Re: [PATCH v2] selftests: fix relative rpath usage

2024-08-14 Thread Eugene Syromiatnikov

On Wed, Aug 14, 2024 at 05:14:08AM -0600, Shuah Khan wrote:
> On 8/13/24 10:33, Eugene Syromiatnikov wrote:
> >On Mon, Aug 12, 2024 at 05:03:45PM -0600, Shuah Khan wrote:
> >>On 8/12/24 10:56, Eugene Syromiatnikov wrote:
> >>>The relative RPATH ("./") supplied to linker options in CFLAGS is resolved
> >>>relative to current working directory and not the executable directory,
> >>>which will lead in incorrect resolution when the test executables are run
> >>>from elsewhere.  Changing it to $ORIGIN makes it resolve relative
> >>>to the directory in which the executables reside, which is supposedly
> >>>the desired behaviour.  This patch also moves these CFLAGS to lib.mk,
> >>>so the RPATH is provided for all selftest binaries, which is arguably
> >>>a useful default.
> >>
> >>Can you elaborate on the erros you would see if this isn't fixed? I 
> >>understand
> >>that check-rpaths tool - howebver I would like to know how it manifests and
> >
> >One would be unable to execute the test binaries that require additional
> >locally built dynamic libraries outside the directories in which they reside:
> >
> > [build@builder selftests]$ alsa/mixer-test
> > alsa/mixer-test: error while loading shared libraries: libatest.so: 
> > cannot open shared object file: No such file or directory
> >
> >>how would you reproduce this problem while running selftests?
> >
> >This usually doesn't come up in a regular selftests usage so far, as they
> >are usually run via make, and make descends into specific test directories
> >to execute make the respective make targets there, triggering the execution
> >of the specific test bineries.
> >
> 
> Right. selftests are run usually via make and when they are installed run 
> through
> a script which descends into specific test directories where the tests are 
> installed.
> 
> Unless we see the problem using kselftest use-case, there is no reason the 
> make changes.

The reason has been outlined in the commit message: relative paths in
RPATH/RUNPATH are incorrect and ought to be fixed.

> Sorry I am not going be taking these patches.

I see, by the same token, kernel maintainers reject any patches that fix
compilation/build warnings, I guess.

> thanks,
> -- Shuah

Re: [PATCH RFT v8 4/9] fork: Add shadow stack support to clone3()

2024-08-14 Thread Mark Brown

On Wed, Aug 14, 2024 at 10:38:54AM +0100, Catalin Marinas wrote:
> On Tue, Aug 13, 2024 at 07:58:26PM +0100, Mark Brown wrote:

> > ISTR the concerns were around someone being clever with vfork() but I
> > don't remember anything super concrete.  In terms of the inconsistency
> > here that was actually another thing that came up - if userspace
> > specifies a stack for clone3() it'll just get used even with CLONE_VFORK
> > so it seemed to make sense to do the same thing for the shadow stack.
> > This was part of the thinking when we were looking at it, if you can
> > specify a regular stack you should be able to specify a shadow stack.

> Yes, I agree. But by this logic, I was wondering why the current clone()
> behaviour does not allocate a shadow stack when a new stack is
> requested with CLONE_VFORK. That's rather theoretical though and we may
> not want to change the ABI.

The default for vfork() is to reuse both the normal and shadow stacks,
clone3() does make it all much more flexible.  All the shadow stack
ABI predates clone3(), even if it ended up getting merged after.

> Anyway, I understood this patch now and the ABI decisions. FWIW:

> Reviewed-by: Catalin Marinas 

Thanks!


signature.asc
Description: PGP signature

Re: [PATCH] kselftest/arm64: signal: fix/refactor SVE vector length enumeration

2024-08-14 Thread Dave Martin

On Tue, Aug 13, 2024 at 01:55:46PM +0100, Andre Przywara wrote:
> On Tue, 13 Aug 2024 12:00:06 +0100
> Mark Brown  wrote:
> 
> Hi broonie,
> 
> > On Mon, Aug 12, 2024 at 03:09:24PM +0100, Andre Przywara wrote:
> > 
> > > + /* Did we find the lowest supported VL? */
> > > + if (use_sme && vq < sve_vq_from_vl(vl))
> > > + break;  
> > 
> > We don't need the use_sme check here, SVE is just architecturally
> > guaranteed to never trip the && case.  Unless you add a warning for
> > broken implementations I'd just skip it.
> 
> Ah, thanks, I wasn't sure about that, and wanted to mimic the existing
> code as close as possible. Will surely just drop it.
> 
> Thanks,
> Andre

Maybe at least worth a comment?

I was looking at this code the other week, and this puzzled me until I
went and looked back at the architecture.

Cheers
---Dave
>

Re: [PATCH net-next v19 06/13] memory-provider: dmabuf devmem memory provider

2024-08-14 Thread Pavel Begunkov


On 8/13/24 22:13, Mina Almasry wrote:

Implement a memory provider that allocates dmabuf devmem in the form of
net_iov.

The provider receives a reference to the struct netdev_dmabuf_binding
via the pool->mp_priv pointer. The driver needs to set this pointer for
the provider in the net_iov.

The provider obtains a reference on the netdev_dmabuf_binding which
guarantees the binding and the underlying mapping remains alive until
the provider is destroyed.

Usage of PP_FLAG_DMA_MAP is required for this memory provide such that
the page_pool can provide the driver with the dma-addrs of the devmem.

Support for PP_FLAG_DMA_SYNC_DEV is omitted for simplicity & p.order !=
0.

Signed-off-by: Willem de Bruijn 
Signed-off-by: Kaiyuan Zhang 
Signed-off-by: Mina Almasry 
Reviewed-by: Pavel Begunkov 

---

v19:
- Add PP_FLAG_ALLOW_UNREADABLE_NETMEM flag. It serves 2 purposes, (a)
   it guards drivers that don't support unreadable netmem (net_iov
   backed) from accidentally getting exposed to it, and (b) drivers that
   wish to create header pools can unset it for that pool to force
   readable netmem.
- Add page_pool_check_memory_provider, which verifies that the driver
   has created a page_pool with the expected configuration. This is used
   to report to the user if the mp configuration succeeded, and also
   verify that the driver is doing the right thing.
- Don't reset niov->dma_addr on allocation/free.

v17:
- Use ASSERT_RTNL (Jakub)

v16:
- Add DEBUG_NET_WARN_ON_ONCE(!rtnl_is_locked()), to catch cases if
   page_pool_init without rtnl_locking when the queue is provided. In
   this case, the queue configuration may be changed while we're initing
   the page_pool, which could be a race.

v13:
- Return on warning (Pavel).
- Fixed pool->recycle_stats not being freed on error (Pavel).
- Applied reviewed-by from Pavel.

v11:
- Rebase to not use the ops. (Christoph)

v8:
- Use skb_frag_size instead of frag->bv_len to fix patch-by-patch build
   error

v6:
- refactor new memory provider functions into net/core/devmem.c (Pavel)

v2:
- Disable devmem for p.order != 0

v1:
- static_branch check in page_is_page_pool_iov() (Willem & Paolo).
- PP_DEVMEM -> PP_IOV (David).
- Require PP_FLAG_DMA_MAP (Jakub).


...

diff --git a/net/core/devmem.c b/net/core/devmem.c
index 301f4250ca82..2f2a7f4dee4c 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -17,6 +17,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  
  #include "page_pool_priv.h"

@@ -153,6 +154,10 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device 
*dev, u32 rxq_idx,
if (err)
goto err_xa_erase;
  
+	err = page_pool_check_memory_provider(dev, rxq, binding);


Frankly, I pretty much don't like it.

1. We do it after reconfiguring the queue just to fail and reconfigure
it again.

2. It should be a part of the common path like netdev_rx_queue_restart(),
not specific to devmem TCP.

These two can be fixed by moving the check into
netdev_rx_queue_restart() just after ->ndo_queue_mem_alloc, assuming
that the callback where we init page pools.

3. That implicit check gives me bad feeling, instead of just getting
direct feedback from the driver, either it's a flag or an error
returned, we have to try to figure what exactly the driver did, with
a high chance this inference will fail us at some point.

And page_pool_check_memory_provider() is not that straightforward,
it doesn't walk through pools of a queue. Not looking too deep,
but it seems like the nested loop can be moved out with the same
effect, so it first looks for a pool in the device and the follows
with the bound_rxqs. And seems the bound_rxqs check would always turn
true, you set the binding into the map in
net_devmem_bind_dmabuf_to_queue() before the restart and it'll be there
after restart for page_pool_check_memory_provider(). Maybe I missed
something, but it's not super clear.

4. And the last thing Jakub mentioned is that we need to be prepared
to expose a flag to the userspace for whether a queue supports
netiov. Not really doable in a sane manner with such implicit
post configuration checks.

And that brings us back to the first approach I mentioned, where
we have a flag in the queue structure, drivers set it, and
netdev_rx_queue_restart() checks it before any callback. That's
where the thread with Jakub stopped, and it reads like at least
he's not against the idea.



+   if (err)
+   goto err_xa_erase;
+
return 0;
  
  err_xa_erase:

@@ -305,4 +310,69 @@ void dev_dmabuf_uninstall(struct net_device *dev)
xa_erase(&binding->bound_rxqs, xa_idx);
}
  }
+

...

diff --git a/net/core/page_pool_user.c b/net/core/page_pool_user.c
index 3a3277ba167b..cbc54ee4f670 100644
--- a/net/core/page_pool_user.c
+++ b/net/core/page_pool_user.c
@@ -344,6 +344,32 @@ void page_pool_unlist(struct page_pool *pool)
mutex_unlock(&page_pools_lock);
  }
  
+int page_pool_check_memory_provider(struct net_device *d

Re: [PATCH net 1/2] selftests: udpgro: report error when receive failed

2024-08-14 Thread Hangbin Liu

On Wed, Aug 14, 2024 at 03:57:57PM +0800, Hangbin Liu wrote:
> ---
>  tools/testing/selftests/net/udpgro.sh | 41 ---
>  1 file changed, 24 insertions(+), 17 deletions(-)
> 
> diff --git a/tools/testing/selftests/net/udpgro.sh 
> b/tools/testing/selftests/net/udpgro.sh
> index 11a1ebda564f..7e0164247b83 100755
> --- a/tools/testing/selftests/net/udpgro.sh
> +++ b/tools/testing/selftests/net/udpgro.sh
> @@ -49,14 +49,15 @@ run_one() {
>  
>   cfg_veth
>  
> - ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 1000 -R 10 ${rx_args} 
> && \
> - echo "ok" || \
> - echo "failed" &
> + ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 1000 -R 10 ${rx_args} &
> + local PID1=$!
>  
>   wait_local_port_listen ${PEER_NS} 8000 udp
>   ./udpgso_bench_tx ${tx_args}
> - ret=$?
> - wait $(jobs -p)
> + check_err $?
> + wait ${PID1}
> + check_err $?
> + [ "$ret" -eq 0 ] && echo "ok" || echo "failed"
>   return $ret
>  }

Self NACK. The ret need to define first in each function, or the check_err
will failed... I forgot to the update the patch before post..

Thanks
Hangbin

Re: [PATCH net 1/2] selftests: udpgro: report error when receive failed

2024-08-14 Thread Hangbin Liu

On Wed, Aug 14, 2024 at 12:19:22PM +0200, Paolo Abeni wrote:
> > --- a/tools/testing/selftests/net/udpgro.sh
> > +++ b/tools/testing/selftests/net/udpgro.sh
> > @@ -49,14 +49,15 @@ run_one() {
> > cfg_veth
> > -   ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 1000 -R 10 ${rx_args} 
> > && \
> > -   echo "ok" || \
> > -   echo "failed" &
> > +   ip netns exec "${PEER_NS}" ./udpgso_bench_rx -C 1000 -R 10 ${rx_args} &
> > +   local PID1=$!
> > wait_local_port_listen ${PEER_NS} 8000 udp
> > ./udpgso_bench_tx ${tx_args}
> > -   ret=$?
> > -   wait $(jobs -p)
> > +   check_err $?
> > +   wait ${PID1}
> > +   check_err $?
> > +   [ "$ret" -eq 0 ] && echo "ok" || echo "failed"
> 
> I think that with the above, in case of a failure, every test after the
> failing one will should fail, regardless of the actual results, am I
> correct?

No, only the failed test echo "failed". The passed tests still
report "ok". The "check_err $?" in run_all function only record none 0
ret as return value.

Thanks
Hangbin

Re: [PATCH v1 0/2] mremap refactor: check src address for vma boundaries first.

2024-08-14 Thread Liam R. Howlett

* jef...@chromium.org  [240814 03:14]:
> From: Jeff Xu 
> 
> mremap doesn't allow relocate, expand, shrink across VMA boundaries,
> refactor the code to check src address range before doing anything on
> the destination, i.e. destination won't be unmapped, if src address
> failed the boundaries check.
> 
> This also allows us to remove can_modify_mm from mremap.c, since
> the src address must be single VMA, can_modify_vma is used.

I don't think sending out a separate patch to address the same thing as
the patch you said you were testing [1] is the correct approach.  You
had already sent suggestions on mremap changes - why send this patch set
instead of making another suggestion?

Maybe send your selftest to be included with the initial patch set would
work?  Does this test pass with the other patch set?

[1] 
https://lore.kernel.org/all/calmywfs0v07z5vhedt1h3hd+3--yr6va0zuqeaato+-8mur...@mail.mail.com/

Re: [PATCH net v2] selftest: af_unix: Fix kselftest compilation warnings

2024-08-14 Thread Jakub Kicinski

On Wed, 14 Aug 2024 13:40:54 +0530 Abhinav Jain wrote:
> Also, @Jakub, please kindly check this and revert (another patch on which you
> have helped a lot already, need one small input and I can send the next 
> version):
> https://lore.kernel.org/all/20240810175509.404094-1-jain.abhinav...@gmail.com/

I read the questions and I don't have an immediate answer.
Please do your best, and we'll review the next version on the list.
Unfortunately there isn't enough hours in the day to help everyone 
I'd like to help.

Re: [PATCH v10 23/40] arm64/signal: Set up and restore the GCS context for signal handlers

2024-08-14 Thread Dave Martin

On Thu, Aug 01, 2024 at 01:06:50PM +0100, Mark Brown wrote:
> When invoking a signal handler we use the GCS configuration and stack
> for the current thread.
> 
> Since we implement signal return by calling the signal handler with a
> return address set up pointing to a trampoline in the vDSO we need to
> also configure any active GCS for this by pushing a frame for the
> trampoline onto the GCS.  If we do not do this then signal return will
> generate a GCS protection fault.
> 
> In order to guard against attempts to bypass GCS protections via signal
> return we only allow returning with GCSPR_EL0 pointing to an address
> where it was previously preempted by a signal.  We do this by pushing a
> cap onto the GCS, this takes the form of an architectural GCS cap token
> with the top bit set and token type of 0 which we add on signal entry
> and validate and pop off on signal return.  The combination of the top
> bit being set and the token type mean that this can't be interpreted as
> a valid token or address.
> 
> Reviewed-by: Thiago Jung Bauermann 
> Signed-off-by: Mark Brown 
> ---
>  arch/arm64/include/asm/gcs.h |   1 +
>  arch/arm64/kernel/signal.c   | 134 
> +--
>  arch/arm64/mm/gcs.c  |   1 +
>  3 files changed, 131 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c

[...]

> +#ifdef CONFIG_ARM64_GCS
> +
> +static int gcs_signal_entry(__sigrestore_t sigtramp, struct ksignal *ksig)
> +{
> + unsigned long __user *gcspr_el0;
> + int ret = 0;
> +
> + if (!system_supports_gcs())
> + return 0;
> +
> + if (!task_gcs_el0_enabled(current))
> + return 0;
> +
> + /*
> +  * We are entering a signal handler, current register state is
> +  * active.
> +  */
> + gcspr_el0 = (unsigned long __user *)read_sysreg_s(SYS_GCSPR_EL0);
> +
> + /*
> +  * Push a cap and the GCS entry for the trampoline onto the GCS.
> +  */
> + put_user_gcs((unsigned long)sigtramp, gcspr_el0 - 2, &ret);
> + put_user_gcs(GCS_SIGNAL_CAP(gcspr_el0 - 1), gcspr_el0 - 1, &ret);
> + if (ret != 0)
> + return ret;

What happens if we went wrong here, or if the signal we are delivering
was caused by a GCS overrun or bad GCSPR_EL0 in the first place?

It feels like a program has no way to rescue itself from excessive
recursion in some thread.  Is there something equivalent to
sigaltstack()?

Or is the shadow stack always supposed to be big enough to cope with
recursion that exhausts the main stack and alternate signal stack (and
if so, how is this ensured)?

> +
> + gcsb_dsync();
> +
> + gcspr_el0 -= 2;
> + write_sysreg_s((unsigned long)gcspr_el0, SYS_GCSPR_EL0);
> +
> + return 0;
> +}

[...]

Cheers
---Dave

Re: [PATCH net-next v19 06/13] memory-provider: dmabuf devmem memory provider

2024-08-14 Thread Mina Almasry

On Wed, Aug 14, 2024 at 10:11 AM Pavel Begunkov  wrote:
...
> > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > index 301f4250ca82..2f2a7f4dee4c 100644
> > --- a/net/core/devmem.c
> > +++ b/net/core/devmem.c
> > @@ -17,6 +17,7 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> >   #include 
> >
> >   #include "page_pool_priv.h"
> > @@ -153,6 +154,10 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device 
> > *dev, u32 rxq_idx,
> >   if (err)
> >   goto err_xa_erase;
> >
> > + err = page_pool_check_memory_provider(dev, rxq, binding);
>
> Frankly, I pretty much don't like it.
>
> 1. We do it after reconfiguring the queue just to fail and reconfigure
> it again.
>

I don't see an issue with that? Or is it just me?

> 2. It should be a part of the common path like netdev_rx_queue_restart(),
> not specific to devmem TCP.
>
> These two can be fixed by moving the check into
> netdev_rx_queue_restart() just after ->ndo_queue_mem_alloc, assuming
> that the callback where we init page pools.
>

The only reason is that the page_pool_check_memory_provider() needs to
know the memory provider to check for. Separating them keep
netdev_rx_queue_restart() usable for other future use cases that don't
expect a memory provider to be bound, but you are correct in that this
can be easily resolved by passing the binding to
netdev_rx_queue_restart() and doing the
page_pool_check_memory_providers() check inside of that function.

> 3. That implicit check gives me bad feeling, instead of just getting
> direct feedback from the driver, either it's a flag or an error
> returned, we have to try to figure what exactly the driver did, with
> a high chance this inference will fail us at some point.
>

This is where I get a bit confused. Jakub did mention that it is
desirable for core to verify that the driver did the right thing,
instead of trusting that a driver did the right thing without
verifying. Relying on a flag from the driver opens the door for the
driver to say "I support this" but actually not create the mp
page_pool. In my mind the explicit check is superior to getting
feedback from the driver.

Additionally this approach lets us detect support in core using 10
lines of code or so, rather than ask every driver that wants to
support mp to add boilerplate code to declare support (and run into
subtle bugs when this boilerplate is missing). There are minor pros
and cons to each approach; I don't see a showstopping reason to go
with one over the other.

> And page_pool_check_memory_provider() is not that straightforward,
> it doesn't walk through pools of a queue.

Right, we don't save the pp of a queue, only a netdev. The outer loop
checks all the pps of the netdev to find one with the correct binding,
and the inner loop checks that this binding is attached to the correct
queue.

> Not looking too deep,
> but it seems like the nested loop can be moved out with the same
> effect, so it first looks for a pool in the device and the follows
> with the bound_rxqs. And seems the bound_rxqs check would always turn
> true, you set the binding into the map in
> net_devmem_bind_dmabuf_to_queue() before the restart and it'll be there
> after restart for page_pool_check_memory_provider(). Maybe I missed
> something, but it's not super clear.
>
> 4. And the last thing Jakub mentioned is that we need to be prepared
> to expose a flag to the userspace for whether a queue supports
> netiov. Not really doable in a sane manner with such implicit
> post configuration checks.
>

I don't see a very strong reason to expose the flag to the userspace
now. userspace can try to bind dmabuf and get an EOPNOTSUPP if the
operation is not supported, right? In the future if passing the flag
to userspace becomes needed for some usecase, we do need feedback from
the driver, and it would be trivial to add similarly to what you
suggested.

> And that brings us back to the first approach I mentioned, where
> we have a flag in the queue structure, drivers set it, and
> netdev_rx_queue_restart() checks it before any callback. That's
> where the thread with Jakub stopped, and it reads like at least
> he's not against the idea.

Hmm, the netdev_rx_queue array is created in core, not by the driver,
does the driver set this flag during initialization? We could run into
subtle bugs with races if a code path checks for support after core
has allocated the netdev_rx_queue array but before the driver has had
a chance to declare support, right? Maybe a minor issue. Instead we
could add an ndo to the queue API that lets the driver tell us that it
could support binding on a given rx queue, and check that in
net_devmem_bind_dmabuf_to_queue() right before we do the bind?

But this is only if declaring support to userspace becomes needed for
some use case. At the moment I'm under the impression that verifying
in core that the driver did the right thing is preferred, and I'd like
to minimize the boilerplate the driver needs to implement

Re: [PATCH v10 24/40] arm64/signal: Expose GCS state in signal frames

2024-08-14 Thread Dave Martin

On Thu, Aug 01, 2024 at 01:06:51PM +0100, Mark Brown wrote:
> Add a context for the GCS state and include it in the signal context when
> running on a system that supports GCS. We reuse the same flags that the
> prctl() uses to specify which GCS features are enabled and also provide the
> current GCS pointer.
> 
> We do not support enabling GCS via signal return, there is a conflict
> between specifying GCSPR_EL0 and allocation of a new GCS and this is not
> an ancticipated use case.  We also enforce GCS configuration locking on
> signal return.
> 
> Reviewed-by: Thiago Jung Bauermann 
> Signed-off-by: Mark Brown 
> ---
>  arch/arm64/include/uapi/asm/sigcontext.h |   9 +++
>  arch/arm64/kernel/signal.c   | 106 
> +++
>  2 files changed, 115 insertions(+)

[...]

> diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c

[...]

> @@ -999,6 +1092,13 @@ static int setup_sigframe_layout(struct 
> rt_sigframe_user_layout *user,
>   return err;
>   }
>  
> + if (add_all || task_gcs_el0_enabled(current)) {
> + err = sigframe_alloc(user, &user->gcs_offset,
> +  sizeof(struct gcs_context));
> + if (err)
> + return err;
> + }
> +

Who turns on GCS?  I have a concern that if libc is new enough to be
built for GCS then the libc startup code will to turn it on, even if
the binary stack running on top of libc is old.

Whether a given library should break old binaries is a bit of a grey
area, but I think that libraries that deliberately export stable ABIs
probably shouldn't.

With that in mind, does any GCS state need to be saved at all?

Is there any scenario where it is legitimate for the signal handler to
change the shadow stack mode or to return with an altered GCSPR_EL0?

Is the guarded stack considered necessary (or at least beneficial) for
backtracing, or is the regular stack sufficient?

(I'm assuming that unwind tables / debug info should allow the shadow
stack to be unwound anyway; rather this question is about whether
software can straightforwardly find out the interrupted GCSPR_EL0
without this information... and whether it needs to.)

>   if (system_supports_sve() || system_supports_sme()) {
>   unsigned int vq = 0;
>  
> @@ -1099,6 +1199,12 @@ static int setup_sigframe(struct 
> rt_sigframe_user_layout *user,
>   __put_user_error(current->thread.fault_code, &esr_ctx->esr, 
> err);
>   }
>  
> + if (system_supports_gcs() && err == 0 && user->gcs_offset) {
> + struct gcs_context __user *gcs_ctx =
> + apply_user_offset(user, user->gcs_offset);
> + err |= preserve_gcs_context(gcs_ctx);
> + }
> +

[...]

Cheers
---Dave

Re: [PATCH net-next v13 02/14] mm: move the page fragment allocator from page_alloc into its own file

2024-08-14 Thread Alexander H Duyck

On Thu, 2024-08-08 at 20:37 +0800, Yunsheng Lin wrote:
> Inspired by [1], move the page fragment allocator from page_alloc
> into its own c file and header file, as we are about to make more
> change for it to replace another page_frag implementation in
> sock.c
> 
> As this patchset is going to replace 'struct page_frag' with
> 'struct page_frag_cache' in sched.h, including page_frag_cache.h
> in sched.h has a compiler error caused by interdependence between
> mm_types.h and mm.h for asm-offsets.c, see [2]. So avoid the compiler
> error by moving 'struct page_frag_cache' to mm_types_task.h as
> suggested by Alexander, see [3].
> 
> 1. https://lore.kernel.org/all/20230411160902.4134381-3-dhowe...@redhat.com/
> 2. https://lore.kernel.org/all/15623dac-9358-4597-b3ee-3694a5956...@gmail.com/
> 3. 
> https://lore.kernel.org/all/CAKgT0UdH1yD=LSCXFJ=ym_aia4oomd-2wxyko42bizawmt_...@mail.gmail.com/
> CC: David Howells 
> CC: Alexander Duyck 
> Signed-off-by: Yunsheng Lin 
> ---
>  include/linux/gfp.h   |  22 ---
>  include/linux/mm_types.h  |  18 ---
>  include/linux/mm_types_task.h |  18 +++
>  include/linux/page_frag_cache.h   |  31 
>  include/linux/skbuff.h|   1 +
>  mm/Makefile   |   1 +
>  mm/page_alloc.c   | 136 
>  mm/page_frag_cache.c  | 145 ++
>  .../selftests/mm/page_frag/page_frag_test.c   |   2 +-
>  9 files changed, 197 insertions(+), 177 deletions(-)
>  create mode 100644 include/linux/page_frag_cache.h
>  create mode 100644 mm/page_frag_cache.c
> 
> 

...

> diff --git a/include/linux/page_frag_cache.h b/include/linux/page_frag_cache.h
> new file mode 100644
> index ..a758cb65a9b3
> --- /dev/null
> +++ b/include/linux/page_frag_cache.h
> @@ -0,0 +1,31 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _LINUX_PAGE_FRAG_CACHE_H
> +#define _LINUX_PAGE_FRAG_CACHE_H
> +
> +#include 
> +#include 
> +#include 
> +

Minor nit. These should usually be in alphabetical order. So
mm_types_task.h should be between log2.h and types.h.

Re: [PATCH] selftests/ftrace: Add required dependency for kprobe tests

2024-08-14 Thread Google

On Wed, 14 Aug 2024 05:43:29 -0600
Shuah Khan  wrote:

> On 8/13/24 18:53, Masami Hiramatsu (Google) wrote:
> > Hi Shuah,
> > 
> > Can you pick this? I confirmed this can be applied on v6.11-rc3.
> > 
> > 
> > 
> > On Thu, 13 Jun 2024 07:12:10 +0900
> > "Masami Hiramatsu (Google)"  wrote:
> > 
> >> From: Masami Hiramatsu (Google) 
> >>
> >> kprobe_args_{char,string}.tc are using available_filter_functions file
> >> which is provided by function tracer. Thus if function tracer is disabled,
> >> these tests are failed on recent kernels because tracefs_create_dir is
> >> not raised events by adding a dynamic event.
> >> Add available_filter_functions to requires line.
> >>
> >> Fixes: 7c1130ea5cae ("test: ftrace: Fix kprobe test for eventfs")
> >> Signed-off-by: Masami Hiramatsu (Google) 
> 
> Applied to linux-kselftest next for Linux 6.12-rc1

Thanks!

> 
> thanks,
> -- Shuah


-- 
Masami Hiramatsu (Google)

Re: [PATCH v3 3/3] selftests/tracing: Add hist poll() support test

2024-08-14 Thread Google

On Wed, 14 Aug 2024 05:38:24 -0600
Shuah Khan  wrote:

> On 8/13/24 18:11, Masami Hiramatsu (Google) wrote:
> > From: Masami Hiramatsu (Google) 
> > 
> > Add a testcase for poll() on hist file. This introduces a helper binary
> > to the ftracetest, because there is no good way to reliably execute
> > poll() on hist file.
> > 
> > Signed-off-by: Masami Hiramatsu (Google) 
> > ---
> >   Changes in v2:
> >- Update poll command to support both of POLLIN and POLLPRI, and timeout.
> >- Identify unsupported stable kernel if poll-in returns soon.
> >- Test both of POLLIN and POLLPRI.
> > ---
> >   tools/testing/selftests/ftrace/Makefile|2 +
> >   tools/testing/selftests/ftrace/poll.c  |   62 
> > +
> >   .../ftrace/test.d/trigger/trigger-hist-poll.tc |   74 
> > 
> >   3 files changed, 138 insertions(+)
> >   create mode 100644 tools/testing/selftests/ftrace/poll.c
> >   create mode 100644 
> > tools/testing/selftests/ftrace/test.d/trigger/trigger-hist-poll.tc
> > 
> > diff --git a/tools/testing/selftests/ftrace/Makefile 
> > b/tools/testing/selftests/ftrace/Makefile
> > index a1e955d2de4c..49d96bb16355 100644
> > --- a/tools/testing/selftests/ftrace/Makefile
> > +++ b/tools/testing/selftests/ftrace/Makefile
> > @@ -6,4 +6,6 @@ TEST_PROGS := ftracetest-ktap
> >   TEST_FILES := test.d settings
> >   EXTRA_CLEAN := $(OUTPUT)/logs/*
> >   
> > +TEST_GEN_PROGS = poll
> > +
> >   include ../lib.mk
> > diff --git a/tools/testing/selftests/ftrace/poll.c 
> > b/tools/testing/selftests/ftrace/poll.c
> > new file mode 100644
> > index ..8003a59fe042
> > --- /dev/null
> > +++ b/tools/testing/selftests/ftrace/poll.c
> > @@ -0,0 +1,62 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Simple poll on a file.
> > + *
> > + * Copyright (c) 2024 Google LLC.
> > + */
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#define BUFSIZE 4096
> > +
> > +/*
> > + * Usage:
> > + *  poll   [timeout]
> > + */
> > +int main(int argc, char *argv[])
> > +{
> > +   struct pollfd pfd;
> > +   char buf[BUFSIZE];
> > +   int timeout = -1;
> > +   int ret;
> > +
> > +   if (argc < 3)
> > +   return -1;
> > +
> > +   if (!strcmp(argv[1], "in"))
> > +   pfd.events = POLLIN;
> > +   else if (!strcmp(argv[1], "pri"))
> > +   pfd.events = POLLPRI;
> > +
> > +   pfd.fd = open(argv[2], O_RDONLY);
> > +   if (pfd.fd < 0) {
> > +   perror("open");
> > +   return -1;
> > +   }
> > +
> > +   if (argc == 4)
> > +   timeout = atoi(argv[3]);
> 
> This code can be simpler and more maintainable using getopt.
> Any reason why you didn't use it?

There is no reason. OK, let me use getopt to clean it up.

> 
> > +
> > +   /* Reset poll by read if POLLIN is specified. */
> > +   if (pfd.events & POLLIN)
> > +   do {} while (read(pfd.fd, buf, BUFSIZE) == BUFSIZE);
> > +
> > +   ret = poll(&pfd, 1, timeout);
> > +   if (ret < 0 && errno != EINTR) {
> > +   perror("poll")> +   return -1;
> > +   }
> > +   close(pfd.fd);
> > +
> > +   /* If timeout happned, return code is 0 */
> 
> Spelling - happened

Oops, thanks!

> 
> > +   if (ret == 0)
> > +   return 1;
> > +
> > +   return 0;
> > +}
> > diff --git 
> > a/tools/testing/selftests/ftrace/test.d/trigger/trigger-hist-poll.tc 
> > b/tools/testing/selftests/ftrace/test.d/trigger/trigger-hist-poll.tc
> > new file mode 100644
> > index ..53bea74e2234
> > --- /dev/null
> > +++ b/tools/testing/selftests/ftrace/test.d/trigger/trigger-hist-poll.tc
> > @@ -0,0 +1,74 @@
> > +#!/bin/sh
> > +# SPDX-License-Identifier: GPL-2.0
> > +# description: event trigger - test poll wait on histogram
> > +# requires: set_event events/sched/sched_process_free/trigger 
> > events/sched/sched_process_free/hist
> > +# flags: instance
> > +
> > +POLL=${FTRACETEST_ROOT}/poll
> > +
> > +if [ ! -x ${POLL} ]; then
> > +  echo "poll program is not compiled!"
> > +  exit_unresolved
> > +fi
> > +
> > +EVENT=events/sched/sched_process_free/
> > +
> > +# Check poll ops is supported. Before implementing poll on hist file, it
> > +# returns soon with POLLIN | POLLOUT, but not POLLPRI.
> > +
> > +# This must wait >1 sec and return 1 (timeout).
> > +set +e
> > +${POLL} in ${EVENT}/hist 1000
> > +ret=$?
> > +set -e
> > +if [ ${ret} != 1 ]; then
> > +  echo "poll on hist file is not supported"
> > +  exit_unsupported
> > +fi
> > +
> > +# Test POLLIN
> > +echo > trace
> > +echo "hist:key=comm" > ${EVENT}/trigger
> > +echo 1 > ${EVENT}/enable
> > +
> > +# This sleep command will exit after 2 seconds.
> > +sleep 2 &
> > +BGPID=$!
> > +# if timeout happens, poll returns 1.
> > +${POLL} in ${EVENT}/hist 4000
> > +echo 0 > tracing_on
> > +
> > +if [ -d /proc/${BGPID} ]; then
> > +  echo "poll exits too soon"
> > +  kill -KILL ${BGPID} ||:
> > +  exit_fail
> > +fi
> > +
> > +if ! grep -qw "sleep" trace; the

Re: [PATCH net-next v13 04/14] mm: page_frag: add '_va' suffix to page_frag API

2024-08-14 Thread Alexander H Duyck

On Thu, 2024-08-08 at 20:37 +0800, Yunsheng Lin wrote:
> Currently the page_frag API is returning 'virtual address'
> or 'va' when allocing and expecting 'virtual address' or
> 'va' as input when freeing.
> 
> As we are about to support new use cases that the caller
> need to deal with 'struct page' or need to deal with both
> 'va' and 'struct page'. In order to differentiate the API
> handling between 'va' and 'struct page', add '_va' suffix
> to the corresponding API mirroring the page_pool_alloc_va()
> API of the page_pool. So that callers expecting to deal with
> va, page or both va and page may call page_frag_alloc_va*,
> page_frag_alloc_pg*, or page_frag_alloc* API accordingly.
> 
> CC: Alexander Duyck 
> Signed-off-by: Yunsheng Lin 
> Reviewed-by: Subbaraya Sundeep 
> Acked-by: Chuck Lever 
> Acked-by: Sagi Grimberg 
> ---
>  drivers/net/ethernet/google/gve/gve_rx.c  |  4 ++--
>  drivers/net/ethernet/intel/ice/ice_txrx.c |  2 +-
>  drivers/net/ethernet/intel/ice/ice_txrx.h |  2 +-
>  drivers/net/ethernet/intel/ice/ice_txrx_lib.c |  2 +-
>  .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |  4 ++--
>  .../marvell/octeontx2/nic/otx2_common.c   |  2 +-
>  drivers/net/ethernet/mediatek/mtk_wed_wo.c|  4 ++--
>  drivers/nvme/host/tcp.c   |  8 +++
>  drivers/nvme/target/tcp.c | 22 +--
>  drivers/vhost/net.c   |  6 ++---
>  include/linux/page_frag_cache.h   | 21 +-
>  include/linux/skbuff.h|  2 +-
>  kernel/bpf/cpumap.c   |  2 +-
>  mm/page_frag_cache.c  | 12 +-
>  net/core/skbuff.c | 16 +++---
>  net/core/xdp.c|  2 +-
>  net/rxrpc/txbuf.c | 15 +++--
>  net/sunrpc/svcsock.c  |  6 ++---
>  .../selftests/mm/page_frag/page_frag_test.c   | 13 ++-
>  19 files changed, 75 insertions(+), 70 deletions(-)
> 

I still say no to this patch. It is an unnecessary name change and adds
no value. If you insist on this patch I will reject the set every time.

The fact is it is polluting the git history and just makes things
harder to maintain without adding any value as you aren't changing what
the function does and there is no need for this. In addition it just
makes it that much harder to backport fixes in the future as people
will have to work around the rename.

Re: [PATCH v10 23/40] arm64/signal: Set up and restore the GCS context for signal handlers

2024-08-14 Thread Mark Brown

On Wed, Aug 14, 2024 at 03:51:42PM +0100, Dave Martin wrote:
> On Thu, Aug 01, 2024 at 01:06:50PM +0100, Mark Brown wrote:

> > +   put_user_gcs((unsigned long)sigtramp, gcspr_el0 - 2, &ret);
> > +   put_user_gcs(GCS_SIGNAL_CAP(gcspr_el0 - 1), gcspr_el0 - 1, &ret);
> > +   if (ret != 0)
> > +   return ret;

> What happens if we went wrong here, or if the signal we are delivering
> was caused by a GCS overrun or bad GCSPR_EL0 in the first place?

> It feels like a program has no way to rescue itself from excessive
> recursion in some thread.  Is there something equivalent to
> sigaltstack()?

> Or is the shadow stack always supposed to be big enough to cope with
> recursion that exhausts the main stack and alternate signal stack (and
> if so, how is this ensured)?

There's no sigaltstack() for GCS, this is also the ABI with the existing
shadow stack on x86 and should be addressed in a cross architecture
fashion.  There have been some discussions about providing a shadow alt
stack but they've generally been circular and inconclusive, there were a
bunch of tradeoffs for corner cases and nobody had a clear sense as to
what a good solution should be.  It was a bit unclear that actively
doing anything was worthwhile.  The issues were IIRC around unwinders
and disjoint shadow stacks, compatibility with non-shadow stacks and
behaviour when we overflow the shadow stack.  I think there were also
some applications trying to be very clever with alt stacks that needed
to be interacted with and complicated everything but I could be
misremembering there.

Practically speaking since we're only storing return addresses the
default GCS should be extremely large so it's unlikely to come up
without first encountering and handling issues on the normal stack.
Users allocating their own shadow stacks should be careful.  This isn't
really satisfying but is probably fine in practice, there's certainly
not been any pressure yet from the existing x86 deployments (though at
present nobody can explicitly select their own shadow stack size,
perhaps it'll become more of an issue when the clone3() stuff is in).

signature.asc
Description: PGP signature

Re: [PATCH net-next v2] net: netconsole: selftests: Create a new netconsole selftest

2024-08-14 Thread Breno Leitao

On Wed, Aug 14, 2024 at 12:24:46PM +0200, Petr Machata wrote:
> 
> Breno Leitao  writes:

> > +   fi
> > +
> > +   if ! grep -q "${MSG}" "${TMPFILENAME}"; then
> > +   echo "FAIL: ${MSG} not found in ${TMPFILENAME}" >&2
> > +   cat "${TMPFILENAME}" >&2
> > +   return ${ksft_fail}
> > +   fi
> > +
> > +   # Delete the file once it is validated, otherwise keep it
> > +   # for debugging purposes
> > +   rm "${TMPFILENAME}"
> 
> Seeing the removal within the validation function is odd, I would expect
> it to be part of cleanup().

Thanks for all the other feedbacks, all of them make sense.

Regarding this one, I kept like this, because I only remove the file if
the test succeed, otherwise I keep the file here for debugging purposes,
as described in the comment above.

If that is not a good practice, I am more than happy to move this
to cleanup.

Thanks for the detailed review,
--breno

[PATCH 1/3] selftests/bpf: Support AF_UNIX SOCK_DGRAM socket pair creation

2024-08-14 Thread Michal Luczaj

Handle AF_UNIX in init_addr_loopback(). For pair creation, bind() the peer
socket to make SOCK_DGRAM connect() happy.

Signed-off-by: Michal Luczaj 
---
 .../bpf/prog_tests/sockmap_helpers.h  | 29 +++
 1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_helpers.h 
b/tools/testing/selftests/bpf/prog_tests/sockmap_helpers.h
index 38e35c72bdaa..c50efa834a11 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_helpers.h
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_helpers.h
@@ -1,6 +1,7 @@
 #ifndef __SOCKMAP_HELPERS__
 #define __SOCKMAP_HELPERS__
 
+#include 
 #include 
 
 /* include/linux/net.h */
@@ -283,6 +284,15 @@ static inline void init_addr_loopback6(struct 
sockaddr_storage *ss,
*len = sizeof(*addr6);
 }
 
+static inline void init_addr_loopback_unix(struct sockaddr_storage *ss,
+  socklen_t *len)
+{
+   struct sockaddr_un *addr = memset(ss, 0, sizeof(*ss));
+
+   addr->sun_family = AF_UNIX;
+   *len = sizeof(sa_family_t);
+}
+
 static inline void init_addr_loopback_vsock(struct sockaddr_storage *ss,
socklen_t *len)
 {
@@ -304,6 +314,9 @@ static inline void init_addr_loopback(int family, struct 
sockaddr_storage *ss,
case AF_INET6:
init_addr_loopback6(ss, len);
return;
+   case AF_UNIX:
+   init_addr_loopback_unix(ss, len);
+   return;
case AF_VSOCK:
init_addr_loopback_vsock(ss, len);
return;
@@ -390,21 +403,27 @@ static inline int create_pair(int family, int sotype, int 
*p0, int *p1)
 {
__close_fd int s, c = -1, p = -1;
struct sockaddr_storage addr;
-   socklen_t len = sizeof(addr);
+   socklen_t len;
int err;
 
s = socket_loopback(family, sotype);
if (s < 0)
return s;
 
-   err = xgetsockname(s, sockaddr(&addr), &len);
-   if (err)
-   return err;
-
c = xsocket(family, sotype, 0);
if (c < 0)
return c;
 
+   init_addr_loopback(family, &addr, &len);
+   err = xbind(c, sockaddr(&addr), len);
+   if (err)
+   return err;
+
+   len = sizeof(addr);
+   err = xgetsockname(s, sockaddr(&addr), &len);
+   if (err)
+   return err;
+
err = connect(c, sockaddr(&addr), len);
if (err) {
if (errno != EINPROGRESS) {
-- 
2.46.0

[PATCH 2/3] selftests/bpf: Allow setting BPF_F_INGRESS in prog_msg_verdict()

2024-08-14 Thread Michal Luczaj

Let a selftest set BPF_F_INGRESS for map/hash redirect.

In run_tests(), explicitly reset skel->bss->test_ingress to false. Earlier
tests might have left it flipped.

Signed-off-by: Michal Luczaj 
---
 tools/testing/selftests/bpf/prog_tests/sockmap_listen.c | 2 ++
 tools/testing/selftests/bpf/progs/test_sockmap_listen.c | 6 --
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c 
b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
index da5a6fb03b69..a5e7d27760cf 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
@@ -1850,6 +1850,8 @@ static void test_udp_unix_redir(struct 
test_sockmap_listen *skel, struct bpf_map
 static void run_tests(struct test_sockmap_listen *skel, struct bpf_map *map,
  int family)
 {
+   skel->bss->test_ingress = false;
+
test_ops(skel, map, family, SOCK_STREAM);
test_ops(skel, map, family, SOCK_DGRAM);
test_redir(skel, map, family, SOCK_STREAM);
diff --git a/tools/testing/selftests/bpf/progs/test_sockmap_listen.c 
b/tools/testing/selftests/bpf/progs/test_sockmap_listen.c
index b7250eb9c30c..5a3504d5dfc3 100644
--- a/tools/testing/selftests/bpf/progs/test_sockmap_listen.c
+++ b/tools/testing/selftests/bpf/progs/test_sockmap_listen.c
@@ -106,9 +106,11 @@ int prog_msg_verdict(struct sk_msg_md *msg)
int verdict;
 
if (test_sockmap)
-   verdict = bpf_msg_redirect_map(msg, &sock_map, zero, 0);
+   verdict = bpf_msg_redirect_map(msg, &sock_map, zero,
+  test_ingress ? BPF_F_INGRESS : 
0);
else
-   verdict = bpf_msg_redirect_hash(msg, &sock_hash, &zero, 0);
+   verdict = bpf_msg_redirect_hash(msg, &sock_hash, &zero,
+   test_ingress ? BPF_F_INGRESS : 
0);
 
count = bpf_map_lookup_elem(&verdict_map, &verdict);
if (count)
-- 
2.46.0

[PATCH 3/3] selftests/bpf: Add selftest for sockmap/hashmap redirection

2024-08-14 Thread Michal Luczaj

Test redirection logic.

BPF_MAP_TYPE_SOCKMAP
BPF_MAP_TYPE_SOCKHASH
✕
sk_msg-to-egress
sk_msg-to-ingress
sk_skb-to-egress
sk_skb-to-ingress
✕
AF_INET, SOCK_STREAM
AF_INET6, SOCK_STREAM
AF_INET, SOCK_DGRAM
AF_INET6, SOCK_DGRAM
AF_UNIX, SOCK_STREAM
AF_UNIX, SOCK_DGRAM
AF_VSOCK, SOCK_STREAM
AF_VSOCK, SOCK_SEQPACKET

Suggested-by: Jakub Sitnicki 
Signed-off-by: Michal Luczaj 
---
 .../bpf/prog_tests/sockmap_helpers.h  |  58 
 .../selftests/bpf/prog_tests/sockmap_redir.c  | 315 ++
 2 files changed, 373 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/sockmap_redir.c

diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_helpers.h 
b/tools/testing/selftests/bpf/prog_tests/sockmap_helpers.h
index c50efa834a11..db0a7b4dc8be 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_helpers.h
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_helpers.h
@@ -16,6 +16,9 @@
 #define VMADDR_CID_LOCAL 1
 #endif
 
+#define u32(v) ((u32){(v)})
+#define u64(v) ((u64){(v)})
+
 #define __always_unused__attribute__((__unused__))
 
 /* include/linux/cleanup.h */
@@ -485,4 +488,59 @@ static inline int create_socket_pairs(int family, int 
sotype, int *c0, int *c1,
return err;
 }
 
+static inline const char *socket_kind_to_str(int sock_fd)
+{
+   int domain = 0, type = 0;
+   socklen_t opt_len;
+
+   opt_len = sizeof(domain);
+   if (getsockopt(sock_fd, SOL_SOCKET, SO_DOMAIN, &domain, &opt_len))
+   FAIL_ERRNO("getsockopt(SO_DOMAIN)");
+
+   opt_len = sizeof(type);
+   if (getsockopt(sock_fd, SOL_SOCKET, SO_TYPE, &type, &opt_len))
+   FAIL_ERRNO("getsockopt(SO_TYPE)");
+
+   switch (domain) {
+   case AF_INET:
+   switch (type) {
+   case SOCK_STREAM:
+   return "tcp4";
+   case SOCK_DGRAM:
+   return "udp4";
+   }
+   break;
+   case AF_INET6:
+   switch (type) {
+   case SOCK_STREAM:
+   return "tcp6";
+   case SOCK_DGRAM:
+   return "udp6";
+   }
+   break;
+   case AF_UNIX:
+   switch (type) {
+   case SOCK_STREAM:
+   return "u_str";
+   case SOCK_DGRAM:
+   return "u_dgr";
+   case SOCK_SEQPACKET:
+   return "u_seq";
+   }
+   break;
+   case AF_VSOCK:
+   switch (type) {
+   case SOCK_STREAM:
+   return "v_str";
+   case SOCK_DGRAM:
+   return "v_dgr";
+   case SOCK_SEQPACKET:
+   return "v_seq";
+   }
+   break;
+   }
+
+   return "???";
+}
+
 #endif // __SOCKMAP_HELPERS__
diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_redir.c 
b/tools/testing/selftests/bpf/prog_tests/sockmap_redir.c
new file mode 100644
index ..335dd348b019
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_redir.c
@@ -0,0 +1,315 @@
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#include "test_progs.h"
+#include "sockmap_helpers.h"
+#include "test_sockmap_listen.skel.h"
+
+enum prog_kind {
+   SK_MSG_EGRESS,
+   SK_MSG_INGRESS,
+   SK_SKB_EGRESS,
+   SK_SKB_INGRESS,
+};
+
+struct {
+   enum prog_kind prog_kind;
+   const char *in, *out;
+} supported[] = {
+   /* Send to local: TCP -> any but vsock */
+   { SK_MSG_INGRESS,   "tcp",  "tcp"   },
+   { SK_MSG_INGRESS,   "tcp",  "udp"   },
+   { SK_MSG_INGRESS,   "tcp",  "u_str" },
+   { SK_MSG_INGRESS,   "tcp",  "u_dgr" },
+   /* Send to egress: TCP -> TCP */
+   { SK_MSG_EGRESS,"tcp",  "tcp"   },
+   /* Ingress to egress: any -> any */
+   { SK_SKB_EGRESS,"any",  "any"   },
+   /* Ingress to local: any -> any but vsock */
+   { SK_SKB_INGRESS,   "any",  "tcp"   },
+   { SK_SKB_INGRESS,   "any",  "udp"   },
+   { SK_SKB_INGRESS,   "any",  "u_str" },
+   { SK_SKB_INGRESS,   "any",  "u_dgr" },
+};
+
+enum {
+   SEND_INNER = 0,
+   SEND_OUTER,
+};
+
+enum {
+   RECV_INNER = 0,
+   RECV_OUTER,
+};
+
+enum map_kind {
+   SOCKMAP,
+   SOCKHASH,
+};
+
+struct redir_spec {
+   const char *name;
+   int idx_send;
+   int idx_recv;
+   enum prog_kind prog_kind;
+};
+
+struct socket_spec {
+   int family;
+   int sotype;
+   int send_flags;
+   int in[2];
+   int out[2];
+};
+
+static int socket_spec_pairs(struct socket_spec *s)
+{
+   return create_socket_pairs(s->family, s->sotype,
+  &s->in[0], &s->out[0],
+  &s->in[1], &s->out[1]);
+}

Re: [PATCH bpf-next v2 0/6] selftests/bpf: Various sockmap-related fixes

2024-08-14 Thread Michal Luczaj

On 8/6/24 19:45, Jakub Sitnicki wrote:
> On Tue, Aug 06, 2024 at 07:18 PM +02, Michal Luczaj wrote:
>> Great, thanks for the review. With this completed, I guess we can unwind
>> the (mail) stack to [1]. Is that ingress-to-local et al. something you
>> wanted to take care of yourself or can I give it a try?
>> [1] https://lore.kernel.org/netdev/87msmqn9ws@cloudflare.com/
> 
> I haven't stated any work on. You're welcome to tackle that.
> 
> All I have is a toy test that I've used to generate the redirect matrix.
> Perhaps it can serve as inspiration:
> 
> https://github.com/jsitnicki/sockmap-redir-matrix

All right, please let me know if this is more or less what you meant and
I'll post the whole series for a review (+patch to purge sockmap_listen of
redir tests, fix misnomers). Mostly I've just copypasted your code
(mangling it terribly along the way), so I feel silly claiming the
authorship. Should I assign you as an author?

Note that the patches are based on [2], which has not reached bpf-next
(patchwork says: "Needs ACK").

[2] [PATCH bpf-next v2 0/6] selftests/bpf: Various sockmap-related fixes

https://lore.kernel.org/bpf/20240731-selftest-sockmap-fixes-v2-0-08a0c73ab...@rbox.co/

Re: [PATCH net-next v19 06/13] memory-provider: dmabuf devmem memory provider

2024-08-14 Thread Pavel Begunkov


On 8/14/24 15:55, Mina Almasry wrote:

On Wed, Aug 14, 2024 at 10:11 AM Pavel Begunkov  wrote:
...

diff --git a/net/core/devmem.c b/net/core/devmem.c
index 301f4250ca82..2f2a7f4dee4c 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -17,6 +17,7 @@
   #include 
   #include 
   #include 
+#include 
   #include 

   #include "page_pool_priv.h"
@@ -153,6 +154,10 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device 
*dev, u32 rxq_idx,
   if (err)
   goto err_xa_erase;

+ err = page_pool_check_memory_provider(dev, rxq, binding);


Frankly, I pretty much don't like it.

1. We do it after reconfiguring the queue just to fail and reconfigure
it again.



I don't see an issue with that? Or is it just me?


Not a bug, just over excessive harassing of the interface,
which can be easily be avoided.



2. It should be a part of the common path like netdev_rx_queue_restart(),
not specific to devmem TCP.

These two can be fixed by moving the check into
netdev_rx_queue_restart() just after ->ndo_queue_mem_alloc, assuming
that the callback where we init page pools.



The only reason is that the page_pool_check_memory_provider() needs to
know the memory provider to check for. Separating them keep
netdev_rx_queue_restart() usable for other future use cases that don't
expect a memory provider to be bound, but you are correct in that this
can be easily resolved by passing the binding to
netdev_rx_queue_restart() and doing the
page_pool_check_memory_providers() check inside of that function.


It's already passed inside the queue.

netdev_rx_queue_restart() {
if (rxq->mp_params && !rxq->netiov_supported)
fail;
}


3. That implicit check gives me bad feeling, instead of just getting
direct feedback from the driver, either it's a flag or an error
returned, we have to try to figure what exactly the driver did, with
a high chance this inference will fail us at some point.



This is where I get a bit confused. Jakub did mention that it is
desirable for core to verify that the driver did the right thing,
instead of trusting that a driver did the right thing without
verifying. Relying on a flag from the driver opens the door for the
driver to say "I support this" but actually not create the mp
page_pool. In my mind the explicit check is superior to getting
feedback from the driver.


You can apply the same argument to anything, but not like
after each for example ->ndo_start_xmit we dig into the
interface's pending queue to make sure it was actually queued.

And even if you check that there is a page pool, the driver
can just create an empty pool that it'll never use. There
are always ways to make it wrong.

Yes, there is a difference, and I'm not against it as a
WARN_ON_ONCE after failing it in a more explicit way.

Jakub might have a different opinion on how it should look
like, and we can clarify on that, but I do believe it's a
confusing interface that can be easily made better.


Additionally this approach lets us detect support in core using 10
lines of code or so, rather than ask every driver that wants to
support mp to add boilerplate code to declare support (and run into
subtle bugs when this boilerplate is missing). There are minor pros


Right, 10 lines of some odd code, which not even clear off the
bat why it's there if we get an error code from the restart /
callbacks, vs one line of "boilerplate" per driver a la

rxq->netiov_supported = true;

that can be added while implementing the queue api. If it's
missing the user gets back not a subtle error.



and cons to each approach; I don't see a showstopping reason to go
with one over the other.


And page_pool_check_memory_provider() is not that straightforward,
it doesn't walk through pools of a queue.


Right, we don't save the pp of a queue, only a netdev. The outer loop
checks all the pps of the netdev to find one with the correct binding,
and the inner loop checks that this binding is attached to the correct
queue.


That's the thing, I doubt about the second part.

net_devmem_bind_dmabuf_to_queue() {
err = xa_alloc(&binding->bound_rxqs, &xa_idx, rxq);
if (err)
return err;

netdev_rx_queue_restart();

// page_pool_check_memory_provider
...
xa_for_each(&binding->bound_rxqs, xa_idx, binding_rxq) {
if (rxq == binding_rxq)
return success;
}

Can't b4 the patches for some reason, but that's the highlight
from the patchset, correct me if I'm wrong. That xa_for_each
check is always true because you put the queue in there right
before it, and I don't that anyone could've erased it.

The problem here is that it seems the ->bound_rxqs state doesn't
depend on what page pools were actually created and with what mp.

So if you try to add a binding to 2 queues of the same interface,
the first succeeds and the second silently fails, then the
following net_devmem_bind_dmabuf_to_queue() for the second queue
will say that ever

Re: [PATCH v10 24/40] arm64/signal: Expose GCS state in signal frames

2024-08-14 Thread Mark Brown

On Wed, Aug 14, 2024 at 04:09:51PM +0100, Dave Martin wrote:
> On Thu, Aug 01, 2024 at 01:06:51PM +0100, Mark Brown wrote:

> > +   if (add_all || task_gcs_el0_enabled(current)) {
> > +   err = sigframe_alloc(user, &user->gcs_offset,
> > +sizeof(struct gcs_context));
> > +   if (err)
> > +   return err;
> > +   }

> Who turns on GCS?  I have a concern that if libc is new enough to be
> built for GCS then the libc startup code will to turn it on, even if
> the binary stack running on top of libc is old.

It should normally be the dynamic linker which should be looking for
annotatations in the binaries it's loading before it decides if it's
going to turn on GCS (and libc doing something similar if it's going to
dlopen() things in a process with GCS enabled).

> Is there any scenario where it is legitimate for the signal handler to
> change the shadow stack mode or to return with an altered GCSPR_EL0?

If userspace can rewrite the stack pointer on return (eg, to return to a
different context as part of userspace threading) then it will need to
be able to also update GCSPR_EL0 to something consistent otherwise
attempting to return from the interrupted context isn't going to go
well.  Changing the mode is a bit more exotic, as it is in general.
It's as much to provide information to the signal handler as anything
else.

> Is the guarded stack considered necessary (or at least beneficial) for
> backtracing, or is the regular stack sufficient?

It's potentially beneficial, being less vulnerable to corruption and
simpler to parse if all you're interested in is return addresses.
Profiling in particular was mentioned, grabbing a linear block of memory
will hopefully be less overhead than chasing down the stack.  The
regular stack should generally be sufficient though.

signature.asc
Description: PGP signature

Re: [PATCH v1 01/16] iommufd/viommu: Add IOMMUFD_OBJ_VIOMMU and IOMMU_VIOMMU_ALLOC ioctl

2024-08-14 Thread Nicolin Chen

On Wed, Aug 07, 2024 at 01:10:42PM -0700, Nicolin Chen wrote:

>  /**
> @@ -876,4 +877,33 @@ struct iommu_fault_alloc {
> __u32 out_fault_fd;
>  };
>  #define IOMMU_FAULT_QUEUE_ALLOC _IO(IOMMUFD_TYPE, 
> IOMMUFD_CMD_FAULT_QUEUE_ALLOC)
> +
> +/**
> + * enum iommu_viommu_type - Virtual IOMMU Type
> + * @IOMMU_VIOMMU_TYPE_DEFAULT: Core-managed VIOMMU type
> + */
> +enum iommu_viommu_type {
> +   IOMMU_VIOMMU_TYPE_DEFAULT,

This needs a "= 0x0". I fixed locally along with others in this series.

Nicolin

Re: [PATCH v1 0/2] mremap refactor: check src address for vma boundaries first.

2024-08-14 Thread Jeff Xu

On Wed, Aug 14, 2024 at 7:40 AM Liam R. Howlett  wrote:
>
> * jef...@chromium.org  [240814 03:14]:
> > From: Jeff Xu 
> >
> > mremap doesn't allow relocate, expand, shrink across VMA boundaries,
> > refactor the code to check src address range before doing anything on
> > the destination, i.e. destination won't be unmapped, if src address
> > failed the boundaries check.
> >
> > This also allows us to remove can_modify_mm from mremap.c, since
> > the src address must be single VMA, can_modify_vma is used.
>
> I don't think sending out a separate patch to address the same thing as
> the patch you said you were testing [1] is the correct approach.  You
> had already sent suggestions on mremap changes - why send this patch set
> instead of making another suggestion?
>
As indicated in the cover letter, this patch aims to improve mremap
performance while preserving existing mseal's semantics. And this
patch can go in-dependantly regardless of in-loop out-loop discussion.

[1] link in your email is broken, but I assume you meant Pedro's V1/V2
of in-loop change. In-loop change has a semantic/regression risk to
mseal, and will take longer time to review/test/prove and bake.

We can leave in-loop discussion in Pedro's thread, I hope the V3 of
Pedro's patch adds more testing coverage and addresses existing
comments in V2.

Thanks
-Jeff

-Jeff

Re: [PATCH 0/6] Extend pmu_counters_test to AMD CPUs

2024-08-14 Thread Colton Lewis

Sean Christopherson  writes:

On Tue, Aug 13, 2024, Colton Lewis wrote:

Sean Christopherson  writes:

> On Tue, Aug 13, 2024, Colton Lewis wrote:
> > (I was positive I had sent this already, but I couldn't find it on  
the

> > mailing list to reply to and ask for reviews.)

> You did[*], it's sitting in my todo folder.  Two things.

> 1. Err on the side of caution when potentially resending, and tag
> everything
> RESEND.  Someone seeing a RESEND version without having seen the
> original version
> is no big deal.  But someone seeing two copies of the same
> patches/emails can get
> quite confusing.

Sorry for jumping the gun. I couldn't find the original patches in my
email or on the (wrong) list and panicked.

Ha, no worries.  FWIW, I highly recommend using lore if you can't  
(quickly) find
something in your own mailbox.  If it hit a tracked list, lore will have  
it.  And
if you use our corporate mail, the retention policy is 18 months unless  
you go
out of your way to tag mails to be kept, i.e. lore is more trustworthy in  
the

long run.

https://lore.kernel.org/all/?q=f:coltonle...@google.com

I did. I have a bookmark for lore, but I now realize that bookmark was
only searching the kvmarm list and this series isn't going to be there
for obvious reasons. I've fixed that.

Re: [PATCH v1 05/16] iommufd/viommu: Add IOMMU_VIOMMU_SET/UNSET_VDEV_ID ioctl

2024-08-14 Thread Nicolin Chen

On Wed, Aug 07, 2024 at 01:10:46PM -0700, Nicolin Chen wrote:
> @@ -135,7 +135,14 @@ void iommufd_device_destroy(struct iommufd_object *obj)
>  {
> struct iommufd_device *idev =
> container_of(obj, struct iommufd_device, obj);
> +   struct iommufd_vdev_id *vdev_id, *curr;
> 
> +   list_for_each_entry(vdev_id, &idev->vdev_id_list, idev_item) {
> +   curr = xa_cmpxchg(&vdev_id->viommu->vdev_ids, 
> vdev_id->vdev_id,
> + vdev_id, NULL, GFP_KERNEL);
> +   WARN_ON(curr != vdev_id);
> +   kfree(vdev_id);
> +   }

Kevin already pointed out previously during the RFC review that
we probably should do one vdev_id per idev. And Jason expressed
okay to either way. I didn't plan to change this part until this
week for the VIRQ series.

My rethinking is that an idev is attached to one (and only one)
nested HWPT. The nested HWPT is associated to one (and only one)
VIOMMU object. So, it's unlikely we can a second vdev_id, i.e.
idev->vdev_id is enough.

This helps us to build a device-based virq report function:
+void iommufd_device_report_virq(struct device *dev, unsigned int data_type,
+   void *data_ptr, size_t data_len);

I built a link from device to viommu reusing Baolu's work:
struct device -> struct iommu_group -> struct iommu_attach_handle
-> struct iommufd_attach_handle -> struct iommufd_device (idev)
-> struct iommufd_vdev_id (idev->vdev_id)

The vdev_id struct holds viommu and virtual ID, so allowing us
to add another two helpers:
+struct iommufd_viommu *iommufd_device_get_viommu(struct device *dev);
+u64 iommufd_device_get_virtual_id(struct device *dev);

A driver that reports event/irq per device can use these helpers
to report virq via the core-managed VIOMMU object. (If a driver
has some non-per-device type of IRQs, it would have to allocate
a driver-managed VIOMMU object instead.)

I have both a revised VIOMMU series and a new VIRQ series ready.
Will send in the following days after some testing/polishing.

Thanks
Nicolin

[PATCH] selftests/net/pmtu.sh: Fix typo in error message

2024-08-14 Thread Abhash Jha

The word 'expected' was spelled as 'exepcted'.
Fixed the typo in this patch.

Signed-off-by: Abhash Jha 
---
 tools/testing/selftests/net/pmtu.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/pmtu.sh 
b/tools/testing/selftests/net/pmtu.sh
index cfc849580..62eceb385 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -1347,7 +1347,7 @@ test_pmtu_ipvX_over_bridged_vxlanY_or_geneveY_exception() 
{
size=$(du -sb $tmpoutfile)
size=${size%%/tmp/*}
 
-   [ $size -ne 1048576 ] && err "File size $size mismatches 
exepcted value in locally bridged vxlan test" && return 1
+   [ $size -ne 1048576 ] && err "File size $size mismatches 
expected value in locally bridged vxlan test" && return 1
done
 
rm -f "$tmpoutfile"
-- 
2.43.0

Re: [PATCH] selftests/net/pmtu.sh: Fix typo in error message

2024-08-14 Thread Jakub Kicinski

On Wed, 14 Aug 2024 22:44:03 +0530 Abhash Jha wrote:
> The word 'expected' was spelled as 'exepcted'.
> Fixed the typo in this patch.
> 
> Signed-off-by: Abhash Jha 

Missing cc: netdev@ please repost

Re: [PATCH net-next v19 00/13] Device Memory TCP

2024-08-14 Thread Jakub Kicinski

On Tue, 13 Aug 2024 21:13:02 + Mina Almasry wrote:
> v18 got a thorough review (thanks!), and this iteration addresses the
> feedback.
> 
> Major changes:
> - Prevent deactivating mp bound queues.
> - Prevent installing xdp on mp bound netdevs, or installing mps on xdp
>   installed netdevs.
> - Fix corner cases in netlink API vis-a-vis missing attributes.
> - Iron out the unreadable netmem driver support story. To be honest, the
>   conversation with Jakub & Pavel got a bit confusing for me. I've
>   implemented an approach in this set that makes sense to me, and
>   AFAICT, addresses the requirements. It may be good as-is, or it
>   may be a conversation starter/continuer. To be honest IMO there
>   are many ways to skin this cat and I don't see an extremely strong
>   reason to go for one approach over another. Here is one approach you
>   may like.
> - Don't reset niov dma_addr on allocation & free.
> - Add some tests to the selftest that catches some of the issues around
>   missing netlink attributes or deactivating mp-bound queues.

Something is going awry in two existing test:

https://netdev.bots.linux.dev/contest.html?branch=net-next-2024-08-14--15-00&pw-n=0&pass=0

Example:

https://netdev-3.bots.linux.dev/vmksft-net-drv/results/727462/2-queues-py/stdout

I'll take a closer look at the code in the evening, but gotta discard
if from pw already..
-- 
pw-bot: cr

[PATCH] selftests/net/pmtu.sh: Fix typo in error message

2024-08-14 Thread Abhash Jha

The word 'expected' was spelled as 'exepcted'.
Fixed the typo in this patch.

Signed-off-by: Abhash Jha 
---
 tools/testing/selftests/net/pmtu.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/pmtu.sh 
b/tools/testing/selftests/net/pmtu.sh
index cfc849580..62eceb385 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -1347,7 +1347,7 @@ test_pmtu_ipvX_over_bridged_vxlanY_or_geneveY_exception() 
{
size=$(du -sb $tmpoutfile)
size=${size%%/tmp/*}
 
-   [ $size -ne 1048576 ] && err "File size $size mismatches 
exepcted value in locally bridged vxlan test" && return 1
+   [ $size -ne 1048576 ] && err "File size $size mismatches 
expected value in locally bridged vxlan test" && return 1
done
 
rm -f "$tmpoutfile"
-- 
2.43.0

[PATCH net v6 0/2] Enhance network interface feature testing

2024-08-14 Thread Abhinav Jain

This small series includes fixes for creation of veth pairs for
networkless kernels & adds tests for turning the different network
interface features on and off in selftests/net/netdevice.sh script.

Changes in v6:
Use XFAIL for ethtool operations that are unsupported instead of SKIP.

Changes in v5:
https://lore.kernel.org/all/20240808122452.25683-1-jain.abhinav...@gmail.com/

Rectify the syntax for ip add link.
Fix the veth_created condition check.

Changes in v4:
https://lore.kernel.org/all/20240807175717.7775-1-jain.abhinav...@gmail.com/

Move veth creation/removal to the main shell script.
Tested using vng on a networkless kernel and the script works, sample
output below the changes.

Changes in v3:
https://lore.kernel.org/all/20240614113240.41550-1-jain.abhinav...@gmail.com/

Add a check for netdev, create veth pair for testing.
Restore feature to its initial state.

Changes in v2:
https://lore.kernel.org/all/20240609132124.51683-1-jain.abhinav...@gmail.com/

Remove tail usage; use read to parse the features from temp file.

v1:
https://lore.kernel.org/all/20240606212714.27472-1-jain.abhinav...@gmail.com/

```
# selftests: net: netdevice.sh
# No valid network device found, creating veth pair
# PASS: veth0: set interface up
# PASS: veth0: set MAC address
# SKIP: veth0: set IP address
# PASS: veth0: ethtool list features
# PASS: veth0: Turned off feature: rx-checksumming
# PASS: veth0: Turned on feature: rx-checksumming
# PASS: veth0: Restore feature rx-checksumming to initial state on
# Actual changes:


# PASS: veth0: Restore feature rx-gro-list to initial state off
# PASS: veth0: Turned off feature: rx-udp-gro-forwarding
# PASS: veth0: Turned on feature: rx-udp-gro-forwarding
# PASS: veth0: Restore feature rx-udp-gro-forwarding to initial state off
# Cannot get register dump: Operation not supported
# XFAIL: veth0: ethtool dump not supported
# PASS: veth0: ethtool stats
# PASS: veth0: stop interface
```

Abhinav Jain (2):
  selftests: net: Create veth pair for testing in networkless kernel
  selftests: net: Add on/off checks for non-fixed features of interface

 tools/testing/selftests/net/netdevice.sh | 55 +++-
 1 file changed, 53 insertions(+), 2 deletions(-)

--
2.34.1

[PATCH net v6 1/2] selftests: net: Create veth pair for testing in networkless kernel

2024-08-14 Thread Abhinav Jain

Check if the netdev list is empty and create veth pair to be used for
feature on/off testing.
Remove the veth pair after testing is complete.

Signed-off-by: Abhinav Jain 
---
 tools/testing/selftests/net/netdevice.sh | 16 
 1 file changed, 16 insertions(+)

diff --git a/tools/testing/selftests/net/netdevice.sh 
b/tools/testing/selftests/net/netdevice.sh
index e3afcb424710..0c32950fdd17 100755
--- a/tools/testing/selftests/net/netdevice.sh
+++ b/tools/testing/selftests/net/netdevice.sh
@@ -129,6 +129,7 @@ kci_netdev_ethtool()
 
kci_netdev_ethtool_test 74 'dump' "ethtool -d $netdev"
kci_netdev_ethtool_test 94 'stats' "ethtool -S $netdev"
+
return 0
 }
 
@@ -196,10 +197,25 @@ if [ ! -e "$TMP_LIST_NETDEV" ];then
 fi
 
 ip link show |grep '^[0-9]' | grep -oE 
'[[:space:]].*eth[0-9]*:|[[:space:]].*enp[0-9]s[0-9]:' | cut -d\  -f2 | cut -d: 
-f1> "$TMP_LIST_NETDEV"
+
+if [ ! -s "$TMP_LIST_NETDEV" ]; then
+   echo "No valid network device found, creating veth pair"
+   ip link add veth0 type veth peer name veth1
+   echo "veth0" > "$TMP_LIST_NETDEV"
+   echo "veth1" >> "$TMP_LIST_NETDEV"
+   veth_created=1
+fi
+
 while read netdev
 do
kci_test_netdev "$netdev"
 done < "$TMP_LIST_NETDEV"
 
+#clean up veth interface pair if it was created
+if [ "$veth_created" ]; then
+   ip link delete veth0
+   echo "Removed veth pair"
+fi
+
 rm "$TMP_LIST_NETDEV"
 exit 0
-- 
2.34.1

[PATCH net v6 2/2] selftests: net: Add on/off checks for non-fixed features of interface

2024-08-14 Thread Abhinav Jain

Implement on/off testing for all non-fixed features via while loop.
Save the initial state so that it can be restored after on/off checks.
Use XFAIL for unsupported ethtool API.

Signed-off-by: Abhinav Jain 
---
 tools/testing/selftests/net/netdevice.sh | 39 ++--
 1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/net/netdevice.sh 
b/tools/testing/selftests/net/netdevice.sh
index 0c32950fdd17..1c9c7c358ee6 100755
--- a/tools/testing/selftests/net/netdevice.sh
+++ b/tools/testing/selftests/net/netdevice.sh
@@ -86,7 +86,7 @@ kci_netdev_ethtool_test()
ret=$?
if [ $ret -ne 0 ];then
if [ $ret -eq "$1" ];then
-   echo "SKIP: $netdev: ethtool $2 not supported"
+   echo "XFAIL: $netdev: ethtool $2 not supported"
return $ksft_skip
else
echo "FAIL: $netdev: ethtool $2"
@@ -124,7 +124,42 @@ kci_netdev_ethtool()
return 1
fi
echo "PASS: $netdev: ethtool list features"
-   #TODO for each non fixed features, try to turn them on/off
+
+   while read -r FEATURE VALUE FIXED; do
+   [ "$FEATURE" != "Features" ] || continue # Skip "Features"
+   [ "$FIXED" != "[fixed]" ] || continue # Skip fixed features
+   feature="${FEATURE%:*}"
+
+   initial_state=$(ethtool -k "$netdev" | grep "$feature:" \
+   | awk '{print $2}')
+   ethtool --offload "$netdev" "$feature" off
+   if [ $? -eq 0 ]; then
+   echo "PASS: $netdev: Turned off feature: $feature"
+   else
+   echo "FAIL: $netdev: Failed to turn off feature:" \
+   "$feature"
+   fi
+
+   ethtool --offload "$netdev" "$feature" on
+   if [ $? -eq 0 ]; then
+   echo "PASS: $netdev: Turned on feature: $feature"
+   else
+   echo "FAIL: $netdev: Failed to turn on feature:" \
+   "$feature"
+   fi
+
+   #restore the feature to its initial state
+   ethtool --offload "$netdev" "$feature" "$initial_state"
+   if [ $? -eq 0 ]; then
+   echo "PASS: $netdev: Restore feature $feature" \
+   "to initial state $initial_state"
+   else
+   echo "FAIL: $netdev: Failed to restore feature" \
+   "$feature to default $initial_state"
+   fi
+
+   done < "$TMP_ETHTOOL_FEATURES"
+
rm "$TMP_ETHTOOL_FEATURES"
 
kci_netdev_ethtool_test 74 'dump' "ethtool -d $netdev"
-- 
2.34.1

Re: [PATCH] arm64/vdso: Remove --hash-style=sysv

2024-08-14 Thread Xi Ruoyao

On Thu, 2024-07-18 at 10:34 -0700, Fangrui Song wrote:
> glibc added support for .gnu.hash in 2006 and .hash has been obsoleted
> for more than one decade in many Linux distributions.  Using
> --hash-style=sysv might imply unaddressed issues and confuse readers.
> 
> Just drop the option and rely on the linker default, which is likely
> "both", or "gnu" when the distribution really wants to eliminate sysv
> hash overhead.
> 
> Similar to commit 6b7e26547fad ("x86/vdso: Emit a GNU hash").
> 
> Signed-off-by: Fangrui Song 

Hi Fangrui,

If I read tools/testing/selftests/vDSO/parse_vdso.c correctly, it does
know DT_GNU_HASH as at now.  Thus after this change the vDSO selftests
are skipped with "Couldn't find __vdso_gettimeofday" etc if the distro
enables --hash-style=gnu by default.

So it seems we need to add DT_GNU_HASH support for parse_vdso.c to keep
test coverage.

> ---
>  arch/arm64/kernel/vdso/Makefile   | 2 +-
>  arch/arm64/kernel/vdso32/Makefile | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kernel/vdso/Makefile b/arch/arm64/kernel/vdso/Makefile
> index d63930c82839..d11da6461278 100644
> --- a/arch/arm64/kernel/vdso/Makefile
> +++ b/arch/arm64/kernel/vdso/Makefile
> @@ -21,7 +21,7 @@ btildflags-$(CONFIG_ARM64_BTI_KERNEL) += -z force-bti
>  # potential future proofing if we end up with internal calls to the exported
>  # routines, as x86 does (see 6f121e548f83 ("x86, vdso: Reimplement vdso.so
>  # preparation in build-time C")).
> -ldflags-y := -shared -soname=linux-vdso.so.1 --hash-style=sysv   \
> +ldflags-y := -shared -soname=linux-vdso.so.1 \
>    -Bsymbolic --build-id=sha1 -n $(btildflags-y)
>  
>  ifdef CONFIG_LD_ORPHAN_WARN
> diff --git a/arch/arm64/kernel/vdso32/Makefile 
> b/arch/arm64/kernel/vdso32/Makefile
> index cc4508c604b2..25a2cb6317f3 100644
> --- a/arch/arm64/kernel/vdso32/Makefile
> +++ b/arch/arm64/kernel/vdso32/Makefile
> @@ -98,7 +98,7 @@ VDSO_AFLAGS += -D__ASSEMBLY__
>  # From arm vDSO Makefile
>  VDSO_LDFLAGS += -Bsymbolic --no-undefined -soname=linux-vdso.so.1
>  VDSO_LDFLAGS += -z max-page-size=4096 -z common-page-size=4096
> -VDSO_LDFLAGS += -shared --hash-style=sysv --build-id=sha1
> +VDSO_LDFLAGS += -shared --build-id=sha1
>  VDSO_LDFLAGS += --orphan-handling=$(CONFIG_LD_ORPHAN_WARN_LEVEL)
>  
>  

-- 
Xi Ruoyao 
School of Aerospace Science and Technology, Xidian University

Re: [PATCH v1 0/2] mremap refactor: check src address for vma boundaries first.

2024-08-14 Thread Liam R. Howlett

* Jeff Xu  [240814 12:57]:
> On Wed, Aug 14, 2024 at 7:40 AM Liam R. Howlett  
> wrote:
> >
> > * jef...@chromium.org  [240814 03:14]:
> > > From: Jeff Xu 
> > >
> > > mremap doesn't allow relocate, expand, shrink across VMA boundaries,
> > > refactor the code to check src address range before doing anything on
> > > the destination, i.e. destination won't be unmapped, if src address
> > > failed the boundaries check.
> > >
> > > This also allows us to remove can_modify_mm from mremap.c, since
> > > the src address must be single VMA, can_modify_vma is used.
> >
> > I don't think sending out a separate patch to address the same thing as
> > the patch you said you were testing [1] is the correct approach.  You
> > had already sent suggestions on mremap changes - why send this patch set
> > instead of making another suggestion?
> >
> As indicated in the cover letter, this patch aims to improve mremap
> performance while preserving existing mseal's semantics.

They are not worth preserving.

> And this
> patch can go in-dependantly regardless of in-loop out-loop discussion.

No, it conflicts with the other mremap patch as it changes the same
code - in a very similar way.

> 
> [1] link in your email is broken, but I assume you meant Pedro's V1/V2
> of in-loop change.

Yes, the email where you delayed discussing the fix so that you could
test it.  Which brings up the question you didn't answer and deleted:
Does your testing pass on those patches?

> In-loop change has a semantic/regression risk to
> mseal, and will take longer time to review/test/prove and bake.

There are no uses, so the risk is minimal.

> We can leave in-loop discussion in Pedro's thread,

No, it is directly linked to these patches as this should have just been
a comment on a patch in that series.

> I hope the V3 of
> Pedro's patch adds more testing coverage and addresses existing
> comments in V2.

The majority of the comments to V2 are mine, you only told us that
splitting a sealed vma is wrong (after I asked you directly to answer)
and then you made a comment about testing of the patch set. Besides the
direct responses to me, your comment was "wait for me to test".

You are holding us hostage by asking for more testing but not sharing
what is and is not valid for mseal() - or even answering questions on
tests you run.  Splitting a vma doesn't change the memory, but that's
not allowed for some reason.

These patches should be rejected in favour of fixing the feature like it
should have been written in the first place.  Anything less is just to
simplify backports and avoiding testing - "avoiding the business logic".

Liam

[1] 
https://lore.kernel.org/all/calmywfvurjbgyfw7x5qrl4cqozjy92nefas750xalxo7o7c...@mail.gmail.com/

Re: [PATCH net-next v13 02/14] mm: move the page fragment allocator from page_alloc into its own file

2024-08-14 Thread Andrew Morton

On Thu, 8 Aug 2024 20:37:02 +0800 Yunsheng Lin  wrote:

> Inspired by [1], move the page fragment allocator from page_alloc
> into its own c file and header file, as we are about to make more
> change for it to replace another page_frag implementation in
> sock.c

Acked-by: Andrew Morton 

This presently has no conflicts with mm.git.

Re: [PATCH v1 05/16] iommufd/viommu: Add IOMMU_VIOMMU_SET/UNSET_VDEV_ID ioctl

2024-08-14 Thread Jason Gunthorpe

On Wed, Aug 14, 2024 at 10:09:22AM -0700, Nicolin Chen wrote:

> This helps us to build a device-based virq report function:
> +void iommufd_device_report_virq(struct device *dev, unsigned int data_type,
> +   void *data_ptr, size_t data_len);
> 
> I built a link from device to viommu reusing Baolu's work:
> struct device -> struct iommu_group -> struct iommu_attach_handle
> -> struct iommufd_attach_handle -> struct iommufd_device (idev)
> -> struct iommufd_vdev_id (idev->vdev_id)

That makes sense, the vdev id would be 1:1 with the struct device, and
the iommufd_device is also supposed to be 1:1 with the struct device.

Jason

Re: [PATCH bpf-next v4 2/2] selftests/bpf: Add mptcp subflow subtest

2024-08-14 Thread Martin KaFai Lau


On 8/5/24 2:52 AM, Matthieu Baerts (NGI0) wrote:

+static void run_subflow(char *new)
+{
+   int server_fd, client_fd, err;
+   char cc[TCP_CA_NAME_MAX];
+   socklen_t len = sizeof(cc);
+
+   server_fd = start_mptcp_server(AF_INET, ADDR_1, PORT_1, 0);
+   if (!ASSERT_GE(server_fd, 0, "start_mptcp_server"))
+   return;
+
+   client_fd = connect_to_fd(server_fd, 0);
+   if (!ASSERT_GE(client_fd, 0, "connect to fd"))
+   goto fail;
+
+   err = getsockopt(server_fd, SOL_TCP, TCP_CONGESTION, cc, &len);
+   if (!ASSERT_OK(err, "getsockopt(srv_fd, TCP_CONGESTION)"))
+   goto fail;

client_fd is leaked.


+
+   send_byte(client_fd);
+
+   ASSERT_OK(ss_search(ADDR_1, "fwmark:0x1"), "ss_search fwmark:0x1");
+   ASSERT_OK(ss_search(ADDR_2, "fwmark:0x2"), "ss_search fwmark:0x2");
+   ASSERT_OK(ss_search(ADDR_1, new), "ss_search new cc");
+   ASSERT_OK(ss_search(ADDR_2, cc), "ss_search default cc");
+
+   close(client_fd);
+fail:
+   close(server_fd);
+}

Re: [PATCH bpf-next v4 2/2] selftests/bpf: Add mptcp subflow subtest

2024-08-14 Thread Martin KaFai Lau


On 8/14/24 3:04 AM, Matthieu Baerts wrote:

Hi Martin,

Thank you for your reply!

On 14/08/2024 03:12, Martin KaFai Lau wrote:

On 8/5/24 2:52 AM, Matthieu Baerts (NGI0) wrote:

+static int endpoint_init(char *flags)
+{
+    SYS(fail, "ip -net %s link add veth1 type veth peer name veth2",
NS_TEST);
+    SYS(fail, "ip -net %s addr add %s/24 dev veth1", NS_TEST, ADDR_1);
+    SYS(fail, "ip -net %s link set dev veth1 up", NS_TEST);
+    SYS(fail, "ip -net %s addr add %s/24 dev veth2", NS_TEST, ADDR_2);
+    SYS(fail, "ip -net %s link set dev veth2 up", NS_TEST);
+    if (SYS_NOFAIL("ip -net %s mptcp endpoint add %s %s", NS_TEST,
ADDR_2, flags)) {
+    printf("'ip mptcp' not supported, skip this test.\n");
+    test__skip();


It is always a skip now in bpf CI:

#171/3   mptcp/subflow:SKIP

This test is a useful addition for the bpf CI selftest.

It can't catch regression if it is always a skip in bpf CI though.


Indeed, for the moment, this test is skipped in bpf CI.

The MPTCP CI checks the MPTCP BPF selftests that are on top of net and
net-next at least once a day. It is always running with the last stable
version of iproute2, so this test is not skipped:

#169/3   mptcp/subflow:OK

https://github.com/multipath-tcp/mptcp_net-next/actions/runs/10384566794/job/28751869426#step:7:11080


iproute2 needs to be updated (cc: Daniel Xu and Manu, the outdated
iproute2 is something that came up multiple times).

Not sure when the iproute2 can be updated. In the mean time, your v3 is
pretty close to getting pm_nl_ctl compiled. Is there other blocker on this?


I will try to find some time to check the modifications I suggested in
the v3, but I don't know how long it will take to have them ready, as
they might require some adaptations of the CI side as well, I need to
check. On the other hand, I understood adding a duplicated version of
the mptcp.h UAPI header is not an option either.

So not to block this (already old) series, I thought it would help to
first focus on this version using 'ip mptcp', while I'm looking at the
selftests modifications. If these modifications are successful, I can
always resend the patch 2/3 from the v3 later, and using 'pm_nl_ctl'
instead of 'ip mptcp', to be able to work with IPRoute2 5.5.

Do you think that could work like that?


If there is CI started covering it, staying with the 'ip mptcp' is fine.

The bpf CI has to start testing it asap also. The iproute2 package will need to 
be updated on the bpf CI side. I think this has to be done regardless.


It will be useful to avoid the uapi header dup on its own. The last one you have 
seems pretty close.





+    goto fail;
+    }
+
+    return 0;
+fail:
+    return -1;
+}
+
+static int _ss_search(char *src, char *dst, char *port, char *keyword)
+{
+    return SYS_NOFAIL("ip netns exec %s ss -enita src %s dst %s %s %d
| grep -q '%s'",
+  NS_TEST, src, dst, port, PORT_1, keyword);
+}
+
+static int ss_search(char *src, char *keyword)
+{
+    return _ss_search(src, ADDR_1, "dport", keyword);
+}
+
+static void run_subflow(char *new)
+{
+    int server_fd, client_fd, err;
+    char cc[TCP_CA_NAME_MAX];
+    socklen_t len = sizeof(cc);
+
+    server_fd = start_mptcp_server(AF_INET, ADDR_1, PORT_1, 0);
+    if (!ASSERT_GE(server_fd, 0, "start_mptcp_server"))
+    return;
+
+    client_fd = connect_to_fd(server_fd, 0);
+    if (!ASSERT_GE(client_fd, 0, "connect to fd"))
+    goto fail;
+
+    err = getsockopt(server_fd, SOL_TCP, TCP_CONGESTION, cc, &len);
+    if (!ASSERT_OK(err, "getsockopt(srv_fd, TCP_CONGESTION)"))
+    goto fail;
+
+    send_byte(client_fd);
+
+    ASSERT_OK(ss_search(ADDR_1, "fwmark:0x1"), "ss_search fwmark:0x1");
+    ASSERT_OK(ss_search(ADDR_2, "fwmark:0x2"), "ss_search fwmark:0x2");
+    ASSERT_OK(ss_search(ADDR_1, new), "ss_search new cc");
+    ASSERT_OK(ss_search(ADDR_2, cc), "ss_search default cc");


Is there a getsockopt way instead of ss + grep?


No there isn't: from the userspace, the app communicates with the MPTCP
socket, which can have multiple paths (subflows, a TCP socket). To keep
the compatibility with TCP, [gs]etsockopt() will look at/modify the
whole MPTCP connection. For example, in some cases, a setsockopt() will
propagate the option to all the subflows. Depending on the option, the
modification might only apply to the first subflow, or to the
user-facing socket.

For advanced users who want to have different options set to the
different subflows of an MPTCP connection, they can use BPF: that's what
is being validated here. In other words, doing a 'getsockopt()' from the
userspace program here will not show all the different marks and TCP CC
that can be set per subflow with BPF. We can see that in the test: a
getsockopt() is done on the MPTCP socket to retrieve the default TCP CC
('cc' which is certainly 'cubic'), but we expect to find another one
('new' which is 'reno'), set by the BPF program from patch 1/2. I guess
we could use bpf to do a getsockopt() pe

[PATCH net 00/14] VLAN fixes for Ocelot driver

2024-08-14 Thread Vladimir Oltean

This is a collection of patches I've gathered over the past several
months.

Patches 1-6/14 are supporting patches for selftests.

Patch 9/14 fixes PTP TX from a VLAN upper of a VLAN-aware bridge port
when using the "ocelot-8021q" tagging protocol. Patch 7/14 is its
supporting selftest.

Patch 10/14 fixes the QoS class used by PTP in the same case as above.
It is hard to quantify - there is no selftest.

Patch 11/14 fixes potential data corruption during PTP TX in the same
case as above. Again, there is no selftest.

Patch 13/14 fixes RX in the same case as above - 8021q upper of a
VLAN-aware bridge port, with the "ocelot-8021q" tagging protocol. Patch
12/14 is a supporting patch for this in the DSA core, and 7/14 is also
its selftest.

Patch 14/14 ensures that VLAN-aware bridges offloaded to Ocelot only
react to the ETH_P_8021Q TPID, and treat absolutely everything else as
VLAN-untagged, including ETH_P_8021AD. Patch 8/14 is the supporting
selftest.

Vladimir Oltean (14):
  selftests: net: local_termination: refactor macvlan creation/deletion
  selftests: net: local_termination: parameterize sending interface
  selftests: net: local_termination: parameterize test name
  selftests: net: local_termination: add one more test for VLAN-aware
bridges
  selftests: net: local_termination: introduce new tests which capture
VLAN behavior
  selftests: net: local_termination: don't use xfail_on_veth()
  selftests: net: local_termination: add PTP frames to the mix
  selftests: net: bridge_vlan_aware: test that other TPIDs are seen as
untagged
  net: mscc: ocelot: use ocelot_xmit_get_vlan_info() also for FDMA and
register injection
  net: mscc: ocelot: fix QoS class for injected packets with
"ocelot-8021q"
  net: mscc: ocelot: serialize access to the injection/extraction groups
  net: dsa: provide a software untagging function on RX for VLAN-aware
bridges
  net: dsa: felix: fix VLAN tag loss on CPU reception with ocelot-8021q
  net: mscc: ocelot: treat 802.1ad tagged traffic as 802.1Q-untagged

 drivers/net/dsa/ocelot/felix.c| 126 -
 drivers/net/ethernet/mscc/ocelot.c| 279 +++-
 drivers/net/ethernet/mscc/ocelot_fdma.c   |   3 +-
 drivers/net/ethernet/mscc/ocelot_vcap.c   |   1 +
 drivers/net/ethernet/mscc/ocelot_vsc7514.c|   4 +
 include/linux/dsa/ocelot.h|  47 ++
 include/net/dsa.h |  16 +-
 include/soc/mscc/ocelot.h |  12 +-
 include/soc/mscc/ocelot_vcap.h|   2 +
 net/dsa/tag.c |   5 +-
 net/dsa/tag.h | 135 --
 net/dsa/tag_ocelot.c  |  37 +-
 .../net/forwarding/bridge_vlan_aware.sh   |  54 ++-
 tools/testing/selftests/net/forwarding/lib.sh |  57 +++
 .../net/forwarding/local_termination.sh   | 431 +++---
 15 files changed, 1036 insertions(+), 173 deletions(-)

-- 
2.34.1

[PATCH net 01/14] selftests: net: local_termination: refactor macvlan creation/deletion

2024-08-14 Thread Vladimir Oltean

This will be used in other subtests as well; make new macvlan_create()
and macvlan_destroy() functions.

Signed-off-by: Vladimir Oltean 
---
 .../net/forwarding/local_termination.sh   | 30 +++
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/local_termination.sh 
b/tools/testing/selftests/net/forwarding/local_termination.sh
index 4b364cdf3ef0..36f3d577d0be 100755
--- a/tools/testing/selftests/net/forwarding/local_termination.sh
+++ b/tools/testing/selftests/net/forwarding/local_termination.sh
@@ -247,19 +247,29 @@ bridge_destroy()
ip link del br0
 }
 
-standalone()
+macvlan_create()
 {
-   h1_create
-   h2_create
+   local lower=$1
 
-   ip link add link $h2 name macvlan0 type macvlan mode private
+   ip link add link $lower name macvlan0 type macvlan mode private
ip link set macvlan0 address $MACVLAN_ADDR
ip link set macvlan0 up
+}
 
-   run_test $h2
-
+macvlan_destroy()
+{
ip link del macvlan0
+}
+
+standalone()
+{
+   h1_create
+   h2_create
+   macvlan_create $h2
+
+   run_test $h2
 
+   macvlan_destroy
h2_destroy
h1_destroy
 }
@@ -268,15 +278,11 @@ bridge()
 {
h1_create
bridge_create
-
-   ip link add link br0 name macvlan0 type macvlan mode private
-   ip link set macvlan0 address $MACVLAN_ADDR
-   ip link set macvlan0 up
+   macvlan_create br0
 
run_test br0
 
-   ip link del macvlan0
-
+   macvlan_destroy
bridge_destroy
h1_destroy
 }
-- 
2.34.1

[PATCH net 02/14] selftests: net: local_termination: parameterize sending interface

2024-08-14 Thread Vladimir Oltean

In future changes we will want to subject the DUT, $h2, to additional
VLAN-tagged traffic. For that, we need to run the tests using $h1.100 as
a sending interface, rather than the currently hardcoded $h1.

Add a parameter to run_test() and modify its 2 callers to explicitly
pass $h1, as was implicit before.

Signed-off-by: Vladimir Oltean 
---
 .../net/forwarding/local_termination.sh   | 39 ++-
 1 file changed, 20 insertions(+), 19 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/local_termination.sh 
b/tools/testing/selftests/net/forwarding/local_termination.sh
index 36f3d577d0be..92f0e242d119 100755
--- a/tools/testing/selftests/net/forwarding/local_termination.sh
+++ b/tools/testing/selftests/net/forwarding/local_termination.sh
@@ -104,44 +104,45 @@ mc_route_destroy()
 
 run_test()
 {
-   local rcv_if_name=$1
-   local smac=$(mac_get $h1)
+   local send_if_name=$1; shift
+   local rcv_if_name=$1; shift
+   local smac=$(mac_get $send_if_name)
local rcv_dmac=$(mac_get $rcv_if_name)
 
tcpdump_start $rcv_if_name
 
-   mc_route_prepare $h1
+   mc_route_prepare $send_if_name
mc_route_prepare $rcv_if_name
 
-   send_uc_ipv4 $h1 $rcv_dmac
-   send_uc_ipv4 $h1 $MACVLAN_ADDR
-   send_uc_ipv4 $h1 $UNKNOWN_UC_ADDR1
+   send_uc_ipv4 $send_if_name $rcv_dmac
+   send_uc_ipv4 $send_if_name $MACVLAN_ADDR
+   send_uc_ipv4 $send_if_name $UNKNOWN_UC_ADDR1
 
ip link set dev $rcv_if_name promisc on
-   send_uc_ipv4 $h1 $UNKNOWN_UC_ADDR2
-   mc_send $h1 $UNKNOWN_IPV4_MC_ADDR2
-   mc_send $h1 $UNKNOWN_IPV6_MC_ADDR2
+   send_uc_ipv4 $send_if_name $UNKNOWN_UC_ADDR2
+   mc_send $send_if_name $UNKNOWN_IPV4_MC_ADDR2
+   mc_send $send_if_name $UNKNOWN_IPV6_MC_ADDR2
ip link set dev $rcv_if_name promisc off
 
mc_join $rcv_if_name $JOINED_IPV4_MC_ADDR
-   mc_send $h1 $JOINED_IPV4_MC_ADDR
+   mc_send $send_if_name $JOINED_IPV4_MC_ADDR
mc_leave
 
mc_join $rcv_if_name $JOINED_IPV6_MC_ADDR
-   mc_send $h1 $JOINED_IPV6_MC_ADDR
+   mc_send $send_if_name $JOINED_IPV6_MC_ADDR
mc_leave
 
-   mc_send $h1 $UNKNOWN_IPV4_MC_ADDR1
-   mc_send $h1 $UNKNOWN_IPV6_MC_ADDR1
+   mc_send $send_if_name $UNKNOWN_IPV4_MC_ADDR1
+   mc_send $send_if_name $UNKNOWN_IPV6_MC_ADDR1
 
ip link set dev $rcv_if_name allmulticast on
-   send_uc_ipv4 $h1 $UNKNOWN_UC_ADDR3
-   mc_send $h1 $UNKNOWN_IPV4_MC_ADDR3
-   mc_send $h1 $UNKNOWN_IPV6_MC_ADDR3
+   send_uc_ipv4 $send_if_name $UNKNOWN_UC_ADDR3
+   mc_send $send_if_name $UNKNOWN_IPV4_MC_ADDR3
+   mc_send $send_if_name $UNKNOWN_IPV6_MC_ADDR3
ip link set dev $rcv_if_name allmulticast off
 
mc_route_destroy $rcv_if_name
-   mc_route_destroy $h1
+   mc_route_destroy $send_if_name
 
sleep 1
 
@@ -267,7 +268,7 @@ standalone()
h2_create
macvlan_create $h2
 
-   run_test $h2
+   run_test $h1 $h2
 
macvlan_destroy
h2_destroy
@@ -280,7 +281,7 @@ bridge()
bridge_create
macvlan_create br0
 
-   run_test br0
+   run_test $h1 br0
 
macvlan_destroy
bridge_destroy
-- 
2.34.1

[PATCH net 03/14] selftests: net: local_termination: parameterize test name

2024-08-14 Thread Vladimir Oltean

There are upcoming tests which verify the RX filtering of a bridge
(or bridge port), but under differing vlan_filtering conditions.
Since we currently print $h2 (the DUT) in the log_test() output, it
becomes necessary to make a further distinction between tests, to not
give the user the impression that the exact same thing is run twice.

Signed-off-by: Vladimir Oltean 
---
 .../net/forwarding/local_termination.sh   | 38 ++-
 1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/local_termination.sh 
b/tools/testing/selftests/net/forwarding/local_termination.sh
index 92f0e242d119..af284edaf401 100755
--- a/tools/testing/selftests/net/forwarding/local_termination.sh
+++ b/tools/testing/selftests/net/forwarding/local_termination.sh
@@ -68,10 +68,11 @@ send_uc_ipv4()
 
 check_rcv()
 {
-   local if_name=$1
-   local type=$2
-   local pattern=$3
-   local should_receive=$4
+   local if_name=$1; shift
+   local type=$1; shift
+   local pattern=$1; shift
+   local should_receive=$1; shift
+   local test_name="$1"; shift
local should_fail=
 
[ $should_receive = true ] && should_fail=0 || should_fail=1
@@ -81,7 +82,7 @@ check_rcv()
 
check_err_fail "$should_fail" "$?" "reception"
 
-   log_test "$if_name: $type"
+   log_test "$test_name: $type"
 }
 
 mc_route_prepare()
@@ -106,6 +107,7 @@ run_test()
 {
local send_if_name=$1; shift
local rcv_if_name=$1; shift
+   local test_name="$1"; shift
local smac=$(mac_get $send_if_name)
local rcv_dmac=$(mac_get $rcv_if_name)
 
@@ -150,61 +152,61 @@ run_test()
 
check_rcv $rcv_if_name "Unicast IPv4 to primary MAC address" \
"$smac > $rcv_dmac, ethertype IPv4 (0x0800)" \
-   true
+   true "$test_name"
 
check_rcv $rcv_if_name "Unicast IPv4 to macvlan MAC address" \
"$smac > $MACVLAN_ADDR, ethertype IPv4 (0x0800)" \
-   true
+   true "$test_name"
 
xfail_on_veth $h1 \
check_rcv $rcv_if_name "Unicast IPv4 to unknown MAC address" \
"$smac > $UNKNOWN_UC_ADDR1, ethertype IPv4 (0x0800)" \
-   false
+   false "$test_name"
 
check_rcv $rcv_if_name "Unicast IPv4 to unknown MAC address, promisc" \
"$smac > $UNKNOWN_UC_ADDR2, ethertype IPv4 (0x0800)" \
-   true
+   true "$test_name"
 
xfail_on_veth $h1 \
check_rcv $rcv_if_name \
"Unicast IPv4 to unknown MAC address, allmulti" \
"$smac > $UNKNOWN_UC_ADDR3, ethertype IPv4 (0x0800)" \
-   false
+   false "$test_name"
 
check_rcv $rcv_if_name "Multicast IPv4 to joined group" \
"$smac > $JOINED_MACV4_MC_ADDR, ethertype IPv4 (0x0800)" \
-   true
+   true "$test_name"
 
xfail_on_veth $h1 \
check_rcv $rcv_if_name \
"Multicast IPv4 to unknown group" \
"$smac > $UNKNOWN_MACV4_MC_ADDR1, ethertype IPv4 
(0x0800)" \
-   false
+   false "$test_name"
 
check_rcv $rcv_if_name "Multicast IPv4 to unknown group, promisc" \
"$smac > $UNKNOWN_MACV4_MC_ADDR2, ethertype IPv4 (0x0800)" \
-   true
+   true "$test_name"
 
check_rcv $rcv_if_name "Multicast IPv4 to unknown group, allmulti" \
"$smac > $UNKNOWN_MACV4_MC_ADDR3, ethertype IPv4 (0x0800)" \
-   true
+   true "$test_name"
 
check_rcv $rcv_if_name "Multicast IPv6 to joined group" \
"$smac > $JOINED_MACV6_MC_ADDR, ethertype IPv6 (0x86dd)" \
-   true
+   true "$test_name"
 
xfail_on_veth $h1 \
check_rcv $rcv_if_name "Multicast IPv6 to unknown group" \
"$smac > $UNKNOWN_MACV6_MC_ADDR1, ethertype IPv6 
(0x86dd)" \
-   false
+   false "$test_name"
 
check_rcv $rcv_if_name "Multicast IPv6 to unknown group, promisc" \
"$smac > $UNKNOWN_MACV6_MC_ADDR2, ethertype IPv6 (0x86dd)" \
-   true
+   true "$test_name"
 
check_rcv $rcv_if_name "Multicast IPv6 to unknown group, allmulti" \
"$smac > $UNKNOWN_MACV6_MC_ADDR3, ethertype IPv6 (0x86dd)" \
-   true
+   true "$test_name"
 
tcpdump_cleanup $rcv_if_name
 }
-- 
2.34.1

[PATCH net 04/14] selftests: net: local_termination: add one more test for VLAN-aware bridges

2024-08-14 Thread Vladimir Oltean

The current bridge() test is for packet reception on a VLAN-unaware
bridge. Some things are different enough with VLAN-aware bridges that
it's worth renaming this test into vlan_unaware_bridge(), and add a new
vlan_aware_bridge() test.

The two will share the same implementation: bridge() becomes a common
function, which receives $vlan_filtering as an argument. Rename it to
test_bridge() at the same time, because just bridge() pollutes the
global namespace and we cannot invoke the binary with the same name from
the iproute2 package currently.

Signed-off-by: Vladimir Oltean 
---
 .../net/forwarding/local_termination.sh   | 22 +++
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/local_termination.sh 
b/tools/testing/selftests/net/forwarding/local_termination.sh
index af284edaf401..5aa364b40e33 100755
--- a/tools/testing/selftests/net/forwarding/local_termination.sh
+++ b/tools/testing/selftests/net/forwarding/local_termination.sh
@@ -1,7 +1,7 @@
 #!/bin/bash
 # SPDX-License-Identifier: GPL-2.0
 
-ALL_TESTS="standalone bridge"
+ALL_TESTS="standalone vlan_unaware_bridge vlan_aware_bridge"
 NUM_NETIFS=2
 PING_COUNT=1
 REQUIRE_MTOOLS=yes
@@ -233,7 +233,9 @@ h2_destroy()
 
 bridge_create()
 {
-   ip link add br0 type bridge
+   local vlan_filtering=$1
+
+   ip link add br0 type bridge vlan_filtering $vlan_filtering
ip link set br0 address $BRIDGE_ADDR
ip link set br0 up
 
@@ -277,10 +279,12 @@ standalone()
h1_destroy
 }
 
-bridge()
+test_bridge()
 {
+   local vlan_filtering=$1
+
h1_create
-   bridge_create
+   bridge_create $vlan_filtering
macvlan_create br0
 
run_test $h1 br0
@@ -290,6 +294,16 @@ bridge()
h1_destroy
 }
 
+vlan_unaware_bridge()
+{
+   test_bridge 0
+}
+
+vlan_aware_bridge()
+{
+   test_bridge 1
+}
+
 cleanup()
 {
pre_cleanup
-- 
2.34.1

[PATCH net 05/14] selftests: net: local_termination: introduce new tests which capture VLAN behavior

2024-08-14 Thread Vladimir Oltean

Add more coverage to the local termination selftest as follows:
- 8021q upper of $h2
- 8021q upper of $h2, where $h2 is a port of a VLAN-unaware bridge
- 8021q upper of $h2, where $h2 is a port of a VLAN-aware bridge
- 8021q upper of VLAN-unaware br0, which is the upper of $h2
- 8021q upper of VLAN-aware br0, which is the upper of $h2

Especially the cases with traffic sent through the VLAN upper of a
VLAN-aware bridge port will be immediately relevant when we will start
transmitting PTP packets as an additional kind of traffic.

Signed-off-by: Vladimir Oltean 
---
 .../net/forwarding/local_termination.sh   | 117 --
 1 file changed, 110 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/local_termination.sh 
b/tools/testing/selftests/net/forwarding/local_termination.sh
index 5aa364b40e33..e22c6a693bef 100755
--- a/tools/testing/selftests/net/forwarding/local_termination.sh
+++ b/tools/testing/selftests/net/forwarding/local_termination.sh
@@ -1,7 +1,9 @@
 #!/bin/bash
 # SPDX-License-Identifier: GPL-2.0
 
-ALL_TESTS="standalone vlan_unaware_bridge vlan_aware_bridge"
+ALL_TESTS="standalone vlan_unaware_bridge vlan_aware_bridge test_vlan \
+  vlan_over_vlan_unaware_bridged_port 
vlan_over_vlan_aware_bridged_port \
+  vlan_over_vlan_unaware_bridge vlan_over_vlan_aware_bridge"
 NUM_NETIFS=2
 PING_COUNT=1
 REQUIRE_MTOOLS=yes
@@ -231,6 +233,30 @@ h2_destroy()
simple_if_fini $h2 $H2_IPV4/24 $H2_IPV6/64
 }
 
+h1_vlan_create()
+{
+   simple_if_init $h1
+   vlan_create $h1 100 v$h1 $H1_IPV4/24 $H1_IPV6/64
+}
+
+h1_vlan_destroy()
+{
+   vlan_destroy $h1 100
+   simple_if_fini $h1
+}
+
+h2_vlan_create()
+{
+   simple_if_init $h2
+   vlan_create $h2 100 v$h2 $H2_IPV4/24 $H2_IPV6/64
+}
+
+h2_vlan_destroy()
+{
+   vlan_destroy $h2 100
+   simple_if_fini $h2
+}
+
 bridge_create()
 {
local vlan_filtering=$1
@@ -241,14 +267,10 @@ bridge_create()
 
ip link set $h2 master br0
ip link set $h2 up
-
-   simple_if_init br0 $H2_IPV4/24 $H2_IPV6/64
 }
 
 bridge_destroy()
 {
-   simple_if_fini br0 $H2_IPV4/24 $H2_IPV6/64
-
ip link del br0
 }
 
@@ -272,7 +294,7 @@ standalone()
h2_create
macvlan_create $h2
 
-   run_test $h1 $h2
+   run_test $h1 $h2 "$h2"
 
macvlan_destroy
h2_destroy
@@ -285,11 +307,13 @@ test_bridge()
 
h1_create
bridge_create $vlan_filtering
+   simple_if_init br0 $H2_IPV4/24 $H2_IPV6/64
macvlan_create br0
 
-   run_test $h1 br0
+   run_test $h1 br0 "vlan_filtering=$vlan_filtering bridge"
 
macvlan_destroy
+   simple_if_fini br0 $H2_IPV4/24 $H2_IPV6/64
bridge_destroy
h1_destroy
 }
@@ -304,6 +328,85 @@ vlan_aware_bridge()
test_bridge 1
 }
 
+test_vlan()
+{
+   h1_vlan_create
+   h2_vlan_create
+   macvlan_create $h2.100
+
+   run_test $h1.100 $h2.100 "VLAN upper"
+
+   macvlan_destroy
+   h2_vlan_destroy
+   h1_vlan_destroy
+}
+
+vlan_over_bridged_port()
+{
+   local vlan_filtering=$1
+
+   h1_vlan_create
+   h2_vlan_create
+   bridge_create $vlan_filtering
+   macvlan_create $h2.100
+
+   run_test $h1.100 $h2.100 "VLAN over vlan_filtering=$vlan_filtering 
bridged port"
+
+   macvlan_destroy
+   bridge_destroy
+   h2_vlan_destroy
+   h1_vlan_destroy
+}
+
+vlan_over_vlan_unaware_bridged_port()
+{
+   vlan_over_bridged_port 0
+}
+
+vlan_over_vlan_aware_bridged_port()
+{
+   vlan_over_bridged_port 1
+}
+
+vlan_over_bridge()
+{
+   local vlan_filtering=$1
+
+   h1_vlan_create
+   bridge_create $vlan_filtering
+   simple_if_init br0
+   vlan_create br0 100 vbr0 $H2_IPV4/24 $H2_IPV6/64
+   macvlan_create br0.100
+
+   if [ $vlan_filtering = 1 ]; then
+   bridge vlan add dev $h2 vid 100 master
+   bridge vlan add dev br0 vid 100 self
+   fi
+
+   run_test $h1.100 br0.100 "VLAN over vlan_filtering=$vlan_filtering 
bridge"
+
+   if [ $vlan_filtering = 1 ]; then
+   bridge vlan del dev br0 vid 100 self
+   bridge vlan del dev $h2 vid 100 master
+   fi
+
+   macvlan_destroy
+   vlan_destroy br0 100
+   simple_if_fini br0
+   bridge_destroy
+   h1_vlan_destroy
+}
+
+vlan_over_vlan_unaware_bridge()
+{
+   vlan_over_bridge 0
+}
+
+vlan_over_vlan_aware_bridge()
+{
+   vlan_over_bridge 1
+}
+
 cleanup()
 {
pre_cleanup
-- 
2.34.1

[PATCH net 06/14] selftests: net: local_termination: don't use xfail_on_veth()

2024-08-14 Thread Vladimir Oltean

xfail_on_veth() for this test is an incorrect approximation which gives
false positives and false negatives.

When local_termination fails with "reception succeeded, but should have failed",
it is because the DUT ($h2) accepts packets even when not configured as
promiscuous. This is not something specific to veth; even the bridge
behaves that way, but this is not captured by the xfail_on_veth test.

The IFF_UNICAST_FLT flag is not explicitly exported to user space, but
it can somewhat be determined from the interface's behavior. We have to
create a macvlan upper with a different MAC address. This forces a
dev_uc_add() call in the kernel. When the unicast filtering list is
not empty, but the device doesn't support IFF_UNICAST_FLT,
__dev_set_rx_mode() force-enables promiscuity on the interface, to
ensure correct behavior (that the requested address is received).

We can monitor the change in the promiscuity flag and infer from it
whether the device supports unicast filtering.

There is no equivalent thing for allmulti, unfortunately. We never know
what's hiding behind a device which has allmulti=off. Whether it will
actually perform RX multicast filtering of unknown traffic is a strong
"maybe". The bridge driver, for example, completely ignores the flag.
We'll have to keep the xfail behavior, but instead of XFAIL on just
veth, always XFAIL.

Signed-off-by: Vladimir Oltean 
---
 tools/testing/selftests/net/forwarding/lib.sh | 57 ++
 .../net/forwarding/local_termination.sh   | 58 ++-
 2 files changed, 99 insertions(+), 16 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/lib.sh 
b/tools/testing/selftests/net/forwarding/lib.sh
index cb0fcd6f0293..c992e385159c 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -500,6 +500,11 @@ check_err_fail()
fi
 }
 
+xfail()
+{
+   FAIL_TO_XFAIL=yes "$@"
+}
+
 xfail_on_slow()
 {
if [[ $KSFT_MACHINE_SLOW = yes ]]; then
@@ -1120,6 +1125,39 @@ mac_get()
ip -j link show dev $if_name | jq -r '.[]["address"]'
 }
 
+ether_addr_to_u64()
+{
+   local addr="$1"
+   local order="$((1 << 40))"
+   local val=0
+   local byte
+
+   addr="${addr//:/ }"
+
+   for byte in $addr; do
+   byte="0x$byte"
+   val=$((val + order * byte))
+   order=$((order >> 8))
+   done
+
+   printf "0x%x" $val
+}
+
+u64_to_ether_addr()
+{
+   local val=$1
+   local byte
+   local i
+
+   for ((i = 40; i >= 0; i -= 8)); do
+   byte=$(((val & (0xff << i)) >> i))
+   printf "%02x" $byte
+   if [ $i -ne 0 ]; then
+   printf ":"
+   fi
+   done
+}
+
 ipv6_lladdr_get()
 {
local if_name=$1
@@ -2236,3 +2274,22 @@ absval()
 
echo $((v > 0 ? v : -v))
 }
+
+has_unicast_flt()
+{
+   local dev=$1; shift
+   local mac_addr=$(mac_get $dev)
+   local tmp=$(ether_addr_to_u64 $mac_addr)
+   local promisc
+
+   ip link set $dev up
+   ip link add link $dev name macvlan-tmp type macvlan mode private
+   ip link set macvlan-tmp address $(u64_to_ether_addr $((tmp + 1)))
+   ip link set macvlan-tmp up
+
+   promisc=$(ip -j -d link show dev $dev | jq -r '.[].promiscuity')
+
+   ip link del macvlan-tmp
+
+   [[ $promisc == 1 ]] && echo "no" || echo "yes"
+}
diff --git a/tools/testing/selftests/net/forwarding/local_termination.sh 
b/tools/testing/selftests/net/forwarding/local_termination.sh
index e22c6a693bef..80ea4c10d764 100755
--- a/tools/testing/selftests/net/forwarding/local_termination.sh
+++ b/tools/testing/selftests/net/forwarding/local_termination.sh
@@ -109,9 +109,11 @@ run_test()
 {
local send_if_name=$1; shift
local rcv_if_name=$1; shift
+   local no_unicast_flt=$1; shift
local test_name="$1"; shift
local smac=$(mac_get $send_if_name)
local rcv_dmac=$(mac_get $rcv_if_name)
+   local should_receive
 
tcpdump_start $rcv_if_name
 
@@ -160,26 +162,26 @@ run_test()
"$smac > $MACVLAN_ADDR, ethertype IPv4 (0x0800)" \
true "$test_name"
 
-   xfail_on_veth $h1 \
-   check_rcv $rcv_if_name "Unicast IPv4 to unknown MAC address" \
-   "$smac > $UNKNOWN_UC_ADDR1, ethertype IPv4 (0x0800)" \
-   false "$test_name"
+   [ $no_unicast_flt = true ] && should_receive=true || 
should_receive=false
+   check_rcv $rcv_if_name "Unicast IPv4 to unknown MAC address" \
+   "$smac > $UNKNOWN_UC_ADDR1, ethertype IPv4 (0x0800)" \
+   $should_receive "$test_name"
 
check_rcv $rcv_if_name "Unicast IPv4 to unknown MAC address, promisc" \
"$smac > $UNKNOWN_UC_ADDR2, ethertype IPv4 (0x0800)" \
true "$test_name"
 
-   xfail_on_veth $h1 \
-   check_rcv $rcv_if_name \
-

[PATCH net 07/14] selftests: net: local_termination: add PTP frames to the mix

2024-08-14 Thread Vladimir Oltean

A breakage in the felix DSA driver shows we do not have enough test
coverage. More generally, it is sufficiently special that it is likely
drivers will treat it differently.

This is not meant to be a full PTP test, it just makes sure that PTP
packets sent to the different addresses corresponding to their profiles
are received correctly. The local_termination selftest seemed like the
most appropriate place for this addition.

PTP RX/TX in some cases makes no sense (over a bridge) and this is why
$skip_ptp exists. And in others - PTP over a bridge port - the IP stack
needs convincing through the available bridge netfilter hooks to leave
the PTP packets alone and not stolen by the bridge rx_handler. It is
safe to assume that users have that figured out already. This is a
driver level test, and by using tcpdump, all that extra setup is out of
scope here.

send_non_ip() was an unfinished idea; written but never used.
Replace it with a more generic send_raw(), and send 3 PTP packet types
times 3 transports.

Signed-off-by: Vladimir Oltean 
---
 .../net/forwarding/local_termination.sh   | 161 --
 1 file changed, 148 insertions(+), 13 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/local_termination.sh 
b/tools/testing/selftests/net/forwarding/local_termination.sh
index 80ea4c10d764..648868f74604 100755
--- a/tools/testing/selftests/net/forwarding/local_termination.sh
+++ b/tools/testing/selftests/net/forwarding/local_termination.sh
@@ -39,9 +39,68 @@ UNKNOWN_MACV6_MC_ADDR1="33:33:01:02:03:05"
 UNKNOWN_MACV6_MC_ADDR2="33:33:01:02:03:06"
 UNKNOWN_MACV6_MC_ADDR3="33:33:01:02:03:07"
 
-NON_IP_MC="01:02:03:04:05:06"
-NON_IP_PKT="00:04 48:45:4c:4f"
-BC="ff:ff:ff:ff:ff:ff"
+PTP_1588_L2_SYNC=" \
+01:1b:19:00:00:00 00:00:de:ad:be:ef 88:f7 00 02 \
+00 2c 00 00 02 00 00 00 00 00 00 00 00 00 00 00 \
+00 00 3e 37 63 ff fe cf 17 0e 00 01 00 00 00 00 \
+00 00 00 00 00 00 00 00 00 00"
+PTP_1588_L2_FOLLOW_UP=" \
+01:1b:19:00:00:00 00:00:de:ad:be:ef 88:f7 08 02 \
+00 2c 00 00 00 00 00 00 00 00 00 00 00 00 00 00 \
+00 00 3e 37 63 ff fe cf 17 0e 00 01 00 00 02 00 \
+00 00 66 83 c5 f1 17 97 ed f0"
+PTP_1588_L2_PDELAY_REQ=" \
+01:80:c2:00:00:0e 00:00:de:ad:be:ef 88:f7 02 02 \
+00 36 00 00 00 00 00 00 00 00 00 00 00 00 00 00 \
+00 00 3e 37 63 ff fe cf 17 0e 00 01 00 06 05 7f \
+00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 \
+00 00 00 00"
+PTP_1588_IPV4_SYNC=" \
+01:00:5e:00:01:81 00:00:de:ad:be:ef 08:00 45 00 \
+00 48 0a 9a 40 00 01 11 cb 88 c0 00 02 01 e0 00 \
+01 81 01 3f 01 3f 00 34 a3 c8 00 02 00 2c 00 00 \
+02 00 00 00 00 00 00 00 00 00 00 00 00 00 3e 37 \
+63 ff fe cf 17 0e 00 01 00 00 00 00 00 00 00 00 \
+00 00 00 00 00 00"
+PTP_1588_IPV4_FOLLOW_UP="
+01:00:5e:00:01:81 00:00:de:ad:be:ef 08:00 45 00 \
+00 48 0a 9b 40 00 01 11 cb 87 c0 00 02 01 e0 00 \
+01 81 01 40 01 40 00 34 a3 c8 08 02 00 2c 00 00 \
+00 00 00 00 00 00 00 00 00 00 00 00 00 00 3e 37 \
+63 ff fe cf 17 0e 00 01 00 00 02 00 00 00 66 83 \
+c6 0f 1d 9a 61 87"
+PTP_1588_IPV4_PDELAY_REQ=" \
+01:00:5e:00:00:6b 00:00:de:ad:be:ef 08:00 45 00 \
+00 52 35 a9 40 00 01 11 a1 85 c0 00 02 01 e0 00 \
+00 6b 01 3f 01 3f 00 3e a2 bc 02 02 00 36 00 00 \
+00 00 00 00 00 00 00 00 00 00 00 00 00 00 3e 37 \
+63 ff fe cf 17 0e 00 01 00 01 05 7f 00 00 00 00 \
+00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00"
+PTP_1588_IPV6_SYNC=" \
+33:33:00:00:01:81 00:00:de:ad:be:ef 86:dd 60 06 \
+7c 2f 00 36 11 01 20 01 0d b8 00 01 00 00 00 00 \
+00 00 00 00 00 01 ff 0e 00 00 00 00 00 00 00 00 \
+00 00 00 00 01 81 01 3f 01 3f 00 36 2e 92 00 02 \
+00 2c 00 00 02 00 00 00 00 00 00 00 00 00 00 00 \
+00 00 3e 37 63 ff fe cf 17 0e 00 01 00 00 00 00 \
+00 00 00 00 00 00 00 00 00 00 00 00"
+PTP_1588_IPV6_FOLLOW_UP=" \
+33:33:00:00:01:81 00:00:de:ad:be:ef 86:dd 60 0a \
+00 bc 00 36 11 01 20 01 0d b8 00 01 00 00 00 00 \
+00 00 00 00 00 01 ff 0e 00 00 00 00 00 00 00 00 \
+00 00 00 00 01 81 01 40 01 40 00 36 2e 92 08 02 \
+00 2c 00 00 00 00 00 00 00 00 00 00 00 00 00 00 \
+00 00 3e 37 63 ff fe cf 17 0e 00 01 00 00 02 00 \
+00 00 66 83 c6 2a 32 09 bd 74 00 00"
+PTP_1588_IPV6_PDELAY_REQ=" \
+33:33:00:00:00:6b 00:00:de:ad:be:ef 86:dd 60 0c \
+5c fd 00 40 11 01 fe 80 00 00 00 00 00 00 3c 37 \
+63 ff fe cf 17 0e ff 02 00 00 00 00 00 00 00 00 \
+00 00 00 00 00 6b 01 3f 01 3f 00 40 b4 54 02 02 \
+00 36 00 00 00 00 00 00 00 00 00 00 00 00 00 00 \
+00 00 3e 37 63 ff fe cf 17 0e 00 01 00 01 05 7f \
+00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 \
+00 00 00 00 00 00"
 
 # Disable promisc to ensure we don't receive unknown MAC DA packets
 export TCPDUMP_EXTRA_FLAGS="-pl"
@@ -49,13 +108,15 @@ export TCPDUMP_EXTRA_FLAGS="-pl"
 h1=${NETIFS[p1]}
 h2=${NETIFS[p2]}
 
-send_non_ip()
+send_raw()
 {
-   local if_name=$1
-   local smac=$2
-   local dmac=$3
+   local if_name=$1; shift
+   local pkt="$1"; shift
+   local smac=$(mac_get $if_name)
 
-   $MZ -q $if_name "$dmac $smac $NON_IP_PKT"
+   pkt="${pkt/00:00:de:ad:be:ef/$smac}"
+
+   $MZ -q $if_name "$p

[PATCH net 08/14] selftests: net: bridge_vlan_aware: test that other TPIDs are seen as untagged

2024-08-14 Thread Vladimir Oltean

The bridge VLAN implementation w.r.t. VLAN protocol is described in
merge commit 1a0b20b25732 ("Merge branch 'bridge-next'"). We are only
sensitive to those VLAN tags whose TPID is equal to the bridge's
vlan_protocol. Thus, an 802.1ad VLAN should be treated as 802.1Q-untagged.

Add 3 tests which validate that:
- 802.1ad-tagged traffic is learned into the PVID of an 802.1Q-aware
  bridge
- Double-tagged traffic is forwarded when just the PVID of the port is
  present in the VLAN group of the ports
- Double-tagged traffic is not forwarded when the PVID of the port is
  absent from the VLAN group of the ports

The test passes with both veth and ocelot.

Signed-off-by: Vladimir Oltean 
---
 .../net/forwarding/bridge_vlan_aware.sh   | 54 ++-
 1 file changed, 53 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/forwarding/bridge_vlan_aware.sh 
b/tools/testing/selftests/net/forwarding/bridge_vlan_aware.sh
index 64bd00fe9a4f..90f8a244ea90 100755
--- a/tools/testing/selftests/net/forwarding/bridge_vlan_aware.sh
+++ b/tools/testing/selftests/net/forwarding/bridge_vlan_aware.sh
@@ -1,7 +1,7 @@
 #!/bin/bash
 # SPDX-License-Identifier: GPL-2.0
 
-ALL_TESTS="ping_ipv4 ping_ipv6 learning flooding vlan_deletion extern_learn"
+ALL_TESTS="ping_ipv4 ping_ipv6 learning flooding vlan_deletion extern_learn 
other_tpid"
 NUM_NETIFS=4
 CHECK_TC="yes"
 source lib.sh
@@ -142,6 +142,58 @@ extern_learn()
bridge fdb del de:ad:be:ef:13:37 dev $swp1 master vlan 1 &> /dev/null
 }
 
+other_tpid()
+{
+   local mac=de:ad:be:ef:13:37
+
+   # Test that packets with TPID 802.1ad VID 3 + TPID 802.1Q VID 5 are
+   # classified as untagged by a bridge with vlan_protocol 802.1Q, and
+   # are processed in the PVID of the ingress port (here 1). Not VID 3,
+   # and not VID 5.
+   RET=0
+
+   tc qdisc add dev $h2 clsact
+   tc filter add dev $h2 ingress protocol all pref 1 handle 101 \
+   flower dst_mac $mac action drop
+   ip link set $h2 promisc on
+   ethtool -K $h2 rx-vlan-filter off rx-vlan-stag-filter off
+
+   $MZ -q $h1 -c 1 -b $mac -a own "88:a8 00:03 81:00 00:05 08:00 
aa-aa-aa-aa-aa-aa-aa-aa-aa"
+   sleep 1
+
+   # Match on 'self' addresses as well, for those drivers which
+   # do not push their learned addresses to the bridge software
+   # database
+   bridge -j fdb show $swp1 | \
+   jq -e ".[] | select(.mac == \"$(mac_get $h1)\") | select(.vlan 
== 1)" &> /dev/null
+   check_err $? "FDB entry was not learned when it should"
+
+   log_test "FDB entry in PVID for VLAN-tagged with other TPID"
+
+   RET=0
+   tc -j -s filter show dev $h2 ingress \
+   | jq -e ".[] | select(.options.handle == 101) \
+   | select(.options.actions[0].stats.packets == 1)" &> /dev/null
+   check_err $? "Packet was not forwarded when it should"
+   log_test "Reception of VLAN with other TPID as untagged"
+
+   bridge vlan del dev $swp1 vid 1
+
+   $MZ -q $h1 -c 1 -b $mac -a own "88:a8 00:03 81:00 00:05 08:00 
aa-aa-aa-aa-aa-aa-aa-aa-aa"
+   sleep 1
+
+   RET=0
+   tc -j -s filter show dev $h2 ingress \
+   | jq -e ".[] | select(.options.handle == 101) \
+   | select(.options.actions[0].stats.packets == 1)" &> /dev/null
+   check_err $? "Packet was forwarded when should not"
+   log_test "Reception of VLAN with other TPID as untagged (no PVID)"
+
+   bridge vlan add dev $swp1 vid 1 pvid untagged
+   ip link set $h2 promisc off
+   tc qdisc del dev $h2 clsact
+}
+
 trap cleanup EXIT
 
 setup_prepare
-- 
2.34.1

[PATCH net 09/14] net: mscc: ocelot: use ocelot_xmit_get_vlan_info() also for FDMA and register injection

2024-08-14 Thread Vladimir Oltean

Problem description
---

On an NXP LS1028A (felix DSA driver) with the following configuration:

- ocelot-8021q tagging protocol
- VLAN-aware bridge (with STP) spanning at least swp0 and swp1
- 8021q VLAN upper interfaces on swp0 and swp1: swp0.700, swp1.700
- ptp4l on swp0.700 and swp1.700

we see that the ptp4l instances do not see each other's traffic,
and they all go to the grand master state due to the
ANNOUNCE_RECEIPT_TIMEOUT_EXPIRES condition.

Jumping to the conclusion for the impatient
---

There is a zero-day bug in the ocelot switchdev driver in the way it
handles VLAN-tagged packet injection. The correct logic already exists in
the source code, in function ocelot_xmit_get_vlan_info() added by commit
5ca721c54d86 ("net: dsa: tag_ocelot: set the classified VLAN during xmit").
But it is used only for normal NPI-based injection with the DSA "ocelot"
tagging protocol. The other injection code paths (register-based and
FDMA-based) roll their own wrong logic. This affects and was noticed on
the DSA "ocelot-8021q" protocol because it uses register-based injection.

By moving ocelot_xmit_get_vlan_info() to a place that's common for both
the DSA tagger and the ocelot switch library, it can also be called from
ocelot_port_inject_frame() in ocelot.c.

We need to touch the lines with ocelot_ifh_port_set()'s prototype
anyway, so let's rename it to something clearer regarding what it does,
and add a kernel-doc. ocelot_ifh_set_basic() should do.

Investigation notes
---

Debugging reveals that PTP event (aka those carrying timestamps, like
Sync) frames injected into swp0.700 (but also swp1.700) hit the wire
with two VLAN tags:

: 01 1b 19 00 00 00 00 01 02 03 04 05 81 00 02 bc
  ~~~
0010: 81 00 02 bc 88 f7 00 12 00 2c 00 00 02 00 00 00
  ~~~
0020: 00 00 00 00 00 00 00 00 00 00 00 01 02 ff fe 03
0030: 04 05 00 01 00 04 00 00 00 00 00 00 00 00 00 00
0040: 00 00

The second (unexpected) VLAN tag makes felix_check_xtr_pkt() ->
ptp_classify_raw() fail to see these as PTP packets at the link
partner's receiving end, and return PTP_CLASS_NONE (because the BPF
classifier is not written to expect 2 VLAN tags).

The reason why packets have 2 VLAN tags is because the transmission
code treats VLAN incorrectly.

Neither ocelot switchdev, nor felix DSA, declare the NETIF_F_HW_VLAN_CTAG_TX
feature. Therefore, at xmit time, all VLANs should be in the skb head,
and none should be in the hwaccel area. This is done by:

static struct sk_buff *validate_xmit_vlan(struct sk_buff *skb,
  netdev_features_t features)
{
if (skb_vlan_tag_present(skb) &&
!vlan_hw_offload_capable(features, skb->vlan_proto))
skb = __vlan_hwaccel_push_inside(skb);
return skb;
}

But ocelot_port_inject_frame() handles things incorrectly:

ocelot_ifh_port_set(ifh, port, rew_op, skb_vlan_tag_get(skb));

void ocelot_ifh_port_set(struct sk_buff *skb, void *ifh, int port, u32 rew_op)
{
(...)
if (vlan_tag)
ocelot_ifh_set_vlan_tci(ifh, vlan_tag);
(...)
}

The way __vlan_hwaccel_push_inside() pushes the tag inside the skb head
is by calling:

static inline void __vlan_hwaccel_clear_tag(struct sk_buff *skb)
{
skb->vlan_present = 0;
}

which does _not_ zero out skb->vlan_tci as seen by skb_vlan_tag_get().
This means that ocelot, when it calls skb_vlan_tag_get(), sees
(and uses) a residual skb->vlan_tci, while the same VLAN tag is
_already_ in the skb head.

The trivial fix for double VLAN headers is to replace the content of
ocelot_ifh_port_set() with:

if (skb_vlan_tag_present(skb))
ocelot_ifh_set_vlan_tci(ifh, skb_vlan_tag_get(skb));

but this would not be correct either, because, as mentioned,
vlan_hw_offload_capable() is false for us, so we'd be inserting dead
code and we'd always transmit packets with VID=0 in the injection frame
header.

I can't actually test the ocelot switchdev driver and rely exclusively
on code inspection, but I don't think traffic from 8021q uppers has ever
been injected properly, and not double-tagged. Thus I'm blaming the
introduction of VLAN fields in the injection header - early driver code.

As hinted at in the early conclusion, what we _want_ to happen for
VLAN transmission was already described once in commit 5ca721c54d86
("net: dsa: tag_ocelot: set the classified VLAN during xmit").

ocelot_xmit_get_vlan_info() intends to ensure that if the port through
which we're transmitting is under a VLAN-aware bridge, the outer VLAN
tag from the skb head is stripped from there and inserted into the
injection frame header (so that the packet is processed in hardware
through that actual VLAN). And in all other cases, the packet is sent
with VID=0 in the injection frame header, since the port is VLAN-unaware
and has logic to

[PATCH net 10/14] net: mscc: ocelot: fix QoS class for injected packets with "ocelot-8021q"

2024-08-14 Thread Vladimir Oltean

There are 2 distinct code paths (listed below) in the source code which
set up an injection header for Ocelot(-like) switches. Code path (2)
lacks the QoS class and source port being set correctly. Especially the
improper QoS classification is a problem for the "ocelot-8021q"
alternative DSA tagging protocol, because we support tc-taprio and each
packet needs to be scheduled precisely through its time slot. This
includes PTP, which is normally assigned to a traffic class other than
0, but would be sent through TC 0 nonetheless.

The code paths are:

(1) ocelot_xmit_common() from net/dsa/tag_ocelot.c - called only by the
standard "ocelot" DSA tagging protocol which uses NPI-based
injection - sets up bit fields in the tag manually to account for
a small difference (destination port offset) between Ocelot and
Seville. Namely, ocelot_ifh_set_dest() is omitted out of
ocelot_xmit_common(), because there's also seville_ifh_set_dest().

(2) ocelot_ifh_set_basic(), called by:
- ocelot_fdma_prepare_skb() for FDMA transmission of the ocelot
  switchdev driver
- ocelot_port_xmit() -> ocelot_port_inject_frame() for
  register-based transmission of the ocelot switchdev driver
- felix_port_deferred_xmit() -> ocelot_port_inject_frame() for the
  DSA tagger ocelot-8021q when it must transmit PTP frames (also
  through register-based injection).
sets the bit fields according to its own logic.

The problem is that (2) doesn't call ocelot_ifh_set_qos_class().
Copying that logic from ocelot_xmit_common() fixes that.

Unfortunately, although desirable, it is not easily possible to
de-duplicate code paths (1) and (2), and make net/dsa/tag_ocelot.c
directly call ocelot_ifh_set_basic()), because of the ocelot/seville
difference. This is the "minimal" fix with some logic duplicated (but
at least more consolidated).

Fixes: 0a6f17c6ae21 ("net: dsa: tag_ocelot_8021q: add support for PTP 
timestamping")
Signed-off-by: Vladimir Oltean 
---
 drivers/net/ethernet/mscc/ocelot.c  | 10 +-
 drivers/net/ethernet/mscc/ocelot_fdma.c |  1 -
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mscc/ocelot.c 
b/drivers/net/ethernet/mscc/ocelot.c
index 69a4e5a90475..9301716e21d5 100644
--- a/drivers/net/ethernet/mscc/ocelot.c
+++ b/drivers/net/ethernet/mscc/ocelot.c
@@ -1208,13 +1208,21 @@ void ocelot_ifh_set_basic(void *ifh, struct ocelot 
*ocelot, int port,
  u32 rew_op, struct sk_buff *skb)
 {
struct ocelot_port *ocelot_port = ocelot->ports[port];
+   struct net_device *dev = skb->dev;
u64 vlan_tci, tag_type;
+   int qos_class;
 
ocelot_xmit_get_vlan_info(skb, ocelot_port->bridge, &vlan_tci,
  &tag_type);
 
+   qos_class = netdev_get_num_tc(dev) ?
+   netdev_get_prio_tc_map(dev, skb->priority) : skb->priority;
+
+   memset(ifh, 0, OCELOT_TAG_LEN);
ocelot_ifh_set_bypass(ifh, 1);
+   ocelot_ifh_set_src(ifh, BIT_ULL(ocelot->num_phys_ports));
ocelot_ifh_set_dest(ifh, BIT_ULL(port));
+   ocelot_ifh_set_qos_class(ifh, qos_class);
ocelot_ifh_set_tag_type(ifh, tag_type);
ocelot_ifh_set_vlan_tci(ifh, vlan_tci);
if (rew_op)
@@ -1225,7 +1233,7 @@ EXPORT_SYMBOL(ocelot_ifh_set_basic);
 void ocelot_port_inject_frame(struct ocelot *ocelot, int port, int grp,
  u32 rew_op, struct sk_buff *skb)
 {
-   u32 ifh[OCELOT_TAG_LEN / 4] = {0};
+   u32 ifh[OCELOT_TAG_LEN / 4];
unsigned int i, count, last;
 
ocelot_write_rix(ocelot, QS_INJ_CTRL_GAP_SIZE(1) |
diff --git a/drivers/net/ethernet/mscc/ocelot_fdma.c 
b/drivers/net/ethernet/mscc/ocelot_fdma.c
index 87b59cc5e441..00326ae8c708 100644
--- a/drivers/net/ethernet/mscc/ocelot_fdma.c
+++ b/drivers/net/ethernet/mscc/ocelot_fdma.c
@@ -665,7 +665,6 @@ static int ocelot_fdma_prepare_skb(struct ocelot *ocelot, 
int port, u32 rew_op,
 
ifh = skb_push(skb, OCELOT_TAG_LEN);
skb_put(skb, ETH_FCS_LEN);
-   memset(ifh, 0, OCELOT_TAG_LEN);
ocelot_ifh_set_basic(ifh, ocelot, port, rew_op, skb);
 
return 0;
-- 
2.34.1

[PATCH net 11/14] net: mscc: ocelot: serialize access to the injection/extraction groups

2024-08-14 Thread Vladimir Oltean

As explained by Horatiu Vultur in commit 603ead96582d ("net: sparx5: Add
spinlock for frame transmission from CPU") which is for a similar
hardware design, multiple CPUs can simultaneously perform injection
or extraction. There are only 2 register groups for injection and 2
for extraction, and the driver only uses one of each. So we'd better
serialize access using spin locks, otherwise frame corruption is
possible.

Note that unlike in sparx5, FDMA in ocelot does not have this issue
because struct ocelot_fdma_tx_ring already contains an xmit_lock.

I guess this is mostly a problem for NXP LS1028A, as that is dual core.
I don't think VSC7514 is. So I'm blaming the commit where LS1028A (aka
the felix DSA driver) started using register-based packet injection and
extraction.

Fixes: 0a6f17c6ae21 ("net: dsa: tag_ocelot_8021q: add support for PTP 
timestamping")
Signed-off-by: Vladimir Oltean 
---
 drivers/net/dsa/ocelot/felix.c | 11 +
 drivers/net/ethernet/mscc/ocelot.c | 52 ++
 drivers/net/ethernet/mscc/ocelot_vsc7514.c |  4 ++
 include/soc/mscc/ocelot.h  |  9 
 4 files changed, 76 insertions(+)

diff --git a/drivers/net/dsa/ocelot/felix.c b/drivers/net/dsa/ocelot/felix.c
index e554699f06d4..8d31ff18c5c7 100644
--- a/drivers/net/dsa/ocelot/felix.c
+++ b/drivers/net/dsa/ocelot/felix.c
@@ -528,7 +528,9 @@ static int felix_tag_8021q_setup(struct dsa_switch *ds)
 * so we need to be careful that there are no extra frames to be
 * dequeued over MMIO, since we would never know to discard them.
 */
+   ocelot_lock_xtr_grp_bh(ocelot, 0);
ocelot_drain_cpu_queue(ocelot, 0);
+   ocelot_unlock_xtr_grp_bh(ocelot, 0);
 
return 0;
 }
@@ -1518,6 +1520,8 @@ static void felix_port_deferred_xmit(struct kthread_work 
*work)
int port = xmit_work->dp->index;
int retries = 10;
 
+   ocelot_lock_inj_grp(ocelot, 0);
+
do {
if (ocelot_can_inject(ocelot, 0))
break;
@@ -1526,6 +1530,7 @@ static void felix_port_deferred_xmit(struct kthread_work 
*work)
} while (--retries);
 
if (!retries) {
+   ocelot_unlock_inj_grp(ocelot, 0);
dev_err(ocelot->dev, "port %d failed to inject skb\n",
port);
ocelot_port_purge_txtstamp_skb(ocelot, port, skb);
@@ -1535,6 +1540,8 @@ static void felix_port_deferred_xmit(struct kthread_work 
*work)
 
ocelot_port_inject_frame(ocelot, port, 0, rew_op, skb);
 
+   ocelot_unlock_inj_grp(ocelot, 0);
+
consume_skb(skb);
kfree(xmit_work);
 }
@@ -1694,6 +1701,8 @@ static bool felix_check_xtr_pkt(struct ocelot *ocelot)
if (!felix->info->quirk_no_xtr_irq)
return false;
 
+   ocelot_lock_xtr_grp(ocelot, grp);
+
while (ocelot_read(ocelot, QS_XTR_DATA_PRESENT) & BIT(grp)) {
struct sk_buff *skb;
unsigned int type;
@@ -1730,6 +1739,8 @@ static bool felix_check_xtr_pkt(struct ocelot *ocelot)
ocelot_drain_cpu_queue(ocelot, 0);
}
 
+   ocelot_unlock_xtr_grp(ocelot, grp);
+
return true;
 }
 
diff --git a/drivers/net/ethernet/mscc/ocelot.c 
b/drivers/net/ethernet/mscc/ocelot.c
index 9301716e21d5..f4e027a6fe95 100644
--- a/drivers/net/ethernet/mscc/ocelot.c
+++ b/drivers/net/ethernet/mscc/ocelot.c
@@ -1099,6 +1099,48 @@ void ocelot_ptp_rx_timestamp(struct ocelot *ocelot, 
struct sk_buff *skb,
 }
 EXPORT_SYMBOL(ocelot_ptp_rx_timestamp);
 
+void ocelot_lock_inj_grp(struct ocelot *ocelot, int grp)
+__acquires(&ocelot->inj_lock)
+{
+   spin_lock(&ocelot->inj_lock);
+}
+EXPORT_SYMBOL_GPL(ocelot_lock_inj_grp);
+
+void ocelot_unlock_inj_grp(struct ocelot *ocelot, int grp)
+  __releases(&ocelot->inj_lock)
+{
+   spin_unlock(&ocelot->inj_lock);
+}
+EXPORT_SYMBOL_GPL(ocelot_unlock_inj_grp);
+
+void ocelot_lock_xtr_grp(struct ocelot *ocelot, int grp)
+__acquires(&ocelot->inj_lock)
+{
+   spin_lock(&ocelot->inj_lock);
+}
+EXPORT_SYMBOL_GPL(ocelot_lock_xtr_grp);
+
+void ocelot_unlock_xtr_grp(struct ocelot *ocelot, int grp)
+  __releases(&ocelot->inj_lock)
+{
+   spin_unlock(&ocelot->inj_lock);
+}
+EXPORT_SYMBOL_GPL(ocelot_unlock_xtr_grp);
+
+void ocelot_lock_xtr_grp_bh(struct ocelot *ocelot, int grp)
+   __acquires(&ocelot->xtr_lock)
+{
+   spin_lock_bh(&ocelot->xtr_lock);
+}
+EXPORT_SYMBOL_GPL(ocelot_lock_xtr_grp_bh);
+
+void ocelot_unlock_xtr_grp_bh(struct ocelot *ocelot, int grp)
+ __releases(&ocelot->xtr_lock)
+{
+   spin_unlock_bh(&ocelot->xtr_lock);
+}
+EXPORT_SYMBOL_GPL(ocelot_unlock_xtr_grp_bh);
+
 int ocelot_xtr_poll_frame(struct ocelot *ocelot, int grp, struct sk_buff 
**nskb)
 {
u64 timestamp, src_port, len;
@@ -1109,6 +1151,8 @@ int ocelot_xtr_poll_frame(struct ocelot *o

[PATCH net 12/14] net: dsa: provide a software untagging function on RX for VLAN-aware bridges

2024-08-14 Thread Vladimir Oltean

Through code analysis, I realized that the ds->untag_bridge_pvid logic
is contradictory - see the newly added FIXME above the kernel-doc for
dsa_software_untag_vlan_unaware_bridge().

Moreover, for the Felix driver, I need something very similar, but which
is actually _not_ contradictory: untag the bridge PVID on RX, but for
VLAN-aware bridges. The existing logic does it for VLAN-unaware bridges.

Since I don't want to change the functionality of drivers which were
supposedly properly tested with the ds->untag_bridge_pvid flag, I have
introduced a new one: ds->untag_vlan_aware_bridge_pvid, and I have
refactored the DSA reception code into a common path for both flags.

TODO: both flags should be unified under a single ds->software_vlan_untag,
which users of both current flags should set. This is not something that
can be carried out right away. It needs very careful examination of all
drivers which make use of this functionality, since some of them
actually get this wrong in the first place.

For example, commit 9130c2d30c17 ("net: dsa: microchip: ksz8795: Use
software untagging on CPU port") uses this in a driver which has
ds->configure_vlan_while_not_filtering = true. The latter mechanism has
been known for many years to be broken by design:
https://lore.kernel.org/netdev/CABumfLzJmXDN_W-8Z=p9kykuvi_hhs7o_pobkekhs2bkaiy...@mail.gmail.com/
and we have the situation of 2 bugs canceling each other. There is no
private VLAN, and the port follows the PVID of the VLAN-unaware bridge.
So, it's kinda ok for that driver to use the ds->untag_bridge_pvid
mechanism, in a broken way.

Signed-off-by: Vladimir Oltean 
---
 include/net/dsa.h |  16 +++---
 net/dsa/tag.c |   5 +-
 net/dsa/tag.h | 135 +++---
 3 files changed, 118 insertions(+), 38 deletions(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index b06f97ae3da1..d7a6c2930277 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -403,14 +403,18 @@ struct dsa_switch {
 */
u32 configure_vlan_while_not_filtering:1;
 
-   /* If the switch driver always programs the CPU port as egress tagged
-* despite the VLAN configuration indicating otherwise, then setting
-* @untag_bridge_pvid will force the DSA receive path to pop the
-* bridge's default_pvid VLAN tagged frames to offer a consistent
-* behavior between a vlan_filtering=0 and vlan_filtering=1 bridge
-* device.
+   /* Pop the default_pvid of VLAN-unaware bridge ports from tagged frames.
+* DEPRECATED: Do NOT set this field in new drivers. Instead look at
+* the dsa_software_vlan_untag() comments.
 */
u32 untag_bridge_pvid:1;
+   /* Pop the default_pvid of VLAN-aware bridge ports from tagged frames.
+* Useful if the switch cannot preserve the VLAN tag as seen on the
+* wire for user port ingress, and chooses to send all frames as
+* VLAN-tagged to the CPU, including those which were originally
+* untagged.
+*/
+   u32 untag_vlan_aware_bridge_pvid:1;
 
/* Let DSA manage the FDB entries towards the
 * CPU, based on the software bridge database.
diff --git a/net/dsa/tag.c b/net/dsa/tag.c
index 6e402d49afd3..79ad105902d9 100644
--- a/net/dsa/tag.c
+++ b/net/dsa/tag.c
@@ -105,8 +105,9 @@ static int dsa_switch_rcv(struct sk_buff *skb, struct 
net_device *dev,
 
p = netdev_priv(skb->dev);
 
-   if (unlikely(cpu_dp->ds->untag_bridge_pvid)) {
-   nskb = dsa_untag_bridge_pvid(skb);
+   if (unlikely(cpu_dp->ds->untag_bridge_pvid ||
+cpu_dp->ds->untag_vlan_aware_bridge_pvid)) {
+   nskb = dsa_software_vlan_untag(skb);
if (!nskb) {
kfree_skb(skb);
return 0;
diff --git a/net/dsa/tag.h b/net/dsa/tag.h
index f6b9c73718df..d5707870906b 100644
--- a/net/dsa/tag.h
+++ b/net/dsa/tag.h
@@ -44,46 +44,81 @@ static inline struct net_device 
*dsa_conduit_find_user(struct net_device *dev,
return NULL;
 }
 
-/* If under a bridge with vlan_filtering=0, make sure to send pvid-tagged
- * frames as untagged, since the bridge will not untag them.
+/**
+ * dsa_software_untag_vlan_aware_bridge: Software untagging for VLAN-aware 
bridge
+ * @skb: Pointer to received socket buffer (packet)
+ * @br: Pointer to bridge upper interface of ingress port
+ * @vid: Parsed VID from packet
+ *
+ * The bridge can process tagged packets. Software like STP/PTP may not. The
+ * bridge can also process untagged packets, to the same effect as if they were
+ * tagged with the PVID of the ingress port. So packets tagged with the PVID of
+ * the bridge port must be software-untagged, to support both use cases.
  */
-static inline struct sk_buff *dsa_untag_bridge_pvid(struct sk_buff *skb)
+static inline void dsa_software_untag_vlan_aware_bridge(struct sk_buff *skb,
+

[PATCH net 13/14] net: dsa: felix: fix VLAN tag loss on CPU reception with ocelot-8021q

2024-08-14 Thread Vladimir Oltean

There is a major design bug with ocelot-8021q, which is that it expects
more of the hardware than the hardware can actually do. The short
summary of the issue is that when a port is under a VLAN-aware bridge
and we use this tagging protocol, VLAN upper interfaces of this port do
not see RX traffic.

We use VCAP ES0 (egress rewriter) rules towards the tag_8021q CPU port
to encapsulate packets with an outer tag, later stripped by software,
that depends on the source user port. We do this so that packets can be
identified in ocelot_rcv(). To be precise, we create rules with
push_outer_tag = OCELOT_ES0_TAG and push_inner_tag = 0.

With this configuration, we expect the switch to keep the inner tag
configuration as found in the packet (if it was untagged on user port
ingress, keep it untagged, otherwise preserve the VLAN tag unmodified
as the inner tag towards the tag_8021q CPU port). But this is not what
happens.

Instead, table "Tagging Combinations" from the user manual suggests
that when the ES0 action is "PUSH_OUTER_TAG=1 and PUSH_INNER_TAG=0",
there will be "no inner tag". Experimentation further clarifies what
this means.

It appears that this "inner tag" which is not pushed into the packet on
its egress towards the CPU is none other than the classified VLAN.

When the ingress user port is standalone or under a VLAN-unaware bridge,
the classified VLAN is a discardable quantity: it is a fixed value - the
result of ocelot_vlan_unaware_pvid()'s configuration, and actually
independent of the VID from any 802.1Q header that may be in the frame.
It is actually preferable to discard the "inner tag" in this case.

The problem is when the ingress port is under a VLAN-aware bridge.
Then, the classified VLAN is taken from the frame's 802.1Q header, with
a fallback on the bridge port's PVID. It would be very good to not
discard the "inner tag" here, because if we do, we break communication
with any 8021q VLAN uppers that the port might have. These have a
processing path outside the bridge.

There seems to be nothing else we can do except to change the
configuration for VCAP ES0 rules, to actually push the inner VLAN into
the frame. There are 2 options for that, first is to push a fixed value
specified in the rule, and second is to push a fixed value, plus
(aka arithmetic +) the classified VLAN. We choose the second option,
and we select that fixed value as 0. Thus, what is pushed in the inner
tag is just the classified VLAN.

>From there, we need to perform software untagging, in the receive path,
of stuff that was untagged on the wire.

Fixes: 7c83a7c539ab ("net: dsa: add a second tagger for Ocelot switches based 
on tag_8021q")
Signed-off-by: Vladimir Oltean 
---
 drivers/net/dsa/ocelot/felix.c | 115 +++--
 1 file changed, 109 insertions(+), 6 deletions(-)

diff --git a/drivers/net/dsa/ocelot/felix.c b/drivers/net/dsa/ocelot/felix.c
index 8d31ff18c5c7..4a705f7333f4 100644
--- a/drivers/net/dsa/ocelot/felix.c
+++ b/drivers/net/dsa/ocelot/felix.c
@@ -61,11 +61,46 @@ static int felix_cpu_port_for_conduit(struct dsa_switch *ds,
return cpu_dp->index;
 }
 
+/**
+ * felix_update_tag_8021q_rx_rule - Update VCAP ES0 tag_8021q rule after
+ * vlan_filtering change
+ * @outer_tagging_rule: Pointer to VCAP filter on which the update is performed
+ * @vlan_filtering: Current bridge VLAN filtering setting
+ *
+ * Source port identification for tag_8021q is done using VCAP ES0 rules on the
+ * CPU port(s). The ES0 tag B (inner tag from the packet) can be configured as
+ * either:
+ * - push_inner_tag=0: the inner tag is never pushed into the frame
+ *(and we lose info about the classified VLAN). This is
+ *good when the classified VLAN is a discardable quantity
+ *for the software RX path: it is either set to
+ *OCELOT_STANDALONE_PVID, or to
+ *ocelot_vlan_unaware_pvid(bridge).
+ * - push_inner_tag=1: the inner tag is always pushed. This is good when the
+ *classified VLAN is not a discardable quantity (the port
+ *is under a VLAN-aware bridge, and software needs to
+ *continue processing the packet in the same VLAN as the
+ *hardware).
+ * The point is that what is good for a VLAN-unaware port is not good for a
+ * VLAN-aware port, and vice versa. Thus, the RX tagging rules must be kept in
+ * sync with the VLAN filtering state of the port.
+ */
+static void
+felix_update_tag_8021q_rx_rule(struct ocelot_vcap_filter *outer_tagging_rule,
+  bool vlan_filtering)
+{
+   if (vlan_filtering)
+   outer_tagging_rule->action.push_inner_tag = OCELOT_ES0_TAG;
+   else
+   outer_tagging_rule->action.push_inner_tag = OCELOT_NO_ES0_TAG;
+}
+
 /* Set up VCAP ES0 rules for pushing a tag_8021q VLAN towards the CPU such that
  * the tagger can

[PATCH net 14/14] net: mscc: ocelot: treat 802.1ad tagged traffic as 802.1Q-untagged

2024-08-14 Thread Vladimir Oltean

I was revisiting the topic of 802.1ad treatment in the Ocelot switch [0]
and realized that not only is its basic VLAN classification pipeline
improper for offloading vlan_protocol 802.1ad bridges, but also improper
for offloading regular 802.1Q bridges already.

Namely, 802.1ad-tagged traffic should be treated as VLAN-untagged by
bridged ports, but this switch treats it as if it was 802.1Q-tagged with
the same VID as in the 802.1ad header. This is markedly different to
what the Linux bridge expects; see the "other_tpid()" function in
tools/testing/selftests/net/forwarding/bridge_vlan_aware.sh.

An idea came to me that the VCAP IS1 TCAM is more powerful than I'm
giving it credit for, and that it actually overwrites the classified VID
before the VLAN Table lookup takes place. In other words, it can be
used even to save a packet from being dropped on ingress due to VLAN
membership.

Add a sophisticated TCAM rule hardcoded into the driver to force the
switch to behave like a Linux bridge with vlan_filtering 1 vlan_protocol
802.1Q.

Regarding the lifetime of the filter: eventually the bridge will
disappear, and vlan_filtering on the port will be restored to 0 for
standalone mode. Then the filter will be deleted.

[0]: https://lore.kernel.org/netdev/20201009122947.nvhye4hvcha3tljh@skbuf/

Fixes: 7142529f1688 ("net: mscc: ocelot: add VLAN filtering")
Signed-off-by: Vladimir Oltean 
---
 drivers/net/ethernet/mscc/ocelot.c  | 188 ++--
 drivers/net/ethernet/mscc/ocelot_vcap.c |   1 +
 include/soc/mscc/ocelot_vcap.h  |   2 +
 3 files changed, 180 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/mscc/ocelot.c 
b/drivers/net/ethernet/mscc/ocelot.c
index f4e027a6fe95..3d72aa7b1305 100644
--- a/drivers/net/ethernet/mscc/ocelot.c
+++ b/drivers/net/ethernet/mscc/ocelot.c
@@ -453,9 +453,158 @@ static u16 ocelot_vlan_unaware_pvid(struct ocelot *ocelot,
return VLAN_N_VID - bridge_num - 1;
 }
 
+/**
+ * ocelot_update_vlan_reclassify_rule() - Make switch aware only to bridge 
VLAN TPID
+ *
+ * @ocelot: Switch private data structure
+ * @port: Index of ingress port
+ *
+ * IEEE 802.1Q-2018 clauses "5.5 C-VLAN component conformance" and "5.6 S-VLAN
+ * component conformance" suggest that a C-VLAN component should only recognize
+ * and filter on C-Tags, and an S-VLAN component should only recognize and
+ * process based on C-Tags.
+ *
+ * In Linux, as per commit 1a0b20b25732 ("Merge branch 'bridge-next'"), C-VLAN
+ * components are largely represented by a bridge with vlan_protocol 802.1Q,
+ * and S-VLAN components by a bridge with vlan_protocol 802.1ad.
+ *
+ * Currently the driver only offloads vlan_protocol 802.1Q, but the hardware
+ * design is non-conformant, because the switch assigns each frame to a VLAN
+ * based on an entirely different question, as detailed in figure "Basic VLAN
+ * Classification Flow" from its manual and reproduced below.
+ *
+ * Set TAG_TYPE, PCP, DEI, VID to port-default values in VLAN_CFG register
+ * if VLAN_AWARE_ENA[port] and frame has outer tag then:
+ *   if VLAN_INNER_TAG_ENA[port] and frame has inner tag then:
+ * TAG_TYPE = (Frame.InnerTPID <> 0x8100)
+ * Set PCP, DEI, VID to values from inner VLAN header
+ *   else:
+ * TAG_TYPE = (Frame.OuterTPID <> 0x8100)
+ * Set PCP, DEI, VID to values from outer VLAN header
+ *   if VID == 0 then:
+ * VID = VLAN_CFG.VLAN_VID
+ *
+ * Summarized, the switch will recognize both 802.1Q and 802.1ad TPIDs as VLAN
+ * "with equal rights", and just set the TAG_TYPE bit to 0 (if 802.1Q) or to 1
+ * (if 802.1ad). It will classify based on whichever of the tags is "outer", no
+ * matter what TPID that may have (or "inner", if VLAN_INNER_TAG_ENA[port]).
+ *
+ * In the VLAN Table, the TAG_TYPE information is not accessible - just the
+ * classified VID is - so it is as if each VLAN Table entry is for 2 VLANs:
+ * C-VLAN X, and S-VLAN X.
+ *
+ * Whereas the Linux bridge behavior is to only filter on frames with a TPID
+ * equal to the vlan_protocol, and treat everything else as VLAN-untagged.
+ *
+ * Consider an ingress packet tagged with 802.1ad VID=3 and 802.1Q VID=5,
+ * received on a bridge vlan_filtering=1 vlan_protocol=802.1Q port. This frame
+ * should be treated as 802.1Q-untagged, and classified to the PVID of that
+ * bridge port. Not to VID=3, and not to VID=5.
+ *
+ * The VCAP IS1 TCAM has everything we need to overwrite the choices made in
+ * the basic VLAN classification pipeline: it can match on TAG_TYPE in the key,
+ * and it can modify the classified VID in the action. Thus, for each port
+ * under a vlan_filtering bridge, we can insert a rule in VCAP IS1 lookup 0 to
+ * match on 802.1ad tagged frames and modify their classified VID to the 802.1Q
+ * PVID of the port. This effectively makes it appear to the outside world as
+ * if those packets were processed as VLAN-untagged.
+ *
+ * The rule needs to be updated each time the bridge PVID changes, and needs
+ * to be dele

Re: [PATCH RFT v8 4/9] fork: Add shadow stack support to clone3()

2024-08-14 Thread Edgecombe, Rick P

On Thu, 2024-08-08 at 09:15 +0100, Mark Brown wrote:
> +int arch_shstk_post_fork(struct task_struct *t, struct kernel_clone_args
> *args)
> +{
> +   /*
> +    * SSP is aligned, so reserved bits and mode bit are a zero, just mark
> +    * the token 64-bit.
> +    */
> +   struct mm_struct *mm;
> +   unsigned long addr, ssp;
> +   u64 expected;
> +   u64 val;
> +   int ret = -EINVAL;

We should probably?
if (!features_enabled(ARCH_SHSTK_SHSTK))
return 0;

> +
> +   ssp = args->shadow_stack + args->shadow_stack_size;
> +   addr = ssp - SS_FRAME_SIZE;
> +   expected = ssp | BIT(0);
> +
> +   mm = get_task_mm(t);
> +   if (!mm)
> +   return -EFAULT;

We could check that the VMA is shadow stack here. I'm not sure what could go
wrong though. If you point it to RW memory it could start the thread with that
as a shadow stack and just blow up at the first call. It might be nicer to fail
earlier though.

> +
> +   /* This should really be an atomic cmpxchg.  It is not. */
> +   if (access_remote_vm(mm, addr, &val, sizeof(val),
> +    FOLL_FORCE) != sizeof(val))
> +   goto out;
> +
> +   if (val != expected)
> +   goto out;
> +   val = 0;

After a token is consumed normally, it doesn't set it to zero. Instead it sets
it to a "previous-ssp token". I don't think we actually want to do that here
though because it involves the old SSP, which doesn't really apply in this case.
I don't see any problem with zero, but was there any special thinking behind it?

> +   if (access_remote_vm(mm, addr, &val, sizeof(val),
> +    FOLL_FORCE | FOLL_WRITE) != sizeof(val))
> +   goto out;

The GUPs still seem a bit unfortunate for a couple reasons:
 - We could do a CMPXCHG version and are just not (I see ARM has identical code
in gcs_consume_token()). It's not the only race like this though FWIW.
 - I *think* this is the only unprivileged FOLL_FORCE that can write to the
current process in the kernel. As is, it could be used on normal RO mappings, at
least in a limited way. Maybe another point for the VMA check. We'd want to
check that it is normal shadow stack?
 - Lingering doubts about the wisdom of doing GUPs during task creation.

I don't think they are show stoppers, but the VMA check would be nice to have in
the first upstream support.

> +
> +   ret = 0;
> +
> +out:
> +   mmput(mm);
> +   return ret;
> +}
> +
> 
[snip]
> 
>  
> +static void shstk_post_fork(struct task_struct *p,
> +   struct kernel_clone_args *args)
> +{
> +   if (!IS_ENABLED(CONFIG_ARCH_HAS_USER_SHADOW_STACK))
> +   return;
> +
> +   if (!args->shadow_stack)
> +   return;
> +
> +   if (arch_shstk_post_fork(p, args) != 0)
> +   force_sig_fault_to_task(SIGSEGV, SEGV_CPERR, NULL, p);
> +}
> +

Hmm, is this forcing the signal on the new task, which is set up on a user
provided shadow stack that failed the token check? It would handle the signal
with an arbitrary SSP then I think. We should probably fail the clone call in
the parent instead, which can be done by doing the work in copy_process(). Do
you see a problem with doing it at the end of copy_process()? I don't know if
there could be ordering constraints.

Re: [PATCH RFT v8 0/9] fork: Support shadow stacks in clone3()

2024-08-14 Thread Edgecombe, Rick P

On Thu, 2024-08-08 at 10:54 -0700, Kees Cook wrote:
> Tested-by: Kees Cook 

I regression tested it with the CET enabled glibc selftests. No issues.

Re: [PATCH net v6 0/2] Enhance network interface feature testing

2024-08-14 Thread Jakub Kicinski

On Wed, 14 Aug 2024 19:15:15 + Abhinav Jain wrote:
> Changes in v6:
> Use XFAIL for ethtool operations that are unsupported instead of SKIP.

One more:

tools/testing/selftests/net/netdevice.sh:   echo "SKIP: $netdev: set IP 
address"

I think the SKIP -> XFAIL conversion should be a separate patch (for
total of 3 patches in the series).
-- 
pw-bot: cr

Re: [PATCH net v6 0/2] Enhance network interface feature testing

2024-08-14 Thread Jakub Kicinski

On Wed, 14 Aug 2024 17:56:51 -0700 Jakub Kicinski wrote:
> On Wed, 14 Aug 2024 19:15:15 + Abhinav Jain wrote:
> > Changes in v6:
> > Use XFAIL for ethtool operations that are unsupported instead of SKIP.  
> 
> One more:
> 
> tools/testing/selftests/net/netdevice.sh:   echo "SKIP: $netdev: set IP 
> address"
> 
> I think the SKIP -> XFAIL conversion should be a separate patch (for
> total of 3 patches in the series).

P.S. and please change the subject to [PATCH net-next], it's a net-next
change, not a net fix.

Re: [PATCH] arm64/vdso: Remove --hash-style=sysv

2024-08-14 Thread Fangrui Song

On Wed, Aug 14, 2024 at 12:56 PM Xi Ruoyao  wrote:
>
> On Thu, 2024-07-18 at 10:34 -0700, Fangrui Song wrote:
> > glibc added support for .gnu.hash in 2006 and .hash has been obsoleted
> > for more than one decade in many Linux distributions.  Using
> > --hash-style=sysv might imply unaddressed issues and confuse readers.
> >
> > Just drop the option and rely on the linker default, which is likely
> > "both", or "gnu" when the distribution really wants to eliminate sysv
> > hash overhead.
> >
> > Similar to commit 6b7e26547fad ("x86/vdso: Emit a GNU hash").
> >
> > Signed-off-by: Fangrui Song 
>
> Hi Fangrui,
>
> If I read tools/testing/selftests/vDSO/parse_vdso.c correctly, it does
> know DT_GNU_HASH as at now.  Thus after this change the vDSO selftests
> are skipped with "Couldn't find __vdso_gettimeofday" etc if the distro
> enables --hash-style=gnu by default.
>
> So it seems we need to add DT_GNU_HASH support for parse_vdso.c to keep
> test coverage.

Hi Xi,

Perhaps the selftests file needs DT_GNU_HASH support like
https://github.com/abseil/abseil-cpp/commit/1278ee9bd9bd4916181521fac96d6fa1100e38e6


> > ---
> >  arch/arm64/kernel/vdso/Makefile   | 2 +-
> >  arch/arm64/kernel/vdso32/Makefile | 2 +-
> >  2 files changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/kernel/vdso/Makefile 
> > b/arch/arm64/kernel/vdso/Makefile
> > index d63930c82839..d11da6461278 100644
> > --- a/arch/arm64/kernel/vdso/Makefile
> > +++ b/arch/arm64/kernel/vdso/Makefile
> > @@ -21,7 +21,7 @@ btildflags-$(CONFIG_ARM64_BTI_KERNEL) += -z force-bti
> >  # potential future proofing if we end up with internal calls to the 
> > exported
> >  # routines, as x86 does (see 6f121e548f83 ("x86, vdso: Reimplement vdso.so
> >  # preparation in build-time C")).
> > -ldflags-y := -shared -soname=linux-vdso.so.1 --hash-style=sysv   \
> > +ldflags-y := -shared -soname=linux-vdso.so.1 \
> >-Bsymbolic --build-id=sha1 -n $(btildflags-y)
> >
> >  ifdef CONFIG_LD_ORPHAN_WARN
> > diff --git a/arch/arm64/kernel/vdso32/Makefile 
> > b/arch/arm64/kernel/vdso32/Makefile
> > index cc4508c604b2..25a2cb6317f3 100644
> > --- a/arch/arm64/kernel/vdso32/Makefile
> > +++ b/arch/arm64/kernel/vdso32/Makefile
> > @@ -98,7 +98,7 @@ VDSO_AFLAGS += -D__ASSEMBLY__
> >  # From arm vDSO Makefile
> >  VDSO_LDFLAGS += -Bsymbolic --no-undefined -soname=linux-vdso.so.1
> >  VDSO_LDFLAGS += -z max-page-size=4096 -z common-page-size=4096
> > -VDSO_LDFLAGS += -shared --hash-style=sysv --build-id=sha1
> > +VDSO_LDFLAGS += -shared --build-id=sha1
> >  VDSO_LDFLAGS += --orphan-handling=$(CONFIG_LD_ORPHAN_WARN_LEVEL)
> >
> >
>
> --
> Xi Ruoyao 
> School of Aerospace Science and Technology, Xidian University



-- 
宋方睿

[PATCH] lib/math: Add int_pow test suite

2024-08-14 Thread Luis Felipe Hernandez

Adds test suite for integer based power function.

Signed-off-by: Luis Felipe Hernandez 
---
 lib/math/Makefile   |  1 +
 lib/math/test_int_pow.c | 70 +
 2 files changed, 71 insertions(+)
 create mode 100644 lib/math/test_int_pow.c

diff --git a/lib/math/Makefile b/lib/math/Makefile
index 91fcdb0c9efe..ba564bf4fb00 100644
--- a/lib/math/Makefile
+++ b/lib/math/Makefile
@@ -6,4 +6,5 @@ obj-$(CONFIG_PRIME_NUMBERS) += prime_numbers.o
 obj-$(CONFIG_RATIONAL) += rational.o
 
 obj-$(CONFIG_TEST_DIV64)   += test_div64.o
+obj-$(CONFIG_TEST_INT_POW) += test_int_pow.o
 obj-$(CONFIG_RATIONAL_KUNIT_TEST) += rational-test.o
diff --git a/lib/math/test_int_pow.c b/lib/math/test_int_pow.c
new file mode 100644
index ..ecabe71d01cc
--- /dev/null
+++ b/lib/math/test_int_pow.c
@@ -0,0 +1,70 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+#include 
+
+#include 
+
+
+static void test_pow_0(struct kunit *test)
+{
+   KUNIT_EXPECT_EQ(test, 1, int_pow(64, 0));
+}
+
+static void test_pow_1(struct kunit *test)
+{
+   KUNIT_EXPECT_EQ(test, 64, int_pow(64, 1));
+}
+
+static void test_base_0(struct kunit *test)
+{
+   KUNIT_EXPECT_EQ(test, 0, int_pow(0, 5));
+}
+
+static void test_base_1(struct kunit *test)
+{
+   KUNIT_EXPECT_EQ(test, 1, int_pow(1, 100));
+   KUNIT_EXPECT_EQ(test, 1, int_pow(1, 0));
+}
+
+static void test_base_0_pow_0(struct kunit *test)
+{
+   KUNIT_EXPECT_EQ(test, 1, int_pow(0, 0));
+}
+
+static void test_base_2_pow_2(struct kunit *test)
+{
+   KUNIT_EXPECT_EQ(test, 4, int_pow(2, 2));
+}
+
+static void test_max_base(struct kunit *test)
+{
+   KUNIT_EXPECT_EQ(test, U64_MAX, int_pow(U64_MAX, 1));
+}
+
+static void test_large_result(struct kunit *test)
+{
+   UNIT_EXPECT_EQ(test, 1152921504606846976, int_pow(2, 60));
+}
+
+static struct kunit_case math_int_pow_test_cases[] = {
+   KUNIT_CASE(test_pow_0),
+   KUNIT_CASE(test_pow_1),
+   KUNIT_CASE(test_base_0),
+   KUNIT_CASE(test_base_1),
+   KUNIT_CASE(test_base_0_pow_0),
+   KUNIT_CASE(test_base_2_pow_2),
+   KUNIT_CASE(test_max_base),
+   KUNIT_CASE(test_large_result),
+   {}
+};
+
+static struct kunit_suite int_pow_test_suite = {
+   .name = "lib-math-int_pow",
+   .test_cases = math_int_pow_test_cases,
+};
+
+kunit_test_suites(&int_pow_test_suite);
+
+MODULE_DESCRIPTION("math.int_pow KUnit test suite");
+MODULE_LICENSE("GPL v2");
-- 
2.46.0

Re: [PATCH bpf-next v4 0/4] selftests/bpf: convert three other cgroup tests to test_progs

2024-08-14 Thread patchwork-bot+netdevbpf

Hello:

This series was applied to bpf/bpf-next.git (master)
by Martin KaFai Lau :

On Tue, 13 Aug 2024 14:45:04 +0200 you wrote:
> Hello,
> this series brings a new set of test converted to the test_progs framework.
> Since the tests are quite small, I chose to group three tests conversion in
> the same series, but feel free to let me know if I should keep one series
> per test. The series focuses on cgroup testing and converts the following
> tests:
> - get_cgroup_id_user
> - cgroup_storage
> - test_skb_cgroup_id_user
> 
> [...]

Here is the summary with links:
  - [bpf-next,v4,1/4] selftests/bpf: convert get_current_cgroup_id_user to 
test_progs
https://git.kernel.org/bpf/bpf-next/c/a4ae5c31e0f2
  - [bpf-next,v4,2/4] selftests/bpf: convert test_cgroup_storage to test_progs
https://git.kernel.org/bpf/bpf-next/c/37a14cfd667a
  - [bpf-next,v4,3/4] selftests/bpf: add proper section name to bpf prog and 
rename it
https://git.kernel.org/bpf/bpf-next/c/7b4400a0a69b
  - [bpf-next,v4,4/4] selftests/bpf: convert test_skb_cgroup_id_user to 
test_progs
https://git.kernel.org/bpf/bpf-next/c/f957c230e173

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html

Re: [PATCH bpf-next v4 4/4] selftests/bpf: convert test_skb_cgroup_id_user to test_progs

2024-08-14 Thread Martin KaFai Lau


On 8/13/24 5:45 AM, Alexis Lothoré (eBPF Foundation) wrote:


+#define DST_ADDR "ff02::1"


[ ... ]


+static int wait_local_ip(void)
+{
+   char *ping_cmd = ping_command(AF_INET6);
+   int i, err;
+
+   for (i = 0; i < WAIT_AUTO_IP_MAX_ATTEMPT; i++) {
+   err = SYS_NOFAIL("%s -c 1 -W 1 %s%%%s", ping_cmd, DST_ADDR,
+VETH_1);


I tried in my qemu. This loop takes at least 3-4 iteration to get the ping 
through. This test could become flaky if the CI is busy.


I have been always wondering why some of the (non) test_progs has this practice.
I traced a little. I think it has something to do with the "ff02::1" used in the 
test and/or the local link address is not ready. I have not further nailed it 
down but I think it is close enough.


It will be easier to use a nodad configured v6 addr.

I take this chance to use an easier "::1" address for the test here instead of 
ff02::1. This also removed the need to add veth pair and no need to ping first.


Applied with the "::1" changes mentioned above.

Thanks for migrating the tests to test_progs. This is long overdue.





+   if (!err)
+   break;
+   }
+
+   return err;
+}
+

Re: [PATCH net-next v13 04/14] mm: page_frag: add '_va' suffix to page_frag API

2024-08-14 Thread Yunsheng Lin

On 2024/8/14 23:49, Alexander H Duyck wrote:
> On Thu, 2024-08-08 at 20:37 +0800, Yunsheng Lin wrote:
>> Currently the page_frag API is returning 'virtual address'
>> or 'va' when allocing and expecting 'virtual address' or
>> 'va' as input when freeing.
>>
>> As we are about to support new use cases that the caller
>> need to deal with 'struct page' or need to deal with both
>> 'va' and 'struct page'. In order to differentiate the API
>> handling between 'va' and 'struct page', add '_va' suffix
>> to the corresponding API mirroring the page_pool_alloc_va()
>> API of the page_pool. So that callers expecting to deal with
>> va, page or both va and page may call page_frag_alloc_va*,
>> page_frag_alloc_pg*, or page_frag_alloc* API accordingly.
>>
>> CC: Alexander Duyck 
>> Signed-off-by: Yunsheng Lin 
>> Reviewed-by: Subbaraya Sundeep 
>> Acked-by: Chuck Lever 
>> Acked-by: Sagi Grimberg 
>> ---
>>  drivers/net/ethernet/google/gve/gve_rx.c  |  4 ++--
>>  drivers/net/ethernet/intel/ice/ice_txrx.c |  2 +-
>>  drivers/net/ethernet/intel/ice/ice_txrx.h |  2 +-
>>  drivers/net/ethernet/intel/ice/ice_txrx_lib.c |  2 +-
>>  .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |  4 ++--
>>  .../marvell/octeontx2/nic/otx2_common.c   |  2 +-
>>  drivers/net/ethernet/mediatek/mtk_wed_wo.c|  4 ++--
>>  drivers/nvme/host/tcp.c   |  8 +++
>>  drivers/nvme/target/tcp.c | 22 +--
>>  drivers/vhost/net.c   |  6 ++---
>>  include/linux/page_frag_cache.h   | 21 +-
>>  include/linux/skbuff.h|  2 +-
>>  kernel/bpf/cpumap.c   |  2 +-
>>  mm/page_frag_cache.c  | 12 +-
>>  net/core/skbuff.c | 16 +++---
>>  net/core/xdp.c|  2 +-
>>  net/rxrpc/txbuf.c | 15 +++--
>>  net/sunrpc/svcsock.c  |  6 ++---
>>  .../selftests/mm/page_frag/page_frag_test.c   | 13 ++-
>>  19 files changed, 75 insertions(+), 70 deletions(-)
>>
> 
> I still say no to this patch. It is an unnecessary name change and adds
> no value. If you insist on this patch I will reject the set every time.
> 
> The fact is it is polluting the git history and just makes things
> harder to maintain without adding any value as you aren't changing what
> the function does and there is no need for this. In addition it just

I guess I have to disagree with the above 'no need for this' part for
now, as mentioned in [1]:

"There are three types of API as proposed in this patchset instead of
two types of API:
1. page_frag_alloc_va() returns [va].
2. page_frag_alloc_pg() returns [page, offset].
3. page_frag_alloc() returns [va] & [page, offset].

You seemed to miss that we need a third naming for the type 3 API.
Do you see type 3 API as a valid API? if yes, what naming are you
suggesting for it? if no, why it is not a valid API?"


1. https://lore.kernel.org/all/ca6be29e-ab53-4673-9624-90d41616a...@huawei.com/

> makes it that much harder to backport fixes in the future as people
> will have to work around the rename.
>

[PATCH] selftests/vDSO: support DT_GNU_HASH

2024-08-14 Thread Fangrui Song

glibc added support for DT_GNU_HASH in 2006 and DT_HASH has been
obsoleted for more than one decade in many Linux distributions.

Many vDSOs support DT_GNU_HASH. This patch adds selftests support.

Signed-off-by: Fangrui Song 
---
 tools/testing/selftests/vDSO/parse_vdso.c | 105 --
 1 file changed, 79 insertions(+), 26 deletions(-)

diff --git a/tools/testing/selftests/vDSO/parse_vdso.c 
b/tools/testing/selftests/vDSO/parse_vdso.c
index 4ae417372e9e..35cb545da13e 100644
--- a/tools/testing/selftests/vDSO/parse_vdso.c
+++ b/tools/testing/selftests/vDSO/parse_vdso.c
@@ -47,6 +47,7 @@ static struct vdso_info
/* Symbol table */
ELF(Sym) *symtab;
const char *symstrings;
+   ELF(Word) *gnu_hash;
ELF(Word) *bucket, *chain;
ELF(Word) nbucket, nchain;
 
@@ -75,6 +76,16 @@ static unsigned long elf_hash(const char *name)
return h;
 }
 
+static uint32_t gnu_hash(const char *name)
+{
+   const unsigned char *s = (void *)name;
+   uint32_t h = 5381;
+
+   for (; *s; s++)
+   h += h * 32 + *s;
+   return h;
+}
+
 void vdso_init_from_sysinfo_ehdr(uintptr_t base)
 {
size_t i;
@@ -117,6 +128,7 @@ void vdso_init_from_sysinfo_ehdr(uintptr_t base)
 */
ELF(Word) *hash = 0;
vdso_info.symstrings = 0;
+   vdso_info.gnu_hash = 0;
vdso_info.symtab = 0;
vdso_info.versym = 0;
vdso_info.verdef = 0;
@@ -137,6 +149,11 @@ void vdso_init_from_sysinfo_ehdr(uintptr_t base)
((uintptr_t)dyn[i].d_un.d_ptr
 + vdso_info.load_offset);
break;
+   case DT_GNU_HASH:
+   vdso_info.gnu_hash =
+   (ELF(Word) *)((uintptr_t)dyn[i].d_un.d_ptr +
+ vdso_info.load_offset);
+   break;
case DT_VERSYM:
vdso_info.versym = (ELF(Versym) *)
((uintptr_t)dyn[i].d_un.d_ptr
@@ -149,17 +166,26 @@ void vdso_init_from_sysinfo_ehdr(uintptr_t base)
break;
}
}
-   if (!vdso_info.symstrings || !vdso_info.symtab || !hash)
+   if (!vdso_info.symstrings || !vdso_info.symtab ||
+   (!hash && !vdso_info.gnu_hash))
return;  /* Failed */
 
if (!vdso_info.verdef)
vdso_info.versym = 0;
 
/* Parse the hash table header. */
-   vdso_info.nbucket = hash[0];
-   vdso_info.nchain = hash[1];
-   vdso_info.bucket = &hash[2];
-   vdso_info.chain = &hash[vdso_info.nbucket + 2];
+   if (vdso_info.gnu_hash) {
+   vdso_info.nbucket = vdso_info.gnu_hash[0];
+   /* The bucket array is located after the header (4 uint32) and 
the bloom
+  filter (size_t array of gnu_hash[2] elements). */
+   vdso_info.bucket = vdso_info.gnu_hash + 4 +
+  sizeof(size_t) / 4 * vdso_info.gnu_hash[2];
+   } else {
+   vdso_info.nbucket = hash[0];
+   vdso_info.nchain = hash[1];
+   vdso_info.bucket = &hash[2];
+   vdso_info.chain = &hash[vdso_info.nbucket + 2];
+   }
 
/* That's all we need. */
vdso_info.valid = true;
@@ -203,6 +229,26 @@ static bool vdso_match_version(ELF(Versym) ver,
&& !strcmp(name, vdso_info.symstrings + aux->vda_name);
 }
 
+static bool check_sym(ELF(Sym) *sym, ELF(Word) i, const char *name,
+ const char *version, unsigned long ver_hash)
+{
+   /* Check for a defined global or weak function w/ right name. */
+   if (ELF64_ST_TYPE(sym->st_info) != STT_FUNC)
+   return false;
+   if (ELF64_ST_BIND(sym->st_info) != STB_GLOBAL &&
+   ELF64_ST_BIND(sym->st_info) != STB_WEAK)
+   return false;
+   if (strcmp(name, vdso_info.symstrings + sym->st_name))
+   return false;
+
+   /* Check symbol version. */
+   if (vdso_info.versym &&
+   !vdso_match_version(vdso_info.versym[i], version, ver_hash))
+   return false;
+
+   return true;
+}
+
 void *vdso_sym(const char *version, const char *name)
 {
unsigned long ver_hash;
@@ -210,29 +256,36 @@ void *vdso_sym(const char *version, const char *name)
return 0;
 
ver_hash = elf_hash(version);
-   ELF(Word) chain = vdso_info.bucket[elf_hash(name) % vdso_info.nbucket];
+   ELF(Word) i;
 
-   for (; chain != STN_UNDEF; chain = vdso_info.chain[chain]) {
-   ELF(Sym) *sym = &vdso_info.symtab[chain];
+   if (vdso_info.gnu_hash) {
+   uint32_t h1 = gnu_hash(name), h2, *hashval;
 
-   /* Check for a defined global or weak function w/ right name. */
-   if (ELF64_ST_TYPE(sym->st_info) != STT_FUNC)
-   continue;
-

Re: [PATCH] arm64/vdso: Remove --hash-style=sysv

2024-08-14 Thread Fangrui Song

On Wed, Aug 14, 2024 at 6:23 PM Fangrui Song  wrote:
>
> On Wed, Aug 14, 2024 at 12:56 PM Xi Ruoyao  wrote:
> >
> > On Thu, 2024-07-18 at 10:34 -0700, Fangrui Song wrote:
> > > glibc added support for .gnu.hash in 2006 and .hash has been obsoleted
> > > for more than one decade in many Linux distributions.  Using
> > > --hash-style=sysv might imply unaddressed issues and confuse readers.
> > >
> > > Just drop the option and rely on the linker default, which is likely
> > > "both", or "gnu" when the distribution really wants to eliminate sysv
> > > hash overhead.
> > >
> > > Similar to commit 6b7e26547fad ("x86/vdso: Emit a GNU hash").
> > >
> > > Signed-off-by: Fangrui Song 
> >
> > Hi Fangrui,
> >
> > If I read tools/testing/selftests/vDSO/parse_vdso.c correctly, it does
> > know DT_GNU_HASH as at now.  Thus after this change the vDSO selftests
> > are skipped with "Couldn't find __vdso_gettimeofday" etc if the distro
> > enables --hash-style=gnu by default.
> >
> > So it seems we need to add DT_GNU_HASH support for parse_vdso.c to keep
> > test coverage.
>
> Hi Xi,
>
> Perhaps the selftests file needs DT_GNU_HASH support like
> https://github.com/abseil/abseil-cpp/commit/1278ee9bd9bd4916181521fac96d6fa1100e38e6

Created 
https://lore.kernel.org/linux-kselftest/20240815032614.2747224-1-mask...@google.com/T/#u
([PATCH] selftests/vDSO: support DT_GNU_HASH)

>
> > > ---
> > >  arch/arm64/kernel/vdso/Makefile   | 2 +-
> > >  arch/arm64/kernel/vdso32/Makefile | 2 +-
> > >  2 files changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/arm64/kernel/vdso/Makefile 
> > > b/arch/arm64/kernel/vdso/Makefile
> > > index d63930c82839..d11da6461278 100644
> > > --- a/arch/arm64/kernel/vdso/Makefile
> > > +++ b/arch/arm64/kernel/vdso/Makefile
> > > @@ -21,7 +21,7 @@ btildflags-$(CONFIG_ARM64_BTI_KERNEL) += -z force-bti
> > >  # potential future proofing if we end up with internal calls to the 
> > > exported
> > >  # routines, as x86 does (see 6f121e548f83 ("x86, vdso: Reimplement 
> > > vdso.so
> > >  # preparation in build-time C")).
> > > -ldflags-y := -shared -soname=linux-vdso.so.1 --hash-style=sysv   \
> > > +ldflags-y := -shared -soname=linux-vdso.so.1 \
> > >-Bsymbolic --build-id=sha1 -n $(btildflags-y)
> > >
> > >  ifdef CONFIG_LD_ORPHAN_WARN
> > > diff --git a/arch/arm64/kernel/vdso32/Makefile 
> > > b/arch/arm64/kernel/vdso32/Makefile
> > > index cc4508c604b2..25a2cb6317f3 100644
> > > --- a/arch/arm64/kernel/vdso32/Makefile
> > > +++ b/arch/arm64/kernel/vdso32/Makefile
> > > @@ -98,7 +98,7 @@ VDSO_AFLAGS += -D__ASSEMBLY__
> > >  # From arm vDSO Makefile
> > >  VDSO_LDFLAGS += -Bsymbolic --no-undefined -soname=linux-vdso.so.1
> > >  VDSO_LDFLAGS += -z max-page-size=4096 -z common-page-size=4096
> > > -VDSO_LDFLAGS += -shared --hash-style=sysv --build-id=sha1
> > > +VDSO_LDFLAGS += -shared --build-id=sha1
> > >  VDSO_LDFLAGS += --orphan-handling=$(CONFIG_LD_ORPHAN_WARN_LEVEL)
> > >
> > >
> >
> > --
> > Xi Ruoyao 
> > School of Aerospace Science and Technology, Xidian University
>
>
>
> --
> 宋方睿



-- 
宋方睿

Re: [PATCH v1 0/2] mremap refactor: check src address for vma boundaries first.

2024-08-14 Thread Jeff Xu

On Wed, Aug 14, 2024 at 12:55 PM Liam R. Howlett
 wrote:
> The majority of the comments to V2 are mine, you only told us that
> splitting a sealed vma is wrong (after I asked you directly to answer)
> and then you made a comment about testing of the patch set. Besides the
> direct responses to me, your comment was "wait for me to test".
>
Please share this link for  " Besides the direct responses to me, your
comment was "wait for me to test".
Or  pop up that email by responding to it, to remind me.  Thanks.

> You are holding us hostage by asking for more testing but not sharing
> what is and is not valid for mseal() - or even answering questions on
> tests you run.
https://docs.kernel.org/process/submitting-patches.html#don-t-get-discouraged-or-impatient

> These patches should be rejected in favour of fixing the feature like it
> should have been written in the first place.
This is not ture.

Without removing arch_unmap, it is impossible to implement in-loop.
And I have mentioned this during initial discussion of mseal patch, as
well as when Pedro expressed the interest on in-loop approach.  If you
like reference, I can find the links for you.

I'm glad that arch_unmap is removed now and resulting in much cleaner
code, it has always been a question/mysterial to me ever since I read
that code.   Thanks to Linus's leadership and Michael Ellerman's quick
response,  this is now resolved.

Best regards,
-Jeff

Re: [PATCH net v3] selftest: af_unix: Fix kselftest compilation warnings

2024-08-14 Thread patchwork-bot+netdevbpf

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski :

On Wed, 14 Aug 2024 13:37:43 +0530 you wrote:
> Change expected_buf from (const void *) to (const char *)
> in function __recvpair().
> This change fixes the below warnings during test compilation:
> 
> ```
> In file included from msg_oob.c:14:
> msg_oob.c: In function ‘__recvpair’:
> 
> [...]

Here is the summary with links:
  - [net,v3] selftest: af_unix: Fix kselftest compilation warnings
https://git.kernel.org/netdev/net/c/6c569b77f030

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html

Re: [PATCH] selftests/vDSO: support DT_GNU_HASH

2024-08-14 Thread Xi Ruoyao

On Wed, 2024-08-14 at 20:26 -0700, Fangrui Song wrote:
> glibc added support for DT_GNU_HASH in 2006 and DT_HASH has been
> obsoleted for more than one decade in many Linux distributions.
> 
> Many vDSOs support DT_GNU_HASH. This patch adds selftests support.
> 
> Signed-off-by: Fangrui Song 

Tested-by: Xi Ruoyao 

> ---
>  tools/testing/selftests/vDSO/parse_vdso.c | 105 --
>  1 file changed, 79 insertions(+), 26 deletions(-)
> 
> diff --git a/tools/testing/selftests/vDSO/parse_vdso.c 
> b/tools/testing/selftests/vDSO/parse_vdso.c
> index 4ae417372e9e..35cb545da13e 100644
> --- a/tools/testing/selftests/vDSO/parse_vdso.c
> +++ b/tools/testing/selftests/vDSO/parse_vdso.c
> @@ -47,6 +47,7 @@ static struct vdso_info
>   /* Symbol table */
>   ELF(Sym) *symtab;
>   const char *symstrings;
> + ELF(Word) *gnu_hash;
>   ELF(Word) *bucket, *chain;
>   ELF(Word) nbucket, nchain;
>  
> @@ -75,6 +76,16 @@ static unsigned long elf_hash(const char *name)
>   return h;
>  }
>  
> +static uint32_t gnu_hash(const char *name)
> +{
> + const unsigned char *s = (void *)name;
> + uint32_t h = 5381;
> +
> + for (; *s; s++)
> + h += h * 32 + *s;
> + return h;
> +}
> +
>  void vdso_init_from_sysinfo_ehdr(uintptr_t base)
>  {
>   size_t i;
> @@ -117,6 +128,7 @@ void vdso_init_from_sysinfo_ehdr(uintptr_t base)
>    */
>   ELF(Word) *hash = 0;
>   vdso_info.symstrings = 0;
> + vdso_info.gnu_hash = 0;
>   vdso_info.symtab = 0;
>   vdso_info.versym = 0;
>   vdso_info.verdef = 0;
> @@ -137,6 +149,11 @@ void vdso_init_from_sysinfo_ehdr(uintptr_t base)
>   ((uintptr_t)dyn[i].d_un.d_ptr
>    + vdso_info.load_offset);
>   break;
> + case DT_GNU_HASH:
> + vdso_info.gnu_hash =
> + (ELF(Word) *)((uintptr_t)dyn[i].d_un.d_ptr +
> +   vdso_info.load_offset);
> + break;
>   case DT_VERSYM:
>   vdso_info.versym = (ELF(Versym) *)
>   ((uintptr_t)dyn[i].d_un.d_ptr
> @@ -149,17 +166,26 @@ void vdso_init_from_sysinfo_ehdr(uintptr_t base)
>   break;
>   }
>   }
> - if (!vdso_info.symstrings || !vdso_info.symtab || !hash)
> + if (!vdso_info.symstrings || !vdso_info.symtab ||
> +     (!hash && !vdso_info.gnu_hash))
>   return;  /* Failed */
>  
>   if (!vdso_info.verdef)
>   vdso_info.versym = 0;
>  
>   /* Parse the hash table header. */
> - vdso_info.nbucket = hash[0];
> - vdso_info.nchain = hash[1];
> - vdso_info.bucket = &hash[2];
> - vdso_info.chain = &hash[vdso_info.nbucket + 2];
> + if (vdso_info.gnu_hash) {
> + vdso_info.nbucket = vdso_info.gnu_hash[0];
> + /* The bucket array is located after the header (4 uint32) and 
> the bloom
> +    filter (size_t array of gnu_hash[2] elements). */
> + vdso_info.bucket = vdso_info.gnu_hash + 4 +
> +    sizeof(size_t) / 4 * vdso_info.gnu_hash[2];
> + } else {
> + vdso_info.nbucket = hash[0];
> + vdso_info.nchain = hash[1];
> + vdso_info.bucket = &hash[2];
> + vdso_info.chain = &hash[vdso_info.nbucket + 2];
> + }
>  
>   /* That's all we need. */
>   vdso_info.valid = true;
> @@ -203,6 +229,26 @@ static bool vdso_match_version(ELF(Versym) ver,
>   && !strcmp(name, vdso_info.symstrings + aux->vda_name);
>  }
>  
> +static bool check_sym(ELF(Sym) *sym, ELF(Word) i, const char *name,
> +   const char *version, unsigned long ver_hash)
> +{
> + /* Check for a defined global or weak function w/ right name. */
> + if (ELF64_ST_TYPE(sym->st_info) != STT_FUNC)
> + return false;
> + if (ELF64_ST_BIND(sym->st_info) != STB_GLOBAL &&
> +     ELF64_ST_BIND(sym->st_info) != STB_WEAK)
> + return false;
> + if (strcmp(name, vdso_info.symstrings + sym->st_name))
> + return false;
> +
> + /* Check symbol version. */
> + if (vdso_info.versym &&
> +     !vdso_match_version(vdso_info.versym[i], version, ver_hash))
> + return false;
> +
> + return true;
> +}
> +
>  void *vdso_sym(const char *version, const char *name)
>  {
>   unsigned long ver_hash;
> @@ -210,29 +256,36 @@ void *vdso_sym(const char *version, const char *name)
>   return 0;
>  
>   ver_hash = elf_hash(version);
> - ELF(Word) chain = vdso_info.bucket[elf_hash(name) % vdso_info.nbucket];
> + ELF(Word) i;
>  
> - for (; chain != STN_UNDEF; chain = vdso_info.chain[chain]) {
> - ELF(Sym) *sym = &vdso_info.symtab[chain];
> + if (vdso_info.gnu_hash) {
> + uint32_t h1 = gnu_hash(name), h2, *hashval;
>  
> -

90 matches

Mail list logo