Re: [PATCH bpf-next] selftests/bpf: Fix test_attach_probe for powerpc uprobes

2021-03-02 Thread Yonghong Song




On 3/1/21 11:04 AM, Jiri Olsa wrote:

When testing uprobes we the test gets GEP (Global Entry Point)
address from kallsyms, but then the function is called locally
so the uprobe is not triggered.

Fixing this by adjusting the address to LEP (Local Entry Point)
for powerpc arch.

Signed-off-by: Jiri Olsa 
---
  .../selftests/bpf/prog_tests/attach_probe.c| 18 +-
  1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/attach_probe.c 
b/tools/testing/selftests/bpf/prog_tests/attach_probe.c
index a0ee87c8e1ea..c3cfb48d3ed0 100644
--- a/tools/testing/selftests/bpf/prog_tests/attach_probe.c
+++ b/tools/testing/selftests/bpf/prog_tests/attach_probe.c
@@ -2,6 +2,22 @@
  #include 
  #include "test_attach_probe.skel.h"
  
+#if defined(__powerpc64__)

+/*
+ * We get the GEP (Global Entry Point) address from kallsyms,
+ * but then the function is called locally, so we need to adjust
+ * the address to get LEP (Local Entry Point).


Any documentation in the kernel about this behavior? This will
help to validate the change without trying with powerpc64 qemu...


+ */
+#define LEP_OFFSET 8
+
+static ssize_t get_offset(ssize_t offset)
+{
+   return offset + LEP_OFFSET;
+}
+#else
+#define get_offset(offset) (offset)
+#endif
+
  ssize_t get_base_addr() {
size_t start, offset;
char buf[256];
@@ -36,7 +52,7 @@ void test_attach_probe(void)
if (CHECK(base_addr < 0, "get_base_addr",
  "failed to find base addr: %zd", base_addr))
return;
-   uprobe_offset = (size_t)&get_base_addr - base_addr;
+   uprobe_offset = get_offset((size_t)&get_base_addr - base_addr);
  
  	skel = test_attach_probe__open_and_load();

if (CHECK(!skel, "skel_open", "failed to open skeleton\n"))



Re: [PATCH net] net: l2tp: reduce log level when passing up invalid packets

2021-03-02 Thread Matthias Schiffer

On 2/23/21 10:47 AM, Tom Parkin wrote:

On  Mon, Feb 22, 2021 at 14:31:38 -0800, Jakub Kicinski wrote:

On Mon, 22 Feb 2021 17:40:16 +0100 Matthias Schiffer wrote:

This will not be sufficient for my usecase: To stay compatible with older
versions of fastd, I can't set the T flag in the first packet of the
handshake, as it won't be known whether the peer has a new enough fastd
version to understand packets that have this bit set. Luckily, the second
handshake byte is always 0 in fastd's protocol, so these packets fail the
tunnel version check and are passed to userspace regardless.

I'm aware that this usecase is far outside of the original intentions of the
code and can only be described as a hack, but I still consider this a
regression in the kernel, as it was working fine in the past, without
visible warnings.
  


I'm sorry, but for the reasons stated above I disagree about it being
a regression.


Hmm, is it common for protocol implementations in the kernel to warn about
invalid packets they receive? While L2TP uses connected sockets and thus
usually no unrelated packets end up in the socket, a simple UDP port scan
originating from the configured remote address/port will trigger the "short
packet" warning now (nmap uses a zero-length payload for UDP scans by
default). Log spam caused by a malicous party might also be a concern.


Indeed, seems like appropriate counters would be a good fit here?
The prints are both potentially problematic for security and lossy.


Yes, I agree with this argument.



Sounds good, I'll send an updated patch adding a counter for invalid packets.

By now I've found another project affected by the kernel warnings:
https://github.com/wlanslovenija/tunneldigger/issues/160



OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH net] net: expand textsearch ts_state to fit skb_seq_state

2021-03-02 Thread patchwork-bot+netdevbpf
Hello:

This patch was applied to netdev/net.git (refs/heads/master):

On Mon,  1 Mar 2021 15:09:44 + you wrote:
> From: Willem de Bruijn 
> 
> The referenced commit expands the skb_seq_state used by
> skb_find_text with a 4B frag_off field, growing it to 48B.
> 
> This exceeds container ts_state->cb, causing a stack corruption:
> 
> [...]

Here is the summary with links:
  - [net] net: expand textsearch ts_state to fit skb_seq_state
https://git.kernel.org/netdev/net/c/b228c9b05876

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html




Re: [PATCH net] tcp: add sanity tests to TCP_QUEUE_SEQ

2021-03-02 Thread patchwork-bot+netdevbpf
Hello:

This patch was applied to netdev/net.git (refs/heads/master):

On Mon,  1 Mar 2021 10:29:17 -0800 you wrote:
> From: Eric Dumazet 
> 
> Qingyu Li reported a syzkaller bug where the repro
> changes RCV SEQ _after_ restoring data in the receive queue.
> 
> mprotect(0x4aa000, 12288, PROT_READ)= 0
> mmap(0x1000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) 
> = 0x1000
> mmap(0x2000, 16777216, PROT_READ|PROT_WRITE|PROT_EXEC, 
> MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x2000
> mmap(0x2100, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) 
> = 0x2100
> socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
> setsockopt(3, SOL_TCP, TCP_REPAIR, [1], 4) = 0
> connect(3, {sa_family=AF_INET6, sin6_port=htons(0), sin6_flowinfo=htonl(0), 
> inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0
> setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [1], 4) = 0
> sendmsg(3, {msg_name=NULL, msg_namelen=0, 
> msg_iov=[{iov_base="0x0003\0\0", iov_len=20}], msg_iovlen=1, 
> msg_controllen=0, msg_flags=0}, 0) = 20
> setsockopt(3, SOL_TCP, TCP_REPAIR, [0], 4) = 0
> setsockopt(3, SOL_TCP, TCP_QUEUE_SEQ, [128], 4) = 0
> recvfrom(3, NULL, 20, 0, NULL, NULL)= -1 ECONNRESET (Connection reset by 
> peer)
> 
> [...]

Here is the summary with links:
  - [net] tcp: add sanity tests to TCP_QUEUE_SEQ
https://git.kernel.org/netdev/net/c/8811f4a9836e

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html




Re: [PATCH net] hv_netvsc: Fix validation in netvsc_linkstatus_callback()

2021-03-02 Thread patchwork-bot+netdevbpf
Hello:

This patch was applied to netdev/net.git (refs/heads/master):

On Mon,  1 Mar 2021 19:25:30 +0100 you wrote:
> Contrary to the RNDIS protocol specification, certain (pre-Fe)
> implementations of Hyper-V's vSwitch did not account for the status
> buffer field in the length of an RNDIS packet; the bug was fixed in
> newer implementations.  Validate the status buffer fields using the
> length of the 'vmtransfer_page' packet (all implementations), that
> is known/validated to be less than or equal to the receive section
> size and not smaller than the length of the RNDIS message.
> 
> [...]

Here is the summary with links:
  - [net] hv_netvsc: Fix validation in netvsc_linkstatus_callback()
https://git.kernel.org/netdev/net/c/3946688edbc5

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html




Re: [PATCH net] net: dsa: tag_mtk: fix 802.1ad VLAN egress

2021-03-02 Thread patchwork-bot+netdevbpf
Hello:

This patch was applied to netdev/net.git (refs/heads/master):

On Tue,  2 Mar 2021 00:01:59 +0800 you wrote:
> A different TPID bit is used for 802.1ad VLAN frames.
> 
> Reported-by: Ilario Gelmetti 
> Fixes: f0af34317f4b ("net: dsa: mediatek: combine MediaTek tag with VLAN tag")
> Signed-off-by: DENG Qingfang 
> ---
>  net/dsa/tag_mtk.c | 19 +--
>  1 file changed, 13 insertions(+), 6 deletions(-)

Here is the summary with links:
  - [net] net: dsa: tag_mtk: fix 802.1ad VLAN egress
https://git.kernel.org/netdev/net/c/9200f515c41f

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html




Re: [net 01/15] net/mlx5e: E-switch, Fix rate calculation for overflow

2021-03-02 Thread Saeed Mahameed
On Sat, 2021-02-27 at 13:14 +0100, Arnd Bergmann wrote:
> On Fri, Feb 12, 2021 at 3:59 AM Saeed Mahameed 
> wrote:
> > 
> > From: Parav Pandit 
> > 
> > rate_bytes_ps is a 64-bit field. It passed as 32-bit field to
> > apply_police_params(). Due to this when police rate is higher
> > than 4Gbps, 32-bit calculation ignores the carry. This results
> > in incorrect rate configurationn the device.
> > 
> > Fix it by performing 64-bit calculation.
> 
> I just stumbled over this commit while looking at an unrelated
> problem.
> 
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> > b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> > index dd0bfbacad47..717fbaa6ce73 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> > @@ -5040,7 +5040,7 @@ static int apply_police_params(struct
> > mlx5e_priv *priv, u64 rate,
> >  */
> >     if (rate) {
> >     rate = (rate * BITS_PER_BYTE) + 50;
> > -   rate_mbps = max_t(u32, do_div(rate, 100), 1);
> > +   rate_mbps = max_t(u64, do_div(rate, 100), 1);
> 
> I think there are still multiple issues with this line:
> 
> - Before commit 1fe3e3166b35 ("net/mlx5e: E-switch, Fix rate
> calculation for
>   overflow"), it was trying to calculate rate divided by 100, but
> now
>   it uses the remainder of the division rather than the quotient. I
> assume
>   this was meant to use div_u64() instead of do_div().
> 

Yes, I already have a patch lined up to fix this issue.

Thanks for spotting this.

> - Both div_u64() and do_div() return a 32-bit number, and '1' is a
> constant
>   that also comfortably fits into a 32-bit number, so changing the
> max_t
>   to return a 64-bit type has no effect on the result
> 

as of the above comment, we shouldn't be using the return value of
do_div().


> - The maximum of an arbitrary unsigned integer and '1' is either one
> or zero,
>    so there doesn't need to be an expensive division here at all.
> From the
>    comment it sounds like the intention was to use 'min_t()' instead
> of 'max_t()'.
>    It has however used 'max_t' since the code was first introduced.
> 

if the input rate is less that 1mbps then the quotient will be 0,
otherwise we want the quotient, and we don't allow 0, so max_t(rate, 1)
should be used, what am I missing ?




Re: [PATCH 0/5] Use obj_cgroup APIs to change kmem pages

2021-03-02 Thread Roman Gushchin
Hi Muchun!

On Mon, Mar 01, 2021 at 02:22:22PM +0800, Muchun Song wrote:
> Since Roman series "The new cgroup slab memory controller" applied. All
> slab objects are changed via the new APIs of obj_cgroup. This new APIs
> introduce a struct obj_cgroup instead of using struct mem_cgroup directly
> to charge slab objects. It prevents long-living objects from pinning the
> original memory cgroup in the memory. But there are still some corner
> objects (e.g. allocations larger than order-1 page on SLUB) which are
> not charged via the API of obj_cgroup. Those objects (include the pages
> which are allocated from buddy allocator directly) are charged as kmem
> pages which still hold a reference to the memory cgroup.

Yes, this is a good idea, large kmallocs should be treated the same
way as small ones.

> 
> E.g. We know that the kernel stack is charged as kmem pages because the
> size of the kernel stack can be greater than 2 pages (e.g. 16KB on x86_64
> or arm64). If we create a thread (suppose the thread stack is charged to
> memory cgroup A) and then move it from memory cgroup A to memory cgroup
> B. Because the kernel stack of the thread hold a reference to the memory
> cgroup A. The thread can pin the memory cgroup A in the memory even if
> we remove the cgroup A. If we want to see this scenario by using the
> following script. We can see that the system has added 500 dying cgroups.
> 
>   #!/bin/bash
> 
>   cat /proc/cgroups | grep memory
> 
>   cd /sys/fs/cgroup/memory
>   echo 1 > memory.move_charge_at_immigrate
> 
>   for i in range{1..500}
>   do
>   mkdir kmem_test
>   echo $$ > kmem_test/cgroup.procs
>   sleep 3600 &
>   echo $$ > cgroup.procs
>   echo `cat kmem_test/cgroup.procs` > cgroup.procs
>   rmdir kmem_test
>   done
> 
>   cat /proc/cgroups | grep memory

Well, moving processes between cgroups always created a lot of issues
and corner cases and this one is definitely not the worst. So this problem
looks a bit artificial, unless I'm missing something. But if it doesn't
introduce any new performance costs and doesn't make the code more complex,
I have nothing against.

Btw, can you, please, run the spell-checker on commit logs? There are many
typos (starting from the title of the series, I guess), which make the patchset
look less appealing.

Thank you!

> 
> This patchset aims to make those kmem pages drop the reference to memory
> cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> of the dying cgroups will not increase if we run the above test script.
> 
> Patch 1-3 are using obj_cgroup APIs to charge kmem pages. The remote
> memory cgroup charing APIs is a mechanism to charge kernel memory to a
> given memory cgroup. So I also make it use the APIs of obj_cgroup.
> Patch 4-5 are doing this.
> 
> Muchun Song (5):
>   mm: memcontrol: introduce obj_cgroup_{un}charge_page
>   mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem
> page
>   mm: memcontrol: reparent the kmem pages on cgroup removal
>   mm: memcontrol: move remote memcg charging APIs to CONFIG_MEMCG_KMEM
>   mm: memcontrol: use object cgroup for remote memory cgroup charging
> 
>  fs/buffer.c  |  10 +-
>  fs/notify/fanotify/fanotify.c|   6 +-
>  fs/notify/fanotify/fanotify_user.c   |   2 +-
>  fs/notify/group.c|   3 +-
>  fs/notify/inotify/inotify_fsnotify.c |   8 +-
>  fs/notify/inotify/inotify_user.c |   2 +-
>  include/linux/bpf.h  |   2 +-
>  include/linux/fsnotify_backend.h |   2 +-
>  include/linux/memcontrol.h   | 109 +++---
>  include/linux/sched.h|   6 +-
>  include/linux/sched/mm.h |  30 ++--
>  kernel/bpf/syscall.c |  35 ++---
>  kernel/fork.c|   4 +-
>  mm/memcontrol.c  | 276 
> ++-
>  mm/page_alloc.c  |   4 +-
>  15 files changed, 324 insertions(+), 175 deletions(-)
> 
> -- 
> 2.11.0
> 


Re: [PATCH bpf-next] selftests/bpf: Fix test_attach_probe for powerpc uprobes

2021-03-02 Thread Andrii Nakryiko
On Mon, Mar 1, 2021 at 11:11 AM Jiri Olsa  wrote:
>
> When testing uprobes we the test gets GEP (Global Entry Point)
> address from kallsyms, but then the function is called locally
> so the uprobe is not triggered.
>
> Fixing this by adjusting the address to LEP (Local Entry Point)
> for powerpc arch.
>
> Signed-off-by: Jiri Olsa 
> ---
>  .../selftests/bpf/prog_tests/attach_probe.c| 18 +-
>  1 file changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/attach_probe.c 
> b/tools/testing/selftests/bpf/prog_tests/attach_probe.c
> index a0ee87c8e1ea..c3cfb48d3ed0 100644
> --- a/tools/testing/selftests/bpf/prog_tests/attach_probe.c
> +++ b/tools/testing/selftests/bpf/prog_tests/attach_probe.c
> @@ -2,6 +2,22 @@
>  #include 
>  #include "test_attach_probe.skel.h"
>
> +#if defined(__powerpc64__)
> +/*
> + * We get the GEP (Global Entry Point) address from kallsyms,
> + * but then the function is called locally, so we need to adjust
> + * the address to get LEP (Local Entry Point).
> + */
> +#define LEP_OFFSET 8
> +
> +static ssize_t get_offset(ssize_t offset)

if we mark this function __weak global, would it work as is? Would it
get an address of a global entry point? I know nothing about this GEP
vs LEP stuff, interesting :)

> +{
> +   return offset + LEP_OFFSET;
> +}
> +#else
> +#define get_offset(offset) (offset)
> +#endif
> +
>  ssize_t get_base_addr() {
> size_t start, offset;
> char buf[256];
> @@ -36,7 +52,7 @@ void test_attach_probe(void)
> if (CHECK(base_addr < 0, "get_base_addr",
>   "failed to find base addr: %zd", base_addr))
> return;
> -   uprobe_offset = (size_t)&get_base_addr - base_addr;
> +   uprobe_offset = get_offset((size_t)&get_base_addr - base_addr);
>
> skel = test_attach_probe__open_and_load();
> if (CHECK(!skel, "skel_open", "failed to open skeleton\n"))
> --
> 2.29.2
>


Re: [PATCH 4/5] mm: memcontrol: move remote memcg charging APIs to CONFIG_MEMCG_KMEM

2021-03-02 Thread Roman Gushchin
On Mon, Mar 01, 2021 at 02:22:26PM +0800, Muchun Song wrote:
> The remote memcg charing APIs is a mechanism to charge kernel memory
> to a given memcg. So we can move the infrastructure to the scope of
> the CONFIG_MEMCG_KMEM.

This is not a good idea, because there is nothing kmem-specific
in the idea of remote charging, and we definitely will see cases
when user memory is charged to the process different from the current.

> 
> As a bonus, on !CONFIG_MEMCG_KMEM build some functions and variables
> can be compiled out.
> 
> Signed-off-by: Muchun Song 
> ---
>  include/linux/sched.h| 2 ++
>  include/linux/sched/mm.h | 2 +-
>  kernel/fork.c| 2 +-
>  mm/memcontrol.c  | 4 
>  4 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ee46f5cab95b..c2d488eddf85 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1314,7 +1314,9 @@ struct task_struct {
>  
>   /* Number of pages to reclaim on returning to userland: */
>   unsigned intmemcg_nr_pages_over_high;
> +#endif
>  
> +#ifdef CONFIG_MEMCG_KMEM
>   /* Used by memcontrol for targeted memcg charge: */
>   struct mem_cgroup   *active_memcg;
>  #endif
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 1ae08b8462a4..64a72975270e 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -294,7 +294,7 @@ static inline void memalloc_nocma_restore(unsigned int 
> flags)
>  }
>  #endif
>  
> -#ifdef CONFIG_MEMCG
> +#ifdef CONFIG_MEMCG_KMEM
>  DECLARE_PER_CPU(struct mem_cgroup *, int_active_memcg);
>  /**
>   * set_active_memcg - Starts the remote memcg charging scope.
> diff --git a/kernel/fork.c b/kernel/fork.c
> index d66cd1014211..d66718bc82d5 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -942,7 +942,7 @@ static struct task_struct *dup_task_struct(struct 
> task_struct *orig, int node)
>   tsk->use_memdelay = 0;
>  #endif
>  
> -#ifdef CONFIG_MEMCG
> +#ifdef CONFIG_MEMCG_KMEM
>   tsk->active_memcg = NULL;
>  #endif
>   return tsk;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 39cb8c5bf8b2..092dc4588b43 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -76,8 +76,10 @@ EXPORT_SYMBOL(memory_cgrp_subsys);
>  
>  struct mem_cgroup *root_mem_cgroup __read_mostly;
>  
> +#ifdef CONFIG_MEMCG_KMEM
>  /* Active memory cgroup to use from an interrupt context */
>  DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg);
> +#endif
>  
>  /* Socket memory accounting disabled? */
>  static bool cgroup_memory_nosocket;
> @@ -1054,6 +1056,7 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct 
> mm_struct *mm)
>  }
>  EXPORT_SYMBOL(get_mem_cgroup_from_mm);
>  
> +#ifdef CONFIG_MEMCG_KMEM
>  static __always_inline struct mem_cgroup *active_memcg(void)
>  {
>   if (in_interrupt())
> @@ -1074,6 +1077,7 @@ static __always_inline bool memcg_kmem_bypass(void)
>  
>   return false;
>  }
> +#endif
>  
>  /**
>   * mem_cgroup_iter - iterate over memory cgroup hierarchy
> -- 
> 2.11.0
> 


[PATCH net v3] net: fix race between napi kthread mode and busy poll

2021-03-02 Thread Wei Wang
Currently, napi_thread_wait() checks for NAPI_STATE_SCHED bit to
determine if the kthread owns this napi and could call napi->poll() on
it. However, if socket busy poll is enabled, it is possible that the
busy poll thread grabs this SCHED bit (after the previous napi->poll()
invokes napi_complete_done() and clears SCHED bit) and tries to poll
on the same napi. napi_disable() could grab the SCHED bit as well.
This patch tries to fix this race by adding a new bit
NAPI_STATE_SCHED_THREADED in napi->state. This bit gets set in
napi_schedule() if the threaded mode is enabled, and gets cleared
in napi_complete_done(), and we only poll the napi in kthread if this
bit is set. This helps distinguish the ownership of the napi between
kthread and other scenarios and fixes the race issue.

Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")
Reported-by: Martin Zaharinov 
Suggested-by: Jakub Kicinski 
Signed-off-by: Wei Wang 
Cc: Alexander Duyck 
Cc: Eric Dumazet 
Cc: Paolo Abeni 
Cc: Hannes Frederic Sowa 
---
 include/linux/netdevice.h |  2 ++
 net/core/dev.c| 14 +-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ddf4cfc12615..682908707c1a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -360,6 +360,7 @@ enum {
NAPI_STATE_IN_BUSY_POLL,/* sk_busy_loop() owns this NAPI */
NAPI_STATE_PREFER_BUSY_POLL,/* prefer busy-polling over softirq 
processing*/
NAPI_STATE_THREADED,/* The poll is performed inside its own 
thread*/
+   NAPI_STATE_SCHED_THREADED,  /* Napi is currently scheduled in 
threaded mode */
 };
 
 enum {
@@ -372,6 +373,7 @@ enum {
NAPIF_STATE_IN_BUSY_POLL= BIT(NAPI_STATE_IN_BUSY_POLL),
NAPIF_STATE_PREFER_BUSY_POLL= BIT(NAPI_STATE_PREFER_BUSY_POLL),
NAPIF_STATE_THREADED= BIT(NAPI_STATE_THREADED),
+   NAPIF_STATE_SCHED_THREADED  = BIT(NAPI_STATE_SCHED_THREADED),
 };
 
 enum gro_result {
diff --git a/net/core/dev.c b/net/core/dev.c
index 6c5967e80132..03c4763de351 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4294,6 +4294,8 @@ static inline void napi_schedule(struct softnet_data 
*sd,
 */
thread = READ_ONCE(napi->thread);
if (thread) {
+   if (thread->state != TASK_INTERRUPTIBLE)
+   set_bit(NAPI_STATE_SCHED_THREADED, 
&napi->state);
wake_up_process(thread);
return;
}
@@ -6486,6 +6488,7 @@ bool napi_complete_done(struct napi_struct *n, int 
work_done)
WARN_ON_ONCE(!(val & NAPIF_STATE_SCHED));
 
new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED |
+ NAPIF_STATE_SCHED_THREADED |
  NAPIF_STATE_PREFER_BUSY_POLL);
 
/* If STATE_MISSED was set, leave STATE_SCHED set,
@@ -6968,16 +6971,25 @@ static int napi_poll(struct napi_struct *n, struct 
list_head *repoll)
 
 static int napi_thread_wait(struct napi_struct *napi)
 {
+   bool woken = false;
+
set_current_state(TASK_INTERRUPTIBLE);
 
while (!kthread_should_stop() && !napi_disable_pending(napi)) {
-   if (test_bit(NAPI_STATE_SCHED, &napi->state)) {
+   /* Testing SCHED_THREADED bit here to make sure the current
+* kthread owns this napi and could poll on this napi.
+* Testing SCHED bit is not enough because SCHED bit might be
+* set by some other busy poll thread or by napi_disable().
+*/
+   if (test_bit(NAPI_STATE_SCHED_THREADED, &napi->state) || woken) 
{
WARN_ON(!list_empty(&napi->poll_list));
__set_current_state(TASK_RUNNING);
return 0;
}
 
schedule();
+   /* woken being true indicates this thread owns this napi. */
+   woken = true;
set_current_state(TASK_INTERRUPTIBLE);
}
__set_current_state(TASK_RUNNING);
-- 
2.30.1.766.gb4fecdf3b7-goog



Re: [PATCH bpf-next] bpf: fix missing * in bpf.h

2021-03-02 Thread Joe Stringer
On Fri, Feb 26, 2021 at 8:51 AM Quentin Monnet  wrote:
>
> 2021-02-24 10:59 UTC-0800 ~ Andrii Nakryiko 
> > On Wed, Feb 24, 2021 at 7:55 AM Daniel Borkmann  
> > wrote:
> >>
> >> On 2/23/21 3:43 PM, Jesper Dangaard Brouer wrote:
> >>> On Tue, 23 Feb 2021 20:45:54 +0800
> >>> Hangbin Liu  wrote:
> >>>
>  Commit 34b2021cc616 ("bpf: Add BPF-helper for MTU checking") lost a *
>  in bpf.h. This will make bpf_helpers_doc.py stop building
>  bpf_helper_defs.h immediately after bpf_check_mtu, which will affect
>  future add functions.
> 
>  Fixes: 34b2021cc616 ("bpf: Add BPF-helper for MTU checking")
>  Signed-off-by: Hangbin Liu 
>  ---
>    include/uapi/linux/bpf.h   | 2 +-
>    tools/include/uapi/linux/bpf.h | 2 +-
>    2 files changed, 2 insertions(+), 2 deletions(-)
> >>>
> >>> Thanks for fixing that!
> >>>
> >>> Acked-by: Jesper Dangaard Brouer 
> >>
> >> Thanks guys, applied!
> >>
> >>> I though I had already fix that, but I must have missed or reintroduced
> >>> this, when I rolling back broken ideas in V13.
> >>>
> >>> I usually run this command to check the man-page (before submitting):
> >>>
> >>>   ./scripts/bpf_helpers_doc.py | rst2man | man -l -
> >>
> >> [+ Andrii] maybe this could be included to run as part of CI to catch such
> >> things in advance?
> >
> > We do something like that as part of bpftool build, so there is no
> > reason we can't add this to selftests/bpf/Makefile as well.
>
> Hi, pretty sure this is the case already? [0]
>
> This helps catching RST formatting issues, for example if a description
> is using invalid markup, and reported by rst2man. My understanding is
> that in the current case, the missing star simply ends the block for the
> helpers documentation from the parser point of view, it's not considered
> an error.
>
> I see two possible workarounds:
>
> 1) Check that the number of helpers found ("len(self.helpers)") is equal
> to the number of helpers in the file, but that requires knowing how many
> helpers we have in the first place (e.g. parsing "__BPF_FUNC_MAPPER(FN)").

This is not so difficult as long as we stick to one symbol per line:

diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
index e2ffac2b7695..74cdcc2bbf18 100755
--- a/scripts/bpf_doc.py
+++ b/scripts/bpf_doc.py
@@ -183,25 +183,51 @@ class HeaderParser(object):
 self.reader.readline()
 self.line = self.reader.readline()

+def get_elem_count(self, target):
+self.seek_to(target, 'Could not find symbol "%s"' % target)
+end_re = re.compile('^$')
+count = 0
+while True:
+capture = end_re.match(self.line)
+if capture:
+break
+self.line = self.reader.readline()
+count += 1
+
+# The last line (either '};' or '/* */' doesn't count.
+return count
+

I can either roll this into my docs update v2, or hold onto it for
another dedicated patch fixup. Either way I'm trialing this out
locally to regression-test my own docs update PR and make sure I'm not
breaking one of the various output formats.


[PATCH] iwlwifi: fix ARCH=i386 compilation warnings

2021-03-02 Thread Pierre-Louis Bossart
An unsigned long variable should rely on '%lu' format strings, not '%zd'

Fixes: a1a6a4cf49ece ("iwlwifi: pnvm: implement reading PNVM from UEFI")
Signed-off-by: Pierre-Louis Bossart 
---
warnings found with v5.12-rc1 and next-20210301

 drivers/net/wireless/intel/iwlwifi/fw/pnvm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/wireless/intel/iwlwifi/fw/pnvm.c 
b/drivers/net/wireless/intel/iwlwifi/fw/pnvm.c
index fd070ca5e517..40f2109a097f 100644
--- a/drivers/net/wireless/intel/iwlwifi/fw/pnvm.c
+++ b/drivers/net/wireless/intel/iwlwifi/fw/pnvm.c
@@ -271,12 +271,12 @@ static int iwl_pnvm_get_from_efi(struct iwl_trans *trans,
err = efivar_entry_get(pnvm_efivar, NULL, &package_size, package);
if (err) {
IWL_DEBUG_FW(trans,
-"PNVM UEFI variable not found %d (len %zd)\n",
+"PNVM UEFI variable not found %d (len %lu)\n",
 err, package_size);
goto out;
}
 
-   IWL_DEBUG_FW(trans, "Read PNVM fro UEFI with size %zd\n", package_size);
+   IWL_DEBUG_FW(trans, "Read PNVM fro UEFI with size %lu\n", package_size);
 
*data = kmemdup(package->data, *len, GFP_KERNEL);
if (!*data)
-- 
2.25.1



Re: [PATCH] e1000e: use proper #include guard name in hw.h

2021-03-02 Thread Nguyen, Anthony L
On Sat, 2021-02-27 at 10:58 +0100, Greg Kroah-Hartman wrote:
> The include guard for the e1000e and e1000 hw.h files are the same,
> so
> add the proper "E" term to the hw.h file for the e1000e driver.

There's a patch in process that addresses this issue [1].

> This resolves some static analyzer warnings, like the one found by
> the
> "lgtm.com" tool.
> 
> Cc: Jesse Brandeburg 
> Cc: Tony Nguyen 
> Cc: "David S. Miller" 
> Cc: Jakub Kicinski 
> Cc: intel-wired-...@lists.osuosl.org
> Signed-off-by: Greg Kroah-Hartman 

[1] https://patchwork.ozlabs.org/project/intel-wired-
lan/patch/20210222040005.20126-1-tseew...@gmail.com/

Thanks,
Tony


Re: [PATCH net 1/3] net: dsa: rtl4_a: Pad using __skb_put_padto()

2021-03-02 Thread Florian Fainelli



On 3/1/2021 5:32 AM, Linus Walleij wrote:
> The eth_skb_pad() function will cause a double free
> on failure since dsa_slave_xmit() will try to free
> the frame if we return NULL. Fix this by using
> __skb_put_padto() instead.
> 
> Fixes: 86dd9868b878 ("net: dsa: tag_rtl4_a: Support also egress tags")
> Reported-by: DENG Qingfang 
> Cc: Mauri Sandberg 
> Signed-off-by: Linus Walleij 

Reviewed-by: Florian Fainelli 
-- 
Florian


Re: [PATCH net 2/3] net: dsa: rtl4_a: Drop skb_cow_head()

2021-03-02 Thread Florian Fainelli



On 3/1/2021 5:32 AM, Linus Walleij wrote:
> The DSA core already provides the tag headroom, drop this.
> 
> Fixes: 86dd9868b878 ("net: dsa: tag_rtl4_a: Support also egress tags")
> Reported-by: Andrew Lunn 
> Reported-by: DENG Qingfang 
> Cc: Mauri Sandberg 
> Signed-off-by: Linus Walleij 

Reviewed-by: Florian Fainelli 
-- 
Florian


Re: [PATCH v3] ath10k: skip the wait for completion to recovery in shutdown path

2021-03-02 Thread Abhishek Kumar
This patch seems to address the comments on v2. Overall this patch LGTM.

Reviewed-by: Abhishek Kumar 

On Tue, Feb 23, 2021 at 6:29 AM Youghandhar Chintala
 wrote:
>
> Currently in the shutdown callback we wait for recovery to complete
> before freeing up the resources. This results in additional two seconds
> delay during the shutdown and thereby increase the shutdown time.
>
> As an attempt to take less time during shutdown, remove the wait for
> recovery completion in the shutdown callback and added an API to freeing
> the reosurces in which they were common for shutdown and removing
> the module.
>
> Tested-on: WCN3990 hw1.0 SNOC WLAN.HL.3.1-01040-QCAHLSWMTPLZ-1
>
> Signed-off-by: Youghandhar Chintala 
> Change-Id: I65bc27b5adae1fedc7f7b367ef13aafbd01f8c0c
> ---
> Changes from v2:
> -Corrected commit text and added common API for freeing the
>  resources for shutdown and unloading the module
> ---
>  drivers/net/wireless/ath/ath10k/snoc.c | 29 ++
>  1 file changed, 20 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/net/wireless/ath/ath10k/snoc.c 
> b/drivers/net/wireless/ath/ath10k/snoc.c
> index 84666f72bdfa..70b3f2bd1c81 100644
> --- a/drivers/net/wireless/ath/ath10k/snoc.c
> +++ b/drivers/net/wireless/ath/ath10k/snoc.c
> @@ -1781,17 +1781,11 @@ static int ath10k_snoc_probe(struct platform_device 
> *pdev)
> return ret;
>  }
>
> -static int ath10k_snoc_remove(struct platform_device *pdev)
> +static int ath10k_snoc_free_resources(struct ath10k *ar)
>  {
> -   struct ath10k *ar = platform_get_drvdata(pdev);
> struct ath10k_snoc *ar_snoc = ath10k_snoc_priv(ar);
>
> -   ath10k_dbg(ar, ATH10K_DBG_SNOC, "snoc remove\n");
> -
> -   reinit_completion(&ar->driver_recovery);
> -
> -   if (test_bit(ATH10K_SNOC_FLAG_RECOVERY, &ar_snoc->flags))
> -   wait_for_completion_timeout(&ar->driver_recovery, 3 * HZ);
> +   ath10k_dbg(ar, ATH10K_DBG_SNOC, "snoc free resources\n");
>
> set_bit(ATH10K_SNOC_FLAG_UNREGISTERING, &ar_snoc->flags);
>
> @@ -1805,12 +1799,29 @@ static int ath10k_snoc_remove(struct platform_device 
> *pdev)
> return 0;
>  }
>
> +static int ath10k_snoc_remove(struct platform_device *pdev)
> +{
> +   struct ath10k *ar = platform_get_drvdata(pdev);
> +   struct ath10k_snoc *ar_snoc = ath10k_snoc_priv(ar);
> +
> +   ath10k_dbg(ar, ATH10K_DBG_SNOC, "snoc remove\n");
> +
> +   reinit_completion(&ar->driver_recovery);
> +
> +   if (test_bit(ATH10K_SNOC_FLAG_RECOVERY, &ar_snoc->flags))
> +   wait_for_completion_timeout(&ar->driver_recovery, 3 * HZ);
> +
> +   ath10k_snoc_free_resources(ar);
> +
> +   return 0;
> +}
> +
>  static void ath10k_snoc_shutdown(struct platform_device *pdev)
>  {
> struct ath10k *ar = platform_get_drvdata(pdev);
>
> ath10k_dbg(ar, ATH10K_DBG_SNOC, "snoc shutdown\n");
> -   ath10k_snoc_remove(pdev);
> +   ath10k_snoc_free_resources(ar);
>  }
>
>  static struct platform_driver ath10k_snoc_driver = {
> --
> 2.29.0
>


Re: [PATCH net 3/3] net: dsa: rtl4_a: Syntax fixes

2021-03-02 Thread Florian Fainelli



On 3/1/2021 5:32 AM, Linus Walleij wrote:
> Some errors spotted in the initial patch: use reverse
> christmas tree for nice code looks and fix a spelling
> mistake.
> 
> Reported-by: Andrew Lunn 
> Reported-by: DENG Qingfang 
> Cc: Mauri Sandberg 
> Signed-off-by: Linus Walleij 

Reviewed-by: Florian Fainelli 
-- 
Florian


Re: [PATCH v1 3/7] net: ipa: gsi: Avoid some writes during irq setup for older IPA

2021-03-02 Thread Alex Elder
On 2/11/21 11:50 AM, AngeloGioacchino Del Regno wrote:
> On some IPA versions (v3.1 and older), writing to registers
> GSI_INTER_EE_SRC_CH_IRQ_OFFSET and GSI_INTER_EE_SRC_EV_CH_IRQ_OFFSET
> will generate a fault and the SoC will lockup.
> 
> Avoid clearing CH and EV_CH interrupts on GSI probe to fix this bad
> behavior: we are anyway not going to get spurious interrupts.

I think the reason for this might be that these registers
are located at a different offset for IPA v3.1.

I'd rather get it right and actively disable these
interrupts rather than assume they won't fire.

Also...  you included an extra blank line; avoid that.

-Alex

> Signed-off-by: AngeloGioacchino Del Regno 
> 
> ---
>  drivers/net/ipa/gsi.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ipa/gsi.c b/drivers/net/ipa/gsi.c
> index 6315336b3ca8..b5460cbb085c 100644
> --- a/drivers/net/ipa/gsi.c
> +++ b/drivers/net/ipa/gsi.c
> @@ -207,11 +207,14 @@ static void gsi_irq_setup(struct gsi *gsi)
>   iowrite32(0, gsi->virt + GSI_CNTXT_SRC_IEOB_IRQ_MSK_OFFSET);
>  
>   /* Reverse the offset adjustment for inter-EE register offsets */
> - adjust = gsi->version < IPA_VERSION_4_5 ? 0 : GSI_EE_REG_ADJUST;
> - iowrite32(0, gsi->virt + adjust + GSI_INTER_EE_SRC_CH_IRQ_OFFSET);
> - iowrite32(0, gsi->virt + adjust + GSI_INTER_EE_SRC_EV_CH_IRQ_OFFSET);
> + if (gsi->version > IPA_VERSION_3_1) {
> + adjust = gsi->version < IPA_VERSION_4_5 ? 0 : GSI_EE_REG_ADJUST;
> + iowrite32(0, gsi->virt + adjust + 
> GSI_INTER_EE_SRC_CH_IRQ_OFFSET);
> + iowrite32(0, gsi->virt + adjust + 
> GSI_INTER_EE_SRC_EV_CH_IRQ_OFFSET);
> + }
>  
>   iowrite32(0, gsi->virt + GSI_CNTXT_GSI_IRQ_EN_OFFSET);
> +
>  }
>  
>  /* Turn off all GSI interrupts when we're all done */
> 



Re: [PATCH v1 0/7] Add support for IPA v3.1, GSI v1.0, MSM8998 IPA

2021-03-02 Thread Alex Elder
On 2/11/21 11:50 AM, AngeloGioacchino Del Regno wrote:
> Hey all!
> 
> This time around I thought that it would be nice to get some modem
> action going on. We have it, it's working (ish), so just.. why not.
> 
> This series adds support for IPA v3.1 (featuring GSI v1.0) and also
> takes account for some bits that are shared with other unimplemented
> IPA v3 variants and it is specifically targeting MSM8998, for which
> support is added.

It was more like "next month" rather than "next week," but I
finally took some more time to look at this today.

Again I think it's surprising how little code you had
to implement to get something that seems is at least
modestly functional.

FYI I have undertaken an effort to make the upstream code
suitable for use for any IPA version (3.0-4.11) in the
past few months.  Most of what I've done is in line with
the things you found were necessary for IPA v3.1 support.
Early on I got most of the support for IPA v4.5 upstream,
and have been holding off trying to get other similar
changes out for review for other versions until I've had
more of a chance to test some of what's new in IPA v4.5.

In the coming weeks I will start posting more of this
work for review.  You'll see that I'm modifying many
things you do in your series (such as making version
checks not assume only v3.5.1 and v4.2 are supported).
My priority is on newer versions, but I want the code
to be (at least) correct for IPA v3.0, v3.1, and v3.5
as well.

What might be best is for you to consider using the
patches when I send them out.  I'll gladly give you some
credit when I do if you like (suggested-by, reviewed-by,
tested-by, whatever you feel is appropriate).  Please
let me know if you would like to be on the Cc list for
this sort of change.

> Since the userspace isn't entirely ready (as far as I can see) for
> data connection (3g/lte/whatever) through the modem, it was possible
> to only partially test this series.

Yes we're still figuring out how the upstream tools need
to interact with the kernel for configuration.

> Specifically, loading the IPA firmware and setting up the interface
> went just fine, along with a basic setup of the network interface
> that got exposed by this driver.

This is great to hear.

> With this series, the benefits that I see are:
>  1. The modem doesn't crash anymore when trying to setup a data
> connection, as now the modem firmware seems to be happy with
> having IPA initialized and ready;
>  2. Other random modem crashes while picking up LTE home network
> signal (even just for calling, nothing fancy) seem to be gone.
> 
> These are the reasons why I think that this series is ready for
> upstream action. It's *at least* stabilizing the platform when
> the modem is up.
> 
> This was tested on the F(x)Tec Pro 1 (MSM8998) smartphone.

I unfortunately can't promise you you'll have the full
connection up and running, but we can probably get very
close.

It would be very helpful for you (someone other than me,
that is) to participate in validating the changes I am
now finalizing.  I hope you're willing.

I'll offer a few more specific comments on each of your
patches.

-Alex


> AngeloGioacchino Del Regno (7):
>   net: ipa: Add support for IPA v3.1 with GSI v1.0
>   net: ipa: endpoint: Don't read unexistant register on IPAv3.1
>   net: ipa: gsi: Avoid some writes during irq setup for older IPA
>   net: ipa: gsi: Use right masks for GSI v1.0 channels hw param
>   net: ipa: Add support for IPA on MSM8998
>   dt-bindings: net: qcom-ipa: Document qcom,sc7180-ipa compatible
>   dt-bindings: net: qcom-ipa: Document qcom,msm8998-ipa compatible
> 
>  .../devicetree/bindings/net/qcom,ipa.yaml |   7 +-
>  drivers/net/ipa/Makefile  |   3 +-
>  drivers/net/ipa/gsi.c |  33 +-
>  drivers/net/ipa/gsi_reg.h |   5 +
>  drivers/net/ipa/ipa_data-msm8998.c| 407 ++
>  drivers/net/ipa/ipa_data.h|   5 +
>  drivers/net/ipa/ipa_endpoint.c|  26 +-
>  drivers/net/ipa/ipa_main.c|  12 +-
>  drivers/net/ipa/ipa_reg.h |   3 +
>  drivers/net/ipa/ipa_version.h |   1 +
>  10 files changed, 480 insertions(+), 22 deletions(-)
>  create mode 100644 drivers/net/ipa/ipa_data-msm8998.c
> 



Re: [PATCH v1 5/7] net: ipa: Add support for IPA on MSM8998

2021-03-02 Thread Alex Elder
On 2/11/21 11:50 AM, AngeloGioacchino Del Regno wrote:
> MSM8998 features IPA v3.1 (GSI v1.0): add the required configuration
> data for it.
> 
> Signed-off-by: AngeloGioacchino Del Regno 
> 

As of today, I have not looked at this file in detail.  You
probably see that the intent is to have this file define
pretty much everything that varies across platforms.  A lot
of this is found in "ipa_utils.c" in downstream code, and
it is organized differently there.

I have reworked the way resources are represented (also
not yet posted for review, but "soon").  I see you included
the additional ones but I'm not completely sure they'll
get programmed properly (but again, I haven't looked very
closely yet).

Interconnects are done differently upstream than downstream,
and to be honest I'm not completely on top of which platforms
require which interconnects.  I'm gathering information about
them as I can.

Jakub pointed out a compile problem, so you should definitely
avoid ever having those in your patches, but sometimes it
happens.

When I'm ready to post my IPA v3.1 data file for review
I'll take another, closer look at what you have here.

-Alex
> ---
>  drivers/net/ipa/Makefile   |   3 +-
>  drivers/net/ipa/ipa_data-msm8998.c | 407 +
>  drivers/net/ipa/ipa_data.h |   5 +
>  drivers/net/ipa/ipa_main.c |   4 +
>  4 files changed, 418 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/net/ipa/ipa_data-msm8998.c
> 
> diff --git a/drivers/net/ipa/Makefile b/drivers/net/ipa/Makefile
> index afe5df1e6eee..4a6f4053dce2 100644
> --- a/drivers/net/ipa/Makefile
> +++ b/drivers/net/ipa/Makefile
> @@ -9,4 +9,5 @@ ipa-y :=  ipa_main.o ipa_clock.o 
> ipa_reg.o ipa_mem.o \
>   ipa_endpoint.o ipa_cmd.o ipa_modem.o \
>   ipa_qmi.o ipa_qmi_msg.o
>  
> -ipa-y+=  ipa_data-sdm845.o ipa_data-sc7180.o
> +ipa-y+=  ipa_data-msm8998.o ipa_data-sdm845.o \
> + ipa_data-sc7180.o
> diff --git a/drivers/net/ipa/ipa_data-msm8998.c 
> b/drivers/net/ipa/ipa_data-msm8998.c
> new file mode 100644
> index ..90e724468e40
> --- /dev/null
> +++ b/drivers/net/ipa/ipa_data-msm8998.c
> @@ -0,0 +1,407 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/* Copyright (c) 2012-2018, The Linux Foundation. All rights reserved.
> + * Copyright (C) 2019-2020 Linaro Ltd.
> + * Coypright (C) 2021, AngeloGioacchino Del Regno
> + * 
> + */
> +
> +#include 
> +
> +#include "gsi.h"
> +#include "ipa_data.h"
> +#include "ipa_endpoint.h"
> +#include "ipa_mem.h"
> +
> +/* Endpoint configuration for the MSM8998 SoC. */
> +static const struct ipa_gsi_endpoint_data ipa_gsi_endpoint_data[] = {
> + [IPA_ENDPOINT_AP_COMMAND_TX] = {
> + .ee_id  = GSI_EE_AP,
> + .channel_id = 6,
> + .endpoint_id= 22,
> + .toward_ipa = true,
> + .channel = {
> + .tre_count  = 256,
> + .event_count= 256,
> + .tlv_count  = 18,
> + },
> + .endpoint = {
> + .seq_type   = IPA_SEQ_DMA_ONLY,
> + .config = {
> + .resource_group = 1,
> + .dma_mode   = true,
> + .dma_endpoint   = IPA_ENDPOINT_AP_LAN_RX,
> + },
> + },
> + },
> + [IPA_ENDPOINT_AP_LAN_RX] = {
> + .ee_id  = GSI_EE_AP,
> + .channel_id = 7,
> + .endpoint_id= 15,
> + .toward_ipa = false,
> + .channel = {
> + .tre_count  = 256,
> + .event_count= 256,
> + .tlv_count  = 8,
> + },
> + .endpoint = {
> + .seq_type   = IPA_SEQ_INVALID,
> + .config = {
> + .resource_group = 1,
> + .aggregation= true,
> + .status_enable  = true,
> + .rx = {
> + .pad_align  = ilog2(sizeof(u32)),
> + },
> + },
> + },
> + },
> + [IPA_ENDPOINT_AP_MODEM_TX] = {
> + .ee_id  = GSI_EE_AP,
> + .channel_id = 5,
> + .endpoint_id= 3,
> + .toward_ipa = true,
> + .channel = {
> + .tre_count  = 512,
> + .event_count= 512,
> + .tlv_count  = 16,
> + },
> + .endpoint = {
> + .filter_support = true,
> + .seq_type

Re: [PATCH v1 6/7] dt-bindings: net: qcom-ipa: Document qcom,sc7180-ipa compatible

2021-03-02 Thread Alex Elder
On 2/11/21 11:50 AM, AngeloGioacchino Del Regno wrote:
> The driver supports SC7180, but the binding was not documented.
> Just add it.

I hadn't noticed that!

I'm trying to get through reviewing your series
today and this will take another hour or so to
go validate to my satisfaction.

Would you be willing to submit just this patch
as a fix, and when you do I will give it a proper
review?

-Alex

> Signed-off-by: AngeloGioacchino Del Regno 
> 
> ---
>  Documentation/devicetree/bindings/net/qcom,ipa.yaml | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/devicetree/bindings/net/qcom,ipa.yaml 
> b/Documentation/devicetree/bindings/net/qcom,ipa.yaml
> index 8a2d12644675..b063c6c1077a 100644
> --- a/Documentation/devicetree/bindings/net/qcom,ipa.yaml
> +++ b/Documentation/devicetree/bindings/net/qcom,ipa.yaml
> @@ -43,7 +43,11 @@ description:
>  
>  properties:
>compatible:
> -const: "qcom,sdm845-ipa"
> +oneOf:
> +  - items:
> +  - enum:
> +  - "qcom,sdm845-ipa"
> +  - "qcom,sc7180-ipa"
>  
>reg:
>  items:
> 



Re: [PATCH v1 7/7] dt-bindings: net: qcom-ipa: Document qcom,msm8998-ipa compatible

2021-03-02 Thread Alex Elder
On 2/11/21 11:50 AM, AngeloGioacchino Del Regno wrote:
> MSM8998 support has been added: document the new compatible.
> 
> Signed-off-by: AngeloGioacchino Del Regno 
> 

With the previous patch in place, this becomes almost
automatic.

But I don't want to claim support for a platform
until things actually *work*.  I don't just mean
we can compile and load (and load firmware), I
want to be able to say we can actually carry LTE
data over IPA before we advertise the compatible
string here.

Maybe I'm being picky, but that's my preference.
It adds some motivation for getting the user space
tools squared away.

Thank you again very much for your patches.

-Alex
> ---
>  Documentation/devicetree/bindings/net/qcom,ipa.yaml | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/Documentation/devicetree/bindings/net/qcom,ipa.yaml 
> b/Documentation/devicetree/bindings/net/qcom,ipa.yaml
> index b063c6c1077a..9dacd224b606 100644
> --- a/Documentation/devicetree/bindings/net/qcom,ipa.yaml
> +++ b/Documentation/devicetree/bindings/net/qcom,ipa.yaml
> @@ -46,6 +46,7 @@ properties:
>  oneOf:
>- items:
>- enum:
> +  - "qcom,msm8998-ipa"
>- "qcom,sdm845-ipa"
>- "qcom,sc7180-ipa"
>  
> 



Re: [PATCH v1 1/7] net: ipa: Add support for IPA v3.1 with GSI v1.0

2021-03-02 Thread Alex Elder
On 2/11/21 11:50 AM, AngeloGioacchino Del Regno wrote:
> In preparation for adding support for the MSM8998 SoC's IPA,
> add the necessary bits for IPA version 3.1 featuring GSI 1.0,
> found on at least MSM8998.
> 
> Signed-off-by: AngeloGioacchino Del Regno 
> 

Overall, this looks good.  As I mentioned, I've
implemented a very similar set of changes in my
private development tree.  It's part of a much
larger set of changes intended to allow many
IPA versions to be supported.

A few minor comments, below.

-Alex

> ---
>  drivers/net/ipa/gsi.c  |  8 
>  drivers/net/ipa/ipa_endpoint.c | 17 +
>  drivers/net/ipa/ipa_main.c |  8 ++--
>  drivers/net/ipa/ipa_reg.h  |  3 +++
>  drivers/net/ipa/ipa_version.h  |  1 +
>  5 files changed, 23 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/net/ipa/gsi.c b/drivers/net/ipa/gsi.c
> index 14d9a791924b..6315336b3ca8 100644
> --- a/drivers/net/ipa/gsi.c
> +++ b/drivers/net/ipa/gsi.c
> @@ -794,14 +794,14 @@ static void gsi_channel_program(struct gsi_channel 
> *channel, bool doorbell)
>  
>   /* Max prefetch is 1 segment (do not set MAX_PREFETCH_FMASK) */
>  
> - /* We enable the doorbell engine for IPA v3.5.1 */
> - if (gsi->version == IPA_VERSION_3_5_1 && doorbell)
> + /* We enable the doorbell engine for IPA v3.x */
> + if (gsi->version < IPA_VERSION_4_0 && doorbell)

My version:
if (gsi->version < IPA_VERSION_4_0 && doorbell)

So... You're doing the right thing.  Almost all changes I made
like this were identical to yours; others were (I think all)
equivalent.

>   val |= USE_DB_ENG_FMASK;
>  
>   /* v4.0 introduces an escape buffer for prefetch.  We use it
>* on all but the AP command channel.
>*/
> - if (gsi->version != IPA_VERSION_3_5_1 && !channel->command) {
> + if (gsi->version >= IPA_VERSION_4_0 && !channel->command) {
>   /* If not otherwise set, prefetch buffers are used */
>   if (gsi->version < IPA_VERSION_4_5)
>   val |= USE_ESCAPE_BUF_ONLY_FMASK;

. . .

> diff --git a/drivers/net/ipa/ipa_main.c b/drivers/net/ipa/ipa_main.c
> index 84bb8ae92725..be191993fbec 100644
> --- a/drivers/net/ipa/ipa_main.c
> +++ b/drivers/net/ipa/ipa_main.c

. . .

> @@ -276,6 +276,7 @@ static void ipa_hardware_config_qsb(struct ipa *ipa)
>  
>   max1 = 12;
>   switch (version) {
> + case IPA_VERSION_3_1:

I do this a little differently now.  These values will be
found in the "ipa_data" file for the platform.

Also I think you'd need different values for IPA v3.1 than
for IPA v3.5.1.

>   case IPA_VERSION_3_5_1:
>   max0 = 8;
>   break;
> @@ -404,6 +405,9 @@ static void ipa_hardware_config(struct ipa *ipa)
>   /* Enable open global clocks (not needed for IPA v4.5) */
>   val = GLOBAL_FMASK;
>   val |= GLOBAL_2X_CLK_FMASK;
> + if (version == IPA_VERSION_3_1)
> + val |= MISC_FMASK;

I see this being set for a workaround or IPA v3.1 in the
msm-4.4 tree, but the other two flags aren't set in that
case.  So this might not be quite right.

> +
>   iowrite32(val, ipa->reg_virt + IPA_REG_CLKON_CFG_OFFSET);
>  
>   /* Disable PA mask to allow HOLB drop */

. . .


Re: [PATCH v1 4/7] net: ipa: gsi: Use right masks for GSI v1.0 channels hw param

2021-03-02 Thread Alex Elder
On 2/11/21 11:50 AM, AngeloGioacchino Del Regno wrote:
> In GSI v1.0 the register GSI_HW_PARAM_2_OFFSET has different layout
> so the number of channels and events per EE are, of course, laid out
> in 8 bits each (0-7, 8-15 respectively).
> 
> Signed-off-by: AngeloGioacchino Del Regno 
> 
> ---
>  drivers/net/ipa/gsi.c | 16 +---
>  drivers/net/ipa/gsi_reg.h |  5 +
>  2 files changed, 18 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ipa/gsi.c b/drivers/net/ipa/gsi.c
> index b5460cbb085c..3311ffe514c9 100644
> --- a/drivers/net/ipa/gsi.c
> +++ b/drivers/net/ipa/gsi.c
> @@ -1790,7 +1790,7 @@ static void gsi_channel_teardown(struct gsi *gsi)
>  int gsi_setup(struct gsi *gsi)
>  {
>   struct device *dev = gsi->dev;
> - u32 val;
> + u32 val, mask;
>   int ret;
>  
>   /* Here is where we first touch the GSI hardware */
> @@ -1804,7 +1804,12 @@ int gsi_setup(struct gsi *gsi)
>  
>   val = ioread32(gsi->virt + GSI_GSI_HW_PARAM_2_OFFSET);
>  
> - gsi->channel_count = u32_get_bits(val, NUM_CH_PER_EE_FMASK);
> + if (gsi->version == IPA_VERSION_3_1)
> + mask = GSIV1_NUM_CH_PER_EE_FMASK;
> + else
> + mask = NUM_CH_PER_EE_FMASK;
> +
> + gsi->channel_count = u32_get_bits(val, mask);

I have a different way of doing this, at least for
encoding, and I'd rather use a similar convention in
this case.  At some point it might become obvious
that "there's got to be a better way" and I might have
to consider something else, but for now I've been
doing what I describe below.

Anyway, what I'd ask for here is to create a a static
inline function in "ipa_reg.h" (or "gsi_reg.h") to
extract these values.  In this case it might look
like this:

static inline u32 num_ev_per_ee_get(enum ipa_version version,
u32 val)
{
if (version == IPA_VERSION_3_0 || version == IPA_VERSION_3_1)
return u32_get_bits(val, GENMASK(8, 0));

return u32_get_bits(val, GENMASK(7, 3));
}

(I'm not sure if the above is correct for all versions...)

Then the caller would do:
gsi->evt_ring_count = num_ev_per_ee_get(ipa->version, val);

I'd want the same general thing for the channel count.

-Alex

>   if (!gsi->channel_count) {
>   dev_err(dev, "GSI reports zero channels supported\n");
>   return -EINVAL;
> @@ -1816,7 +1821,12 @@ int gsi_setup(struct gsi *gsi)
>   gsi->channel_count = GSI_CHANNEL_COUNT_MAX;
>   }
>  
> - gsi->evt_ring_count = u32_get_bits(val, NUM_EV_PER_EE_FMASK);
> + if (gsi->version == IPA_VERSION_3_1)
> + mask = GSIV1_NUM_EV_PER_EE_FMASK;
> + else
> + mask = NUM_EV_PER_EE_FMASK;
> +
> + gsi->evt_ring_count = u32_get_bits(val, mask);
>   if (!gsi->evt_ring_count) {
>   dev_err(dev, "GSI reports zero event rings supported\n");
>   return -EINVAL;
> diff --git a/drivers/net/ipa/gsi_reg.h b/drivers/net/ipa/gsi_reg.h
> index 0e138bbd8205..4ba579fa21c2 100644
> --- a/drivers/net/ipa/gsi_reg.h
> +++ b/drivers/net/ipa/gsi_reg.h
> @@ -287,6 +287,11 @@ enum gsi_generic_cmd_opcode {
>   GSI_EE_N_GSI_HW_PARAM_2_OFFSET(GSI_EE_AP)
>  #define GSI_EE_N_GSI_HW_PARAM_2_OFFSET(ee) \
>   (0x0001f040 + 0x4000 * (ee))
> +
> +/* Fields below are present for IPA v3.1 with GSI version 1 */
> +#define GSIV1_NUM_EV_PER_EE_FMASKGENMASK(8, 0)
> +#define GSIV1_NUM_CH_PER_EE_FMASKGENMASK(15, 8)
> +/* Fields below are present for IPA v3.5.1 and above */
>  #define IRAM_SIZE_FMASK  GENMASK(2, 0)
>  #define NUM_CH_PER_EE_FMASK  GENMASK(7, 3)
>  #define NUM_EV_PER_EE_FMASK  GENMASK(12, 8)
> 



Re: [PATCH v1 2/7] net: ipa: endpoint: Don't read unexistant register on IPAv3.1

2021-03-02 Thread Alex Elder
On 2/11/21 11:50 AM, AngeloGioacchino Del Regno wrote:
> On IPAv3.1 there is no such FLAVOR_0 register so it is impossible
> to read tx/rx channel masks and we have to rely on the correctness
> on the provided configuration.

This works, and is simple.

I think I would rather populate the available mask here
with a mask containing the actual version-specific available
endpoints.  On the other hand, looking at the downstream code,
it looks like almost any of these endpoints could be used.

So, while I don't know for sure the all-1's value here is
*correct*, it's more of a validation check anyway, so it's
probably fine

-Alex

> Signed-off-by: AngeloGioacchino Del Regno 
> 
> ---
>  drivers/net/ipa/ipa_endpoint.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/net/ipa/ipa_endpoint.c b/drivers/net/ipa/ipa_endpoint.c
> index 06d8aa34276e..10c477e1bb90 100644
> --- a/drivers/net/ipa/ipa_endpoint.c
> +++ b/drivers/net/ipa/ipa_endpoint.c
> @@ -1659,6 +1659,15 @@ int ipa_endpoint_config(struct ipa *ipa)
>   u32 max;
>   u32 val;
>  
> + /* Some IPA versions don't provide a FLAVOR register and we cannot
> +  * check the rx/tx masks hence we have to rely on the correctness
> +  * of the provided configuration.
> +  */
> + if (ipa->version == IPA_VERSION_3_1) {
> + ipa->available = U32_MAX;
> + return 0;
> + }
> +
>   /* Find out about the endpoints supplied by the hardware, and ensure
>* the highest one doesn't exceed the number we support.
>*/
> 



RE: [PATCH net v4 1/1] can: can_skb_set_owner(): fix ref counting if socket was closed before setting skb ownership

2021-03-02 Thread Joakim Zhang

> -Original Message-
> From: Joakim Zhang 
> Sent: 2021年3月1日 18:57
> To: Oleksij Rempel ; m...@pengutronix.de; David S.
> Miller ; Jakub Kicinski ; Oliver
> Hartkopp ; Robin van der Gracht
> 
> Cc: Andre Naujoks ; Eric Dumazet
> ; ker...@pengutronix.de; linux-...@vger.kernel.org;
> netdev@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: RE: [PATCH net v4 1/1] can: can_skb_set_owner(): fix ref counting if
> socket was closed before setting skb ownership
> 
> 
> > -Original Message-
> > From: Oleksij Rempel 
> > Sent: 2021年2月26日 17:25
> > To: m...@pengutronix.de; David S. Miller ; Jakub
> > Kicinski ; Oliver Hartkopp ;
> > Robin van der Gracht 
> > Cc: Oleksij Rempel ; Andre Naujoks
> > ; Eric Dumazet ;
> > ker...@pengutronix.de; linux-...@vger.kernel.org;
> > netdev@vger.kernel.org; linux-ker...@vger.kernel.org
> > Subject: [PATCH net v4 1/1] can: can_skb_set_owner(): fix ref counting
> > if socket was closed before setting skb ownership
> >
> > There are two ref count variables controlling the free()ing of a socket:
> > - struct sock::sk_refcnt - which is changed by sock_hold()/sock_put()
> > - struct sock::sk_wmem_alloc - which accounts the memory allocated by
> >   the skbs in the send path.
> >
> > In case there are still TX skbs on the fly and the socket() is closed,
> > the struct sock::sk_refcnt reaches 0. In the TX-path the CAN stack
> > clones an "echo" skb, calls sock_hold() on the original socket and
> > references it. This produces the following back trace:
> >
> > | WARNING: CPU: 0 PID: 280 at lib/refcount.c:25
> > | refcount_warn_saturate+0x114/0x134
> > | refcount_t: addition on 0; use-after-free.
> > | Modules linked in: coda_vpu(E) v4l2_jpeg(E) videobuf2_vmalloc(E)
> > imx_vdoa(E)
> > | CPU: 0 PID: 280 Comm: test_can.sh Tainted: GE
> > 5.11.0-04577-gf8ff6603c617 #203
> > | Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
> > | Backtrace:
> > | [<80bafea4>] (dump_backtrace) from [<80bb0280>]
> > | (show_stack+0x20/0x24)
> > | r7: r6:600f0113 r5: r4:81441220 [<80bb0260>]
> > | (show_stack) from [<80bb593c>] (dump_stack+0xa0/0xc8) [<80bb589c>]
> > | (dump_stack) from [<8012b268>] (__warn+0xd4/0x114) r9:0019
> > | r8:80f4a8c2 r7:83e4150c r6: r5:0009 r4:80528f90
> > | [<8012b194>] (__warn) from [<80bb09c4>]
> > | (warn_slowpath_fmt+0x88/0xc8)
> > | r9:83f26400 r8:80f4a8d1 r7:0009 r6:80528f90 r5:0019
> > | r4:80f4a8c2 [<80bb0940>] (warn_slowpath_fmt) from [<80528f90>]
> > | (refcount_warn_saturate+0x114/0x134) r8: r7:
> > | r6:82b44000 r5:834e5600 r4:83f4d540 [<80528e7c>]
> > | (refcount_warn_saturate) from [<8079a4c8>]
> > | (__refcount_add.constprop.0+0x4c/0x50)
> > | [<8079a47c>] (__refcount_add.constprop.0) from [<8079a57c>]
> > | (can_put_echo_skb+0xb0/0x13c) [<8079a4cc>] (can_put_echo_skb) from
> > | [<8079ba98>] (flexcan_start_xmit+0x1c4/0x230) r9:0010
> > | r8:83f48610
> > | r7:0fdc r6:0c08 r5:82b44000 r4:834e5600 [<8079b8d4>]
> > | (flexcan_start_xmit) from [<80969078>] (netdev_start_xmit+0x44/0x70)
> > | r9:814c0ba0 r8:80c8790c r7: r6:834e5600 r5:82b44000
> > | r4:82ab1f00 [<80969034>] (netdev_start_xmit) from [<809725a4>]
> > | (dev_hard_start_xmit+0x19c/0x318) r9:814c0ba0 r8:
> > | r7:82ab1f00
> > | r6:82b44000 r5: r4:834e5600 [<80972408>]
> > | (dev_hard_start_xmit) from [<809c6584>] (sch_direct_xmit+0xcc/0x264)
> > | r10:834e5600
> > | r9: r8: r7:82b44000 r6:82ab1f00 r5:834e5600
> > | r4:83f27400 [<809c64b8>] (sch_direct_xmit) from [<809c6c0c>]
> > | (__qdisc_run+0x4f0/0x534)
> >
> > To fix this problem, only set skb ownership to sockets which have
> > still a ref count > 0.
> >
> > Cc: Oliver Hartkopp 
> > Cc: Andre Naujoks 
> > Suggested-by: Eric Dumazet 
> > Fixes: 0ae89beb283a ("can: add destructor for self generated skbs")
> > Signed-off-by: Oleksij Rempel 
> 
> I will give out a test result tomorrow when the board is available. 😊

I also met this issue in the past and this patch indeed fix it. Thanks Oleksij 
Rempe.

Tested-by: Joakim Zhang 

Best Regards,
Joakim Zhang
> Best Regards,
> Joakim Zhang
> > ---
> >  include/linux/can/skb.h | 8 ++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/can/skb.h b/include/linux/can/skb.h index
> > 685f34cfba20..d82018cc0d0b 100644
> > --- a/include/linux/can/skb.h
> > +++ b/include/linux/can/skb.h
> > @@ -65,8 +65,12 @@ static inline void can_skb_reserve(struct sk_buff
> > *skb)
> >
> >  static inline void can_skb_set_owner(struct sk_buff *skb, struct sock *sk)
> {
> > -   if (sk) {
> > -   sock_hold(sk);
> > +   /*
> > +* If the socket has already been closed by user space, the refcount may
> > +* already be 0 (and the socket will be freed after the last TX skb has
> > +* been freed). So only increase socket refcount if the refcount is > 0.
> > +*/
> > +   if (sk && refcount_inc_not_zero(&sk->sk_refcnt)) {
> >

[Patch bpf-next v2 0/9] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP

2021-03-02 Thread Cong Wang
From: Cong Wang 

We have thousands of services connected to a daemon on every host
via AF_UNIX dgram sockets, after they are moved into VM, we have to
add a proxy to forward these communications from VM to host, because
rewriting thousands of them is not practical. This proxy uses an
AF_UNIX socket connected to services and a UDP socket to connect to
the host. It is inefficient because data is copied between kernel
space and user space twice, and we can not use splice() which only
supports TCP. Therefore, we want to use sockmap to do the splicing
without going to user-space at all (after the initial setup).

Currently sockmap only fully supports TCP, UDP is partially supported
as it is only allowed to add into sockmap. This patchset, as the second
part of the original large patchset, extends sockmap with:
1) cross-protocol support with BPF_SK_SKB_VERDICT; 2) full UDP support.

On the high level, ->sendmsg_locked() and ->read_sock() are required
for each protocol to support sockmap redirection, and in order to do
sock proto update, a new ops ->update_proto() is introduced, which is
also required to implement. A BPF ->recvmsg() is also needed to replace
the original ->recvmsg() to retrieve skmsg. Please see each patch for
more details.

To see the big picture, the original patchset is available here:
https://github.com/congwang/linux/tree/sockmap
this patchset is also available:
https://github.com/congwang/linux/tree/sockmap2

---
v2: separate from the original large patchset
rebase to the latest bpf-next
split UDP test case
move inet_csk_has_ulp() check to tcp_bpf.c
clean up udp_read_sock()

Cong Wang (9):
  sock_map: introduce BPF_SK_SKB_VERDICT
  sock: introduce sk_prot->update_proto()
  udp: implement ->sendmsg_locked()
  udp: implement ->read_sock() for sockmap
  udp: add ->read_sock() and ->sendmsg_locked() to ipv6
  skmsg: extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data()
  udp: implement udp_bpf_recvmsg() for sockmap
  sock_map: update sock type checks for UDP
  selftests/bpf: add a test case for udp sockmap

 include/linux/skmsg.h |  25 ++--
 include/net/ipv6.h|   1 +
 include/net/sock.h|   3 +
 include/net/tcp.h |   3 +-
 include/net/udp.h |   4 +
 include/uapi/linux/bpf.h  |   1 +
 kernel/bpf/syscall.c  |   1 +
 net/core/skmsg.c  | 113 +-
 net/core/sock_map.c   |  52 ---
 net/ipv4/af_inet.c|   2 +
 net/ipv4/tcp_bpf.c| 129 +++-
 net/ipv4/tcp_ipv4.c   |   3 +
 net/ipv4/udp.c|  68 -
 net/ipv4/udp_bpf.c|  78 +-
 net/ipv6/af_inet6.c   |   2 +
 net/ipv6/tcp_ipv6.c   |   3 +
 net/ipv6/udp.c|  30 +++-
 net/tls/tls_sw.c  |   4 +-
 tools/bpf/bpftool/common.c|   1 +
 tools/bpf/bpftool/prog.c  |   1 +
 tools/include/uapi/linux/bpf.h|   1 +
 .../selftests/bpf/prog_tests/sockmap_listen.c | 140 ++
 .../selftests/bpf/progs/test_sockmap_listen.c |  22 +++
 23 files changed, 517 insertions(+), 170 deletions(-)

-- 
2.25.1



[Patch bpf-next v2 3/9] udp: implement ->sendmsg_locked()

2021-03-02 Thread Cong Wang
From: Cong Wang 

UDP already has udp_sendmsg() which takes lock_sock() inside.
We have to build ->sendmsg_locked() on top of it, by adding
a new parameter for whether the sock has been locked.

Cc: John Fastabend 
Cc: Daniel Borkmann 
Cc: Jakub Sitnicki 
Cc: Lorenz Bauer 
Signed-off-by: Cong Wang 
---
 include/net/udp.h  |  1 +
 net/ipv4/af_inet.c |  1 +
 net/ipv4/udp.c | 30 +++---
 3 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index df7cc1edc200..5264ba1439f9 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -292,6 +292,7 @@ int udp_get_port(struct sock *sk, unsigned short snum,
 int udp_err(struct sk_buff *, u32);
 int udp_abort(struct sock *sk, int err);
 int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len);
+int udp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t len);
 int udp_push_pending_frames(struct sock *sk);
 void udp_flush_pending_frames(struct sock *sk);
 int udp_cmsg_send(struct sock *sk, struct msghdr *msg, u16 *gso_size);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index a02ce89b56b5..d8c73a848c53 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1071,6 +1071,7 @@ const struct proto_ops inet_dgram_ops = {
.setsockopt= sock_common_setsockopt,
.getsockopt= sock_common_getsockopt,
.sendmsg   = inet_sendmsg,
+   .sendmsg_locked= udp_sendmsg_locked,
.recvmsg   = inet_recvmsg,
.mmap  = sock_no_mmap,
.sendpage  = inet_sendpage,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index dbd25b59ce0e..93db853601d7 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1024,7 +1024,7 @@ int udp_cmsg_send(struct sock *sk, struct msghdr *msg, 
u16 *gso_size)
 }
 EXPORT_SYMBOL_GPL(udp_cmsg_send);
 
-int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
+static int __udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len, bool 
locked)
 {
struct inet_sock *inet = inet_sk(sk);
struct udp_sock *up = udp_sk(sk);
@@ -1063,15 +1063,18 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
 * There are pending frames.
 * The socket lock must be held while it's corked.
 */
-   lock_sock(sk);
+   if (!locked)
+   lock_sock(sk);
if (likely(up->pending)) {
if (unlikely(up->pending != AF_INET)) {
-   release_sock(sk);
+   if (!locked)
+   release_sock(sk);
return -EINVAL;
}
goto do_append_data;
}
-   release_sock(sk);
+   if (!locked)
+   release_sock(sk);
}
ulen += sizeof(struct udphdr);
 
@@ -1241,11 +1244,13 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
goto out;
}
 
-   lock_sock(sk);
+   if (!locked)
+   lock_sock(sk);
if (unlikely(up->pending)) {
/* The socket is already corked while preparing it. */
/* ... which is an evident application bug. --ANK */
-   release_sock(sk);
+   if (!locked)
+   release_sock(sk);
 
net_dbg_ratelimited("socket already corked\n");
err = -EINVAL;
@@ -1272,7 +1277,8 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
err = udp_push_pending_frames(sk);
else if (unlikely(skb_queue_empty(&sk->sk_write_queue)))
up->pending = 0;
-   release_sock(sk);
+   if (!locked)
+   release_sock(sk);
 
 out:
ip_rt_put(rt);
@@ -1302,8 +1308,18 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
err = 0;
goto out;
 }
+
+int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
+{
+   return __udp_sendmsg(sk, msg, len, false);
+}
 EXPORT_SYMBOL(udp_sendmsg);
 
+int udp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t len)
+{
+   return __udp_sendmsg(sk, msg, len, true);
+}
+
 int udp_sendpage(struct sock *sk, struct page *page, int offset,
 size_t size, int flags)
 {
-- 
2.25.1



[Patch bpf-next v2 1/9] sock_map: introduce BPF_SK_SKB_VERDICT

2021-03-02 Thread Cong Wang
From: Cong Wang 

Reusing BPF_SK_SKB_STREAM_VERDICT is possible but its name is
confusing and more importantly we still want to distinguish them
from user-space. So we can just reuse the stream verdict code but
introduce a new type of eBPF program, skb_verdict. Users are not
allowed to set stream_verdict and skb_verdict at the same time.

Cc: John Fastabend 
Cc: Daniel Borkmann 
Cc: Jakub Sitnicki 
Cc: Lorenz Bauer 
Signed-off-by: Cong Wang 
---
 include/linux/skmsg.h  |  3 +++
 include/uapi/linux/bpf.h   |  1 +
 kernel/bpf/syscall.c   |  1 +
 net/core/skmsg.c   |  4 +++-
 net/core/sock_map.c| 23 ++-
 tools/bpf/bpftool/common.c |  1 +
 tools/bpf/bpftool/prog.c   |  1 +
 tools/include/uapi/linux/bpf.h |  1 +
 8 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 6c09d94be2e9..451530d41af7 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -58,6 +58,7 @@ struct sk_psock_progs {
struct bpf_prog *msg_parser;
struct bpf_prog *stream_parser;
struct bpf_prog *stream_verdict;
+   struct bpf_prog *skb_verdict;
 };
 
 enum sk_psock_state_bits {
@@ -442,6 +443,7 @@ static inline void psock_progs_drop(struct sk_psock_progs 
*progs)
psock_set_prog(&progs->msg_parser, NULL);
psock_set_prog(&progs->stream_parser, NULL);
psock_set_prog(&progs->stream_verdict, NULL);
+   psock_set_prog(&progs->skb_verdict, NULL);
 }
 
 int sk_psock_tls_strp_read(struct sk_psock *psock, struct sk_buff *skb);
@@ -489,5 +491,6 @@ static inline void skb_bpf_redirect_clear(struct sk_buff 
*skb)
 {
skb->_sk_redir = 0;
 }
+
 #endif /* CONFIG_NET_SOCK_MSG */
 #endif /* _LINUX_SKMSG_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b89af20cfa19..1a08ab00a45e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -247,6 +247,7 @@ enum bpf_attach_type {
BPF_XDP_CPUMAP,
BPF_SK_LOOKUP,
BPF_XDP,
+   BPF_SK_SKB_VERDICT,
__MAX_BPF_ATTACH_TYPE
 };
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c859bc46d06c..afa803a1553e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2941,6 +2941,7 @@ attach_type_to_prog_type(enum bpf_attach_type attach_type)
return BPF_PROG_TYPE_SK_MSG;
case BPF_SK_SKB_STREAM_PARSER:
case BPF_SK_SKB_STREAM_VERDICT:
+   case BPF_SK_SKB_VERDICT:
return BPF_PROG_TYPE_SK_SKB;
case BPF_LIRC_MODE2:
return BPF_PROG_TYPE_LIRC_MODE2;
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 07f54015238a..5efd790f1b47 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -693,7 +693,7 @@ void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
rcu_assign_sk_user_data(sk, NULL);
if (psock->progs.stream_parser)
sk_psock_stop_strp(sk, psock);
-   else if (psock->progs.stream_verdict)
+   else if (psock->progs.stream_verdict || psock->progs.skb_verdict)
sk_psock_stop_verdict(sk, psock);
write_unlock_bh(&sk->sk_callback_lock);
sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED);
@@ -1010,6 +1010,8 @@ static int sk_psock_verdict_recv(read_descriptor_t *desc, 
struct sk_buff *skb,
}
skb_set_owner_r(skb, sk);
prog = READ_ONCE(psock->progs.stream_verdict);
+   if (!prog)
+   prog = READ_ONCE(psock->progs.skb_verdict);
if (likely(prog)) {
skb_dst_drop(skb);
skb_bpf_redirect_clear(skb);
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index dd53a7771d7e..3bddd9dd2da2 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -155,6 +155,8 @@ static void sock_map_del_link(struct sock *sk,
strp_stop = true;
if (psock->saved_data_ready && 
stab->progs.stream_verdict)
verdict_stop = true;
+   if (psock->saved_data_ready && stab->progs.skb_verdict)
+   verdict_stop = true;
list_del(&link->list);
sk_psock_free_link(link);
}
@@ -227,7 +229,7 @@ static struct sk_psock *sock_map_psock_get_checked(struct 
sock *sk)
 static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs,
 struct sock *sk)
 {
-   struct bpf_prog *msg_parser, *stream_parser, *stream_verdict;
+   struct bpf_prog *msg_parser, *stream_parser, *stream_verdict, 
*skb_verdict;
struct sk_psock *psock;
int ret;
 
@@ -256,6 +258,15 @@ static int sock_map_link(struct bpf_map *map, struct 
sk_psock_progs *progs,
}
}
 
+   skb_verdict = READ_ONCE(progs->skb_verdict);
+   if (skb_verdict) {
+   s

[Patch bpf-next v2 2/9] sock: introduce sk_prot->update_proto()

2021-03-02 Thread Cong Wang
From: Cong Wang 

Currently sockmap calls into each protocol to update the struct
proto and replace it. This certainly won't work when the protocol
is implemented as a module, for example, AF_UNIX.

Introduce a new ops sk->sk_prot->update_proto(), so each protocol
can implement its own way to replace the struct proto. This also
helps get rid of symbol dependencies on CONFIG_INET.

Cc: John Fastabend 
Cc: Daniel Borkmann 
Cc: Jakub Sitnicki 
Cc: Lorenz Bauer 
Signed-off-by: Cong Wang 
---
 include/linux/skmsg.h | 18 +++---
 include/net/sock.h|  3 +++
 include/net/tcp.h |  1 +
 include/net/udp.h |  1 +
 net/core/skmsg.c  |  5 -
 net/core/sock_map.c   | 24 
 net/ipv4/tcp_bpf.c| 23 ---
 net/ipv4/tcp_ipv4.c   |  3 +++
 net/ipv4/udp.c|  3 +++
 net/ipv4/udp_bpf.c| 14 --
 net/ipv6/tcp_ipv6.c   |  3 +++
 net/ipv6/udp.c|  3 +++
 12 files changed, 56 insertions(+), 45 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 451530d41af7..b5df69d5d397 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -98,6 +98,7 @@ struct sk_psock {
void (*saved_close)(struct sock *sk, long timeout);
void (*saved_write_space)(struct sock *sk);
void (*saved_data_ready)(struct sock *sk);
+   int  (*saved_update_proto)(struct sock *sk, bool restore);
struct proto*sk_proto;
struct sk_psock_work_state  work_state;
struct work_struct  work;
@@ -350,25 +351,12 @@ static inline void sk_psock_cork_free(struct sk_psock 
*psock)
}
 }
 
-static inline void sk_psock_update_proto(struct sock *sk,
-struct sk_psock *psock,
-struct proto *ops)
-{
-   /* Pairs with lockless read in sk_clone_lock() */
-   WRITE_ONCE(sk->sk_prot, ops);
-}
-
 static inline void sk_psock_restore_proto(struct sock *sk,
  struct sk_psock *psock)
 {
sk->sk_prot->unhash = psock->saved_unhash;
-   if (inet_csk_has_ulp(sk)) {
-   tcp_update_ulp(sk, psock->sk_proto, psock->saved_write_space);
-   } else {
-   sk->sk_write_space = psock->saved_write_space;
-   /* Pairs with lockless read in sk_clone_lock() */
-   WRITE_ONCE(sk->sk_prot, psock->sk_proto);
-   }
+   if (psock->saved_update_proto)
+   psock->saved_update_proto(sk, true);
 }
 
 static inline void sk_psock_set_state(struct sk_psock *psock,
diff --git a/include/net/sock.h b/include/net/sock.h
index 636810ddcd9b..0e8577c917e8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1184,6 +1184,9 @@ struct proto {
void(*unhash)(struct sock *sk);
void(*rehash)(struct sock *sk);
int (*get_port)(struct sock *sk, unsigned short 
snum);
+#ifdef CONFIG_BPF_SYSCALL
+   int (*update_proto)(struct sock *sk, bool restore);
+#endif
 
/* Keeping track of sockets in use */
 #ifdef CONFIG_PROC_FS
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 075de26f449d..2efa4e5ea23d 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2203,6 +2203,7 @@ struct sk_psock;
 
 #ifdef CONFIG_BPF_SYSCALL
 struct proto *tcp_bpf_get_proto(struct sock *sk, struct sk_psock *psock);
+int tcp_bpf_update_proto(struct sock *sk, bool restore);
 void tcp_bpf_clone(const struct sock *sk, struct sock *newsk);
 #endif /* CONFIG_BPF_SYSCALL */
 
diff --git a/include/net/udp.h b/include/net/udp.h
index d4d064c59232..df7cc1edc200 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -518,6 +518,7 @@ static inline struct sk_buff *udp_rcv_segment(struct sock 
*sk,
 #ifdef CONFIG_BPF_SYSCALL
 struct sk_psock;
 struct proto *udp_bpf_get_proto(struct sock *sk, struct sk_psock *psock);
+int udp_bpf_update_proto(struct sock *sk, bool restore);
 #endif
 
 #endif /* _UDP_H */
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 5efd790f1b47..7dbd8344ec89 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -563,11 +563,6 @@ struct sk_psock *sk_psock_init(struct sock *sk, int node)
 
write_lock_bh(&sk->sk_callback_lock);
 
-   if (inet_csk_has_ulp(sk)) {
-   psock = ERR_PTR(-EINVAL);
-   goto out;
-   }
-
if (sk->sk_user_data) {
psock = ERR_PTR(-EBUSY);
goto out;
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 3bddd9dd2da2..13d2af5bb81c 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -184,26 +184,10 @@ static void sock_map_unref(struct sock *sk, void 
*link_raw)
 
 static int sock_map_init_proto(struct sock *sk, struct sk_psock *psock)
 {
-   struct proto *prot;
-
-   switch (sk->sk_type) {
-   case SOCK_STREAM:
-   prot = tcp_bpf_get_proto(sk, psock);
-   

[Patch bpf-next v2 4/9] udp: implement ->read_sock() for sockmap

2021-03-02 Thread Cong Wang
From: Cong Wang 

Cc: John Fastabend 
Cc: Daniel Borkmann 
Cc: Jakub Sitnicki 
Cc: Lorenz Bauer 
Signed-off-by: Cong Wang 
---
 include/net/udp.h  |  2 ++
 net/ipv4/af_inet.c |  1 +
 net/ipv4/udp.c | 34 ++
 3 files changed, 37 insertions(+)

diff --git a/include/net/udp.h b/include/net/udp.h
index 5264ba1439f9..44a94cfc63b5 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -330,6 +330,8 @@ struct sock *__udp6_lib_lookup(struct net *net,
   struct sk_buff *skb);
 struct sock *udp6_lib_lookup_skb(const struct sk_buff *skb,
 __be16 sport, __be16 dport);
+int udp_read_sock(struct sock *sk, read_descriptor_t *desc,
+ sk_read_actor_t recv_actor);
 
 /* UDP uses skb->dev_scratch to cache as much information as possible and avoid
  * possibly multiple cache miss on dequeue()
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index d8c73a848c53..df8e8e238756 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1072,6 +1072,7 @@ const struct proto_ops inet_dgram_ops = {
.getsockopt= sock_common_getsockopt,
.sendmsg   = inet_sendmsg,
.sendmsg_locked= udp_sendmsg_locked,
+   .read_sock = udp_read_sock,
.recvmsg   = inet_recvmsg,
.mmap  = sock_no_mmap,
.sendpage  = inet_sendpage,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 93db853601d7..54f24b1d4f65 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1798,6 +1798,40 @@ struct sk_buff *__skb_recv_udp(struct sock *sk, unsigned 
int flags,
 }
 EXPORT_SYMBOL(__skb_recv_udp);
 
+int udp_read_sock(struct sock *sk, read_descriptor_t *desc,
+ sk_read_actor_t recv_actor)
+{
+   int copied = 0;
+
+   while (1) {
+   int offset = 0, err;
+   struct sk_buff *skb;
+
+   skb = __skb_recv_udp(sk, 0, 1, &offset, &err);
+   if (!skb)
+   break;
+   if (offset < skb->len) {
+   int used;
+   size_t len;
+
+   len = skb->len - offset;
+   used = recv_actor(desc, skb, offset, len);
+   if (used <= 0) {
+   if (!copied)
+   copied = used;
+   break;
+   } else if (used <= len) {
+   copied += used;
+   offset += used;
+   }
+   }
+   if (!desc->count)
+   break;
+   }
+
+   return copied;
+}
+
 /*
  * This should be easy, if there is something there we
  * return it, otherwise we block.
-- 
2.25.1



[Patch bpf-next v2 5/9] udp: add ->read_sock() and ->sendmsg_locked() to ipv6

2021-03-02 Thread Cong Wang
From: Cong Wang 

Similarly, udpv6_sendmsg() takes lock_sock() inside too,
we have to build ->sendmsg_locked() on top of it.

For ->read_sock(), we can just use udp_read_sock().

Cc: John Fastabend 
Cc: Daniel Borkmann 
Cc: Jakub Sitnicki 
Cc: Lorenz Bauer 
Signed-off-by: Cong Wang 
---
 include/net/ipv6.h  |  1 +
 net/ipv4/udp.c  |  1 +
 net/ipv6/af_inet6.c |  2 ++
 net/ipv6/udp.c  | 27 +--
 4 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index bd1f396cc9c7..48b6850dae85 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -1119,6 +1119,7 @@ int inet6_hash_connect(struct inet_timewait_death_row 
*death_row,
 int inet6_sendmsg(struct socket *sock, struct msghdr *msg, size_t size);
 int inet6_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
  int flags);
+int udpv6_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t len);
 
 /*
  * reassembly.c
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 54f24b1d4f65..717c543aaec3 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1831,6 +1831,7 @@ int udp_read_sock(struct sock *sk, read_descriptor_t 
*desc,
 
return copied;
 }
+EXPORT_SYMBOL(udp_read_sock);
 
 /*
  * This should be easy, if there is something there we
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 1fb75f01756c..634ab3a825d7 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -714,7 +714,9 @@ const struct proto_ops inet6_dgram_ops = {
.setsockopt= sock_common_setsockopt,/* ok   */
.getsockopt= sock_common_getsockopt,/* ok   */
.sendmsg   = inet6_sendmsg, /* retpoline's sake */
+   .sendmsg_locked= udpv6_sendmsg_locked,
.recvmsg   = inet6_recvmsg, /* retpoline's sake */
+   .read_sock = udp_read_sock,
.mmap  = sock_no_mmap,
.sendpage  = sock_no_sendpage,
.set_peek_off  = sk_set_peek_off,
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 105ba0cf739d..4372597bc271 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1272,7 +1272,7 @@ static int udp_v6_push_pending_frames(struct sock *sk)
return err;
 }
 
-int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
+static int __udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len, 
bool locked)
 {
struct ipv6_txoptions opt_space;
struct udp_sock *up = udp_sk(sk);
@@ -1361,7 +1361,8 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
 * There are pending frames.
 * The socket lock must be held while it's corked.
 */
-   lock_sock(sk);
+   if (!locked)
+   lock_sock(sk);
if (likely(up->pending)) {
if (unlikely(up->pending != AF_INET6)) {
release_sock(sk);
@@ -1370,7 +1371,8 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
dst = NULL;
goto do_append_data;
}
-   release_sock(sk);
+   if (!locked)
+   release_sock(sk);
}
ulen += sizeof(struct udphdr);
 
@@ -1533,11 +1535,13 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
goto out;
}
 
-   lock_sock(sk);
+   if (!locked)
+   lock_sock(sk);
if (unlikely(up->pending)) {
/* The socket is already corked while preparing it. */
/* ... which is an evident application bug. --ANK */
-   release_sock(sk);
+   if (!locked)
+   release_sock(sk);
 
net_dbg_ratelimited("udp cork app bug 2\n");
err = -EINVAL;
@@ -1562,7 +1566,8 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
 
if (err > 0)
err = np->recverr ? net_xmit_errno(err) : 0;
-   release_sock(sk);
+   if (!locked)
+   release_sock(sk);
 
 out:
dst_release(dst);
@@ -1593,6 +1598,16 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t len)
goto out;
 }
 
+int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
+{
+   return __udpv6_sendmsg(sk, msg, len, false);
+}
+
+int udpv6_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t len)
+{
+   return __udpv6_sendmsg(sk, msg, len, true);
+}
+
 void udpv6_destroy_sock(struct sock *sk)
 {
struct udp_sock *up = udp_sk(sk);
-- 
2.25.1



[Patch bpf-next v2 6/9] skmsg: extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data()

2021-03-02 Thread Cong Wang
From: Cong Wang 

Although these two functions are only used by TCP, they are not
specific to TCP at all, both operate on skmsg and ingress_msg,
so fit in net/core/skmsg.c very well.

And we will need them for non-TCP, so rename and move them to
skmsg.c and export them to modules.

Cc: John Fastabend 
Cc: Daniel Borkmann 
Cc: Jakub Sitnicki 
Cc: Lorenz Bauer 
Signed-off-by: Cong Wang 
---
 include/linux/skmsg.h |   4 ++
 include/net/tcp.h |   2 -
 net/core/skmsg.c  | 104 +
 net/ipv4/tcp_bpf.c| 106 +-
 net/tls/tls_sw.c  |   4 +-
 5 files changed, 112 insertions(+), 108 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index b5df69d5d397..8c24495d8d33 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -126,6 +126,10 @@ int sk_msg_zerocopy_from_iter(struct sock *sk, struct 
iov_iter *from,
  struct sk_msg *msg, u32 bytes);
 int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from,
 struct sk_msg *msg, u32 bytes);
+int sk_msg_wait_data(struct sock *sk, struct sk_psock *psock, int flags,
+long timeo, int *err);
+int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
+  int len, int flags);
 
 static inline void sk_msg_check_to_free(struct sk_msg *msg, u32 i, u32 bytes)
 {
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 2efa4e5ea23d..31b1696c62ba 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2209,8 +2209,6 @@ void tcp_bpf_clone(const struct sock *sk, struct sock 
*newsk);
 
 int tcp_bpf_sendmsg_redir(struct sock *sk, struct sk_msg *msg, u32 bytes,
  int flags);
-int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
- struct msghdr *msg, int len, int flags);
 #endif /* CONFIG_NET_SOCK_MSG */
 
 #if !defined(CONFIG_BPF_SYSCALL) || !defined(CONFIG_NET_SOCK_MSG)
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 7dbd8344ec89..fa10d869a728 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -399,6 +399,110 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct 
iov_iter *from,
 }
 EXPORT_SYMBOL_GPL(sk_msg_memcopy_from_iter);
 
+int sk_msg_wait_data(struct sock *sk, struct sk_psock *psock, int flags,
+long timeo, int *err)
+{
+   DEFINE_WAIT_FUNC(wait, woken_wake_function);
+   int ret = 0;
+
+   if (sk->sk_shutdown & RCV_SHUTDOWN)
+   return 1;
+
+   if (!timeo)
+   return ret;
+
+   add_wait_queue(sk_sleep(sk), &wait);
+   sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
+   ret = sk_wait_event(sk, &timeo,
+   !list_empty(&psock->ingress_msg) ||
+   !skb_queue_empty(&sk->sk_receive_queue), &wait);
+   sk_clear_bit(SOCKWQ_ASYNC_WAITDATA, sk);
+   remove_wait_queue(sk_sleep(sk), &wait);
+   return ret;
+}
+EXPORT_SYMBOL_GPL(sk_msg_wait_data);
+
+/* Receive sk_msg from psock->ingress_msg to @msg. */
+int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
+  int len, int flags)
+{
+   struct iov_iter *iter = &msg->msg_iter;
+   int peek = flags & MSG_PEEK;
+   struct sk_msg *msg_rx;
+   int i, copied = 0;
+
+   msg_rx = list_first_entry_or_null(&psock->ingress_msg,
+ struct sk_msg, list);
+
+   while (copied != len) {
+   struct scatterlist *sge;
+
+   if (unlikely(!msg_rx))
+   break;
+
+   i = msg_rx->sg.start;
+   do {
+   struct page *page;
+   int copy;
+
+   sge = sk_msg_elem(msg_rx, i);
+   copy = sge->length;
+   page = sg_page(sge);
+   if (copied + copy > len)
+   copy = len - copied;
+   copy = copy_page_to_iter(page, sge->offset, copy, iter);
+   if (!copy)
+   return copied ? copied : -EFAULT;
+
+   copied += copy;
+   if (likely(!peek)) {
+   sge->offset += copy;
+   sge->length -= copy;
+   if (!msg_rx->skb)
+   sk_mem_uncharge(sk, copy);
+   msg_rx->sg.size -= copy;
+
+   if (!sge->length) {
+   sk_msg_iter_var_next(i);
+   if (!msg_rx->skb)
+   put_page(page);
+   }
+   } else {
+   /* Lets not optimize peek case if 
copy_page_to_iter
+   

[Patch bpf-next v2 7/9] udp: implement udp_bpf_recvmsg() for sockmap

2021-03-02 Thread Cong Wang
From: Cong Wang 

We have to implement udp_bpf_recvmsg() to replace the ->recvmsg()
to retrieve skmsg from ingress_msg.

Cc: John Fastabend 
Cc: Daniel Borkmann 
Cc: Jakub Sitnicki 
Cc: Lorenz Bauer 
Signed-off-by: Cong Wang 
---
 net/ipv4/udp_bpf.c | 64 +-
 1 file changed, 63 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c
index 595836088e85..9a37ba056575 100644
--- a/net/ipv4/udp_bpf.c
+++ b/net/ipv4/udp_bpf.c
@@ -4,6 +4,68 @@
 #include 
 #include 
 #include 
+#include 
+
+#include "udp_impl.h"
+
+static struct proto *udpv6_prot_saved __read_mostly;
+
+static int sk_udp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
+ int noblock, int flags, int *addr_len)
+{
+#if IS_ENABLED(CONFIG_IPV6)
+   if (sk->sk_family == AF_INET6)
+   return udpv6_prot_saved->recvmsg(sk, msg, len, noblock, flags,
+addr_len);
+#endif
+   return udp_prot.recvmsg(sk, msg, len, noblock, flags, addr_len);
+}
+
+static int udp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
+  int nonblock, int flags, int *addr_len)
+{
+   struct sk_psock *psock;
+   int copied, ret;
+
+   if (unlikely(flags & MSG_ERRQUEUE))
+   return inet_recv_error(sk, msg, len, addr_len);
+
+   psock = sk_psock_get(sk);
+   if (unlikely(!psock))
+   return sk_udp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
+
+   lock_sock(sk);
+   if (sk_psock_queue_empty(psock)) {
+   ret = sk_udp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
+   goto out;
+   }
+
+msg_bytes_ready:
+   copied = sk_msg_recvmsg(sk, psock, msg, len, flags);
+   if (!copied) {
+   int data, err = 0;
+   long timeo;
+
+   timeo = sock_rcvtimeo(sk, nonblock);
+   data = sk_msg_wait_data(sk, psock, flags, timeo, &err);
+   if (data) {
+   if (!sk_psock_queue_empty(psock))
+   goto msg_bytes_ready;
+   ret = sk_udp_recvmsg(sk, msg, len, nonblock, flags, 
addr_len);
+   goto out;
+   }
+   if (err) {
+   ret = err;
+   goto out;
+   }
+   copied = -EAGAIN;
+   }
+   ret = copied;
+out:
+   release_sock(sk);
+   sk_psock_put(sk, psock);
+   return ret;
+}
 
 enum {
UDP_BPF_IPV4,
@@ -11,7 +73,6 @@ enum {
UDP_BPF_NUM_PROTS,
 };
 
-static struct proto *udpv6_prot_saved __read_mostly;
 static DEFINE_SPINLOCK(udpv6_prot_lock);
 static struct proto udp_bpf_prots[UDP_BPF_NUM_PROTS];
 
@@ -20,6 +81,7 @@ static void udp_bpf_rebuild_protos(struct proto *prot, const 
struct proto *base)
*prot= *base;
prot->unhash = sock_map_unhash;
prot->close  = sock_map_close;
+   prot->recvmsg = udp_bpf_recvmsg;
 }
 
 static void udp_bpf_check_v6_needs_rebuild(struct proto *ops)
-- 
2.25.1



[Patch bpf-next v2 9/9] selftests/bpf: add a test case for udp sockmap

2021-03-02 Thread Cong Wang
From: Cong Wang 

Add a test case to ensure redirection between two UDP sockets work.

Cc: John Fastabend 
Cc: Daniel Borkmann 
Cc: Jakub Sitnicki 
Cc: Lorenz Bauer 
Signed-off-by: Cong Wang 
---
 .../selftests/bpf/prog_tests/sockmap_listen.c | 140 ++
 .../selftests/bpf/progs/test_sockmap_listen.c |  22 +++
 2 files changed, 162 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c 
b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
index c26e6bf05e49..a549ebd3b5a6 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
@@ -1563,6 +1563,142 @@ static void test_redir(struct test_sockmap_listen 
*skel, struct bpf_map *map,
}
 }
 
+static void udp_redir_to_connected(int family, int sotype, int sock_mapfd,
+  int verd_mapfd, enum redir_mode mode)
+{
+   const char *log_prefix = redir_mode_str(mode);
+   struct sockaddr_storage addr;
+   int c0, c1, p0, p1;
+   unsigned int pass;
+   socklen_t len;
+   int err, n;
+   u64 value;
+   u32 key;
+   char b;
+
+   zero_verdict_count(verd_mapfd);
+
+   p0 = socket_loopback(family, sotype | SOCK_NONBLOCK);
+   if (p0 < 0)
+   return;
+   len = sizeof(addr);
+   err = xgetsockname(p0, sockaddr(&addr), &len);
+   if (err)
+   goto close_peer0;
+
+   c0 = xsocket(family, sotype | SOCK_NONBLOCK, 0);
+   if (c0 < 0)
+   goto close_peer0;
+   err = xconnect(c0, sockaddr(&addr), len);
+   if (err)
+   goto close_cli0;
+   err = xgetsockname(c0, sockaddr(&addr), &len);
+   if (err)
+   goto close_cli0;
+   err = xconnect(p0, sockaddr(&addr), len);
+   if (err)
+   goto close_cli0;
+
+   p1 = socket_loopback(family, sotype | SOCK_NONBLOCK);
+   if (p1 < 0)
+   goto close_cli0;
+   err = xgetsockname(p1, sockaddr(&addr), &len);
+   if (err)
+   goto close_cli0;
+
+   c1 = xsocket(family, sotype | SOCK_NONBLOCK, 0);
+   if (c1 < 0)
+   goto close_peer1;
+   err = xconnect(c1, sockaddr(&addr), len);
+   if (err)
+   goto close_cli1;
+   err = xgetsockname(c1, sockaddr(&addr), &len);
+   if (err)
+   goto close_cli1;
+   err = xconnect(p1, sockaddr(&addr), len);
+   if (err)
+   goto close_cli1;
+
+   key = 0;
+   value = p0;
+   err = xbpf_map_update_elem(sock_mapfd, &key, &value, BPF_NOEXIST);
+   if (err)
+   goto close_cli1;
+
+   key = 1;
+   value = p1;
+   err = xbpf_map_update_elem(sock_mapfd, &key, &value, BPF_NOEXIST);
+   if (err)
+   goto close_cli1;
+
+   n = write(c1, "a", 1);
+   if (n < 0)
+   FAIL_ERRNO("%s: write", log_prefix);
+   if (n == 0)
+   FAIL("%s: incomplete write", log_prefix);
+   if (n < 1)
+   goto close_cli1;
+
+   key = SK_PASS;
+   err = xbpf_map_lookup_elem(verd_mapfd, &key, &pass);
+   if (err)
+   goto close_cli1;
+   if (pass != 1)
+   FAIL("%s: want pass count 1, have %d", log_prefix, pass);
+
+   n = read(mode == REDIR_INGRESS ? p0 : c0, &b, 1);
+   if (n < 0)
+   FAIL_ERRNO("%s: read", log_prefix);
+   if (n == 0)
+   FAIL("%s: incomplete read", log_prefix);
+
+close_cli1:
+   xclose(c1);
+close_peer1:
+   xclose(p1);
+close_cli0:
+   xclose(c0);
+close_peer0:
+   xclose(p0);
+}
+
+static void udp_skb_redir_to_connected(struct test_sockmap_listen *skel,
+  struct bpf_map *inner_map, int 
family,
+  int sotype)
+{
+   int verdict = bpf_program__fd(skel->progs.prog_skb_verdict);
+   int verdict_map = bpf_map__fd(skel->maps.verdict_map);
+   int sock_map = bpf_map__fd(inner_map);
+   int err;
+
+   err = xbpf_prog_attach(verdict, sock_map, BPF_SK_SKB_VERDICT, 0);
+   if (err)
+   return;
+
+   skel->bss->test_ingress = false;
+   udp_redir_to_connected(family, sotype, sock_map, verdict_map,
+  REDIR_EGRESS);
+   skel->bss->test_ingress = true;
+   udp_redir_to_connected(family, sotype, sock_map, verdict_map,
+  REDIR_INGRESS);
+
+   xbpf_prog_detach2(verdict, sock_map, BPF_SK_SKB_VERDICT);
+}
+
+static void test_udp_redir(struct test_sockmap_listen *skel, struct bpf_map 
*map,
+  int family)
+{
+   const char *family_name, *map_name;
+   char s[MAX_TEST_NAME];
+
+   family_name = family_str(family);
+   map_name = map_type_str(map);
+   snprintf(s, sizeof(s), "%s %s %s", map_name, family_name, __func__);
+   if (!test__start_subtest(s))
+   return;
+  

[Patch bpf-next v2 8/9] sock_map: update sock type checks for UDP

2021-03-02 Thread Cong Wang
From: Cong Wang 

Now UDP supports sockmap and redirection, we can safely update
the sock type checks for it accordingly.

Cc: John Fastabend 
Cc: Daniel Borkmann 
Cc: Jakub Sitnicki 
Cc: Lorenz Bauer 
Signed-off-by: Cong Wang 
---
 net/core/sock_map.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 13d2af5bb81c..f7eee4b7b994 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -549,7 +549,10 @@ static bool sk_is_udp(const struct sock *sk)
 
 static bool sock_map_redirect_allowed(const struct sock *sk)
 {
-   return sk_is_tcp(sk) && sk->sk_state != TCP_LISTEN;
+   if (sk_is_tcp(sk))
+   return sk->sk_state != TCP_LISTEN;
+   else
+   return sk->sk_state == TCP_ESTABLISHED;
 }
 
 static bool sock_map_sk_is_suitable(const struct sock *sk)
-- 
2.25.1



Re: [External] Re: [PATCH 0/5] Use obj_cgroup APIs to change kmem pages

2021-03-02 Thread Muchun Song
On Tue, Mar 2, 2021 at 9:12 AM Roman Gushchin  wrote:
>
> Hi Muchun!
>
> On Mon, Mar 01, 2021 at 02:22:22PM +0800, Muchun Song wrote:
> > Since Roman series "The new cgroup slab memory controller" applied. All
> > slab objects are changed via the new APIs of obj_cgroup. This new APIs
> > introduce a struct obj_cgroup instead of using struct mem_cgroup directly
> > to charge slab objects. It prevents long-living objects from pinning the
> > original memory cgroup in the memory. But there are still some corner
> > objects (e.g. allocations larger than order-1 page on SLUB) which are
> > not charged via the API of obj_cgroup. Those objects (include the pages
> > which are allocated from buddy allocator directly) are charged as kmem
> > pages which still hold a reference to the memory cgroup.
>
> Yes, this is a good idea, large kmallocs should be treated the same
> way as small ones.
>
> >
> > E.g. We know that the kernel stack is charged as kmem pages because the
> > size of the kernel stack can be greater than 2 pages (e.g. 16KB on x86_64
> > or arm64). If we create a thread (suppose the thread stack is charged to
> > memory cgroup A) and then move it from memory cgroup A to memory cgroup
> > B. Because the kernel stack of the thread hold a reference to the memory
> > cgroup A. The thread can pin the memory cgroup A in the memory even if
> > we remove the cgroup A. If we want to see this scenario by using the
> > following script. We can see that the system has added 500 dying cgroups.
> >
> >   #!/bin/bash
> >
> >   cat /proc/cgroups | grep memory
> >
> >   cd /sys/fs/cgroup/memory
> >   echo 1 > memory.move_charge_at_immigrate
> >
> >   for i in range{1..500}
> >   do
> >   mkdir kmem_test
> >   echo $$ > kmem_test/cgroup.procs
> >   sleep 3600 &
> >   echo $$ > cgroup.procs
> >   echo `cat kmem_test/cgroup.procs` > cgroup.procs
> >   rmdir kmem_test
> >   done
> >
> >   cat /proc/cgroups | grep memory
>
> Well, moving processes between cgroups always created a lot of issues
> and corner cases and this one is definitely not the worst. So this problem
> looks a bit artificial, unless I'm missing something. But if it doesn't
> introduce any new performance costs and doesn't make the code more complex,
> I have nothing against.

OK. I just want to show that large kmallocs are charged as kmem pages.
So I constructed this test case.

>
> Btw, can you, please, run the spell-checker on commit logs? There are many
> typos (starting from the title of the series, I guess), which make the 
> patchset
> look less appealing.

Sorry for my poor English. I will do that. Thanks for your suggestions.


>
> Thank you!
>
> >
> > This patchset aims to make those kmem pages drop the reference to memory
> > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> > of the dying cgroups will not increase if we run the above test script.
> >
> > Patch 1-3 are using obj_cgroup APIs to charge kmem pages. The remote
> > memory cgroup charing APIs is a mechanism to charge kernel memory to a
> > given memory cgroup. So I also make it use the APIs of obj_cgroup.
> > Patch 4-5 are doing this.
> >
> > Muchun Song (5):
> >   mm: memcontrol: introduce obj_cgroup_{un}charge_page
> >   mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem
> > page
> >   mm: memcontrol: reparent the kmem pages on cgroup removal
> >   mm: memcontrol: move remote memcg charging APIs to CONFIG_MEMCG_KMEM
> >   mm: memcontrol: use object cgroup for remote memory cgroup charging
> >
> >  fs/buffer.c  |  10 +-
> >  fs/notify/fanotify/fanotify.c|   6 +-
> >  fs/notify/fanotify/fanotify_user.c   |   2 +-
> >  fs/notify/group.c|   3 +-
> >  fs/notify/inotify/inotify_fsnotify.c |   8 +-
> >  fs/notify/inotify/inotify_user.c |   2 +-
> >  include/linux/bpf.h  |   2 +-
> >  include/linux/fsnotify_backend.h |   2 +-
> >  include/linux/memcontrol.h   | 109 +++---
> >  include/linux/sched.h|   6 +-
> >  include/linux/sched/mm.h |  30 ++--
> >  kernel/bpf/syscall.c |  35 ++---
> >  kernel/fork.c|   4 +-
> >  mm/memcontrol.c  | 276 
> > ++-
> >  mm/page_alloc.c  |   4 +-
> >  15 files changed, 324 insertions(+), 175 deletions(-)
> >
> > --
> > 2.11.0
> >


[PATCH v2] can: c_can: move runtime PM enable/disable to c_can_platform

2021-03-02 Thread Tong Zhang
Currently doing modprobe c_can_pci will make kernel complain
"Unbalanced pm_runtime_enable!", this is caused by pm_runtime_enable()
called before pm is initialized.
This fix is similar to 227619c3ff7c, move those pm_enable/disable code to
c_can_platform.

Signed-off-by: Tong Zhang 
---
 drivers/net/can/c_can/c_can.c  | 24 +---
 drivers/net/can/c_can/c_can_platform.c |  6 +-
 2 files changed, 6 insertions(+), 24 deletions(-)

diff --git a/drivers/net/can/c_can/c_can.c b/drivers/net/can/c_can/c_can.c
index ef474bae47a1..6958830cb983 100644
--- a/drivers/net/can/c_can/c_can.c
+++ b/drivers/net/can/c_can/c_can.c
@@ -212,18 +212,6 @@ static const struct can_bittiming_const 
c_can_bittiming_const = {
.brp_inc = 1,
 };
 
-static inline void c_can_pm_runtime_enable(const struct c_can_priv *priv)
-{
-   if (priv->device)
-   pm_runtime_enable(priv->device);
-}
-
-static inline void c_can_pm_runtime_disable(const struct c_can_priv *priv)
-{
-   if (priv->device)
-   pm_runtime_disable(priv->device);
-}
-
 static inline void c_can_pm_runtime_get_sync(const struct c_can_priv *priv)
 {
if (priv->device)
@@ -1335,7 +1323,6 @@ static const struct net_device_ops c_can_netdev_ops = {
 
 int register_c_can_dev(struct net_device *dev)
 {
-   struct c_can_priv *priv = netdev_priv(dev);
int err;
 
/* Deactivate pins to prevent DRA7 DCAN IP from being
@@ -1345,28 +1332,19 @@ int register_c_can_dev(struct net_device *dev)
 */
pinctrl_pm_select_sleep_state(dev->dev.parent);
 
-   c_can_pm_runtime_enable(priv);
-
dev->flags |= IFF_ECHO; /* we support local echo */
dev->netdev_ops = &c_can_netdev_ops;
 
err = register_candev(dev);
-   if (err)
-   c_can_pm_runtime_disable(priv);
-   else
+   if (!err)
devm_can_led_init(dev);
-
return err;
 }
 EXPORT_SYMBOL_GPL(register_c_can_dev);
 
 void unregister_c_can_dev(struct net_device *dev)
 {
-   struct c_can_priv *priv = netdev_priv(dev);
-
unregister_candev(dev);
-
-   c_can_pm_runtime_disable(priv);
 }
 EXPORT_SYMBOL_GPL(unregister_c_can_dev);
 
diff --git a/drivers/net/can/c_can/c_can_platform.c 
b/drivers/net/can/c_can/c_can_platform.c
index 05f425ceb53a..47b251b1607c 100644
--- a/drivers/net/can/c_can/c_can_platform.c
+++ b/drivers/net/can/c_can/c_can_platform.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -386,6 +387,7 @@ static int c_can_plat_probe(struct platform_device *pdev)
platform_set_drvdata(pdev, dev);
SET_NETDEV_DEV(dev, &pdev->dev);
 
+   pm_runtime_enable(priv->device);
ret = register_c_can_dev(dev);
if (ret) {
dev_err(&pdev->dev, "registering %s failed (err=%d)\n",
@@ -398,6 +400,7 @@ static int c_can_plat_probe(struct platform_device *pdev)
return 0;
 
 exit_free_device:
+   pm_runtime_disable(priv->device);
free_c_can_dev(dev);
 exit:
dev_err(&pdev->dev, "probe failed\n");
@@ -408,9 +411,10 @@ static int c_can_plat_probe(struct platform_device *pdev)
 static int c_can_plat_remove(struct platform_device *pdev)
 {
struct net_device *dev = platform_get_drvdata(pdev);
+   struct c_can_priv *priv = netdev_priv(dev);
 
unregister_c_can_dev(dev);
-
+   pm_runtime_disable(priv->device);
free_c_can_dev(dev);
 
return 0;
-- 
2.25.1



Re: [PATCH] can: c_can: move runtime PM enable/disable to c_can_platform

2021-03-02 Thread Tong Zhang
On Mon, Mar 1, 2021 at 2:49 PM Marc Kleine-Budde  wrote:
>
> On 28.02.2021 23:15:48, Tong Zhang wrote:
> > Currently doing modprobe c_can_pci will make kernel complain
> > "Unbalanced pm_runtime_enable!", this is caused by pm_runtime_enable()
> > called before pm is initialized in register_candev() and doing so will
>
> I don't see where register_candev() is doing any pm related
> initialization.
>
> > also cause it to enable twice.
>
> > This fix is similar to 227619c3ff7c, move those pm_enable/disable code to
> > c_can_platform.
>
> As I understand 227619c3ff7c ("can: m_can: move runtime PM
> enable/disable to m_can_platform"), PCI devices automatically enable PM,
> when the "PCI device is added".

Hi Marc,
Thanks for the comments. I thinks you are right -- I was mislead by the trace --
I have corrected the commit log along with the indent fix in v2 patch.
Thanks again for your help,
- Tong

>
> Please clarify the above point, otherwise the code looks OK, small
> nitpick inline:


Re: [External] Re: [PATCH 2/5] mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem page

2021-03-02 Thread Muchun Song
On Tue, Mar 2, 2021 at 2:11 AM Shakeel Butt  wrote:
>
> On Sun, Feb 28, 2021 at 10:25 PM Muchun Song  wrote:
> >
> > We want to reuse the obj_cgroup APIs to reparent the kmem pages when
> > the memcg offlined. If we do this, we should store an object cgroup
> > pointer to page->memcg_data for the kmem pages.
> >
> > Finally, page->memcg_data can have 3 different meanings.
> >
> >   1) For the slab pages, page->memcg_data points to an object cgroups
> >  vector.
> >
> >   2) For the kmem pages (exclude the slab pages), page->memcg_data
> >  points to an object cgroup.
> >
> >   3) For the user pages (e.g. the LRU pages), page->memcg_data points
> >  to a memory cgroup.
> >
> > Currently we always get the memcg associated with a page via page_memcg
> > or page_memcg_rcu. page_memcg_check is special, it has to be used in
> > cases when it's not known if a page has an associated memory cgroup
> > pointer or an object cgroups vector. Because the page->memcg_data of
> > the kmem page is not pointing to a memory cgroup in the later patch,
> > the page_memcg and page_memcg_rcu cannot be applicable for the kmem
> > pages. In this patch, we introduce page_memcg_kmem to get the memcg
> > associated with the kmem pages. And make page_memcg and page_memcg_rcu
> > no longer apply to the kmem pages.
> >
> > In the end, there are 4 helpers to get the memcg associated with a
> > page. The usage is as follows.
> >
> >   1) Get the memory cgroup associated with a non-kmem page (e.g. the LRU
> >  pages).
> >
> >  - page_memcg()
> >  - page_memcg_rcu()
>
> Can you rename these to page_memcg_lru[_rcu] to make them explicitly
> for LRU pages?

Yes. Will do. Thanks.

>
> >
> >   2) Get the memory cgroup associated with a kmem page (exclude the slab
> >  pages).
> >
> >  - page_memcg_kmem()
> >
> >   3) Get the memory cgroup associated with a page. It has to be used in
> >  cases when it's not known if a page has an associated memory cgroup
> >  pointer or an object cgroups vector. Returns NULL for slab pages or
> >  uncharged pages, otherwise, returns memory cgroup for charged pages
> >  (e.g. kmem pages, LRU pages).
> >
> >  - page_memcg_check()
> >
> > In some place, we use page_memcg to check whether the page is charged.
> > Now we introduce page_memcg_charged helper to do this.
> >
> > This is a preparation for reparenting the kmem pages. To support reparent
> > kmem pages, we just need to adjust page_memcg_kmem and page_memcg_check in
> > the later patch.
> >
> > Signed-off-by: Muchun Song 
> > ---
> [snip]
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -855,10 +855,11 @@ void __mod_lruvec_page_state(struct page *page, enum 
> > node_stat_item idx,
> >  int val)
> >  {
> > struct page *head = compound_head(page); /* rmap on tail pages */
> > -   struct mem_cgroup *memcg = page_memcg(head);
> > +   struct mem_cgroup *memcg;
> > pg_data_t *pgdat = page_pgdat(page);
> > struct lruvec *lruvec;
> >
> > +   memcg = PageMemcgKmem(head) ? page_memcg_kmem(head) : 
> > page_memcg(head);
>
> Should page_memcg_check() be used here?

Yeah. page_memcg_check() can be used here.
But on the inside of the page_memcg_check(),
there is a READ_ONCE(). Actually, we do not
need READ_ONCE() here. So I use page_memcg
or page_memcg_kmem directly. Thanks.

>
> > /* Untracked pages have no memcg, no lruvec. Update only the node */
> > if (!memcg) {
> > __mod_node_page_state(pgdat, idx, val);
> > @@ -3170,12 +3171,13 @@ int __memcg_kmem_charge_page(struct page *page, 
> > gfp_t gfp, int order)
> >   */
> >  void __memcg_kmem_uncharge_page(struct page *page, int order)
> >  {
> > -   struct mem_cgroup *memcg = page_memcg(page);
> > +   struct mem_cgroup *memcg;
> > unsigned int nr_pages = 1 << order;
> >
> > -   if (!memcg)
> > +   if (!page_memcg_charged(page))
> > return;
> >
> > +   memcg = page_memcg_kmem(page);
> > VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
> > __memcg_kmem_uncharge(memcg, nr_pages);
> > page->memcg_data = 0;
> > @@ -6831,24 +6833,25 @@ static void uncharge_batch(const struct 
> > uncharge_gather *ug)
> >  static void uncharge_page(struct page *page, struct uncharge_gather *ug)
> >  {
> > unsigned long nr_pages;
> > +   struct mem_cgroup *memcg;
> >
> > VM_BUG_ON_PAGE(PageLRU(page), page);
> >
> > -   if (!page_memcg(page))
> > +   if (!page_memcg_charged(page))
> > return;
> >
> > /*
> >  * Nobody should be changing or seriously looking at
> > -* page_memcg(page) at this point, we have fully
> > -* exclusive access to the page.
> > +* page memcg at this point, we have fully exclusive
> > +* access to the page.
> >  */
> > -
> > -   if (ug->memcg != page_memcg(page)) {
> > +   memcg = PageMemcgKme

Re: [External] Re: [PATCH 2/5] mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem page

2021-03-02 Thread Shakeel Butt
On Mon, Mar 1, 2021 at 7:03 PM Muchun Song  wrote:
>
> On Tue, Mar 2, 2021 at 2:11 AM Shakeel Butt  wrote:
> >
> > On Sun, Feb 28, 2021 at 10:25 PM Muchun Song  
> > wrote:
> > >
> > > We want to reuse the obj_cgroup APIs to reparent the kmem pages when
> > > the memcg offlined. If we do this, we should store an object cgroup
> > > pointer to page->memcg_data for the kmem pages.
> > >
> > > Finally, page->memcg_data can have 3 different meanings.
> > >
> > >   1) For the slab pages, page->memcg_data points to an object cgroups
> > >  vector.
> > >
> > >   2) For the kmem pages (exclude the slab pages), page->memcg_data
> > >  points to an object cgroup.
> > >
> > >   3) For the user pages (e.g. the LRU pages), page->memcg_data points
> > >  to a memory cgroup.
> > >
> > > Currently we always get the memcg associated with a page via page_memcg
> > > or page_memcg_rcu. page_memcg_check is special, it has to be used in
> > > cases when it's not known if a page has an associated memory cgroup
> > > pointer or an object cgroups vector. Because the page->memcg_data of
> > > the kmem page is not pointing to a memory cgroup in the later patch,
> > > the page_memcg and page_memcg_rcu cannot be applicable for the kmem
> > > pages. In this patch, we introduce page_memcg_kmem to get the memcg
> > > associated with the kmem pages. And make page_memcg and page_memcg_rcu
> > > no longer apply to the kmem pages.
> > >
> > > In the end, there are 4 helpers to get the memcg associated with a
> > > page. The usage is as follows.
> > >
> > >   1) Get the memory cgroup associated with a non-kmem page (e.g. the LRU
> > >  pages).
> > >
> > >  - page_memcg()
> > >  - page_memcg_rcu()
> >
> > Can you rename these to page_memcg_lru[_rcu] to make them explicitly
> > for LRU pages?
>
> Yes. Will do. Thanks.
>

Please follow Johannes' suggestion regarding page_memcg_kmem() and
then no need to rename these.


[PATCH] net: ethernet: mtk-star-emac: fix wrong unmap in RX handling

2021-03-02 Thread Biao Huang
mtk_star_dma_unmap_rx() should unmap the dma_addr of old skb rather than
that of new skb.
Assign new_dma_addr to desc_data.dma_addr after all handling of old skb
ends to avoid unexpected receive side error.

Fixes: f96e9641e92b ("net: ethernet: mtk-star-emac: fix error path in RX 
handling")
Signed-off-by: Biao Huang 
---
 drivers/net/ethernet/mediatek/mtk_star_emac.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_star_emac.c 
b/drivers/net/ethernet/mediatek/mtk_star_emac.c
index a8641a407c06..96d2891f1675 100644
--- a/drivers/net/ethernet/mediatek/mtk_star_emac.c
+++ b/drivers/net/ethernet/mediatek/mtk_star_emac.c
@@ -1225,8 +1225,6 @@ static int mtk_star_receive_packet(struct mtk_star_priv 
*priv)
goto push_new_skb;
}
 
-   desc_data.dma_addr = new_dma_addr;
-
/* We can't fail anymore at this point: it's safe to unmap the skb. */
mtk_star_dma_unmap_rx(priv, &desc_data);
 
@@ -1236,6 +1234,9 @@ static int mtk_star_receive_packet(struct mtk_star_priv 
*priv)
desc_data.skb->dev = ndev;
netif_receive_skb(desc_data.skb);
 
+   /* update dma_addr for new skb */
+   desc_data.dma_addr = new_dma_addr;
+
 push_new_skb:
desc_data.len = skb_tailroom(new_skb);
desc_data.skb = new_skb;
-- 
2.18.0



Re: [RFC net-next] net: dsa: rtl8366rb: support bridge offloading

2021-03-02 Thread DENG Qingfang
On Mon, Mar 1, 2021 at 9:55 PM Linus Walleij  wrote:
>
> BTW where did you find this register? It's not in any of my
> vendor driver code dumps.

DD-WRT
https://svn.dd-wrt.com/browser/src/linux/universal/linux-4.14/drivers/net/ethernet/ag7100/RTL8366RB_DRIVER/rtl8368s_reg.h#L581

>
> Curious!
>
> Yours,
> Linus Walleij


Re: [External] Re: [PATCH 2/5] mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem page

2021-03-02 Thread Muchun Song
On Tue, Mar 2, 2021 at 3:09 AM Johannes Weiner  wrote:
>
> Muchun, can you please reduce the CC list to mm/memcg folks only for
> the next submission? I think probably 80% of the current recipients
> don't care ;-)

At first, I just used scripts/get_maintainer.pl to get the
CC list. I will reduce the CC list in the next version.
Thanks.

>
> On Mon, Mar 01, 2021 at 10:11:45AM -0800, Shakeel Butt wrote:
> > On Sun, Feb 28, 2021 at 10:25 PM Muchun Song  
> > wrote:
> > >
> > > We want to reuse the obj_cgroup APIs to reparent the kmem pages when
> > > the memcg offlined. If we do this, we should store an object cgroup
> > > pointer to page->memcg_data for the kmem pages.
> > >
> > > Finally, page->memcg_data can have 3 different meanings.
> > >
> > >   1) For the slab pages, page->memcg_data points to an object cgroups
> > >  vector.
> > >
> > >   2) For the kmem pages (exclude the slab pages), page->memcg_data
> > >  points to an object cgroup.
> > >
> > >   3) For the user pages (e.g. the LRU pages), page->memcg_data points
> > >  to a memory cgroup.
> > >
> > > Currently we always get the memcg associated with a page via page_memcg
> > > or page_memcg_rcu. page_memcg_check is special, it has to be used in
> > > cases when it's not known if a page has an associated memory cgroup
> > > pointer or an object cgroups vector. Because the page->memcg_data of
> > > the kmem page is not pointing to a memory cgroup in the later patch,
> > > the page_memcg and page_memcg_rcu cannot be applicable for the kmem
> > > pages. In this patch, we introduce page_memcg_kmem to get the memcg
> > > associated with the kmem pages. And make page_memcg and page_memcg_rcu
> > > no longer apply to the kmem pages.
> > >
> > > In the end, there are 4 helpers to get the memcg associated with a
> > > page. The usage is as follows.
> > >
> > >   1) Get the memory cgroup associated with a non-kmem page (e.g. the LRU
> > >  pages).
> > >
> > >  - page_memcg()
> > >  - page_memcg_rcu()
> >
> > Can you rename these to page_memcg_lru[_rcu] to make them explicitly
> > for LRU pages?
>
> The next patch removes page_memcg_kmem() again to replace it with
> page_objcg(). That should (luckily) remove the need for this
> distinction and keep page_memcg() simple and obvious.
>
> It would be better to not introduce page_memcg_kmem() in the first
> place in this patch, IMO.

OK. I will follow your suggestion. Thanks.


Re: [External] Re: [PATCH 4/5] mm: memcontrol: move remote memcg charging APIs to CONFIG_MEMCG_KMEM

2021-03-02 Thread Muchun Song
On Tue, Mar 2, 2021 at 9:15 AM Roman Gushchin  wrote:
>
> On Mon, Mar 01, 2021 at 02:22:26PM +0800, Muchun Song wrote:
> > The remote memcg charing APIs is a mechanism to charge kernel memory
> > to a given memcg. So we can move the infrastructure to the scope of
> > the CONFIG_MEMCG_KMEM.
>
> This is not a good idea, because there is nothing kmem-specific
> in the idea of remote charging, and we definitely will see cases
> when user memory is charged to the process different from the current.

Got it. Thanks for your reminder.


>
> >
> > As a bonus, on !CONFIG_MEMCG_KMEM build some functions and variables
> > can be compiled out.
> >
> > Signed-off-by: Muchun Song 
> > ---
> >  include/linux/sched.h| 2 ++
> >  include/linux/sched/mm.h | 2 +-
> >  kernel/fork.c| 2 +-
> >  mm/memcontrol.c  | 4 
> >  4 files changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index ee46f5cab95b..c2d488eddf85 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1314,7 +1314,9 @@ struct task_struct {
> >
> >   /* Number of pages to reclaim on returning to userland: */
> >   unsigned intmemcg_nr_pages_over_high;
> > +#endif
> >
> > +#ifdef CONFIG_MEMCG_KMEM
> >   /* Used by memcontrol for targeted memcg charge: */
> >   struct mem_cgroup   *active_memcg;
> >  #endif
> > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> > index 1ae08b8462a4..64a72975270e 100644
> > --- a/include/linux/sched/mm.h
> > +++ b/include/linux/sched/mm.h
> > @@ -294,7 +294,7 @@ static inline void memalloc_nocma_restore(unsigned int 
> > flags)
> >  }
> >  #endif
> >
> > -#ifdef CONFIG_MEMCG
> > +#ifdef CONFIG_MEMCG_KMEM
> >  DECLARE_PER_CPU(struct mem_cgroup *, int_active_memcg);
> >  /**
> >   * set_active_memcg - Starts the remote memcg charging scope.
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index d66cd1014211..d66718bc82d5 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -942,7 +942,7 @@ static struct task_struct *dup_task_struct(struct 
> > task_struct *orig, int node)
> >   tsk->use_memdelay = 0;
> >  #endif
> >
> > -#ifdef CONFIG_MEMCG
> > +#ifdef CONFIG_MEMCG_KMEM
> >   tsk->active_memcg = NULL;
> >  #endif
> >   return tsk;
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 39cb8c5bf8b2..092dc4588b43 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -76,8 +76,10 @@ EXPORT_SYMBOL(memory_cgrp_subsys);
> >
> >  struct mem_cgroup *root_mem_cgroup __read_mostly;
> >
> > +#ifdef CONFIG_MEMCG_KMEM
> >  /* Active memory cgroup to use from an interrupt context */
> >  DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg);
> > +#endif
> >
> >  /* Socket memory accounting disabled? */
> >  static bool cgroup_memory_nosocket;
> > @@ -1054,6 +1056,7 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct 
> > mm_struct *mm)
> >  }
> >  EXPORT_SYMBOL(get_mem_cgroup_from_mm);
> >
> > +#ifdef CONFIG_MEMCG_KMEM
> >  static __always_inline struct mem_cgroup *active_memcg(void)
> >  {
> >   if (in_interrupt())
> > @@ -1074,6 +1077,7 @@ static __always_inline bool memcg_kmem_bypass(void)
> >
> >   return false;
> >  }
> > +#endif
> >
> >  /**
> >   * mem_cgroup_iter - iterate over memory cgroup hierarchy
> > --
> > 2.11.0
> >


Re: [External] Re: [PATCH 2/5] mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem page

2021-03-02 Thread Muchun Song
On Tue, Mar 2, 2021 at 11:36 AM Shakeel Butt  wrote:
>
> On Mon, Mar 1, 2021 at 7:03 PM Muchun Song  wrote:
> >
> > On Tue, Mar 2, 2021 at 2:11 AM Shakeel Butt  wrote:
> > >
> > > On Sun, Feb 28, 2021 at 10:25 PM Muchun Song  
> > > wrote:
> > > >
> > > > We want to reuse the obj_cgroup APIs to reparent the kmem pages when
> > > > the memcg offlined. If we do this, we should store an object cgroup
> > > > pointer to page->memcg_data for the kmem pages.
> > > >
> > > > Finally, page->memcg_data can have 3 different meanings.
> > > >
> > > >   1) For the slab pages, page->memcg_data points to an object cgroups
> > > >  vector.
> > > >
> > > >   2) For the kmem pages (exclude the slab pages), page->memcg_data
> > > >  points to an object cgroup.
> > > >
> > > >   3) For the user pages (e.g. the LRU pages), page->memcg_data points
> > > >  to a memory cgroup.
> > > >
> > > > Currently we always get the memcg associated with a page via page_memcg
> > > > or page_memcg_rcu. page_memcg_check is special, it has to be used in
> > > > cases when it's not known if a page has an associated memory cgroup
> > > > pointer or an object cgroups vector. Because the page->memcg_data of
> > > > the kmem page is not pointing to a memory cgroup in the later patch,
> > > > the page_memcg and page_memcg_rcu cannot be applicable for the kmem
> > > > pages. In this patch, we introduce page_memcg_kmem to get the memcg
> > > > associated with the kmem pages. And make page_memcg and page_memcg_rcu
> > > > no longer apply to the kmem pages.
> > > >
> > > > In the end, there are 4 helpers to get the memcg associated with a
> > > > page. The usage is as follows.
> > > >
> > > >   1) Get the memory cgroup associated with a non-kmem page (e.g. the LRU
> > > >  pages).
> > > >
> > > >  - page_memcg()
> > > >  - page_memcg_rcu()
> > >
> > > Can you rename these to page_memcg_lru[_rcu] to make them explicitly
> > > for LRU pages?
> >
> > Yes. Will do. Thanks.
> >
>
> Please follow Johannes' suggestion regarding page_memcg_kmem() and
> then no need to rename these.

OK.


Re: [PATCH 4/5] mm: memcontrol: move remote memcg charging APIs to CONFIG_MEMCG_KMEM

2021-03-02 Thread Roman Gushchin
On Mon, Mar 01, 2021 at 07:43:27PM -0800, Shakeel Butt wrote:
> On Mon, Mar 1, 2021 at 5:16 PM Roman Gushchin  wrote:
> >
> > On Mon, Mar 01, 2021 at 02:22:26PM +0800, Muchun Song wrote:
> > > The remote memcg charing APIs is a mechanism to charge kernel memory
> > > to a given memcg. So we can move the infrastructure to the scope of
> > > the CONFIG_MEMCG_KMEM.
> >
> > This is not a good idea, because there is nothing kmem-specific
> > in the idea of remote charging, and we definitely will see cases
> > when user memory is charged to the process different from the current.
> >
> 
> Indeed and which remind me: what happened to the "Charge loop device
> i/o to issuing cgroup" series? That series was doing remote charging
> for user pages.

Yeah, this is exactly what I minded. We're using it internally, and as I
remember there were no obstacles to upstream it too.
I'll ping Dan when after the merge window.

Thanks!


Re: [RFC net-next] net: dsa: rtl8366rb: support bridge offloading

2021-03-02 Thread DENG Qingfang
On Mon, Mar 1, 2021 at 9:48 PM Linus Walleij  wrote:
> With my minor changes:
> Tested-by: Linus Walleij 

How about using a mutex lock in port_bridge_{join,leave} ?
In my opinion all functions that access multiple registers should be
synchronized.

> Yours,
> Linus Walleij


Re: [PATCH] net: 9p: free what was emitted when read count is 0

2021-03-02 Thread Dominique Martinet
Jisheng Zhang wrote on Mon, Mar 01, 2021 at 11:01:57AM +0800:
> Per my understanding of iov_iter, we need to call iov_iter_advance()
> even when the read out count is 0. I believe we can see this common style
> in other fs.

I'm not sure where you see this style, but I don't see exceptions for
0-sized read not advancing the iov in general, and I guess this makes
sense.


Rather than make an exception for 0, how about just removing the if as
follow ?

I've checked that the non_zc case (copy_to_iter with 0 size) also works
to the same effect, so I'm not sure why the check got added in the
first place... But then again this is old code so maybe the semantics
changed since 2015.



diff --git a/net/9p/client.c b/net/9p/client.c
index 4f62f299da0c..0a0039255c5b 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -1623,11 +1623,6 @@ p9_client_read_once(struct p9_fid *fid, u64 offset, 
struct iov_iter *to,
}

p9_debug(P9_DEBUG_9P, "<<< RREAD count %d\n", count);
-   if (!count) {
-   p9_tag_remove(clnt, req);
-   return 0;
-   }
-
if (non_zc) {
int n = copy_to_iter(dataptr, count, to);




If you're ok with that, would you mind resending that way?

I'd also want the commit message to be reworded a bit, at least the
first line (summary) doesn't make sense right now: I have no idea
what you mean by "free what was emitted".
Just "9p: advance iov on empty read" or something similar would do.


> > cat version? coreutils' doesn't seem to do that on their git)
> 
> busybox cat

Ok, could reproduce with busybox cat, thanks.
As expected I can't reproduce with older kernels so will run a bisect
for the sake of it as time allows

-- 
Dominique


Re: [PATCH 4/5] mm: memcontrol: move remote memcg charging APIs to CONFIG_MEMCG_KMEM

2021-03-02 Thread Shakeel Butt
On Mon, Mar 1, 2021 at 5:16 PM Roman Gushchin  wrote:
>
> On Mon, Mar 01, 2021 at 02:22:26PM +0800, Muchun Song wrote:
> > The remote memcg charing APIs is a mechanism to charge kernel memory
> > to a given memcg. So we can move the infrastructure to the scope of
> > the CONFIG_MEMCG_KMEM.
>
> This is not a good idea, because there is nothing kmem-specific
> in the idea of remote charging, and we definitely will see cases
> when user memory is charged to the process different from the current.
>

Indeed and which remind me: what happened to the "Charge loop device
i/o to issuing cgroup" series? That series was doing remote charging
for user pages.


icmpv6.h:70:2: error: implicit declaration of function '__icmpv6_send'; did you mean 'icmpv6_send'? [-Werror=implicit-function-declaration]

2021-03-02 Thread Naresh Kamboju
Stable rc builds failed on arm64, arm, arc, mips, parisc, ppc, riscv,
sh, s390 and x86_64.
Build failed branches list:
  - Stable-rc Linux 5.4.102-rc2
  - Stable-rc Linux 4.19.178-rc2
  - Stable-rc Linux 4.14.223-rc2
  - Stable-rc Linux 4.9.259-rc1

Failed build set list:
  - arm64 (allnoconfig) with gcc-8, gcc-9 and gcc-10
  - arm64 (tinyconfig) with gcc-8, gcc-9 and gcc-10
  
  - x86_64 (allnoconfig) with gcc-8, gcc-9 and gcc-10
  - x86_64 (tinyconfig) with gcc-8, gcc-9 and gcc-10

# to reproduce this build locally:

tuxmake --target-arch=arm64 --kconfig=allnoconfig --toolchain=gcc-9
--wrapper=sccache --environment=SCCACHE_BUCKET=sccache.tuxbuild.com
--runtime=podman --image=public.ecr.aws/tuxmake/arm64_gcc-9 config
default kernel xipkernel modules dtbs dtbs-legacy debugkernel

make --silent --keep-going --jobs=8
O=/home/tuxbuild/.cache/tuxmake/builds/1/tmp ARCH=arm64
CROSS_COMPILE=aarch64-linux-gnu- 'CC=sccache aarch64-linux-gnu-gcc'
'HOSTCC=sccache gcc' allnoconfig
make --silent --keep-going --jobs=8
O=/home/tuxbuild/.cache/tuxmake/builds/1/tmp ARCH=arm64
CROSS_COMPILE=aarch64-linux-gnu- 'CC=sccache aarch64-linux-gnu-gcc'
'HOSTCC=sccache gcc'
In file included from include/net/ndisc.h:53,
 from include/net/ipv6.h:18,
 from include/linux/sunrpc/clnt.h:28,
 from include/linux/nfs_fs.h:32,
 from init/do_mounts.c:23:
include/linux/icmpv6.h: In function 'icmpv6_ndo_send':
include/linux/icmpv6.h:70:2: error: implicit declaration of function
'__icmpv6_send'; did you mean 'icmpv6_send'?
[-Werror=implicit-function-declaration]
   70 |  __icmpv6_send(skb_in, type, code, info, &parm);
  |  ^
  |  icmpv6_send
cc1: some warnings being treated as errors
make[2]: *** [scripts/Makefile.build:261: init/do_mounts.o] Error 1

Reported-by: Naresh Kamboju 

Build log link,
https://gitlab.com/Linaro/lkft/mirrors/stable/linux-stable-rc/-/jobs/1064182703#L61
https://gitlab.com/Linaro/lkft/mirrors/stable/linux-stable-rc/-/jobs/1064593353#L62

--
Linaro LKFT
https://lkft.linaro.org


Re: [PATCH] vdpa/mlx5: Fix wrong use of bit numbers

2021-03-02 Thread Eli Cohen
On Mon, Mar 01, 2021 at 10:33:14AM -0500, Michael S. Tsirkin wrote:
> On Mon, Mar 01, 2021 at 03:52:45PM +0800, Jason Wang wrote:
> > 
> > On 2021/3/1 2:28 下午, Eli Cohen wrote:
> > > VIRTIO_F_VERSION_1 is a bit number. Use BIT_ULL() with mask
> > > conditionals.
> > > 
> > > Also, in mlx5_vdpa_is_little_endian() use BIT_ULL for consistency with
> > > the rest of the code.
> > > 
> > > Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 
> > > devices")
> > > Signed-off-by: Eli Cohen 
> > 
> > 
> > Acked-by: Jason Wang 
> 
> And CC stable I guess?

Is this a question or a request? :-)

> 
> > 
> > > ---
> > >   drivers/vdpa/mlx5/net/mlx5_vnet.c | 4 ++--
> > >   1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> > > b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > > index dc7031132fff..7d21b857a94a 100644
> > > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > > @@ -821,7 +821,7 @@ static int create_virtqueue(struct mlx5_vdpa_net 
> > > *ndev, struct mlx5_vdpa_virtque
> > >   MLX5_SET(virtio_q, vq_ctx, event_qpn_or_msix, 
> > > mvq->fwqp.mqp.qpn);
> > >   MLX5_SET(virtio_q, vq_ctx, queue_size, mvq->num_ent);
> > >   MLX5_SET(virtio_q, vq_ctx, virtio_version_1_0,
> > > -  !!(ndev->mvdev.actual_features & VIRTIO_F_VERSION_1));
> > > +  !!(ndev->mvdev.actual_features & BIT_ULL(VIRTIO_F_VERSION_1)));
> > >   MLX5_SET64(virtio_q, vq_ctx, desc_addr, mvq->desc_addr);
> > >   MLX5_SET64(virtio_q, vq_ctx, used_addr, mvq->device_addr);
> > >   MLX5_SET64(virtio_q, vq_ctx, available_addr, mvq->driver_addr);
> > > @@ -1578,7 +1578,7 @@ static void teardown_virtqueues(struct 
> > > mlx5_vdpa_net *ndev)
> > >   static inline bool mlx5_vdpa_is_little_endian(struct mlx5_vdpa_dev 
> > > *mvdev)
> > >   {
> > >   return virtio_legacy_is_little_endian() ||
> > > - (mvdev->actual_features & (1ULL << VIRTIO_F_VERSION_1));
> > > + (mvdev->actual_features & BIT_ULL(VIRTIO_F_VERSION_1));
> > >   }
> > >   static __virtio16 cpu_to_mlx5vdpa16(struct mlx5_vdpa_dev *mvdev, u16 
> > > val)
> 


Re: [PATCH] e1000e: use proper #include guard name in hw.h

2021-03-02 Thread gre...@linuxfoundation.org
On Tue, Mar 02, 2021 at 01:37:59AM +, Nguyen, Anthony L wrote:
> On Sat, 2021-02-27 at 10:58 +0100, Greg Kroah-Hartman wrote:
> > The include guard for the e1000e and e1000 hw.h files are the same,
> > so
> > add the proper "E" term to the hw.h file for the e1000e driver.
> 
> There's a patch in process that addresses this issue [1].

Thanks, hopefully it gets fixed somehow :)

greg k-h


dsa_master_find_slave()'s time complexity and potential performance hit

2021-03-02 Thread DENG Qingfang
Since commit 7b9a2f4bac68 ("net: dsa: use ports list to find slave"),
dsa_master_find_slave() has been iterating over a linked list instead
of accessing arrays, making its time complexity O(n).
The said function is called frequently in DSA RX path, so it may cause
a performance hit, especially for switches that have many ports (20+)
such as RTL8380/8390/9300 (There is a downstream DSA driver for it,
see 
https://github.com/openwrt/openwrt/tree/openwrt-21.02/target/linux/realtek/files-5.4/drivers/net/dsa/rtl83xx).
I don't have one of those switches, so I can't test if the performance
impact is huge or not.


[PATCH net] net: tcp: don't allocate fast clones for fastopen SYN

2021-03-02 Thread Jakub Kicinski
When receiver does not accept TCP Fast Open it will only ack
the SYN, and not the data. We detect this and immediately queue
the data for (re)transmission in tcp_rcv_fastopen_synack().

In DC networks with very low RTT and without RFS the SYN-ACK
may arrive before NIC driver reported Tx completion on
the original SYN. In which case skb_still_in_host_queue()
returns true and sender will need to wait for the retransmission
timer to fire milliseconds later.

Revert back to non-fast clone skbs, this way
skb_still_in_host_queue() won't prevent the recovery flow
from completing.

Suggested-by: Eric Dumazet 
Fixes: 355a901e6cf1 ("tcp: make connect() mem charging friendly")
Signed-off-by: Neil Spring 
Signed-off-by: Jakub Kicinski 
---
 net/ipv4/tcp_output.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index fbf140a770d8..cd9461588539 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3759,9 +3759,16 @@ static int tcp_send_syn_data(struct sock *sk, struct 
sk_buff *syn)
/* limit to order-0 allocations */
space = min_t(size_t, space, SKB_MAX_HEAD(MAX_TCP_HEADER));
 
-   syn_data = sk_stream_alloc_skb(sk, space, sk->sk_allocation, false);
+   syn_data = alloc_skb(MAX_TCP_HEADER + space, sk->sk_allocation);
if (!syn_data)
goto fallback;
+   if (!sk_wmem_schedule(sk, syn_data->truesize)) {
+   __kfree_skb(syn_data);
+   goto fallback;
+   }
+   skb_reserve(syn_data, MAX_TCP_HEADER);
+   INIT_LIST_HEAD(&syn_data->tcp_tsorted_anchor);
+
syn_data->ip_summed = CHECKSUM_PARTIAL;
memcpy(syn_data->cb, syn->cb, sizeof(syn->cb));
if (space) {
-- 
2.26.2



Re: [PATCH] iwlwifi: fix ARCH=i386 compilation warnings

2021-03-02 Thread Kalle Valo
Pierre-Louis Bossart  writes:

> An unsigned long variable should rely on '%lu' format strings, not '%zd'
>
> Fixes: a1a6a4cf49ece ("iwlwifi: pnvm: implement reading PNVM from UEFI")
> Signed-off-by: Pierre-Louis Bossart 
> ---
> warnings found with v5.12-rc1 and next-20210301

Luca, can I take this to wireless-drivers?

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches


[PATCH 14/44] net: caif: inline register_ldisc

2021-03-02 Thread Jiri Slaby
register_ldisc only calls tty_register_ldisc. Inline register_ldisc into
the only caller of register_ldisc, i.e. caif_ser_init. Now,
caif_ser_init is symmetric to caif_ser_exit in this regard.

Signed-off-by: Jiri Slaby 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: netdev@vger.kernel.org
---
 drivers/net/caif/caif_serial.c | 17 -
 1 file changed, 4 insertions(+), 13 deletions(-)

diff --git a/drivers/net/caif/caif_serial.c b/drivers/net/caif/caif_serial.c
index 675c374b32ee..da6fffb4d5a8 100644
--- a/drivers/net/caif/caif_serial.c
+++ b/drivers/net/caif/caif_serial.c
@@ -389,18 +389,6 @@ static struct tty_ldisc_ops caif_ldisc = {
.write_wakeup = ldisc_tx_wakeup
 };
 
-static int register_ldisc(void)
-{
-   int result;
-
-   result = tty_register_ldisc(N_CAIF, &caif_ldisc);
-   if (result < 0) {
-   pr_err("cannot register CAIF ldisc=%d err=%d\n", N_CAIF,
-   result);
-   return result;
-   }
-   return result;
-}
 static const struct net_device_ops netdev_ops = {
.ndo_open = caif_net_open,
.ndo_stop = caif_net_close,
@@ -443,7 +431,10 @@ static int __init caif_ser_init(void)
 {
int ret;
 
-   ret = register_ldisc();
+   ret = tty_register_ldisc(N_CAIF, &caif_ldisc);
+   if (ret < 0)
+   pr_err("cannot register CAIF ldisc=%d err=%d\n", N_CAIF, ret);
+
debugfsdir = debugfs_create_dir("caif_serial", NULL);
return ret;
 }
-- 
2.30.1



[PATCH 15/44] net: nfc: nci: remove memset of nci_uart_drivers

2021-03-02 Thread Jiri Slaby
nci_uart_drivers is a global definition, so there is no need to
initialize its memory to zero during module load.

Signed-off-by: Jiri Slaby 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: netdev@vger.kernel.org
---
 net/nfc/nci/uart.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/net/nfc/nci/uart.c b/net/nfc/nci/uart.c
index 16d009c9b6a0..c9987d1cccdf 100644
--- a/net/nfc/nci/uart.c
+++ b/net/nfc/nci/uart.c
@@ -468,7 +468,6 @@ static struct tty_ldisc_ops nci_uart_ldisc = {
 
 static int __init nci_uart_init(void)
 {
-   memset(nci_uart_drivers, 0, sizeof(nci_uart_drivers));
return tty_register_ldisc(N_NCI, &nci_uart_ldisc);
 }
 
-- 
2.30.1



[PATCH 17/44] net: nfc: nci: drop nci_uart_default_recv

2021-03-02 Thread Jiri Slaby
nci_uart_register returns -EINVAL immediately when nu->ops.recv is not
set. So the same 'if' later never triggers so nci_uart_default_recv is
never used. Drop it.

Signed-off-by: Jiri Slaby 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: netdev@vger.kernel.org
---
 net/nfc/nci/uart.c | 10 --
 1 file changed, 10 deletions(-)

diff --git a/net/nfc/nci/uart.c b/net/nfc/nci/uart.c
index 5cf7d3729d5f..9958b37d8f9d 100644
--- a/net/nfc/nci/uart.c
+++ b/net/nfc/nci/uart.c
@@ -387,12 +387,6 @@ static int nci_uart_send(struct nci_uart *nu, struct 
sk_buff *skb)
return 0;
 }
 
-/* -- Default recv handler -- */
-static int nci_uart_default_recv(struct nci_uart *nu, struct sk_buff *skb)
-{
-   return nci_recv_frame(nu->ndev, skb);
-}
-
 int nci_uart_register(struct nci_uart *nu)
 {
if (!nu || !nu->ops.open ||
@@ -402,10 +396,6 @@ int nci_uart_register(struct nci_uart *nu)
/* Set the send callback */
nu->ops.send = nci_uart_send;
 
-   /* Install default handlers if not overridden */
-   if (!nu->ops.recv)
-   nu->ops.recv = nci_uart_default_recv;
-
/* Add this driver in the driver list */
if (nci_uart_drivers[nu->driver]) {
pr_err("driver %d is already registered\n", nu->driver);
-- 
2.30.1



[PATCH 16/44] net: nfc: nci: drop nci_uart_ops::recv_buf

2021-03-02 Thread Jiri Slaby
There is noone setting nci_uart_ops::recv_buf, so the default one
(nci_uart_default_recv_buf) is always used. So drop this hook, move
nci_uart_default_recv_buf before the use in nci_uart_tty_receive and
remove unused parameter flags.

Signed-off-by: Jiri Slaby 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: netdev@vger.kernel.org
---
 include/net/nfc/nci_core.h |   2 -
 net/nfc/nci/uart.c | 136 ++---
 2 files changed, 67 insertions(+), 71 deletions(-)

diff --git a/include/net/nfc/nci_core.h b/include/net/nfc/nci_core.h
index 43c9c5d2bedb..bd76e8e082c0 100644
--- a/include/net/nfc/nci_core.h
+++ b/include/net/nfc/nci_core.h
@@ -430,8 +430,6 @@ struct nci_uart_ops {
int (*open)(struct nci_uart *nci_uart);
void (*close)(struct nci_uart *nci_uart);
int (*recv)(struct nci_uart *nci_uart, struct sk_buff *skb);
-   int (*recv_buf)(struct nci_uart *nci_uart, const u8 *data, char *flags,
-   int count);
int (*send)(struct nci_uart *nci_uart, struct sk_buff *skb);
void (*tx_start)(struct nci_uart *nci_uart);
void (*tx_done)(struct nci_uart *nci_uart);
diff --git a/net/nfc/nci/uart.c b/net/nfc/nci/uart.c
index c9987d1cccdf..5cf7d3729d5f 100644
--- a/net/nfc/nci/uart.c
+++ b/net/nfc/nci/uart.c
@@ -229,6 +229,72 @@ static void nci_uart_tty_wakeup(struct tty_struct *tty)
nci_uart_tx_wakeup(nu);
 }
 
+/* -- Default recv_buf handler --
+ *
+ * This handler supposes that NCI frames are sent over UART link without any
+ * framing. It reads NCI header, retrieve the packet size and once all packet
+ * bytes are received it passes it to nci_uart driver for processing.
+ */
+static int nci_uart_default_recv_buf(struct nci_uart *nu, const u8 *data,
+int count)
+{
+   int chunk_len;
+
+   if (!nu->ndev) {
+   nfc_err(nu->tty->dev,
+   "receive data from tty but no NCI dev is attached yet, 
drop buffer\n");
+   return 0;
+   }
+
+   /* Decode all incoming data in packets
+* and enqueue then for processing.
+*/
+   while (count > 0) {
+   /* If this is the first data of a packet, allocate a buffer */
+   if (!nu->rx_skb) {
+   nu->rx_packet_len = -1;
+   nu->rx_skb = nci_skb_alloc(nu->ndev,
+  NCI_MAX_PACKET_SIZE,
+  GFP_ATOMIC);
+   if (!nu->rx_skb)
+   return -ENOMEM;
+   }
+
+   /* Eat byte after byte till full packet header is received */
+   if (nu->rx_skb->len < NCI_CTRL_HDR_SIZE) {
+   skb_put_u8(nu->rx_skb, *data++);
+   --count;
+   continue;
+   }
+
+   /* Header was received but packet len was not read */
+   if (nu->rx_packet_len < 0)
+   nu->rx_packet_len = NCI_CTRL_HDR_SIZE +
+   nci_plen(nu->rx_skb->data);
+
+   /* Compute how many bytes are missing and how many bytes can
+* be consumed.
+*/
+   chunk_len = nu->rx_packet_len - nu->rx_skb->len;
+   if (count < chunk_len)
+   chunk_len = count;
+   skb_put_data(nu->rx_skb, data, chunk_len);
+   data += chunk_len;
+   count -= chunk_len;
+
+   /* Chcek if packet is fully received */
+   if (nu->rx_packet_len == nu->rx_skb->len) {
+   /* Pass RX packet to driver */
+   if (nu->ops.recv(nu, nu->rx_skb) != 0)
+   nfc_err(nu->tty->dev, "corrupted RX packet\n");
+   /* Next packet will be a new one */
+   nu->rx_skb = NULL;
+   }
+   }
+
+   return 0;
+}
+
 /* nci_uart_tty_receive()
  *
  * Called by tty low level driver when receive data is
@@ -250,7 +316,7 @@ static void nci_uart_tty_receive(struct tty_struct *tty, 
const u8 *data,
return;
 
spin_lock(&nu->rx_lock);
-   nu->ops.recv_buf(nu, (void *)data, flags, count);
+   nci_uart_default_recv_buf(nu, data, count);
spin_unlock(&nu->rx_lock);
 
tty_unthrottle(tty);
@@ -321,72 +387,6 @@ static int nci_uart_send(struct nci_uart *nu, struct 
sk_buff *skb)
return 0;
 }
 
-/* -- Default recv_buf handler --
- *
- * This handler supposes that NCI frames are sent over UART link without any
- * framing. It reads NCI header, retrieve the packet size and once all packet
- * bytes are received it passes it to nci_uart driver for processing.
- */
-static int nci_uart_default_recv_buf(struct nci_uart *nu, const u8 *data,
-char *flags, int count)
-{
-   in

[RFC net-next 0/9] net: hns3: refactor and new features for flow director

2021-03-02 Thread Huazhong Tan
This patchset refactor some functions and add some new features for
flow director.

patch 1~3: refactor large functions
patch 4, 7: add traffic class and user-def field support for ethtool
patch 5: use asynchronously configuration
patch 6: clean up for hns3_del_all_fd_entries()
patch 8, 9: add support for queue bonding mode

Jian Shen (9):
  net: hns3: refactor out hclge_add_fd_entry()
  net: hns3: refactor out hclge_fd_get_tuple()
  net: hns3: refactor for function hclge_fd_convert_tuple
  net: hns3: add support for traffic class tuple support for flow
director by ethtool
  net: hns3: refactor flow director configuration
  net: hns3: refine for hns3_del_all_fd_entries()
  net: hns3: add support for user-def data of flow director
  net: hns3: add support for queue bonding mode of flow director
  net: hns3: add queue bonding mode support for VF

 drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h|8 +
 drivers/net/ethernet/hisilicon/hns3/hnae3.h|9 +-
 drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c |4 +-
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c|   91 +-
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.h|   14 +-
 drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c |   13 +-
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c |2 +
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h |   21 +
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 1564 ++--
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h|   63 +
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c   |2 +
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  |   74 +
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h  |7 +
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c   |   16 +
 14 files changed, 1407 insertions(+), 481 deletions(-)

-- 
2.7.4



[RFC net-next 1/9] net: hns3: refactor out hclge_add_fd_entry()

2021-03-02 Thread Huazhong Tan
From: Jian Shen 

The process of function hclge_add_fd_entry() is complex and
prolix. To make it more readable, extract the process of
fs->ring_cookie to a single function.

Signed-off-by: Jian Shen 
Signed-off-by: Huazhong Tan 
---
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 67 +-
 1 file changed, 40 insertions(+), 27 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 34b744d..3491698 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -5980,6 +5980,42 @@ static bool hclge_is_cls_flower_active(struct 
hnae3_handle *handle)
return hdev->fd_active_type == HCLGE_FD_TC_FLOWER_ACTIVE;
 }
 
+static int hclge_fd_parse_ring_cookie(struct hclge_dev *hdev, u64 ring_cookie,
+ u16 *vport_id, u8 *action, u16 *queue_id)
+{
+   struct hclge_vport *vport = hdev->vport;
+
+   if (ring_cookie == RX_CLS_FLOW_DISC) {
+   *action = HCLGE_FD_ACTION_DROP_PACKET;
+   } else {
+   u32 ring = ethtool_get_flow_spec_ring(ring_cookie);
+   u8 vf = ethtool_get_flow_spec_ring_vf(ring_cookie);
+   u16 tqps;
+
+   if (vf > hdev->num_req_vfs) {
+   dev_err(&hdev->pdev->dev,
+   "Error: vf id (%u) > max vf num (%u)\n",
+   vf, hdev->num_req_vfs);
+   return -EINVAL;
+   }
+
+   *vport_id = vf ? hdev->vport[vf].vport_id : vport->vport_id;
+   tqps = hdev->vport[vf].nic.kinfo.num_tqps;
+
+   if (ring >= tqps) {
+   dev_err(&hdev->pdev->dev,
+   "Error: queue id (%u) > max tqp num (%u)\n",
+   ring, tqps - 1);
+   return -EINVAL;
+   }
+
+   *action = HCLGE_FD_ACTION_SELECT_QUEUE;
+   *queue_id = ring;
+   }
+
+   return 0;
+}
+
 static int hclge_add_fd_entry(struct hnae3_handle *handle,
  struct ethtool_rxnfc *cmd)
 {
@@ -6016,33 +6052,10 @@ static int hclge_add_fd_entry(struct hnae3_handle 
*handle,
if (ret)
return ret;
 
-   if (fs->ring_cookie == RX_CLS_FLOW_DISC) {
-   action = HCLGE_FD_ACTION_DROP_PACKET;
-   } else {
-   u32 ring = ethtool_get_flow_spec_ring(fs->ring_cookie);
-   u8 vf = ethtool_get_flow_spec_ring_vf(fs->ring_cookie);
-   u16 tqps;
-
-   if (vf > hdev->num_req_vfs) {
-   dev_err(&hdev->pdev->dev,
-   "Error: vf id (%u) > max vf num (%u)\n",
-   vf, hdev->num_req_vfs);
-   return -EINVAL;
-   }
-
-   dst_vport_id = vf ? hdev->vport[vf].vport_id : vport->vport_id;
-   tqps = vf ? hdev->vport[vf].alloc_tqps : vport->alloc_tqps;
-
-   if (ring >= tqps) {
-   dev_err(&hdev->pdev->dev,
-   "Error: queue id (%u) > max tqp num (%u)\n",
-   ring, tqps - 1);
-   return -EINVAL;
-   }
-
-   action = HCLGE_FD_ACTION_SELECT_QUEUE;
-   q_index = ring;
-   }
+   ret = hclge_fd_parse_ring_cookie(hdev, fs->ring_cookie, &dst_vport_id,
+&action, &q_index);
+   if (ret)
+   return ret;
 
rule = kzalloc(sizeof(*rule), GFP_KERNEL);
if (!rule)
-- 
2.7.4



[RFC net-next 6/9] net: hns3: refine for hns3_del_all_fd_entries()

2021-03-02 Thread Huazhong Tan
From: Jian Shen 

For only PF driver can configure flow director rule, it's
better to call hclge_del_all_fd_entries() directly in hclge
layer, rather than call hns3_del_all_fd_entries() in hns3
layer. Then we can remove the ae_algo->ops.del_all_fd_entries.

Signed-off-by: Jian Shen 
Signed-off-by: Huazhong Tan 
---
 drivers/net/ethernet/hisilicon/hns3/hnae3.h |  2 --
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 10 --
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 10 +++---
 3 files changed, 3 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h 
b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
index e9e60a9..f2eec5c 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
@@ -608,8 +608,6 @@ struct hnae3_ae_ops {
struct ethtool_rxnfc *cmd);
int (*del_fd_entry)(struct hnae3_handle *handle,
struct ethtool_rxnfc *cmd);
-   void (*del_all_fd_entries)(struct hnae3_handle *handle,
-  bool clear_list);
int (*get_fd_rule_cnt)(struct hnae3_handle *handle,
   struct ethtool_rxnfc *cmd);
int (*get_fd_rule_info)(struct hnae3_handle *handle,
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index bf4302a..44b775e 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -4143,14 +4143,6 @@ static void hns3_uninit_phy(struct net_device *netdev)
h->ae_algo->ops->mac_disconnect_phy(h);
 }
 
-static void hns3_del_all_fd_rules(struct net_device *netdev, bool clear_list)
-{
-   struct hnae3_handle *h = hns3_get_handle(netdev);
-
-   if (h->ae_algo->ops->del_all_fd_entries)
-   h->ae_algo->ops->del_all_fd_entries(h, clear_list);
-}
-
 static int hns3_client_start(struct hnae3_handle *handle)
 {
if (!handle->ae_algo->ops->client_start)
@@ -4337,8 +4329,6 @@ static void hns3_client_uninit(struct hnae3_handle 
*handle, bool reset)
 
hns3_nic_uninit_irq(priv);
 
-   hns3_del_all_fd_rules(netdev, true);
-
hns3_clear_all_ring(handle, true);
 
hns3_nic_uninit_vector_data(priv);
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 8ba07cf..bbeb541 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -6271,13 +6271,9 @@ static void hclge_clear_fd_rules_in_list(struct 
hclge_dev *hdev,
}
 }
 
-static void hclge_del_all_fd_entries(struct hnae3_handle *handle,
-bool clear_list)
+static void hclge_del_all_fd_entries(struct hclge_dev *hdev)
 {
-   struct hclge_vport *vport = hclge_get_vport(handle);
-   struct hclge_dev *hdev = vport->back;
-
-   hclge_clear_fd_rules_in_list(hdev, clear_list);
+   hclge_clear_fd_rules_in_list(hdev, true);
 }
 
 static int hclge_restore_fd_entries(struct hnae3_handle *handle)
@@ -11334,6 +11330,7 @@ static void hclge_uninit_ae_dev(struct hnae3_ae_dev 
*ae_dev)
hclge_misc_affinity_teardown(hdev);
hclge_state_uninit(hdev);
hclge_uninit_mac_table(hdev);
+   hclge_del_all_fd_entries(hdev);
 
if (mac->phydev)
mdiobus_unregister(mac->mdio_bus);
@@ -12157,7 +12154,6 @@ static const struct hnae3_ae_ops hclge_ops = {
.get_link_mode = hclge_get_link_mode,
.add_fd_entry = hclge_add_fd_entry,
.del_fd_entry = hclge_del_fd_entry,
-   .del_all_fd_entries = hclge_del_all_fd_entries,
.get_fd_rule_cnt = hclge_get_fd_rule_cnt,
.get_fd_rule_info = hclge_get_fd_rule_info,
.get_fd_all_rules = hclge_get_all_rules,
-- 
2.7.4



[RFC net-next 8/9] net: hns3: add support for queue bonding mode of flow director

2021-03-02 Thread Huazhong Tan
From: Jian Shen 

For device version V3, it supports queue bonding, which can
identify the tuple information of TCP stream, and create flow
director rules automatically, in order to keep the tx and rx
packets are in the same queue pair. The driver set FD_ADD
field of TX BD for TCP SYN packet, and set FD_DEL filed for
TCP FIN or RST packet. The hardware create or remove a fd rule
according to the TX BD, and it also support to age-out a rule
if not hit for a long time.

The queue bonding mode is default to be disabled, and can be
enabled/disabled with ethtool priv-flags command.

Signed-off-by: Jian Shen 
Signed-off-by: Huazhong Tan 
---
 drivers/net/ethernet/hisilicon/hns3/hnae3.h|   7 ++
 drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c |   4 +-
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c|  81 +-
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.h|  14 ++-
 drivers/net/ethernet/hisilicon/hns3/hns3_ethtool.c |  13 ++-
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c |   2 +
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h |   7 ++
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 119 -
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h|   3 +
 9 files changed, 241 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h 
b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
index f2eec5c..9145272 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
@@ -466,6 +466,10 @@ struct hnae3_ae_dev {
  *   Check if any cls flower rule exist
  * dbg_read_cmd
  *   Execute debugfs read command.
+ * request_flush_qb_config
+ *   Request to update queue bonding configuration
+ * query_fd_qb_state
+ *   Query whether hw queue bonding enabled
  */
 struct hnae3_ae_ops {
int (*init_ae_dev)(struct hnae3_ae_dev *ae_dev);
@@ -647,6 +651,8 @@ struct hnae3_ae_ops {
int (*del_cls_flower)(struct hnae3_handle *handle,
  struct flow_cls_offload *cls_flower);
bool (*cls_flower_active)(struct hnae3_handle *handle);
+   void (*request_flush_qb_config)(struct hnae3_handle *handle);
+   bool (*query_fd_qb_state)(struct hnae3_handle *handle);
 };
 
 struct hnae3_dcb_ops {
@@ -735,6 +741,7 @@ struct hnae3_roce_private_info {
 
 enum hnae3_pflag {
HNAE3_PFLAG_LIMIT_PROMISC,
+   HNAE3_PFLAG_FD_QB_ENABLE,
HNAE3_PFLAG_MAX
 };
 
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c
index dd11c57..e97da2a 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c
@@ -243,8 +243,8 @@ static int hns3_dbg_bd_info(struct hnae3_handle *h, const 
char *cmd_buf)
dev_info(dev, "(TX)vlan_tag: %u\n",
 le16_to_cpu(tx_desc->tx.outer_vlan_tag));
dev_info(dev, "(TX)tv: %u\n", le16_to_cpu(tx_desc->tx.tv));
-   dev_info(dev, "(TX)paylen_ol4cs: %u\n",
-le32_to_cpu(tx_desc->tx.paylen_ol4cs));
+   dev_info(dev, "(TX)paylen_fdop_ol4cs: %u\n",
+le32_to_cpu(tx_desc->tx.paylen_fdop_ol4cs));
dev_info(dev, "(TX)vld_ra_ri: %u\n",
 le16_to_cpu(tx_desc->tx.bdtp_fe_sc_vld_ra_ri));
dev_info(dev, "(TX)mss_hw_csum: %u\n", mss_hw_csum);
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 44b775e..76dcf82 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -1061,6 +1061,73 @@ static int hns3_handle_vtags(struct hns3_enet_ring 
*tx_ring,
return 0;
 }
 
+static bool hns3_query_fd_qb_state(struct hnae3_handle *handle)
+{
+   const struct hnae3_ae_ops *ops = handle->ae_algo->ops;
+
+   if (!test_bit(HNAE3_PFLAG_FD_QB_ENABLE, &handle->priv_flags))
+   return false;
+
+   if (!ops->query_fd_qb_state)
+   return false;
+
+   return ops->query_fd_qb_state(handle);
+}
+
+/* fd_op is the field of tx bd indicates hw whether to add or delete
+ * a qb rule or do nothing.
+ */
+static u8 hns3_fd_qb_handle(struct hns3_enet_ring *ring, struct sk_buff *skb)
+{
+   struct hnae3_handle *handle = ring->tqp->handle;
+   union l4_hdr_info l4;
+   union l3_hdr_info l3;
+   u8 l4_proto_tmp = 0;
+   __be16 frag_off;
+   u8 ip_version;
+   u8 fd_op = 0;
+
+   if (!hns3_query_fd_qb_state(handle))
+   return 0;
+
+   if (skb->encapsulation) {
+   ip_version = inner_ip_hdr(skb)->version;
+   l3.hdr = skb_inner_network_header(skb);
+   l4.hdr = skb_inner_transport_header(skb);
+   } else {
+   ip_version = ip_hdr(skb)->version;
+   l3.hdr = skb_network_header(skb);
+   l4.hdr = skb_transport_header(skb);
+   }
+
+   if (ip_version == IP_VERSION_IPV6) {
+  

[RFC net-next 7/9] net: hns3: add support for user-def data of flow director

2021-03-02 Thread Huazhong Tan
From: Jian Shen 

For DEVICE_VERSION_V3, the hardware supports to match specified
data in the specified offset of packet payload. Each layer can
have one offset, and can't be masked when configure flow director
rule by ethtool command. The layer is choosed according to the
flow-type, ether for L2, ip4/ipv6 for L3, and tcp4/tcp6/udp4/udp6
for L4. For example, tcp4/tcp6/udp4/udp6 rules share the same
user-def offset, but each rule can have its own user-def value.

For the user-def field of ethtool -N/U command is 64 bits long.
The bit 0~15 is used for user-def value, and bit 32~47 for user-def
offset in HNS3 driver.

Signed-off-by: Jian Shen 
Signed-off-by: Huazhong Tan 
---
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h |  14 +
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 301 -
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h|  36 +++
 3 files changed, 337 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
index ff52a65..03eca23 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
@@ -243,6 +243,7 @@ enum hclge_opcode_type {
HCLGE_OPC_FD_KEY_CONFIG = 0x1202,
HCLGE_OPC_FD_TCAM_OP= 0x1203,
HCLGE_OPC_FD_AD_OP  = 0x1204,
+   HCLGE_OPC_FD_USER_DEF_OP= 0x1207,
 
/* MDIO command */
HCLGE_OPC_MDIO_CONFIG   = 0x1900,
@@ -1075,6 +1076,19 @@ struct hclge_fd_ad_config_cmd {
u8 rsv2[8];
 };
 
+#define HCLGE_FD_USER_DEF_OFT_S0
+#define HCLGE_FD_USER_DEF_OFT_MGENMASK(14, 0)
+#define HCLGE_FD_USER_DEF_EN_B 15
+struct hclge_fd_user_def_cfg_cmd {
+   __le16 ol2_cfg;
+   __le16 l2_cfg;
+   __le16 ol3_cfg;
+   __le16 l3_cfg;
+   __le16 ol4_cfg;
+   __le16 l4_cfg;
+   u8 rsv[12];
+};
+
 struct hclge_get_m7_bd_cmd {
__le32 bd_num;
u8 rsv[20];
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index bbeb541..15998fc 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -414,7 +414,9 @@ static const struct key_info tuple_key_info[] = {
{ INNER_ETH_TYPE, 16, KEY_OPT_LE16,
  offsetof(struct hclge_fd_rule, tuples.ether_proto),
  offsetof(struct hclge_fd_rule, tuples_mask.ether_proto) },
-   { INNER_L2_RSV, 16, KEY_OPT_LE16, -1, -1 },
+   { INNER_L2_RSV, 16, KEY_OPT_LE16,
+ offsetof(struct hclge_fd_rule, tuples.l2_user_def),
+ offsetof(struct hclge_fd_rule, tuples_mask.l2_user_def) },
{ INNER_IP_TOS, 8, KEY_OPT_U8,
  offsetof(struct hclge_fd_rule, tuples.ip_tos),
  offsetof(struct hclge_fd_rule, tuples_mask.ip_tos) },
@@ -427,14 +429,18 @@ static const struct key_info tuple_key_info[] = {
{ INNER_DST_IP, 32, KEY_OPT_IP,
  offsetof(struct hclge_fd_rule, tuples.dst_ip),
  offsetof(struct hclge_fd_rule, tuples_mask.dst_ip) },
-   { INNER_L3_RSV, 16, KEY_OPT_LE16, -1, -1 },
+   { INNER_L3_RSV, 16, KEY_OPT_LE16,
+ offsetof(struct hclge_fd_rule, tuples.l3_user_def),
+ offsetof(struct hclge_fd_rule, tuples_mask.l3_user_def) },
{ INNER_SRC_PORT, 16, KEY_OPT_LE16,
  offsetof(struct hclge_fd_rule, tuples.src_port),
  offsetof(struct hclge_fd_rule, tuples_mask.src_port) },
{ INNER_DST_PORT, 16, KEY_OPT_LE16,
  offsetof(struct hclge_fd_rule, tuples.dst_port),
  offsetof(struct hclge_fd_rule, tuples_mask.dst_port) },
-   { INNER_L4_RSV, 32, KEY_OPT_LE32, -1, -1 },
+   { INNER_L4_RSV, 32, KEY_OPT_LE32,
+ offsetof(struct hclge_fd_rule, tuples.l4_user_def),
+ offsetof(struct hclge_fd_rule, tuples_mask.l4_user_def) },
 };
 
 static int hclge_mac_update_stats_defective(struct hclge_dev *hdev)
@@ -5110,15 +5116,75 @@ static void hclge_fd_insert_rule_node(struct hlist_head 
*hlist,
hlist_add_head(&rule->rule_node, hlist);
 }
 
+static int hclge_fd_inc_user_def_refcnt(struct hclge_dev *hdev,
+   struct hclge_fd_rule *rule)
+{
+   struct hclge_fd_user_def_info *info;
+   struct hclge_fd_user_def_cfg *cfg;
+
+   if (!rule || rule->rule_type != HCLGE_FD_EP_ACTIVE ||
+   rule->ep.user_def.layer == HCLGE_FD_USER_DEF_NONE)
+   return 0;
+
+   /* for valid layer is start from 1, so need minus 1 to get the cfg */
+   cfg = &hdev->fd_cfg.user_def_cfg[rule->ep.user_def.layer - 1];
+   info = &rule->ep.user_def;
+
+   if (cfg->ref_cnt && cfg->offset != info->offset) {
+   dev_err(&hdev->pdev->dev,
+   "No available offset for layer%d fd rule, each layer 
only support one user def offset.\n",
+

[RFC net-next 9/9] net: hns3: add queue bonding mode support for VF

2021-03-02 Thread Huazhong Tan
From: Jian Shen 

For device version V3, the hardware supports queue bonding
mode. VF can not enable queue bond mode unless PF enables it.
So VF needs to query whether PF support queue bonding mode
when initializing, and query whether PF enables queue bonding
mode periodically. For the resource limited, to avoid a VF
occupy to many FD rule space, only trust VF is allowed to enable
queue bonding mode.

Signed-off-by: Jian Shen 
Signed-off-by: Huazhong Tan 
---
 drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h|  8 +++
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 46 +-
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h|  2 +
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_cmd.c   |  2 +
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.c  | 74 ++
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_main.h  |  7 ++
 .../ethernet/hisilicon/hns3/hns3vf/hclgevf_mbx.c   | 16 +
 7 files changed, 154 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h 
b/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h
index 33defa4..797adc9 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hclge_mbx.h
@@ -46,6 +46,8 @@ enum HCLGE_MBX_OPCODE {
HCLGE_MBX_PUSH_PROMISC_INFO,/* (PF -> VF) push vf promisc info */
HCLGE_MBX_VF_UNINIT,/* (VF -> PF) vf is unintializing */
HCLGE_MBX_HANDLE_VF_TBL,/* (VF -> PF) store/clear hw table */
+   HCLGE_MBX_SET_QB = 0x28,/* (VF -> PF) set queue bonding */
+   HCLGE_MBX_PUSH_QB_STATE,/* (PF -> VF) push qb state */
 
HCLGE_MBX_GET_VF_FLR_STATUS = 200, /* (M7 -> PF) get vf flr status */
HCLGE_MBX_PUSH_LINK_STATUS, /* (M7 -> PF) get port link status */
@@ -75,6 +77,12 @@ enum hclge_mbx_tbl_cfg_subcode {
HCLGE_MBX_VPORT_LIST_CLEAR,
 };
 
+enum hclge_mbx_qb_cfg_subcode {
+   HCLGE_MBX_QB_CHECK_CAPS = 0,/* query whether support qb */
+   HCLGE_MBX_QB_ENABLE,/* request pf enable qb */
+   HCLGE_MBX_QB_GET_STATE  /* query whether qb enabled */
+};
+
 #define HCLGE_MBX_MAX_MSG_SIZE 14
 #define HCLGE_MBX_MAX_RESP_DATA_SIZE   8U
 #define HCLGE_MBX_MAX_RING_CHAIN_PARAM_NUM 4
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index ee5881d..8852f2f 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -4170,10 +4170,33 @@ static int hclge_sync_pf_qb_mode(struct hclge_dev *hdev)
return ret;
 }
 
+static int hclge_sync_vf_qb_mode(struct hclge_vport *vport)
+{
+   struct hclge_dev *hdev = vport->back;
+   bool request_enable = false;
+   int ret;
+
+   if (!test_and_clear_bit(HCLGE_VPORT_STATE_QB_CHANGE, &vport->state))
+   return 0;
+
+   if (vport->vf_info.trusted && vport->vf_info.request_qb_en &&
+   test_bit(HCLGE_STATE_HW_QB_ENABLE, &hdev->state))
+   request_enable = true;
+
+   ret = hclge_set_fd_qb(hdev, vport->vport_id, request_enable);
+   if (ret)
+   set_bit(HCLGE_VPORT_STATE_QB_CHANGE, &vport->state);
+   vport->vf_info.qb_en = request_enable ? 1 : 0;
+
+   return ret;
+}
+
 static int hclge_disable_fd_qb_mode(struct hclge_dev *hdev)
 {
struct hnae3_ae_dev *ae_dev = hdev->ae_dev;
+   struct hclge_vport *vport;
int ret;
+   u16 i;
 
if (!test_bit(HNAE3_DEV_SUPPORT_QB_B, ae_dev->caps) ||
!test_bit(HCLGE_STATE_HW_QB_ENABLE, &hdev->state))
@@ -4185,17 +4208,35 @@ static int hclge_disable_fd_qb_mode(struct hclge_dev 
*hdev)
 
clear_bit(HCLGE_STATE_HW_QB_ENABLE, &hdev->state);
 
+   for (i = 1; i < hdev->num_alloc_vport; i++) {
+   vport = &hdev->vport[i];
+   set_bit(HCLGE_VPORT_STATE_QB_CHANGE, &vport->state);
+   }
+
return 0;
 }
 
 static void hclge_sync_fd_qb_mode(struct hclge_dev *hdev)
 {
struct hnae3_ae_dev *ae_dev = hdev->ae_dev;
+   struct hclge_vport *vport;
+   int ret;
+   u16 i;
 
if (!test_bit(HNAE3_DEV_SUPPORT_QB_B, ae_dev->caps))
return;
 
-   hclge_sync_pf_qb_mode(hdev);
+   ret = hclge_sync_pf_qb_mode(hdev);
+   if (ret)
+   return;
+
+   for (i = 1; i < hdev->num_alloc_vport; i++) {
+   vport = &hdev->vport[i];
+
+   ret = hclge_sync_vf_qb_mode(vport);
+   if (ret)
+   return;
+   }
 }
 
 static void hclge_periodic_service_task(struct hclge_dev *hdev)
@@ -11485,6 +11526,9 @@ static int hclge_set_vf_trust(struct hnae3_handle 
*handle, int vf, bool enable)
 
vport->vf_info.trusted = new_trusted;
 
+   set_bit(HCLGE_VPORT_STATE_QB_CHANGE, &vport->state);
+   hclge_task_schedule(hdev, 0);
+
return 0;
 }
 
diff --git a/drivers/net/ethernet/hisilicon/

[RFC net-next 3/9] net: hns3: refactor for function hclge_fd_convert_tuple

2021-03-02 Thread Huazhong Tan
From: Jian Shen 

Currently, there are too many branches for hclge_fd_convert_tuple().
And it may be more when add new tuples. Refactor it by sorting the
tuples according to their length. So it only needs several KEY_OPT
now, and being flexible to add new tuples.

Signed-off-by: Jian Shen 
Signed-off-by: Huazhong Tan 
---
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 189 +
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h|  12 ++
 2 files changed, 97 insertions(+), 104 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 8d313d5..8a3a2eb 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -384,36 +384,56 @@ static const struct key_info meta_data_key_info[] = {
 };
 
 static const struct key_info tuple_key_info[] = {
-   { OUTER_DST_MAC, 48},
-   { OUTER_SRC_MAC, 48},
-   { OUTER_VLAN_TAG_FST, 16},
-   { OUTER_VLAN_TAG_SEC, 16},
-   { OUTER_ETH_TYPE, 16},
-   { OUTER_L2_RSV, 16},
-   { OUTER_IP_TOS, 8},
-   { OUTER_IP_PROTO, 8},
-   { OUTER_SRC_IP, 32},
-   { OUTER_DST_IP, 32},
-   { OUTER_L3_RSV, 16},
-   { OUTER_SRC_PORT, 16},
-   { OUTER_DST_PORT, 16},
-   { OUTER_L4_RSV, 32},
-   { OUTER_TUN_VNI, 24},
-   { OUTER_TUN_FLOW_ID, 8},
-   { INNER_DST_MAC, 48},
-   { INNER_SRC_MAC, 48},
-   { INNER_VLAN_TAG_FST, 16},
-   { INNER_VLAN_TAG_SEC, 16},
-   { INNER_ETH_TYPE, 16},
-   { INNER_L2_RSV, 16},
-   { INNER_IP_TOS, 8},
-   { INNER_IP_PROTO, 8},
-   { INNER_SRC_IP, 32},
-   { INNER_DST_IP, 32},
-   { INNER_L3_RSV, 16},
-   { INNER_SRC_PORT, 16},
-   { INNER_DST_PORT, 16},
-   { INNER_L4_RSV, 32},
+   { OUTER_DST_MAC, 48, KEY_OPT_MAC, -1, -1 },
+   { OUTER_SRC_MAC, 48, KEY_OPT_MAC, -1, -1 },
+   { OUTER_VLAN_TAG_FST, 16, KEY_OPT_LE16, -1, -1 },
+   { OUTER_VLAN_TAG_SEC, 16, KEY_OPT_LE16, -1, -1 },
+   { OUTER_ETH_TYPE, 16, KEY_OPT_LE16, -1, -1 },
+   { OUTER_L2_RSV, 16, KEY_OPT_LE16, -1, -1 },
+   { OUTER_IP_TOS, 8, KEY_OPT_U8, -1, -1 },
+   { OUTER_IP_PROTO, 8, KEY_OPT_U8, -1, -1 },
+   { OUTER_SRC_IP, 32, KEY_OPT_IP, -1, -1 },
+   { OUTER_DST_IP, 32, KEY_OPT_IP, -1, -1 },
+   { OUTER_L3_RSV, 16, KEY_OPT_LE16, -1, -1 },
+   { OUTER_SRC_PORT, 16, KEY_OPT_LE16, -1, -1 },
+   { OUTER_DST_PORT, 16, KEY_OPT_LE16, -1, -1 },
+   { OUTER_L4_RSV, 32, KEY_OPT_LE32, -1, -1 },
+   { OUTER_TUN_VNI, 24, KEY_OPT_VNI, -1, -1 },
+   { OUTER_TUN_FLOW_ID, 8, KEY_OPT_U8, -1, -1 },
+   { INNER_DST_MAC, 48, KEY_OPT_MAC,
+ offsetof(struct hclge_fd_rule, tuples.dst_mac),
+ offsetof(struct hclge_fd_rule, tuples_mask.dst_mac) },
+   { INNER_SRC_MAC, 48, KEY_OPT_MAC,
+ offsetof(struct hclge_fd_rule, tuples.src_mac),
+ offsetof(struct hclge_fd_rule, tuples_mask.src_mac) },
+   { INNER_VLAN_TAG_FST, 16, KEY_OPT_LE16,
+ offsetof(struct hclge_fd_rule, tuples.vlan_tag1),
+ offsetof(struct hclge_fd_rule, tuples_mask.vlan_tag1) },
+   { INNER_VLAN_TAG_SEC, 16, KEY_OPT_LE16, -1, -1 },
+   { INNER_ETH_TYPE, 16, KEY_OPT_LE16,
+ offsetof(struct hclge_fd_rule, tuples.ether_proto),
+ offsetof(struct hclge_fd_rule, tuples_mask.ether_proto) },
+   { INNER_L2_RSV, 16, KEY_OPT_LE16, -1, -1 },
+   { INNER_IP_TOS, 8, KEY_OPT_U8,
+ offsetof(struct hclge_fd_rule, tuples.ip_tos),
+ offsetof(struct hclge_fd_rule, tuples_mask.ip_tos) },
+   { INNER_IP_PROTO, 8, KEY_OPT_U8,
+ offsetof(struct hclge_fd_rule, tuples.ip_proto),
+ offsetof(struct hclge_fd_rule, tuples_mask.ip_proto) },
+   { INNER_SRC_IP, 32, KEY_OPT_IP,
+ offsetof(struct hclge_fd_rule, tuples.src_ip),
+ offsetof(struct hclge_fd_rule, tuples_mask.src_ip) },
+   { INNER_DST_IP, 32, KEY_OPT_IP,
+ offsetof(struct hclge_fd_rule, tuples.dst_ip),
+ offsetof(struct hclge_fd_rule, tuples_mask.dst_ip) },
+   { INNER_L3_RSV, 16, KEY_OPT_LE16, -1, -1 },
+   { INNER_SRC_PORT, 16, KEY_OPT_LE16,
+ offsetof(struct hclge_fd_rule, tuples.src_port),
+ offsetof(struct hclge_fd_rule, tuples_mask.src_port) },
+   { INNER_DST_PORT, 16, KEY_OPT_LE16,
+ offsetof(struct hclge_fd_rule, tuples.dst_port),
+ offsetof(struct hclge_fd_rule, tuples_mask.dst_port) },
+   { INNER_L4_RSV, 32, KEY_OPT_LE32, -1, -1 },
 };
 
 static int hclge_mac_update_stats_defective(struct hclge_dev *hdev)
@@ -5225,96 +5245,57 @@ static int hclge_fd_ad_config(struct hclge_dev *hdev, 
u8 stage, int loc,
 static bool hclge_fd_convert_tuple(u32 tuple_bit, u8 *key_x, u8 *key_y,
   struct hclge_fd_rule *rule)
 {
+   int offset, moffset, ip_offset;
+   enum HCLGE_FD_KEY_OPT key_opt;
u16 tmp_x_s, tmp_y_s;
 

[RFC net-next 2/9] net: hns3: refactor out hclge_fd_get_tuple()

2021-03-02 Thread Huazhong Tan
From: Jian Shen 

The process of function hclge_fd_get_tuple() is complex and
prolix. To make it more readable, extract the process of each
flow-type tuple to a single function.

Signed-off-by: Jian Shen 
Signed-off-by: Huazhong Tan 
---
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 220 +++--
 1 file changed, 117 insertions(+), 103 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 3491698..8d313d5 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -5789,144 +5789,158 @@ static int hclge_fd_update_rule_list(struct hclge_dev 
*hdev,
return 0;
 }
 
-static int hclge_fd_get_tuple(struct hclge_dev *hdev,
- struct ethtool_rx_flow_spec *fs,
- struct hclge_fd_rule *rule)
+static void hclge_fd_get_tcpip4_tuple(struct hclge_dev *hdev,
+ struct ethtool_rx_flow_spec *fs,
+ struct hclge_fd_rule *rule, u8 ip_proto)
 {
-   u32 flow_type = fs->flow_type & ~(FLOW_EXT | FLOW_MAC_EXT);
+   rule->tuples.src_ip[IPV4_INDEX] =
+   be32_to_cpu(fs->h_u.tcp_ip4_spec.ip4src);
+   rule->tuples_mask.src_ip[IPV4_INDEX] =
+   be32_to_cpu(fs->m_u.tcp_ip4_spec.ip4src);
 
-   switch (flow_type) {
-   case SCTP_V4_FLOW:
-   case TCP_V4_FLOW:
-   case UDP_V4_FLOW:
-   rule->tuples.src_ip[IPV4_INDEX] =
-   be32_to_cpu(fs->h_u.tcp_ip4_spec.ip4src);
-   rule->tuples_mask.src_ip[IPV4_INDEX] =
-   be32_to_cpu(fs->m_u.tcp_ip4_spec.ip4src);
+   rule->tuples.dst_ip[IPV4_INDEX] =
+   be32_to_cpu(fs->h_u.tcp_ip4_spec.ip4dst);
+   rule->tuples_mask.dst_ip[IPV4_INDEX] =
+   be32_to_cpu(fs->m_u.tcp_ip4_spec.ip4dst);
 
-   rule->tuples.dst_ip[IPV4_INDEX] =
-   be32_to_cpu(fs->h_u.tcp_ip4_spec.ip4dst);
-   rule->tuples_mask.dst_ip[IPV4_INDEX] =
-   be32_to_cpu(fs->m_u.tcp_ip4_spec.ip4dst);
+   rule->tuples.src_port = be16_to_cpu(fs->h_u.tcp_ip4_spec.psrc);
+   rule->tuples_mask.src_port = be16_to_cpu(fs->m_u.tcp_ip4_spec.psrc);
 
-   rule->tuples.src_port = be16_to_cpu(fs->h_u.tcp_ip4_spec.psrc);
-   rule->tuples_mask.src_port =
-   be16_to_cpu(fs->m_u.tcp_ip4_spec.psrc);
+   rule->tuples.dst_port = be16_to_cpu(fs->h_u.tcp_ip4_spec.pdst);
+   rule->tuples_mask.dst_port = be16_to_cpu(fs->m_u.tcp_ip4_spec.pdst);
 
-   rule->tuples.dst_port = be16_to_cpu(fs->h_u.tcp_ip4_spec.pdst);
-   rule->tuples_mask.dst_port =
-   be16_to_cpu(fs->m_u.tcp_ip4_spec.pdst);
+   rule->tuples.ip_tos = fs->h_u.tcp_ip4_spec.tos;
+   rule->tuples_mask.ip_tos = fs->m_u.tcp_ip4_spec.tos;
 
-   rule->tuples.ip_tos = fs->h_u.tcp_ip4_spec.tos;
-   rule->tuples_mask.ip_tos = fs->m_u.tcp_ip4_spec.tos;
+   rule->tuples.ether_proto = ETH_P_IP;
+   rule->tuples_mask.ether_proto = 0x;
 
-   rule->tuples.ether_proto = ETH_P_IP;
-   rule->tuples_mask.ether_proto = 0x;
+   rule->tuples.ip_proto = ip_proto;
+   rule->tuples_mask.ip_proto = 0xFF;
+}
 
-   break;
-   case IP_USER_FLOW:
-   rule->tuples.src_ip[IPV4_INDEX] =
-   be32_to_cpu(fs->h_u.usr_ip4_spec.ip4src);
-   rule->tuples_mask.src_ip[IPV4_INDEX] =
-   be32_to_cpu(fs->m_u.usr_ip4_spec.ip4src);
+static void hclge_fd_get_ip4_tuple(struct hclge_dev *hdev,
+  struct ethtool_rx_flow_spec *fs,
+  struct hclge_fd_rule *rule)
+{
+   rule->tuples.src_ip[IPV4_INDEX] =
+   be32_to_cpu(fs->h_u.usr_ip4_spec.ip4src);
+   rule->tuples_mask.src_ip[IPV4_INDEX] =
+   be32_to_cpu(fs->m_u.usr_ip4_spec.ip4src);
 
-   rule->tuples.dst_ip[IPV4_INDEX] =
-   be32_to_cpu(fs->h_u.usr_ip4_spec.ip4dst);
-   rule->tuples_mask.dst_ip[IPV4_INDEX] =
-   be32_to_cpu(fs->m_u.usr_ip4_spec.ip4dst);
+   rule->tuples.dst_ip[IPV4_INDEX] =
+   be32_to_cpu(fs->h_u.usr_ip4_spec.ip4dst);
+   rule->tuples_mask.dst_ip[IPV4_INDEX] =
+   be32_to_cpu(fs->m_u.usr_ip4_spec.ip4dst);
 
-   rule->tuples.ip_tos = fs->h_u.usr_ip4_spec.tos;
-   rule->tuples_mask.ip_tos = fs->m_u.usr_ip4_spec.tos;
+   rule->tuples.ip_tos = fs->h_u.usr_ip4_spec.tos;
+   rule->tuples_mask.ip_tos = fs->m_u.usr_ip4_spec.tos;
 
-   rule->tu

Re: [PATCH] net/mlx5e: fix mlx5e_tc_tun_update_header_ipv6 dummy definition

2021-03-02 Thread Saeed Mahameed
On Mon, 2021-03-01 at 11:57 +0200, Vlad Buslov wrote:
> On Thu 25 Feb 2021 at 14:54, Arnd Bergmann  wrote:
> > From: Arnd Bergmann 
> > 
> > The alternative implementation of this function in a header file
> > is declared as a global symbol, and gets added to every .c file
> > that includes it, which leads to a link error:
> > 
> > arm-linux-gnueabi-ld:
> > drivers/net/ethernet/mellanox/mlx5/core/en_rx.o: in function
> > `mlx5e_tc_tun_update_header_ipv6':
> > en_rx.c:(.text+0x0): multiple definition of
> > `mlx5e_tc_tun_update_header_ipv6';
> > drivers/net/ethernet/mellanox/mlx5/core/en_main.o:en_main.c:(.text+
> > 0x0): first defined here
> > 
> > Mark it 'static inline' like the other functions here.
> > 
> > Fixes: c7b9038d8af6 ("net/mlx5e: TC preparation refactoring for
> > routing update event")
> > Signed-off-by: Arnd Bergmann 
> > ---
> >  drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.h | 10 ++---
> > -
> >  1 file changed, 6 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.h
> > b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.h
> > index 67de2bf36861..89d5ca91566e 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.h
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/tc_tun.h
> > @@ -76,10 +76,12 @@ int mlx5e_tc_tun_update_header_ipv6(struct
> > mlx5e_priv *priv,
> >  static inline int
> >  mlx5e_tc_tun_create_header_ipv6(struct mlx5e_priv *priv,
> > struct net_device *mirred_dev,
> > -   struct mlx5e_encap_entry *e) {
> > return -EOPNOTSUPP; }
> > -int mlx5e_tc_tun_update_header_ipv6(struct mlx5e_priv *priv,
> > -   struct net_device *mirred_dev,
> > -   struct mlx5e_encap_entry *e)
> > +   struct mlx5e_encap_entry *e)
> > +{ return -EOPNOTSUPP; }
> > +static inline int
> > +mlx5e_tc_tun_update_header_ipv6(struct mlx5e_priv *priv,
> > +   struct net_device *mirred_dev,
> > +   struct mlx5e_encap_entry *e)
> >  { return -EOPNOTSUPP; }
> >  #endif
> >  int mlx5e_tc_tun_route_lookup(struct mlx5e_priv *priv,
> 
> Thanks Arnd!
> 
> Reviewed-by: Vlad Buslov 

Applied to net-mlx5, 

Thanks.




[RFC net-next 5/9] net: hns3: refactor flow director configuration

2021-03-02 Thread Huazhong Tan
From: Jian Shen 

Currently, there are 3 flow director work modes in HNS3 driver,
include EP(ethtool), tc flower and aRFS. The flow director rules
are configured synchronously and need holding spin lock. With this
limitation, all the commands with firmware are also needed to use
spin lock.

To eliminate the limitation, configure flow director rules
asynchronously. The rules are still kept in the fd_rule_list
with below states.
TO_ADD: the rule is waiting to add to hardware
TO_DEL: the rule is waiting to remove from hardware
ADDING: the rule is adding to hardware
ACTIVE: the rule is already added in hardware

When receive a new request to add or delete flow director rule,
check whether the rule location is existent, update the rule
content and state, and request to schedule the service task to
finish the configuration.

Signed-off-by: Jian Shen 
Signed-off-by: Huazhong Tan 
---
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 629 ++---
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h|  10 +
 2 files changed, 420 insertions(+), 219 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 0a121ee..8ba07cf 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -62,7 +62,7 @@ static void hclge_sync_vlan_filter(struct hclge_dev *hdev);
 static int hclge_reset_ae_dev(struct hnae3_ae_dev *ae_dev);
 static bool hclge_get_hw_reset_stat(struct hnae3_handle *handle);
 static void hclge_rfs_filter_expire(struct hclge_dev *hdev);
-static void hclge_clear_arfs_rules(struct hnae3_handle *handle);
+static void hclge_clear_arfs_rules(struct hclge_dev *hdev);
 static enum hnae3_reset_type hclge_get_reset_level(struct hnae3_ae_dev *ae_dev,
   unsigned long *addr);
 static int hclge_set_default_loopback(struct hclge_dev *hdev);
@@ -70,6 +70,7 @@ static int hclge_set_default_loopback(struct hclge_dev *hdev);
 static void hclge_sync_mac_table(struct hclge_dev *hdev);
 static void hclge_restore_hw_table(struct hclge_dev *hdev);
 static void hclge_sync_promisc_mode(struct hclge_dev *hdev);
+static void hclge_sync_fd_table(struct hclge_dev *hdev);
 
 static struct hnae3_ae_algo ae_algo;
 
@@ -4115,6 +4116,7 @@ static void hclge_periodic_service_task(struct hclge_dev 
*hdev)
hclge_update_link_status(hdev);
hclge_sync_mac_table(hdev);
hclge_sync_promisc_mode(hdev);
+   hclge_sync_fd_table(hdev);
 
if (time_is_after_jiffies(hdev->last_serv_processed + HZ)) {
delta = jiffies - hdev->last_serv_processed;
@@ -5016,6 +5018,198 @@ static void hclge_request_update_promisc_mode(struct 
hnae3_handle *handle)
set_bit(HCLGE_STATE_PROMISC_CHANGED, &hdev->state);
 }
 
+static void hclge_sync_fd_state(struct hclge_dev *hdev)
+{
+   if (hlist_empty(&hdev->fd_rule_list))
+   hdev->fd_active_type = HCLGE_FD_RULE_NONE;
+}
+
+static void hclge_update_fd_rule_node(struct hclge_dev *hdev,
+ struct hclge_fd_rule *old_rule,
+ struct hclge_fd_rule *new_rule,
+ enum HCLGE_FD_NODE_STATE state)
+{
+   switch (state) {
+   case HCLGE_FD_TO_ADD:
+   /* if new request is TO_ADD, we should configure the
+* new rule to hardware, no matter what the state of
+* old rule is. Even though the old rule is already
+* configured in the hardware, the new rule will replace
+* it.
+*/
+   new_rule->rule_node.next = old_rule->rule_node.next;
+   new_rule->rule_node.pprev = old_rule->rule_node.pprev;
+   memcpy(old_rule, new_rule, sizeof(*old_rule));
+   kfree(new_rule);
+   break;
+   case HCLGE_FD_TO_DEL:
+   /* if new request is TO_DEL, and old rule is existent
+* 1) the state of old rule is TO_DEL, we need do nothing,
+* because we delete rule by location, other rule content
+* is unncessary.
+* 2) the state of old rule is ACTIVE, we need to change its
+* state to TO_DEL, so the rule will be deleted when periodic
+* task being scheduled.
+* 3) the state of old rule is TO_ADD, it means the rule hasn't
+* been added to hardware, so we just delete the rule node from
+* fd_rule_list directly.
+* 4) the state of old rule is ADDING, it means the rule is
+* being configured to hardware. We also delete the rule node
+* from fd_rule_list directly, and will handle configuration
+* result of old rule in hclge_fd_sync_from_add_list().
+*/
+   if (old_rule->state 

Re: [RFC v4 01/11] eventfd: Increase the recursion depth of eventfd_signal()

2021-03-02 Thread Jason Wang



On 2021/2/23 7:50 下午, Xie Yongji wrote:

Increase the recursion depth of eventfd_signal() to 1. This
is the maximum recursion depth we have found so far.

Signed-off-by: Xie Yongji 



Acked-by: Jason Wang 

It might be useful to explain how/when we can reach for this condition.

Thanks



---
  fs/eventfd.c| 2 +-
  include/linux/eventfd.h | 5 -
  2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index e265b6dd4f34..cc7cd1dbedd3 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -71,7 +71,7 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
 * it returns true, the eventfd_signal() call should be deferred to a
 * safe context.
 */
-   if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
+   if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH))
return 0;
  
  	spin_lock_irqsave(&ctx->wqh.lock, flags);

diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index fa0a524baed0..886d99cd38ef 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -29,6 +29,9 @@
  #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
  #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
  
+/* Maximum recursion depth */

+#define EFD_WAKE_DEPTH 1
+
  struct eventfd_ctx;
  struct file;
  
@@ -47,7 +50,7 @@ DECLARE_PER_CPU(int, eventfd_wake_count);
  
  static inline bool eventfd_signal_count(void)

  {
-   return this_cpu_read(eventfd_wake_count);
+   return this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH;
  }
  
  #else /* CONFIG_EVENTFD */




Re: [RFC v4 02/11] vhost-vdpa: protect concurrent access to vhost device iotlb

2021-03-02 Thread Jason Wang



On 2021/2/23 7:50 下午, Xie Yongji wrote:

Use vhost_dev->mutex to protect vhost device iotlb from
concurrent access.

Fixes: 4c8cf318("vhost: introduce vDPA-based backend")
Signed-off-by: Xie Yongji 
---
  drivers/vhost/vdpa.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index c50079dfb281..5500e3bf05c1 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -723,6 +723,7 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev 
*dev,
if (r)
return r;
  
+	mutex_lock(&dev->mutex);



I think this should be done before the vhost_dev_check_owner() above.

Thanks



switch (msg->type) {
case VHOST_IOTLB_UPDATE:
r = vhost_vdpa_process_iotlb_update(v, msg);
@@ -742,6 +743,7 @@ static int vhost_vdpa_process_iotlb_msg(struct vhost_dev 
*dev,
r = -EINVAL;
break;
}
+   mutex_unlock(&dev->mutex);
  
  	return r;

  }




Re: [RFC v4 04/11] vdpa: Add an opaque pointer for vdpa_config_ops.dma_map()

2021-03-02 Thread Jason Wang



On 2021/2/23 7:50 下午, Xie Yongji wrote:

Add an opaque pointer for DMA mapping.

Suggested-by: Jason Wang 
Signed-off-by: Xie Yongji 



Acked-by: Jason Wang 



---
  drivers/vdpa/vdpa_sim/vdpa_sim.c | 6 +++---
  drivers/vhost/vdpa.c | 2 +-
  include/linux/vdpa.h | 2 +-
  3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index d5942842432d..5cfc262ce055 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -512,14 +512,14 @@ static int vdpasim_set_map(struct vdpa_device *vdpa,
  }
  
  static int vdpasim_dma_map(struct vdpa_device *vdpa, u64 iova, u64 size,

-  u64 pa, u32 perm)
+  u64 pa, u32 perm, void *opaque)
  {
struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
int ret;
  
  	spin_lock(&vdpasim->iommu_lock);

-   ret = vhost_iotlb_add_range(vdpasim->iommu, iova, iova + size - 1, pa,
-   perm);
+   ret = vhost_iotlb_add_range_ctx(vdpasim->iommu, iova, iova + size - 1,
+   pa, perm, opaque);
spin_unlock(&vdpasim->iommu_lock);
  
  	return ret;

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 5500e3bf05c1..70857fe3263c 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -544,7 +544,7 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
return r;
  
  	if (ops->dma_map) {

-   r = ops->dma_map(vdpa, iova, size, pa, perm);
+   r = ops->dma_map(vdpa, iova, size, pa, perm, NULL);
} else if (ops->set_map) {
if (!v->in_batch)
r = ops->set_map(vdpa, dev->iotlb);
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 4ab5494503a8..93dca2c328ae 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -241,7 +241,7 @@ struct vdpa_config_ops {
/* DMA ops */
int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb);
int (*dma_map)(struct vdpa_device *vdev, u64 iova, u64 size,
-  u64 pa, u32 perm);
+  u64 pa, u32 perm, void *opaque);
int (*dma_unmap)(struct vdpa_device *vdev, u64 iova, u64 size);
  
  	/* Free device resources */




Re: [PATCH net-next RFC v4] net: hdlc_x25: Queue outgoing LAPB frames

2021-03-02 Thread Martin Schiller

On 2021-03-01 09:56, Xie He wrote:

On Sun, Feb 28, 2021 at 10:56 PM Martin Schiller  wrote:


>> Also, I have a hard time assessing if such a wrap is really
>> enforceable.
>
> Sorry. I don't understand what you mean. What "wrap" are you referring
> to?

I mean the change from only one hdlc interface to both hdlc and
hdlc_x25.

I can't estimate how many users are out there and how their setup 
looks

like.


I'm also thinking about solving this issue by adding new APIs to the
HDLC subsystem (hdlc_stop_queue / hdlc_wake_queue) for hardware
drivers to call instead of netif_stop_queue / netif_wake_queue. This
way we can preserve backward compatibility.

However I'm reluctant to change the code of all the hardware drivers
because I'm afraid of introducing bugs, etc. When I look at the code
of "wan/lmc/lmc_main.c", I feel I'm not able to make sure there are no
bugs (related to stop_queue / wake_queue) after my change (and even
before my change, actually). There are even serious style problems:
the majority of its lines are indented by spaces.

So I don't want to mess with all the hardware drivers. Hardware driver
developers (if they wish to properly support hdlc_x25) should do the
change themselves. This is not a problem for me, because I use my own
out-of-tree hardware driver. However if I add APIs with no user code
in the kernel, other developers may think these APIs are not
necessary.


I don't think a change that affects the entire HDLC subsystem is
justified, since the actual problem only affects the hdlc_x25 area.

The approach with the additional hdlc_x25 is clean and purposeful and
I personally could live with it.

I just don't see myself in the position to decide such a change at the
moment.

@Jakub: What is your opinion on this.


Re: [PATCH] iwlwifi: fix ARCH=i386 compilation warnings

2021-03-02 Thread Coelho, Luciano
On Tue, 2021-03-02 at 07:58 +0200, Kalle Valo wrote:
> Pierre-Louis Bossart  writes:
> 
> > An unsigned long variable should rely on '%lu' format strings, not '%zd'
> > 
> > Fixes: a1a6a4cf49ece ("iwlwifi: pnvm: implement reading PNVM from UEFI")
> > Signed-off-by: Pierre-Louis Bossart 
> > ---
> > warnings found with v5.12-rc1 and next-20210301
> 
> Luca, can I take this to wireless-drivers?

Yes, please.

Acked-by: Luca Coelho 

--
Cheers,
Luca.


Re: [PATCH v2] can: c_can: move runtime PM enable/disable to c_can_platform

2021-03-02 Thread Marc Kleine-Budde
On 3/2/21 3:55 AM, Tong Zhang wrote:
> Currently doing modprobe c_can_pci will make kernel complain
> "Unbalanced pm_runtime_enable!", this is caused by pm_runtime_enable()
> called before pm is initialized.
> This fix is similar to 227619c3ff7c, move those pm_enable/disable code to
> c_can_platform.
> 
> Signed-off-by: Tong Zhang 

Applied to linux-can/testing.

Thanks,
Marc

-- 
Pengutronix e.K. | Marc Kleine-Budde   |
Embedded Linux   | https://www.pengutronix.de  |
Vertretung West/Dortmund | Phone: +49-231-2826-924 |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917- |



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] net: 9p: free what was emitted when read count is 0

2021-03-02 Thread Jisheng Zhang
On Tue, 2 Mar 2021 13:38:08 +0900 Dominique Martinet wrote:

> 
> 
> Jisheng Zhang wrote on Mon, Mar 01, 2021 at 11:01:57AM +0800:
> > Per my understanding of iov_iter, we need to call iov_iter_advance()
> > even when the read out count is 0. I believe we can see this common style
> > in other fs.  
> 
> I'm not sure where you see this style, but I don't see exceptions for
> 0-sized read not advancing the iov in general, and I guess this makes
> sense.

for example, function dio_refill_pages() in fs/direct-io.c, and below code piece
from net/core/datagram.c:

copied = iov_iter_get_pages(from, pages, length,
MAX_SKB_FRAGS - frag, &start);
if (copied < 0)
return -EFAULT;

iov_iter_advance(from, copied);

As can be seen, for "copied >=0" case, we call iov_iter_advance()

> 
> 
> Rather than make an exception for 0, how about just removing the if as
> follow ?

IMHO, we may need to keep the "if" in current logic. When count
reaches zero, we need to break the "while(iov_iter_count(to))" loop, so removing
the "if" modifying the logic.

> 
> I've checked that the non_zc case (copy_to_iter with 0 size) also works
> to the same effect, so I'm not sure why the check got added in the
> first place... But then again this is old code so maybe the semantics
> changed since 2015.
> 
> 
> 
> diff --git a/net/9p/client.c b/net/9p/client.c
> index 4f62f299da0c..0a0039255c5b 100644
> --- a/net/9p/client.c
> +++ b/net/9p/client.c
> @@ -1623,11 +1623,6 @@ p9_client_read_once(struct p9_fid *fid, u64 offset, 
> struct iov_iter *to,
> }
> 
> p9_debug(P9_DEBUG_9P, "<<< RREAD count %d\n", count);
> -   if (!count) {
> -   p9_tag_remove(clnt, req);
> -   return 0;
> -   }
> -
> if (non_zc) {
> int n = copy_to_iter(dataptr, count, to);
> 
> 
> 
> 
> If you're ok with that, would you mind resending that way?
> 
> I'd also want the commit message to be reworded a bit, at least the
> first line (summary) doesn't make sense right now: I have no idea
> what you mean by "free what was emitted".
> Just "9p: advance iov on empty read" or something similar would do.

Thanks for the suggestion. I will send a v2 to update the commit msg but
keep the patch as is if you agree with above keeping "if" logic.
> 
> 
> > > cat version? coreutils' doesn't seem to do that on their git)  
> >
> > busybox cat  
> 
> Ok, could reproduce with busybox cat, thanks.
> As expected I can't reproduce with older kernels so will run a bisect
> for the sake of it as time allows
> 

Thanks


[RFC net-next 4/9] net: hns3: add support for traffic class tuple support for flow director by ethtool

2021-03-02 Thread Huazhong Tan
From: Jian Shen 

The hardware supports to parse and match the traffic class field
of IPv6 packet for flow director, uses the same tuple as ip tos.
So removes the limitation of configure 'tclass' by driver.

Signed-off-by: Jian Shen 
Signed-off-by: Huazhong Tan 
---
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 27 --
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 8a3a2eb..0a121ee 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -5519,8 +5519,7 @@ static int hclge_fd_check_tcpip6_tuple(struct 
ethtool_tcpip6_spec *spec,
if (!spec || !unused_tuple)
return -EINVAL;
 
-   *unused_tuple |= BIT(INNER_SRC_MAC) | BIT(INNER_DST_MAC) |
-   BIT(INNER_IP_TOS);
+   *unused_tuple |= BIT(INNER_SRC_MAC) | BIT(INNER_DST_MAC);
 
/* check whether src/dst ip address used */
if (ipv6_addr_any((struct in6_addr *)spec->ip6src))
@@ -5535,8 +5534,8 @@ static int hclge_fd_check_tcpip6_tuple(struct 
ethtool_tcpip6_spec *spec,
if (!spec->pdst)
*unused_tuple |= BIT(INNER_DST_PORT);
 
-   if (spec->tclass)
-   return -EOPNOTSUPP;
+   if (!spec->tclass)
+   *unused_tuple |= BIT(INNER_IP_TOS);
 
return 0;
 }
@@ -5548,7 +5547,7 @@ static int hclge_fd_check_ip6_tuple(struct 
ethtool_usrip6_spec *spec,
return -EINVAL;
 
*unused_tuple |= BIT(INNER_SRC_MAC) | BIT(INNER_DST_MAC) |
-   BIT(INNER_IP_TOS) | BIT(INNER_SRC_PORT) | BIT(INNER_DST_PORT);
+   BIT(INNER_SRC_PORT) | BIT(INNER_DST_PORT);
 
/* check whether src/dst ip address used */
if (ipv6_addr_any((struct in6_addr *)spec->ip6src))
@@ -5560,8 +5559,8 @@ static int hclge_fd_check_ip6_tuple(struct 
ethtool_usrip6_spec *spec,
if (!spec->l4_proto)
*unused_tuple |= BIT(INNER_IP_PROTO);
 
-   if (spec->tclass)
-   return -EOPNOTSUPP;
+   if (!spec->tclass)
+   *unused_tuple |= BIT(INNER_IP_TOS);
 
if (spec->l4_4_bytes)
return -EOPNOTSUPP;
@@ -5847,6 +5846,9 @@ static void hclge_fd_get_tcpip6_tuple(struct hclge_dev 
*hdev,
rule->tuples.ether_proto = ETH_P_IPV6;
rule->tuples_mask.ether_proto = 0x;
 
+   rule->tuples.ip_tos = fs->h_u.tcp_ip6_spec.tclass;
+   rule->tuples_mask.ip_tos = fs->m_u.tcp_ip6_spec.tclass;
+
rule->tuples.ip_proto = ip_proto;
rule->tuples_mask.ip_proto = 0xFF;
 }
@@ -5868,6 +5870,9 @@ static void hclge_fd_get_ip6_tuple(struct hclge_dev *hdev,
rule->tuples.ip_proto = fs->h_u.usr_ip6_spec.l4_proto;
rule->tuples_mask.ip_proto = fs->m_u.usr_ip6_spec.l4_proto;
 
+   rule->tuples.ip_tos = fs->h_u.tcp_ip6_spec.tclass;
+   rule->tuples_mask.ip_tos = fs->m_u.tcp_ip6_spec.tclass;
+
rule->tuples.ether_proto = ETH_P_IPV6;
rule->tuples_mask.ether_proto = 0x;
 }
@@ -6277,6 +6282,10 @@ static void hclge_fd_get_tcpip6_info(struct 
hclge_fd_rule *rule,
cpu_to_be32_array(spec_mask->ip6dst, rule->tuples_mask.dst_ip,
  IPV6_SIZE);
 
+   spec->tclass = rule->tuples.ip_tos;
+   spec_mask->tclass = rule->unused_tuple & BIT(INNER_IP_TOS) ?
+   0 : rule->tuples_mask.ip_tos;
+
spec->psrc = cpu_to_be16(rule->tuples.src_port);
spec_mask->psrc = rule->unused_tuple & BIT(INNER_SRC_PORT) ?
0 : cpu_to_be16(rule->tuples_mask.src_port);
@@ -6304,6 +6313,10 @@ static void hclge_fd_get_ip6_info(struct hclge_fd_rule 
*rule,
cpu_to_be32_array(spec_mask->ip6dst,
  rule->tuples_mask.dst_ip, IPV6_SIZE);
 
+   spec->tclass = rule->tuples.ip_tos;
+   spec_mask->tclass = rule->unused_tuple & BIT(INNER_IP_TOS) ?
+   0 : rule->tuples_mask.ip_tos;
+
spec->l4_proto = rule->tuples.ip_proto;
spec_mask->l4_proto = rule->unused_tuple & BIT(INNER_IP_PROTO) ?
0 : rule->tuples_mask.ip_proto;
-- 
2.7.4



Re: [RFC v4 03/11] vhost-iotlb: Add an opaque pointer for vhost IOTLB

2021-03-02 Thread Jason Wang



On 2021/2/23 7:50 下午, Xie Yongji wrote:

Add an opaque pointer for vhost IOTLB. And introduce
vhost_iotlb_add_range_ctx() to accept it.

Suggested-by: Jason Wang 
Signed-off-by: Xie Yongji 



Acked-by: Jason Wang 



---
  drivers/vhost/iotlb.c   | 20 
  include/linux/vhost_iotlb.h |  3 +++
  2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/iotlb.c b/drivers/vhost/iotlb.c
index 0fd3f87e913c..5c99e1112cbb 100644
--- a/drivers/vhost/iotlb.c
+++ b/drivers/vhost/iotlb.c
@@ -36,19 +36,21 @@ void vhost_iotlb_map_free(struct vhost_iotlb *iotlb,
  EXPORT_SYMBOL_GPL(vhost_iotlb_map_free);
  
  /**

- * vhost_iotlb_add_range - add a new range to vhost IOTLB
+ * vhost_iotlb_add_range_ctx - add a new range to vhost IOTLB
   * @iotlb: the IOTLB
   * @start: start of the IOVA range
   * @last: last of IOVA range
   * @addr: the address that is mapped to @start
   * @perm: access permission of this range
+ * @opaque: the opaque pointer for the new mapping
   *
   * Returns an error last is smaller than start or memory allocation
   * fails
   */
-int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
- u64 start, u64 last,
- u64 addr, unsigned int perm)
+int vhost_iotlb_add_range_ctx(struct vhost_iotlb *iotlb,
+ u64 start, u64 last,
+ u64 addr, unsigned int perm,
+ void *opaque)
  {
struct vhost_iotlb_map *map;
  
@@ -71,6 +73,7 @@ int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,

map->last = last;
map->addr = addr;
map->perm = perm;
+   map->opaque = opaque;
  
  	iotlb->nmaps++;

vhost_iotlb_itree_insert(map, &iotlb->root);
@@ -80,6 +83,15 @@ int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
  
  	return 0;

  }
+EXPORT_SYMBOL_GPL(vhost_iotlb_add_range_ctx);
+
+int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
+ u64 start, u64 last,
+ u64 addr, unsigned int perm)
+{
+   return vhost_iotlb_add_range_ctx(iotlb, start, last,
+addr, perm, NULL);
+}
  EXPORT_SYMBOL_GPL(vhost_iotlb_add_range);
  
  /**

diff --git a/include/linux/vhost_iotlb.h b/include/linux/vhost_iotlb.h
index 6b09b786a762..2d0e2f52f938 100644
--- a/include/linux/vhost_iotlb.h
+++ b/include/linux/vhost_iotlb.h
@@ -17,6 +17,7 @@ struct vhost_iotlb_map {
u32 perm;
u32 flags_padding;
u64 __subtree_last;
+   void *opaque;
  };
  
  #define VHOST_IOTLB_FLAG_RETIRE 0x1

@@ -29,6 +30,8 @@ struct vhost_iotlb {
unsigned int flags;
  };
  
+int vhost_iotlb_add_range_ctx(struct vhost_iotlb *iotlb, u64 start, u64 last,

+ u64 addr, unsigned int perm, void *opaque);
  int vhost_iotlb_add_range(struct vhost_iotlb *iotlb, u64 start, u64 last,
  u64 addr, unsigned int perm);
  void vhost_iotlb_del_range(struct vhost_iotlb *iotlb, u64 start, u64 last);




Re: [Intel-wired-lan] [PATCH net 1/2] e1000e: Fix duplicate include guard

2021-03-02 Thread Dvora Fuxbrumer

On 22/02/2021 06:00, Tom Seewald wrote:

The include guard "_E1000_HW_H_" is used by header files in three
different drivers (e1000/e1000_hw.h, e1000e/hw.h, and igb/e1000_hw.h).
Using the same include guard macro in more than one header file may
cause unexpected behavior from the compiler. Fix the duplicate include
guard in the e1000e driver by renaming it.

Fixes: bc7f75fa9788 ("[E1000E]: New pci-express e1000 driver (currently for ICH9 
devices only)")
Signed-off-by: Tom Seewald 
---
  drivers/net/ethernet/intel/e1000e/hw.h | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/hw.h 
b/drivers/net/ethernet/intel/e1000e/hw.h
index 69a2329ea463..db79c4e6413e 100644
--- a/drivers/net/ethernet/intel/e1000e/hw.h
+++ b/drivers/net/ethernet/intel/e1000e/hw.h
@@ -1,8 +1,8 @@
  /* SPDX-License-Identifier: GPL-2.0 */
  /* Copyright(c) 1999 - 2018 Intel Corporation. */
  
-#ifndef _E1000_HW_H_

-#define _E1000_HW_H_
+#ifndef _E1000E_HW_H_
+#define _E1000E_HW_H_
  
  #include "regs.h"

  #include "defines.h"
@@ -714,4 +714,4 @@ struct e1000_hw {
  #include "80003es2lan.h"
  #include "ich8lan.h"
  
-#endif

+#endif /* _E1000E_HW_H_ */


Tested-by: Dvora Fuxbrumer 


Re: [PATCH bpf-next 1/2] xsk: update rings for load-acquire/store-release semantics

2021-03-02 Thread Björn Töpel





On 2021-03-01 17:08, Toke Høiland-Jørgensen wrote:

Björn Töpel  writes:


From: Björn Töpel 

Currently, the AF_XDP rings uses smp_{r,w,}mb() fences on the
kernel-side. By updating the rings for load-acquire/store-release
semantics, the full barrier on the consumer side can be replaced with
improved performance as a nice side-effect.

Note that this change does *not* require similar changes on the
libbpf/userland side, however it is recommended [1].

On x86-64 systems, by removing the smp_mb() on the Rx and Tx side, the
l2fwd AF_XDP xdpsock sample performance increases by
1%. Weakly-ordered platforms, such as ARM64 might benefit even more.

[1] https://lore.kernel.org/bpf/20200316184423.GA14143@willie-the-truck/

Signed-off-by: Björn Töpel 
---
  net/xdp/xsk_queue.h | 27 +++
  1 file changed, 11 insertions(+), 16 deletions(-)

diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 2823b7c3302d..e24279d8d845 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -47,19 +47,18 @@ struct xsk_queue {
u64 queue_empty_descs;
  };
  
-/* The structure of the shared state of the rings are the same as the

- * ring buffer in kernel/events/ring_buffer.c. For the Rx and completion
- * ring, the kernel is the producer and user space is the consumer. For
- * the Tx and fill rings, the kernel is the consumer and user space is
- * the producer.
+/* The structure of the shared state of the rings are a simple
+ * circular buffer, as outlined in
+ * Documentation/core-api/circular-buffers.rst. For the Rx and
+ * completion ring, the kernel is the producer and user space is the
+ * consumer. For the Tx and fill rings, the kernel is the consumer and
+ * user space is the producer.
   *
   * producer consumer
   *
- * if (LOAD ->consumer) {   LOAD ->producer
- *(A)   smp_rmb()   (C)
+ * if (LOAD ->consumer) {  (A)  LOAD.acq ->producer  (C)


Why is LOAD.acq not needed on the consumer side?



You mean why LOAD.acq is not needed on the *producer* side, i.e. the
->consumer? The ->consumer is a control dependency for the store, so
there is no ordering constraint for ->consumer at producer side. If
there's no space, no data is written. So, no barrier is needed there --
at least that has been my perspective.

This is very similar to the buffer in
Documentation/core-api/circular-buffers.rst. Roping in Paul for some
guidance.


Björn


-Toke



Re: [PATCH bpf-next 2/2] libbpf, xsk: add libbpf_smp_store_release libbpf_smp_load_acquire

2021-03-02 Thread Björn Töpel

On 2021-03-01 17:10, Toke Høiland-Jørgensen wrote:

Björn Töpel  writes:


From: Björn Töpel 

Now that the AF_XDP rings have load-acquire/store-release semantics,
move libbpf to that as well.

The library-internal libbpf_smp_{load_acquire,store_release} are only
valid for 32-bit words on ARM64.

Also, remove the barriers that are no longer in use.


So what happens if an updated libbpf is paired with an older kernel (or
vice versa)?



"This is fine." ;-) This was briefly discussed in [1], outlined by the
previous commit!

...even on POWER.


Björn

[1] https://lore.kernel.org/bpf/20200316184423.GA14143@willie-the-truck/



-Toke



Re: [PATCH] net: 9p: free what was emitted when read count is 0

2021-03-02 Thread Dominique Martinet
Jisheng Zhang wrote on Tue, Mar 02, 2021 at 03:39:40PM +0800:
> > Rather than make an exception for 0, how about just removing the if as
> > follow ?
> 
> IMHO, we may need to keep the "if" in current logic. When count
> reaches zero, we need to break the "while(iov_iter_count(to))" loop, so 
> removing
> the "if" modifying the logic.

We're not looking at the same loop, the break will happen properly
without the if because it's the return value of p9_client_read_once()
now.

In the old code I remember what you're saying and it makes sense, I
guess that was the reason for the special case.
It's not longer required, let's remove it.

-- 
Dominique


Re: BUG: soft lockup in ieee80211_tasklet_handler

2021-03-02 Thread syzbot
syzbot has found a reproducer for the following issue on:

HEAD commit:7a7fd0de Merge branch 'kmap-conversion-for-5.12' of git://..
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=14df34ead0
kernel config:  https://syzkaller.appspot.com/x/.config?x=e0da2d01cc636e2c
dashboard link: https://syzkaller.appspot.com/bug?extid=27df43cf7ae73de7d8ee
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=154a476cd0
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1152fb82d0

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+27df43cf7ae73de7d...@syzkaller.appspotmail.com

watchdog: BUG: soft lockup - CPU#0 stuck for 123s! [syz-executor290:22312]
Modules linked in:
irq event stamp: 18402725
hardirqs last  enabled at (18402724): [] 
asm_sysvec_irq_work+0x12/0x20 arch/x86/include/asm/idtentry.h:658
hardirqs last disabled at (18402725): [] 
sysvec_apic_timer_interrupt+0xb/0xc0 arch/x86/kernel/apic/apic.c:1100
softirqs last  enabled at (18165196): [] invoke_softirq 
kernel/softirq.c:221 [inline]
softirqs last  enabled at (18165196): [] __irq_exit_rcu 
kernel/softirq.c:422 [inline]
softirqs last  enabled at (18165196): [] 
irq_exit_rcu+0x134/0x200 kernel/softirq.c:434
softirqs last disabled at (18165199): [] invoke_softirq 
kernel/softirq.c:221 [inline]
softirqs last disabled at (18165199): [] __irq_exit_rcu 
kernel/softirq.c:422 [inline]
softirqs last disabled at (18165199): [] 
irq_exit_rcu+0x134/0x200 kernel/softirq.c:434
CPU: 0 PID: 22312 Comm: syz-executor290 Not tainted 5.12.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
RIP: 0010:write_comp_data kernel/kcov.c:218 [inline]
RIP: 0010:__sanitizer_cov_trace_switch+0x63/0xf0 kernel/kcov.c:320
Code: 4d 8b 10 31 c9 65 4c 8b 24 25 00 f0 01 00 4d 85 d2 74 6b 4c 89 e6 bf 03 
00 00 00 4c 8b 4c 24 20 49 8b 6c c8 10 e8 2d ff ff ff <84> c0 74 47 49 8b 84 24 
b8 14 00 00 41 8b bc 24 b4 14 00 00 48 8b
RSP: 0018:c90078d8 EFLAGS: 0246
RAX:  RBX: 0003 RCX: 0006
RDX:  RSI: 88801c37 RDI: 0003
RBP: 00b0 R08: 8a84bea0 R09: 885fcfcf
R10: 0008 R11: 0080 R12: 88801c37
R13: 0080 R14: 888012b6a450 R15: 
FS:  () GS:8880b9c0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 004d0110 CR3: 27282000 CR4: 00350ef0
Call Trace:
 
 ieee80211_rx_h_mgmt net/mac80211/rx.c:3588 [inline]
 ieee80211_rx_handlers+0x89ef/0xae60 net/mac80211/rx.c:3793
 ieee80211_invoke_rx_handlers net/mac80211/rx.c:3823 [inline]
 ieee80211_prepare_and_rx_handle+0x22ad/0x5070 net/mac80211/rx.c:4537
 __ieee80211_rx_handle_packet net/mac80211/rx.c:4635 [inline]
 ieee80211_rx_list+0x930/0x2680 net/mac80211/rx.c:4819
 ieee80211_rx_napi+0xf7/0x3d0 net/mac80211/rx.c:4842
 ieee80211_rx include/net/mac80211.h:4524 [inline]
 ieee80211_tasklet_handler+0xd4/0x130 net/mac80211/main.c:235
 tasklet_action_common.constprop.0+0x1d7/0x2d0 kernel/softirq.c:557
 __do_softirq+0x29b/0x9f6 kernel/softirq.c:345
 invoke_softirq kernel/softirq.c:221 [inline]
 __irq_exit_rcu kernel/softirq.c:422 [inline]
 irq_exit_rcu+0x134/0x200 kernel/softirq.c:434
 sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1100
 
 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:632
RIP: 0010:mm_update_next_owner+0x432/0x7a0 kernel/exit.c:388
Code: 8d ad b0 fb ff ff 48 81 fd 50 c8 cb 8b 0f 84 65 01 00 00 e8 90 e6 2e 00 
48 8d bd dc fb ff ff 48 89 f8 48 c1 e8 03 0f b6 14 18 <48> 89 f8 83 e0 07 83 c0 
03 38 d0 7c 08 84 d2 0f 85 b5 02 00 00 44
RSP: 0018:c9000ab77b18 EFLAGS: 0217
RAX: 1110041046f5 RBX: dc00 RCX: 
RDX:  RSI: 814470e0 RDI: 8880208237ac
RBP: 888020823bd0 R08:  R09: 8bc0a083
R10: 8144711f R11: 0001 R12: 888018b0
R13: 888020823780 R14: 0020 R15: 888011520010
 exit_mm kernel/exit.c:500 [inline]
 do_exit+0xb02/0x2a60 kernel/exit.c:812
 do_group_exit+0x125/0x310 kernel/exit.c:922
 get_signal+0x42c/0x2100 kernel/signal.c:2773
 arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
 handle_signal_work kernel/entry/common.c:147 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
 exit_to_user_mode_prepare+0x148/0x250 kernel/entry/common.c:208
 __syscall_exit_to_user_mode_work kernel/entry/common.c:290 [inline]
 syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x453dd9
Code: Unable to access opcode bytes at RIP 0x453daf.
RSP: 002b:7fcbbf2d5218 EFLAGS: 0246 ORIG_RAX: 00ca
RAX: fe00 RBX: 004d8268 RCX: 00453dd9
RDX: 00

Re: [Intel-wired-lan] [PATCH net 1/2] e1000e: Fix duplicate include guard

2021-03-02 Thread Neftin, Sasha

On 2/22/2021 06:00, Tom Seewald wrote:

The include guard "_E1000_HW_H_" is used by header files in three
different drivers (e1000/e1000_hw.h, e1000e/hw.h, and igb/e1000_hw.h).
Using the same include guard macro in more than one header file may
cause unexpected behavior from the compiler. Fix the duplicate include
guard in the e1000e driver by renaming it.

Fixes: bc7f75fa9788 ("[E1000E]: New pci-express e1000 driver (currently for ICH9 
devices only)")
Signed-off-by: Tom Seewald 
---
  drivers/net/ethernet/intel/e1000e/hw.h | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/hw.h 
b/drivers/net/ethernet/intel/e1000e/hw.h
index 69a2329ea463..db79c4e6413e 100644
--- a/drivers/net/ethernet/intel/e1000e/hw.h
+++ b/drivers/net/ethernet/intel/e1000e/hw.h
@@ -1,8 +1,8 @@
  /* SPDX-License-Identifier: GPL-2.0 */
  /* Copyright(c) 1999 - 2018 Intel Corporation. */
  
-#ifndef _E1000_HW_H_

-#define _E1000_HW_H_
+#ifndef _E1000E_HW_H_
+#define _E1000E_HW_H_
  
  #include "regs.h"

  #include "defines.h"
@@ -714,4 +714,4 @@ struct e1000_hw {
  #include "80003es2lan.h"
  #include "ich8lan.h"
  
-#endif

+#endif /* _E1000E_HW_H_ */


Acked-by: Sasha Neftin 


Re: [PATCH] net: ethernet: mtk-star-emac: fix wrong unmap in RX handling

2021-03-02 Thread Bartosz Golaszewski
On Tue, Mar 2, 2021 at 4:33 AM Biao Huang  wrote:
>
> mtk_star_dma_unmap_rx() should unmap the dma_addr of old skb rather than
> that of new skb.
> Assign new_dma_addr to desc_data.dma_addr after all handling of old skb
> ends to avoid unexpected receive side error.
>
> Fixes: f96e9641e92b ("net: ethernet: mtk-star-emac: fix error path in RX 
> handling")
> Signed-off-by: Biao Huang 
> ---
>  drivers/net/ethernet/mediatek/mtk_star_emac.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/mediatek/mtk_star_emac.c 
> b/drivers/net/ethernet/mediatek/mtk_star_emac.c
> index a8641a407c06..96d2891f1675 100644
> --- a/drivers/net/ethernet/mediatek/mtk_star_emac.c
> +++ b/drivers/net/ethernet/mediatek/mtk_star_emac.c
> @@ -1225,8 +1225,6 @@ static int mtk_star_receive_packet(struct mtk_star_priv 
> *priv)
> goto push_new_skb;
> }
>
> -   desc_data.dma_addr = new_dma_addr;
> -
> /* We can't fail anymore at this point: it's safe to unmap the skb. */
> mtk_star_dma_unmap_rx(priv, &desc_data);
>
> @@ -1236,6 +1234,9 @@ static int mtk_star_receive_packet(struct mtk_star_priv 
> *priv)
> desc_data.skb->dev = ndev;
> netif_receive_skb(desc_data.skb);
>
> +   /* update dma_addr for new skb */
> +   desc_data.dma_addr = new_dma_addr;
> +
>  push_new_skb:
> desc_data.len = skb_tailroom(new_skb);
> desc_data.skb = new_skb;
> --
> 2.18.0
>

Thanks for spotting that. Maybe also update the commit so that it
says: "it's safe to unmap the old skb"? Would make the thing clearer
IMO.

In any case:

Acked-by: Bartosz Golaszewski 

Bartosz


Re: [PATCH] netdevsim: init u64 stats for 32bit hardware

2021-03-02 Thread Dmitry Vyukov
On Fri, Jan 29, 2021 at 6:36 AM Hillf Danton  wrote:
>
> On 29 Jan 2021 2:58:22 Jakub Kicinski wrpte:
>
> >On Thu, 28 Jan 2021 10:43:16 +0800 Hillf Danton wrote:
>
> >> Init the u64 stats in order to avoid the lockdep prints on the 32bit
>
> >> hardware like
>
> >
>
> >Thanks for the fix!
>
>
>
> Hi Jakub,
>
> >
>
> >Unless it's my poor eyesight I think this didn't get into patchwork:
>
> >
>
> >https://patchwork.kernel.org/project/netdevbpf/list/
>
>
>
> You are right.
>
> And the reason is that my inbox never survived certain check
>
> at @vger.kernel.org.
>
>
>
> Hillf

Hi,

What happened with this patch?
I hoped this would get at least into 5.12. syzbot can't start testing
arm32 because of this.


Re: [PATCH] iwlwifi: fix ARCH=i386 compilation warnings

2021-03-02 Thread Kalle Valo
"Coelho, Luciano"  writes:

> On Tue, 2021-03-02 at 07:58 +0200, Kalle Valo wrote:
>> Pierre-Louis Bossart  writes:
>> 
>> > An unsigned long variable should rely on '%lu' format strings, not '%zd'
>> > 
>> > Fixes: a1a6a4cf49ece ("iwlwifi: pnvm: implement reading PNVM from UEFI")
>> > Signed-off-by: Pierre-Louis Bossart 
>> > ---
>> > warnings found with v5.12-rc1 and next-20210301
>> 
>> Luca, can I take this to wireless-drivers?
>
> Yes, please.
>
> Acked-by: Luca Coelho 

Thansk. I don't see this in patchwork yet (I guess vger is slow again)
so I cannot assign it to me at the moment, will do it later.

-- 
https://patchwork.kernel.org/project/linux-wireless/list/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches


[PATCH net 1/1] stmmac: intel: Fix mdio bus registration issue for TGL-H/ADL-S

2021-03-02 Thread Wong Vee Khee
On Intel platforms which consist of two Ethernet Controllers such as
TGL-H and ADL-S, a unique MDIO bus id is required for MDIO bus to be
successful registered:

[   13.076133] sysfs: cannot create duplicate filename 
'/class/mdio_bus/stmmac-1'
[   13.083404] CPU: 8 PID: 1898 Comm: systemd-udevd Tainted: G U
5.11.0-net-next #106
[   13.092410] Hardware name: Intel Corporation Alder Lake Client 
Platform/AlderLake-S ADP-S DRR4 CRB, BIOS ADLIFSI1.R00.1494.B00.2012031421 
12/03/2020
[   13.105709] Call Trace:
[   13.108176]  dump_stack+0x64/0x7c
[   13.111553]  sysfs_warn_dup+0x56/0x70
[   13.115273]  sysfs_do_create_link_sd.isra.2+0xbd/0xd0
[   13.120371]  device_add+0x4df/0x840
[   13.123917]  ? complete_all+0x2a/0x40
[   13.127636]  __mdiobus_register+0x98/0x310 [libphy]
[   13.132572]  stmmac_mdio_register+0x1c5/0x3f0 [stmmac]
[   13.137771]  ? stmmac_napi_add+0xa5/0xf0 [stmmac]
[   13.142493]  stmmac_dvr_probe+0x806/0xee0 [stmmac]
[   13.147341]  intel_eth_pci_probe+0x1cb/0x250 [dwmac_intel]
[   13.152884]  pci_device_probe+0xd2/0x150
[   13.156897]  really_probe+0xf7/0x4d0
[   13.160527]  driver_probe_device+0x5d/0x140
[   13.164761]  device_driver_attach+0x4f/0x60
[   13.168996]  __driver_attach+0xa2/0x140
[   13.172891]  ? device_driver_attach+0x60/0x60
[   13.177300]  bus_for_each_dev+0x76/0xc0
[   13.181188]  bus_add_driver+0x189/0x230
[   13.185083]  ? 0xc0795000
[   13.188446]  driver_register+0x5b/0xf0
[   13.192249]  ? 0xc0795000
[   13.195577]  do_one_initcall+0x4d/0x210
[   13.199467]  ? kmem_cache_alloc_trace+0x2ff/0x490
[   13.204228]  do_init_module+0x5b/0x21c
[   13.208031]  load_module+0x2a0c/0x2de0
[   13.211838]  ? __do_sys_finit_module+0xb1/0x110
[   13.216420]  __do_sys_finit_module+0xb1/0x110
[   13.220825]  do_syscall_64+0x33/0x40
[   13.224451]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   13.229515] RIP: 0033:0x7fc2b1919ccd
[   13.233113] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 
f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 
f0 ff ff 73 01 c3 48 8b 0d 93 31 0c 00 f7 d8 64 89 01 48
[   13.251912] RSP: 002b:7ffcea2e5b98 EFLAGS: 0246 ORIG_RAX: 
0139
[   13.259527] RAX: ffda RBX: 560558920f10 RCX: 7fc2b1919ccd
[   13.266706] RDX:  RSI: 7fc2b1a881e3 RDI: 0012
[   13.273887] RBP: 0002 R08:  R09: 
[   13.281036] R10: 0012 R11: 0246 R12: 7fc2b1a881e3
[   13.288183] R13:  R14:  R15: 7ffcea2e5d58
[   13.295389] libphy: mii_bus stmmac-1 failed to register

Fixes: 88af9bd4efbd ("stmmac: intel: Add ADL-S 1Gbps PCI IDs")
Fixes: 8450e23f142f ("stmmac: intel: Add PCI IDs for TGL-H platform")
Signed-off-by: Wong Vee Khee 
---
 .../net/ethernet/stmicro/stmmac/dwmac-intel.c | 54 ++-
 1 file changed, 41 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c
index 74b14d647619..e6eaf378e8e7 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c
@@ -462,8 +462,8 @@ static int tgl_common_data(struct pci_dev *pdev,
return intel_mgbe_common_data(pdev, plat);
 }
 
-static int tgl_sgmii_data(struct pci_dev *pdev,
- struct plat_stmmacenet_data *plat)
+static int tgl_sgmii_phy0_data(struct pci_dev *pdev,
+  struct plat_stmmacenet_data *plat)
 {
plat->bus_id = 1;
plat->phy_interface = PHY_INTERFACE_MODE_SGMII;
@@ -472,12 +472,26 @@ static int tgl_sgmii_data(struct pci_dev *pdev,
return tgl_common_data(pdev, plat);
 }
 
-static struct stmmac_pci_info tgl_sgmii1g_info = {
-   .setup = tgl_sgmii_data,
+static struct stmmac_pci_info tgl_sgmii1g_phy0_info = {
+   .setup = tgl_sgmii_phy0_data,
 };
 
-static int adls_sgmii_data(struct pci_dev *pdev,
-  struct plat_stmmacenet_data *plat)
+static int tgl_sgmii_phy1_data(struct pci_dev *pdev,
+  struct plat_stmmacenet_data *plat)
+{
+   plat->bus_id = 2;
+   plat->phy_interface = PHY_INTERFACE_MODE_SGMII;
+   plat->serdes_powerup = intel_serdes_powerup;
+   plat->serdes_powerdown = intel_serdes_powerdown;
+   return tgl_common_data(pdev, plat);
+}
+
+static struct stmmac_pci_info tgl_sgmii1g_phy1_info = {
+   .setup = tgl_sgmii_phy1_data,
+};
+
+static int adls_sgmii_phy0_data(struct pci_dev *pdev,
+   struct plat_stmmacenet_data *plat)
 {
plat->bus_id = 1;
plat->phy_interface = PHY_INTERFACE_MODE_SGMII;
@@ -487,10 +501,24 @@ static int adls_sgmii_data(struct pci_dev *pdev,
return tgl_common_data(pdev, plat);
 }
 
-static struct stmmac_pci_info adls_sgmii1g_info = {
-   .setup = adls_sgmii_data,
+static struct stmmac

Re: [net 01/15] net/mlx5e: E-switch, Fix rate calculation for overflow

2021-03-02 Thread Arnd Bergmann
On Tue, Mar 2, 2021 at 1:52 AM Saeed Mahameed  wrote:
> On Sat, 2021-02-27 at 13:14 +0100, Arnd Bergmann wrote:
> > On Fri, Feb 12, 2021 at 3:59 AM Saeed Mahameed 
> > wrote:
> > >
> > > From: Parav Pandit 
> > >
> > > rate_bytes_ps is a 64-bit field. It passed as 32-bit field to
> > > apply_police_params(). Due to this when police rate is higher
> > > than 4Gbps, 32-bit calculation ignores the carry. This results
> > > in incorrect rate configurationn the device.
> > >
> > > Fix it by performing 64-bit calculation.
> >
> > I just stumbled over this commit while looking at an unrelated
> > problem.
> >
> > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> > > b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> > > index dd0bfbacad47..717fbaa6ce73 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> > > @@ -5040,7 +5040,7 @@ static int apply_police_params(struct
> > > mlx5e_priv *priv, u64 rate,
> > >  */
> > > if (rate) {
> > > rate = (rate * BITS_PER_BYTE) + 50;
> > > -   rate_mbps = max_t(u32, do_div(rate, 100), 1);
> > > +   rate_mbps = max_t(u64, do_div(rate, 100), 1);
> >
> > I think there are still multiple issues with this line:
> >
> > - Before commit 1fe3e3166b35 ("net/mlx5e: E-switch, Fix rate
> > calculation for
> >   overflow"), it was trying to calculate rate divided by 100, but
> > now
> >   it uses the remainder of the division rather than the quotient. I
> > assume
> >   this was meant to use div_u64() instead of do_div().
> >
>
> Yes, I already have a patch lined up to fix this issue.

ok

> > - Both div_u64() and do_div() return a 32-bit number, and '1' is a
> > constant
> >   that also comfortably fits into a 32-bit number, so changing the
> > max_t
> >   to return a 64-bit type has no effect on the result
> >
>
> as of the above comment, we shouldn't be using the return value of
> do_div().

Ok, I was confused here because do_div() returns a 32-bit type,
and is called by div_u64(). Of course that was nonsense because
do_div() returns the 32-bit remainder, while the division result
remains 64-bit.

> > - The maximum of an arbitrary unsigned integer and '1' is either one
> > or zero,
> >so there doesn't need to be an expensive division here at all.
> > From the
> >comment it sounds like the intention was to use 'min_t()' instead
> > of 'max_t()'.
> >It has however used 'max_t' since the code was first introduced.
> >
>
> if the input rate is less that 1mbps then the quotient will be 0,
> otherwise we want the quotient, and we don't allow 0, so max_t(rate, 1)
> should be used, what am I missing ?

And I have no idea what I was thinking here, of course you are right
and there is no other bug.

   Arnd


Re: [PATCH bpf-next 2/2] libbpf, xsk: add libbpf_smp_store_release libbpf_smp_load_acquire

2021-03-02 Thread Daniel Borkmann

On 3/2/21 9:05 AM, Björn Töpel wrote:

On 2021-03-01 17:10, Toke Høiland-Jørgensen wrote:

Björn Töpel  writes:

From: Björn Töpel 

Now that the AF_XDP rings have load-acquire/store-release semantics,
move libbpf to that as well.

The library-internal libbpf_smp_{load_acquire,store_release} are only
valid for 32-bit words on ARM64.

Also, remove the barriers that are no longer in use.


So what happens if an updated libbpf is paired with an older kernel (or
vice versa)?


"This is fine." ;-) This was briefly discussed in [1], outlined by the
previous commit!

...even on POWER.


Could you put a summary or quote of that discussion on 'why it is okay and does 
not
cause /forward or backward/ compat issues with user space' directly into patch 
1's
commit message?

I feel just referring to a link is probably less suitable in this case as it 
should
rather be part of the commit message that contains the justification on why it 
is
waterproof - at least it feels that specific area may be a bit 
under-documented, so
having it as direct part certainly doesn't hurt.

Would also be great to get Will's ACK on that when you have a v2. :)

Thanks,
Daniel


[1] https://lore.kernel.org/bpf/20200316184423.GA14143@willie-the-truck/


[PATCH] vhost-vdpa: honor CAP_IPC_LOCK

2021-03-02 Thread Jason Wang
When CAP_IPC_LOCK is set we should not check locked memory against
rlimit as what has been implemented in mlock().

Signed-off-by: Jason Wang 
---
 drivers/vhost/vdpa.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index ef688c8c0e0e..e93572e2e344 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -638,7 +638,8 @@ static int vhost_vdpa_process_iotlb_update(struct 
vhost_vdpa *v,
mmap_read_lock(dev->mm);
 
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-   if (npages + atomic64_read(&dev->mm->pinned_vm) > lock_limit) {
+   if (!capable(CAP_IPC_LOCK) &&
+   (npages + atomic64_read(&dev->mm->pinned_vm) > lock_limit)) {
ret = -ENOMEM;
goto unlock;
}
-- 
2.18.1



Re: [PATCH bpf-next 2/2] libbpf, xsk: add libbpf_smp_store_release libbpf_smp_load_acquire

2021-03-02 Thread Björn Töpel

On 2021-03-02 10:13, Daniel Borkmann wrote:

On 3/2/21 9:05 AM, Björn Töpel wrote:

On 2021-03-01 17:10, Toke Høiland-Jørgensen wrote:

Björn Töpel  writes:

From: Björn Töpel 

Now that the AF_XDP rings have load-acquire/store-release semantics,
move libbpf to that as well.

The library-internal libbpf_smp_{load_acquire,store_release} are only
valid for 32-bit words on ARM64.

Also, remove the barriers that are no longer in use.


So what happens if an updated libbpf is paired with an older kernel (or
vice versa)?


"This is fine." ;-) This was briefly discussed in [1], outlined by the
previous commit!

...even on POWER.


Could you put a summary or quote of that discussion on 'why it is okay 
and does not
cause /forward or backward/ compat issues with user space' directly into 
patch 1's

commit message?

I feel just referring to a link is probably less suitable in this case 
as it should
rather be part of the commit message that contains the justification on 
why it is
waterproof - at least it feels that specific area may be a bit 
under-documented, so

having it as direct part certainly doesn't hurt.



I agree; It's enough in the weed as it is already.

I wonder if it's possible to cook a LKMM litmus test for this...?



Would also be great to get Will's ACK on that when you have a v2. :)



Yup! :-)


Björn



Thanks,
Daniel


[1] https://lore.kernel.org/bpf/20200316184423.GA14143@willie-the-truck/


[PATCH v2] net: 9p: advance iov on empty read

2021-03-02 Thread Jisheng Zhang
I met below warning when cating a small size(about 80bytes) txt file
on 9pfs(msize=2097152 is passed to 9p mount option), the reason is we
miss iov_iter_advance() if the read count is 0 for zerocopy case, so
we didn't truncate the pipe, then iov_iter_pipe() thinks the pipe is
full. Fix it by removing the exception for 0 to ensure to call
iov_iter_advance() even on empty read for zerocopy case.

[8.279568] WARNING: CPU: 0 PID: 39 at lib/iov_iter.c:1203 
iov_iter_pipe+0x31/0x40
[8.280028] Modules linked in:
[8.280561] CPU: 0 PID: 39 Comm: cat Not tainted 5.11.0+ #6
[8.281260] RIP: 0010:iov_iter_pipe+0x31/0x40
[8.281974] Code: 2b 42 54 39 42 5c 76 22 c7 07 20 00 00 00 48 89 57 18 8b 
42 50 48 c7 47 08 b
[8.283169] RSP: 0018:888000cbbd80 EFLAGS: 0246
[8.283512] RAX: 0010 RBX: 888000117d00 RCX: 
[8.283876] RDX: 88800031d600 RSI:  RDI: 888000cbbd90
[8.284244] RBP: 888000cbbe38 R08:  R09: 8880008d2058
[8.284605] R10: 0002 R11: 888000375510 R12: 0050
[8.284964] R13: 888000cbbe80 R14: 0050 R15: 88800031d600
[8.285439] FS:  7f24fd8af600() GS:88803ec0() 
knlGS:
[8.285844] CS:  0010 DS:  ES:  CR0: 80050033
[8.286150] CR2: 7f24fd7d7b90 CR3: 00c97000 CR4: 000406b0
[8.286710] Call Trace:
[8.288279]  generic_file_splice_read+0x31/0x1a0
[8.289273]  ? do_splice_to+0x2f/0x90
[8.289511]  splice_direct_to_actor+0xcc/0x220
[8.289788]  ? pipe_to_sendpage+0xa0/0xa0
[8.290052]  do_splice_direct+0x8b/0xd0
[8.290314]  do_sendfile+0x1ad/0x470
[8.290576]  do_syscall_64+0x2d/0x40
[8.290818]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[8.291409] RIP: 0033:0x7f24fd7dca0a
[8.292511] Code: c3 0f 1f 80 00 00 00 00 4c 89 d2 4c 89 c6 e9 bd fd ff ff 
0f 1f 44 00 00 31 8
[8.293360] RSP: 002b:7ffc20932818 EFLAGS: 0206 ORIG_RAX: 
0028
[8.293800] RAX: ffda RBX: 0100 RCX: 7f24fd7dca0a
[8.294153] RDX:  RSI: 0003 RDI: 0001
[8.294504] RBP: 0003 R08:  R09: 
[8.294867] R10: 0100 R11: 0206 R12: 0003
[8.295217] R13: 0001 R14: 0001 R15: 
[8.295782] ---[ end trace 63317af81b3ca24b ]---

Signed-off-by: Jisheng Zhang 
---
Since v1:
 - reword the commit msg
 - fix the issue by removing exception for 0 code path, thank Dominique!

 net/9p/client.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/net/9p/client.c b/net/9p/client.c
index 4f62f299da0c..0a9019da18f3 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -1623,10 +1623,6 @@ p9_client_read_once(struct p9_fid *fid, u64 offset, 
struct iov_iter *to,
}
 
p9_debug(P9_DEBUG_9P, "<<< RREAD count %d\n", count);
-   if (!count) {
-   p9_tag_remove(clnt, req);
-   return 0;
-   }
 
if (non_zc) {
int n = copy_to_iter(dataptr, count, to);
-- 
2.30.1



Re: [PATCH] net: 9p: free what was emitted when read count is 0

2021-03-02 Thread Jisheng Zhang
On Tue, 2 Mar 2021 17:08:13 +0900 Dominique Martinet wrote:


> 
> 
> Jisheng Zhang wrote on Tue, Mar 02, 2021 at 03:39:40PM +0800:
> > > Rather than make an exception for 0, how about just removing the if as
> > > follow ?  
> >
> > IMHO, we may need to keep the "if" in current logic. When count
> > reaches zero, we need to break the "while(iov_iter_count(to))" loop, so 
> > removing
> > the "if" modifying the logic.  
> 
> We're not looking at the same loop, the break will happen properly

I was reading the old code because I switched to linux-5.4 longterm tree
for other development ;)

> without the if because it's the return value of p9_client_read_once()
> now.
> 
> In the old code I remember what you're saying and it makes sense, I
> guess that was the reason for the special case.
> It's not longer required, let's remove it.

Thank you. patch v2 is sent out.


Re: [PATCH bpf-next 2/2] libbpf, xsk: add libbpf_smp_store_release libbpf_smp_load_acquire

2021-03-02 Thread Daniel Borkmann

On 3/2/21 10:16 AM, Björn Töpel wrote:

On 2021-03-02 10:13, Daniel Borkmann wrote:

On 3/2/21 9:05 AM, Björn Töpel wrote:

On 2021-03-01 17:10, Toke Høiland-Jørgensen wrote:

Björn Töpel  writes:

From: Björn Töpel 

Now that the AF_XDP rings have load-acquire/store-release semantics,
move libbpf to that as well.

The library-internal libbpf_smp_{load_acquire,store_release} are only
valid for 32-bit words on ARM64.

Also, remove the barriers that are no longer in use.


So what happens if an updated libbpf is paired with an older kernel (or
vice versa)?


"This is fine." ;-) This was briefly discussed in [1], outlined by the
previous commit!

...even on POWER.


Could you put a summary or quote of that discussion on 'why it is okay and does 
not
cause /forward or backward/ compat issues with user space' directly into patch 
1's
commit message?

I feel just referring to a link is probably less suitable in this case as it 
should
rather be part of the commit message that contains the justification on why it 
is
waterproof - at least it feels that specific area may be a bit 
under-documented, so
having it as direct part certainly doesn't hurt.


I agree; It's enough in the weed as it is already.

I wonder if it's possible to cook a LKMM litmus test for this...?


That would be amazing! :-)

(Another option which can be done independently could be to update [0] with 
outlining a
 pairing scenario as we have here describing the forward/backward compatibility 
on the
 barriers used, I think that would be quite useful as well.)

  [0] Documentation/memory-barriers.txt


Would also be great to get Will's ACK on that when you have a v2. :)


Yup! :-)


Björn



Thanks,
Daniel


[1] https://lore.kernel.org/bpf/20200316184423.GA14143@willie-the-truck/




KMSAN: uninit-value in ieee802154_hdr_push

2021-03-02 Thread syzbot
Hello,

syzbot found the following issue on:

HEAD commit:29ad81a1 arch/x86: add missing include to sparsemem.h
git tree:   https://github.com/google/kmsan.git master
console output: https://syzkaller.appspot.com/x/log.txt?x=1756eff2d0
kernel config:  https://syzkaller.appspot.com/x/.config?x=c8e3b38ca92283e
dashboard link: https://syzkaller.appspot.com/bug?extid=4f6e279a71100e94ae65
compiler:   Debian clang version 11.0.1-2
userspace arch: i386

Unfortunately, I don't have any reproducer for this issue yet.

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+4f6e279a71100e94a...@syzkaller.appspotmail.com

=
BUG: KMSAN: uninit-value in ieee802154_hdr_push_sechdr 
net/ieee802154/header_ops.c:54 [inline]
BUG: KMSAN: uninit-value in ieee802154_hdr_push+0xd68/0xdd0 
net/ieee802154/header_ops.c:108
CPU: 1 PID: 15015 Comm: syz-executor.3 Not tainted 5.11.0-rc7-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x21c/0x280 lib/dump_stack.c:120
 kmsan_report+0xfb/0x1e0 mm/kmsan/kmsan_report.c:118
 __msan_warning+0x5f/0xa0 mm/kmsan/kmsan_instr.c:197
 ieee802154_hdr_push_sechdr net/ieee802154/header_ops.c:54 [inline]
 ieee802154_hdr_push+0xd68/0xdd0 net/ieee802154/header_ops.c:108
 ieee802154_header_create+0xd07/0x1070 net/mac802154/iface.c:404
 wpan_dev_hard_header include/net/cfg802154.h:374 [inline]
 dgram_sendmsg+0xf48/0x15c0 net/ieee802154/socket.c:670
 ieee802154_sock_sendmsg+0xec/0x130 net/ieee802154/socket.c:97
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg net/socket.c:672 [inline]
 sys_sendmsg+0xcfc/0x12f0 net/socket.c:2345
 ___sys_sendmsg net/socket.c:2399 [inline]
 __sys_sendmsg+0x714/0x830 net/socket.c:2432
 __compat_sys_sendmsg net/compat.c:347 [inline]
 __do_compat_sys_sendmsg net/compat.c:354 [inline]
 __se_compat_sys_sendmsg+0xa7/0xc0 net/compat.c:351
 __ia32_compat_sys_sendmsg+0x4a/0x70 net/compat.c:351
 do_syscall_32_irqs_on arch/x86/entry/common.c:79 [inline]
 __do_fast_syscall_32+0x102/0x160 arch/x86/entry/common.c:141
 do_fast_syscall_32+0x6a/0xc0 arch/x86/entry/common.c:166
 do_SYSENTER_32+0x73/0x90 arch/x86/entry/common.c:209
 entry_SYSENTER_compat_after_hwframe+0x4d/0x5c
RIP: 0023:0xf7f65549
Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 
03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 
8d b4 26 00 00 00 00 8d b4 26 00 00 00 00
RSP: 002b:f555f5fc EFLAGS: 0296 ORIG_RAX: 0172
RAX: ffda RBX: 0005 RCX: 23c0
RDX:  RSI:  RDI: 
RBP:  R08:  R09: 
R10:  R11:  R12: 
R13:  R14:  R15: 

Local variable hdr@ieee802154_header_create created at:
 ieee802154_header_create+0xc9/0x1070 net/mac802154/iface.c:368
 ieee802154_header_create+0xc9/0x1070 net/mac802154/iface.c:368
=


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkal...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.


Re: [PATCH ipsec 0/2] vti(6): fix ipv4 pmtu check to honor ip header df

2021-03-02 Thread Sabrina Dubroca
2021-02-26, 23:35:04 +0200, Eyal Birger wrote:
> This series aligns vti(6) handling of non-df IPv4 packets exceeding
> the size of the tunnel MTU to avoid sending "Frag needed" and instead
> fragment the packets after encapsulation.
> 
> Eyal Birger (2):
>   vti: fix ipv4 pmtu check to honor ip header df
>   vti6: fix ipv4 pmtu check to honor ip header df

Thanks Eyal.
Reviewed-by: Sabrina Dubroca 

Steffen, that's going to conflict with commit 4372339efc06 ("net:
always use icmp{,v6}_ndo_send from ndo_start_xmit") from net.

-- 
Sabrina



Triggering WARN in net/wireless/nl80211.c

2021-03-02 Thread Christian Brauner
Hey everyone,

I get the following WARN triggered in net/wireless/nl80211.c during boot
on v5.12-rc1:

[   36.749643] [ cut here ]
[   36.749645] WARNING: CPU: 7 PID: 829 at net/wireless/nl80211.c:7746 
nl80211_get_reg_do+0x215/0x250 [cfg80211]
[   36.749683] Modules linked in: bnep ccm algif_aead des_generic libdes arc4 
algif_skcipher cmac md4 algif_hash af_alg typec_displayport binfmt_misc 
snd_hda_codec_hdmi nls_iso8859_1 joydev mei_hdcp intel_rapl_msr 
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm 
snd_hda_codec_realtek snd_hda_codec_generic rapl iwlmvm intel_cstate mac80211 
snd_hda_intel snd_intel_dspcfg input_leds snd_seq_midi snd_hda_codec rmi_smbus 
libarc4 snd_seq_midi_event rmi_core serio_raw snd_hda_core uvcvideo snd_rawmidi 
efi_pstore iwlwifi snd_hwdep intel_wmi_thunderbolt videobuf2_vmalloc snd_pcm 
wmi_bmof btusb videobuf2_memops thinkpad_acpi videobuf2_v4l2 btrtl 
videobuf2_common processor_thermal_device btbcm processor_thermal_rfim nvram 
snd_seq btintel platform_profile videodev processor_thermal_mbox ucsi_acpi 
bluetooth ledtrig_audio processor_thermal_rapl mei_me snd_seq_device 
intel_rapl_common typec_ucsi mc ecdh_generic cfg80211 ecc intel_pch_thermal mei 
intel_soc_dts_iosf intel_xhci_usb_role_switch
[   36.749713]  typec snd_timer snd soundcore int3403_thermal 
int340x_thermal_zone acpi_pad mac_hid int3400_thermal acpi_thermal_rel 
sch_fq_codel pkcs8_key_parser ip_tables x_tables autofs4 btrfs blake2b_generic 
xor zstd_compress raid6_pq libcrc32c dm_crypt uas usb_storage i915 
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i2c_algo_bit aesni_intel 
crypto_simd drm_kms_helper cryptd syscopyarea nvme sysfillrect psmouse 
sysimgblt fb_sys_fops nvme_core cec e1000e i2c_i801 drm i2c_smbus wmi video
[   36.749735] CPU: 7 PID: 829 Comm: iwd Tainted: G U
5.12.0-rc1-brauner-v5.12-rc1 #316
[   36.749737] Hardware name: LENOVO 20KHCTO1WW/20KHCTO1WW, BIOS N23ET75W (1.50 
) 10/13/2020
[   36.749738] RIP: 0010:nl80211_get_reg_do+0x215/0x250 [cfg80211]
[   36.749763] Code: 00 e8 8f f0 07 d0 85 c0 0f 84 ee fe ff ff eb a3 4c 89 e7 
48 89 45 c0 e8 49 6d 44 d0 e8 64 09 47 d0 48 8b 45 c0 e9 36 ff ff ff <0f> 0b 4c 
89 e7 e8 31 6d 44 d0 e8 4c 09 47 d0 b8 ea ff ff ff e9 1d
[   36.749765] RSP: 0018:a71800b7bb10 EFLAGS: 00010202
[   36.749766] RAX:  RBX: 0001 RCX: 
[   36.749767] RDX: 8bef4beb0008 RSI:  RDI: 8bef4beb0300
[   36.749768] RBP: a71800b7bb50 R08: 8bef4beb0300 R09: 8bef4b7ba014
[   36.749770] R10:  R11: 001f R12: 8bef4d38e800
[   36.749771] R13: a71800b7bb78 R14: 8bef4b7ba014 R15: 
[   36.749772] FS:  7f89333c4740() GS:8bf2d17c() 
knlGS:
[   36.749773] CS:  0010 DS:  ES:  CR0: 80050033
[   36.749774] CR2: 7ffdda038ca8 CR3: 000108ca0002 CR4: 003706e0
[   36.749776] DR0:  DR1:  DR2: 
[   36.749777] DR3:  DR6: fffe0ff0 DR7: 0400
[   36.749778] Call Trace:
[   36.749781]  ? rtnl_unlock+0xe/0x10
[   36.749786]  genl_family_rcv_msg_doit.isra.0+0xec/0x150
[   36.749791]  genl_rcv_msg+0xe5/0x1e0
[   36.749793]  ? __cfg80211_rdev_from_attrs+0x1c0/0x1c0 [cfg80211]
[   36.749821]  ? nl80211_send_regdom.constprop.0+0x1a0/0x1a0 [cfg80211]
[   36.749847]  ? genl_family_rcv_msg_doit.isra.0+0x150/0x150
[   36.749850]  netlink_rcv_skb+0x55/0x100
[   36.749853]  genl_rcv+0x29/0x40
[   36.749855]  netlink_unicast+0x1a8/0x250
[   36.749858]  netlink_sendmsg+0x233/0x460
[   36.749860]  sock_sendmsg+0x65/0x70
[   36.749863]  __sys_sendto+0x113/0x190
[   36.749865]  ? __secure_computing+0x42/0xe0
[   36.749867]  __x64_sys_sendto+0x29/0x30
[   36.749869]  do_syscall_64+0x38/0x90
[   36.749872]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   36.749874] RIP: 0033:0x7f89334e16c0
[   36.749875] Code: c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 
04 25 18 00 00 00 85 c0 75 1d 45 31 c9 45 31 c0 b8 2c 00 00 00 0f 05 <48> 3d 00 
f0 ff ff 77 68 c3 0f 1f 80 00 00 00 00 55 48 83 ec 20 48
[   36.749877] RSP: 002b:7ffdda03da08 EFLAGS: 0246 ORIG_RAX: 
002c
[   36.749878] RAX: ffda RBX: 561f2c3c7b00 RCX: 7f89334e16c0
[   36.749879] RDX: 001c RSI: 561f2c3e66c0 RDI: 0004
[   36.749880] RBP: 561f2c3d28e0 R08:  R09: 
[   36.749880] R10:  R11: 0246 R12: 7ffdda03da6c
[   36.749881] R13: 7ffdda03da68 R14: 561f2c3d1790 R15: 
[   36.749883] ---[ end trace 7cf430797302f3ab ]---

> dmesg | grep -i wifi
[   32.842573] Intel(R) Wireless WiFi driver for Linux
[   32.869098] iwlwifi :02:00.0: Found debug destination: EXTERNAL_DRAM
[   32.869102] iwlwifi :02:00.0: Found debug configuration: 0
[   32.869688] iwlwifi :02:00.0: 

  1   2   3   >