Re: [PATCH v2 1/2] ip_tunnel: fix ip tunnel lookup in collect_md mode
> On 20 Jun 2017, at 2:08 AM, Pravin Shelar wrote: > > On Mon, Jun 19, 2017 at 6:13 AM, 严海双 > wrote: >> >> >>> On 19 Jun 2017, at 1:43 PM, Pravin Shelar wrote: >>> >>> On Fri, Jun 16, 2017 at 8:27 PM, Haishuang Yan >>> wrote: >>>> In collect_md mode, if the tun dev is down, it still can call >>>> ip_tunnel_rcv to receive on packets, and the rx statistics increase >>>> improperly. >>>> >>>> Fixes: 2e15ea390e6f ("ip_gre: Add support to collect tunnel metadata.") >>>> Cc: Pravin B Shelar >>>> Signed-off-by: Haishuang Yan >>>> >>>> --- >>>> Change since v2: >>>> * Fix wrong recipient addresss >>>> --- >>>> net/ipv4/ip_tunnel.c | 2 +- >>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>> >>>> diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c >>>> index 0f1d876..a3caba1 100644 >>>> --- a/net/ipv4/ip_tunnel.c >>>> +++ b/net/ipv4/ip_tunnel.c >>>> @@ -176,7 +176,7 @@ struct ip_tunnel *ip_tunnel_lookup(struct >>>> ip_tunnel_net *itn, >>>> return cand; >>>> >>>> t = rcu_dereference(itn->collect_md_tun); >>>> - if (t) >>>> + if (t && (t->dev->flags & IFF_UP)) >>>> return t; >>>> >>> It would be nice if we could increment drop count if tunnel device is not >>> up. >>> >> Hi Pravin >> >> I think it’s not necessary, for example as gre tunnel, if ipgre_rcv fails, >> it would trigger send an icmp unreachable >> message: >> >>if (ipgre_rcv(skb, &tpi, hdr_len) == PACKET_RCVD) >>return 0; >> >>icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0); >> >> Since the tunnel device didn’t touch the packets, so increase drop >> statistics is not necessary. >> > icmp err packets are not reliable on all networks. device stats are > much more convenient during debugging connectivity issues. > Okay, if the tunnel device is not up, packets will transfer to fallback tunnel, and if the fallback device is up, the RX drops will be increased as expected: gre0: flags=193 mtu 1476 inet 172.16.20.1 netmask 255.255.255.0 unspec 00-00-00-00-00-00-F0-00-00-00-00-00-00-00-00-00 txqueuelen 1000 (UNSPEC) RX packets 105 bytes 4522 (4.4 KiB) RX errors 0 dropped 105 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 My question is whether we should increase the drops of original tunnel device instead of fallback device?
Re: [PATCH 2/2] ip6_gre: Set flowi6_proto as IPPROTO_GRE in xmit path.
> On May 21, 2016, at 1:48 AM, David Miller wrote: > > From: Haishuang Yan > Date: Wed, 18 May 2016 18:05:52 +0800 > >> In gre6 xmit path, we are sending a GRE packet, so set fl6 proto >> to IPPROTO_GRE properly. >> >> Signed-off-by: Haishuang Yan > > I think it would be a lot better to initialize the flow protocol field > properly in ip6gre_tnl_link_config() instead of fixing it up every > single transmit. > > I agree, I will modify the changes in v2. Thanks Haishuang
Re: [PATCH v2] ip_tunnel: enclose a code block in macro IS_ENABLED(CONFIG_IPV6)
> On May 24, 2016, at 11:14 AM, Eric Dumazet wrote: > > On Tue, 2016-05-24 at 10:39 +0800, Haishuang Yan wrote: >> For ipv6 case, enclose the code block in macro IS_ENABLED(CONFIG_IPV6). >> >> --- >> Changes in v2: >> - Place the "#if IS_ENABLED" block before the "} else if >> (..) {" piece and the "#endif" before the closing brace and this >> becomes much easier to look at. > > _Why_ is this patch needed ? > > Please describe in the changelog what _actual_ problem you are trying to > address. > > We have many points in the kernel using ipv6_get_dsfield() even if > CONFIG_IPV6=n, and it seems fine to me at least. > > Thanks. > > > Yes, you’re right, but I think add this patch seems more reasonable in coding. Thanks for your reviewing.
Re: [PATCH] geneve: fix max_mtu setting
> On Jun 26, 2016, at 8:35 PM, zhuyj wrote: > > + if (geneve->remote.sa.sa_family == AF_INET) > + max_mtu -= sizeof(struct iphdr); > + else > + max_mtu -= sizeof(struct ipv6hdr); > > Sorry, if sa_family is not AF_NET, it is AF_INET6? > > There is a lot of macros in include/linux/socket.h. > > Zhu Yanjun > There are only two enumerations AF_INET and AF_INET6 have been assigned in geneve_newlink: if (data[IFLA_GENEVE_REMOTE] && data[IFLA_GENEVE_REMOTE6]) return -EINVAL; if (data[IFLA_GENEVE_REMOTE]) { remote.sa.sa_family = AF_INET; remote.sin.sin_addr.s_addr = nla_get_in_addr(data[IFLA_GENEVE_REMOTE]); } if (data[IFLA_GENEVE_REMOTE6]) { if (!IS_ENABLED(CONFIG_IPV6)) return -EPFNOSUPPORT; remote.sa.sa_family = AF_INET6; remote.sin6.sin6_addr = nla_get_in6_addr(data[IFLA_GENEVE_REMOTE6]); if (ipv6_addr_type(&remote.sin6.sin6_addr) & IPV6_ADDR_LINKLOCAL) { netdev_dbg(dev, "link-local remote is unsupported\n"); return -EINVAL; } } So I think the else case is for AF_INET6 only.
Re: [PATCH 1/2] ip_gre: fix potential memory leak in erspan_rcv
> On 2017年12月15日, at 上午2:47, William Tu wrote: > > On Thu, Dec 14, 2017 at 7:15 AM, Haishuang Yan > wrote: >> If md is NULL, tun_dst must be freed, otherwise it will cause memory >> leak. >> >> Fixes: 84e54fe0a5ea ("gre: introduce native tunnel support for ERSPAN") >> Cc: William Tu >> Signed-off-by: Haishuang Yan >> --- >> net/ipv4/ip_gre.c | 4 +++- >> 1 file changed, 3 insertions(+), 1 deletion(-) >> >> diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c >> index d828821..9253d6f 100644 >> --- a/net/ipv4/ip_gre.c >> +++ b/net/ipv4/ip_gre.c >> @@ -304,8 +304,10 @@ static int erspan_rcv(struct sk_buff *skb, struct >> tnl_ptk_info *tpi, >>return PACKET_REJECT; >> >>md = ip_tunnel_info_opts(&tun_dst->u.tun_info); >> - if (!md) >> + if (!md) { >> + dst_release((struct dst_entry *)tun_dst); >>return PACKET_REJECT; >> + } > I'm not sure about this. Maybe we don't even need to check "if (!md)" > since ip_tun_rx_dst does the memory allocation. > William > Hi, William I think we need to check “if (!md)”, if md is okay, ip_tunnel_rcv will be responsible to free tun_dst: 448 drop: 449 if (tun_dst) 450 dst_release((struct dst_entry *)tun_dst); Thanks.
Re: [PATCH] ip6_gre: fix a pontential issue in ip6erspan_rcv
> On 2017年12月16日, at 上午3:11, David Miller wrote: > > From: Haishuang Yan > Date: Fri, 15 Dec 2017 10:46:38 +0800 > >> pskb_may_pull() can change skb->data, so we need to load ipv6h/ershdr at >> the right place. >> >> Fixes: 5a963eb61b7c ("ip6_gre: Add ERSPAN native tunnel support") >> Cc: William Tu >> Signed-off-by: Haishuang Yan > > The mentioned commit ID only exists in net-next, and this patch does not apply > cleanly there. > Okay, I will send v2 commit rebased on latest master and with prefix [net-next]. Thanks.
Re: [PATCH] ipv4: Namespaceify tcp_fastopen knob
> On 2017年9月13日, at 上午11:57, David Miller wrote: > > From: Haishuang Yan > Date: Tue, 12 Sep 2017 18:30:57 +0800 > >> Different namespace application might require enable TCP Fast Open >> feature independently of the host. >> >> Reported-by: Luca BRUNO >> Signed-off-by: Haishuang Yan > ... >> diff --git a/samples/bpf/test_ipip.sh b/samples/bpf/test_ipip.sh >> index 1969254..7bbc521 100755 >> --- a/samples/bpf/test_ipip.sh >> +++ b/samples/bpf/test_ipip.sh >> @@ -173,6 +173,8 @@ function cleanup { >> cleanup >> echo "Testing IP tunnels..." >> test_ipip >> +sleep 1 >> test_ipip6 >> +sleep 1 >> test_ip6ip6 >> echo "*** PASS ***" > > This seems like a completely unrelated change. > Sorry, I make a mistake for including my local test changes in this patch. I will remove this change in v2 commit. Thanks David for reviewing.
Re: [PATCH v2] geneve: Fix setting ttl value in collect metadata mode
> On 2017年9月13日, at 上午7:43, Pravin Shelar wrote: > > On Tue, Sep 12, 2017 at 12:05 AM, Haishuang Yan > wrote: >> Similar to vxlan/ipip tunnel, if key->tos is zero in collect metadata >> mode, tos should also fallback to ip{4,6}_dst_hoplimit. >> >> Signed-off-by: Haishuang Yan >> >> --- >> Changes since v2: >> * Make the commit message more clearer. >> --- >> drivers/net/geneve.c | 6 ++ >> 1 file changed, 2 insertions(+), 4 deletions(-) >> >> diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c >> index f640407..d52a65f 100644 >> --- a/drivers/net/geneve.c >> +++ b/drivers/net/geneve.c >> @@ -834,11 +834,10 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct >> net_device *dev, >>sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true); >>if (geneve->collect_md) { >>tos = ip_tunnel_ecn_encap(key->tos, ip_hdr(skb), skb); >> - ttl = key->ttl; >>} else { >>tos = ip_tunnel_ecn_encap(fl4.flowi4_tos, ip_hdr(skb), skb); >> - ttl = key->ttl ? : ip4_dst_hoplimit(&rt->dst); >>} >> + ttl = key->ttl ? : ip4_dst_hoplimit(&rt->dst); >>df = key->tun_flags & TUNNEL_DONT_FRAGMENT ? htons(IP_DF) : 0; >> > This changes user API of Geneve collect-metadata mode. I do not see > good reason for this. Why user can not set right TTL for the flow? > For example, I test this case with script samples/bpf/test_tunnel_bpf.sh, and modify samples/bpf/tcbpf2_kern.c with following patch: diff --git a/samples/bpf/tcbpf2_kern.c b/samples/bpf/tcbpf2_kern.c index 370b749..16b09ed 100644 --- a/samples/bpf/tcbpf2_kern.c +++ b/samples/bpf/tcbpf2_kern.c @@ -204,7 +204,6 @@ int _geneve_set_tunnel(struct __sk_buff *skb) key.remote_ipv4 = 0xac100164; /* 172.16.1.100 */ key.tunnel_id = 2; key.tunnel_tos = 0; - key.tunnel_ttl = 64; __builtin_memset(&gopt, 0x0, sizeof(gopt)); gopt.opt_class = 0x102; /* Open Virtual Networking (OVN) */ The user didn’t set key.tunnel_ttl and then key.tunnel_ttl is 0, so tcpdump output on veth1 would have ttl value of 0 in ip header: PING 10.1.1.100 (10.1.1.100) 56(84) bytes of data. 18:59:27.347373 82:81:85:8a:72:62 > 7e:99:fb:3b:15:73, ethertype IPv4 (0x0800), length 104: (tos 0x0, id 7386, offset 0, flags [none], proto UDP (17), length 90) 64 bytes from 10.1.1.100: icmp_seq=1 ttl=64 time=0.083 ms 172.16.1.200.57776 > 172.16.1.100.6081: [no cksum] UDP, length 62 --- 10.1.1.100 ping statistics — After my patch, tcpdump output on veth1 would have ttl value of 64 in ip header: PING 10.1.1.100 (10.1.1.100) 56(84) bytes of data. 19:12:10.870410 f2:56:13:68:e9:a9 > 8a:1e:93:dc:59:27, ethertype IPv4 (0x0800), length 104: (tos 0x0, ttl 64, id 48595, offset 0, flags [none], p roto UDP (17), length 90) 64 bytes from 10.1.1.100: icmp_seq=1 ttl=64 time=0.091 ms 172.16.1.200.57776 > 172.16.1.100.6081: [no cksum] UDP, length 62 --- 10.1.1.100 ping statistics ---
Re: [PATCH v2] ipv4: Namespaceify tcp_fastopen knob
> On 2017年9月13日, at 下午9:02, Eric Dumazet wrote: > > On Wed, 2017-09-13 at 05:44 -0700, Eric Dumazet wrote: >> On Wed, 2017-09-13 at 19:19 +0800, Haishuang Yan wrote: >>> Different namespace application might require enable TCP Fast Open >>> feature independently of the host. >>> >> >> Poor changelog, no actual description / list of sysctls that are moved >> to per netns. >> >> And looking at the patch, it seems your conversion is not complete. >> >> So I will ask you to provide more evidence that you tested your patch >> next time you submit it. > > I suggest you move one sysctl at a time, in a patch series. > > It will be easier to document and test for you, and review for us. > > Thanks. > Okay, I will split my patch for each sysctl change. Thanks.
Re: [PATCH] be2net: Fix some u16 fields appropriately
> On 2017年8月29日, at 上午7:19, David Miller wrote: > > From: Haishuang Yan > Date: Sun, 27 Aug 2017 15:24:45 +0800 > >> In be_tx_compl_process, frag_index declared as u32, so it's better to >> declare last_index as u32 also. >> >> CC: Ajit Khaparde >> Fixes: b0fd2eb28bd4 ("be2net: Declare some u16 fields as u32 to improve >> performance") >> Signed-off-by: Haishuang Yan > > That is not a legitimate reason for making this change. > >> @@ -255,7 +255,7 @@ struct be_tx_stats { >> /* Structure to hold some data of interest obtained from a TX CQE */ >> struct be_tx_compl_info { >> u8 status; /* Completion status */ >> -u16 end_index; /* Completed TXQ Index */ >> +u32 end_index; /* Completed TXQ Index */ >> }; >> >> struct be_tx_obj { > > The ->end_index comes solely from: > > txcp->end_index = GET_TX_COMPL_BITS(wrb_index, compl); > > Which is precisely a 16-bit value. > > I'm not applying this, sorry. > Hi David, The GET_TX_COMPL_BITS comes from amap_get which also returns a 32-bit value: 6 static inline u32 amap_get(void *ptr, u32 dw_offset, u32 mask, u32 offset) 5 { 4 u32 *dw = (u32 *) ptr; 3 return mask & (*(dw + dw_offset) >> offset); 2 } 1 869 #define AMAP_GET_BITS(_struct, field, ptr) \ 1 amap_get(ptr, \ 2 offsetof(_struct, field)/32,\ 3 amap_mask(sizeof(((_struct *)0)->field)), \ 4 AMAP_BIT_OFFSET(_struct, field))
Re: [PATCH] ipv4: Namespaceify tcp_max_orphans knob
> On 2017年9月9日, at 上午6:13, Cong Wang wrote: > > On Wed, Sep 6, 2017 at 8:10 PM, Haishuang Yan > wrote: >> Different namespace application might require different maximal number >> of TCP sockets independently of the host. > > So after your patch we could have N * net->ipv4.sysctl_tcp_max_orphans > in a whole system, right? This just makes OOM easier to trigger. > >From my understanding, before the patch, we had N * >net->ipv4.sysctl_tcp_max_orphans, and after the patch, we could have ns1.sysctl_tcp_max_orphans + ns2.sysctl_tcp_max_orphans + ns3.sysctl_tcp_max_orphans, is that right? Thanks for your reviewing.
Re: [PATCH] ipv4: Namespaceify tcp_max_orphans knob
> On 2017年9月9日, at 下午12:35, Cong Wang wrote: > > On Fri, Sep 8, 2017 at 6:25 PM, 严海双 wrote: >> >> >>> On 2017年9月9日, at 上午6:13, Cong Wang wrote: >>> >>> On Wed, Sep 6, 2017 at 8:10 PM, Haishuang Yan >>> wrote: >>>> Different namespace application might require different maximal number >>>> of TCP sockets independently of the host. >>> >>> So after your patch we could have N * net->ipv4.sysctl_tcp_max_orphans >>> in a whole system, right? This just makes OOM easier to trigger. >>> >> >> From my understanding, before the patch, we had N * >> net->ipv4.sysctl_tcp_max_orphans, >> and after the patch, we could have ns1.sysctl_tcp_max_orphans + >> ns2.sysctl_tcp_max_orphans >> + ns3.sysctl_tcp_max_orphans, is that right? Thanks for your reviewing. > > Nope, by N I mean the number of containers. Before your patch, the limit > is global, after your patch it is per container. > Yeah, for example, if there is N containers, before the patch, I mean the limit is: N * net->ipv4.sysctl_tcp_max_orphans After the patch, the limit is: ns1. net->ipv4.sysctl_tcp_max_orphans + ns2. net->ipv4.sysctl_tcp_max_orphans + …
Re: [PATCH] ipv4: Namespaceify tcp_max_orphans knob
> On 2017年9月9日, at 下午1:16, David Miller wrote: > > From: 严海双 > Date: Sat, 9 Sep 2017 13:09:57 +0800 > >> >> >>> On 2017年9月9日, at 下午12:35, Cong Wang wrote: >>> >>> On Fri, Sep 8, 2017 at 6:25 PM, 严海双 >>> wrote: >>>> >>>> >>>>> On 2017年9月9日, at 上午6:13, Cong Wang wrote: >>>>> >>>>> On Wed, Sep 6, 2017 at 8:10 PM, Haishuang Yan >>>>> wrote: >>>>>> Different namespace application might require different maximal number >>>>>> of TCP sockets independently of the host. >>>>> >>>>> So after your patch we could have N * net->ipv4.sysctl_tcp_max_orphans >>>>> in a whole system, right? This just makes OOM easier to trigger. >>>>> >>>> >>>> From my understanding, before the patch, we had N * >>>> net->ipv4.sysctl_tcp_max_orphans, >>>> and after the patch, we could have ns1.sysctl_tcp_max_orphans + >>>> ns2.sysctl_tcp_max_orphans >>>> + ns3.sysctl_tcp_max_orphans, is that right? Thanks for your reviewing. >>> >>> Nope, by N I mean the number of containers. Before your patch, the limit >>> is global, after your patch it is per container. >>> >> >> Yeah, for example, if there is N containers, before the patch, I mean the >> limit is: >> >> N * net->ipv4.sysctl_tcp_max_orphans >> >> After the patch, the limit is: >> >> ns1. net->ipv4.sysctl_tcp_max_orphans + ns2. >> net->ipv4.sysctl_tcp_max_orphans + … > > Not true. > > Please remove "N" from your equation of the current situation. > > "sysctl_tcp_max_orphans" applies to entire system, it is a global limit, > comparing one limit against all orphans in the system, there is no N. Yes, it’s right. I browse the source code and found that it’s a global limit, sorry for my mistake. Thanks David and Cong.
Re: [PATCH v3 2/2] ip6_tunnel: fix ip6 tunnel lookup in collect_md mode
> On 2017年9月6日, at 上午11:14, Alexei Starovoitov wrote: > > On 9/4/17 1:36 AM, Haishuang Yan wrote: >> In collect_md mode, if the tun dev is down, it still can call >> __ip6_tnl_rcv to receive on packets, and the rx statistics increase >> improperly. >> >> Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels") >> Cc: Alexei Starovoitov >> Signed-off-by: Haishuang Yan >> >> --- >> Change since v3: >> * Increment rx_dropped if tunnel device is not up, suggested by >> Pravin B Shelar >> * Fix wrong recipient address >> --- >> net/ipv6/ip6_tunnel.c | 7 +-- >> 1 file changed, 5 insertions(+), 2 deletions(-) >> >> diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c >> index 10a693a..e91d3b6 100644 >> --- a/net/ipv6/ip6_tunnel.c >> +++ b/net/ipv6/ip6_tunnel.c >> @@ -171,8 +171,11 @@ static struct net_device_stats *ip6_get_stats(struct >> net_device *dev) >> } >> >> t = rcu_dereference(ip6n->collect_md_tun); >> -if (t) >> -return t; >> +if (t) { >> +if (t->dev->flags & IFF_UP) >> +return t; >> +t->dev->stats.rx_dropped++; >> +} > > Why increment the stats only for this drop case? Because It was suggested by Pravin on v2 commit of the patch. > There are plenty of other conditions where packet > will be dropped in ip6 tunnel. I think it's important > to present consistent behavior to the users, > so I'd increment drop stats either for all drop cases > or for none. And today it's none. > The ! IFF_UP case should probably be return NULL too > >
Re: [PATCH] ip6_tunnel: fix potential issue in __ip6_tnl_rcv
> On 13 Jun 2017, at 4:56 PM, Haishuang Yan > wrote: > > When __ip6_tnl_rcv fails, the tun_dst won't be freed, so call > dst_release to free it in error code path. > > CC: Alexei Starovoitov > Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels") > Signed-off-by: Haishuang Yan > Tested-by: Zhang Shengju > > --- > v4: > - Add tester information > v3: > - Free tun_dst from error code path > v2: > - Add the the missing Fixes information > --- > net/ipv6/ip6_tunnel.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c > index 9b37f97..ef99d59 100644 > --- a/net/ipv6/ip6_tunnel.c > +++ b/net/ipv6/ip6_tunnel.c > @@ -859,6 +859,8 @@ static int __ip6_tnl_rcv(struct ip6_tnl *tunnel, struct > sk_buff *skb, > return 0; > > drop: > + if (tun_dst) > + dst_release((struct dst_entry *)tun_dst); > kfree_skb(skb); > return 0; > } > -- > 1.8.3.1 > > Hi David, Please ignore the patches, I forgot to add subject prefix, sorry.
Re: [PATCH] ip6_tunnel: Correct tos value in collect_md mode
> On 14 Jun 2017, at 10:48 AM, Haishuang Yan > wrote: > > Same as ip_gre, geneve and vxlan, use key->tos as tos value. > > CC: Peter Dawson > Fixes: 0e9a709560db ("ip6_tunnel, ip6_gre: fix setting of DSCP on > encapsulated packets”) > Suggested-by: Daniel Borkmann > Signed-off-by: Haishuang Yan > > --- > Changes since v2: > * Add fixes information > * mask key->tos with RT_TOS() suggested by Daniel > --- > net/ipv6/ip6_tunnel.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c > index ef99d59..6400726 100644 > --- a/net/ipv6/ip6_tunnel.c > +++ b/net/ipv6/ip6_tunnel.c > @@ -1249,7 +1249,7 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device > *dev, __u8 dsfield, > fl6.flowi6_proto = IPPROTO_IPIP; > fl6.daddr = key->u.ipv6.dst; > fl6.flowlabel = key->label; > - dsfield = ip6_tclass(key->label); > + dsfield = RT_TOS(key->tos); > } else { > if (!(t->parms.flags & IP6_TNL_F_IGN_ENCAP_LIMIT)) > encap_limit = t->parms.encap_limit; > @@ -1320,7 +1320,7 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device > *dev, __u8 dsfield, > fl6.flowi6_proto = IPPROTO_IPV6; > fl6.daddr = key->u.ipv6.dst; > fl6.flowlabel = key->label; > - dsfield = ip6_tclass(key->label); > + dsfield = RT_TOS(key->tos); > } else { > offset = ip6_tnl_parse_tlv_enc_lim(skb, > skb_network_header(skb)); > /* ip6_tnl_parse_tlv_enc_lim() might have reallocated skb->head > */ > -- > 1.8.3.1 > > Sorry, I forgot to add subject prefix, please ignore this patch.
Re: [PATCH v4 2/3] ipv4: Namespaceify tcp_fastopen_key knob
> On 2017年9月26日, at 上午7:24, David Miller wrote: > > From: Haishuang Yan > Date: Fri, 22 Sep 2017 21:48:43 +0800 > >> @@ -9,13 +9,18 @@ >> #include >> #include >> >> -struct tcp_fastopen_context __rcu *tcp_fastopen_ctx; >> - >> -static DEFINE_SPINLOCK(tcp_fastopen_ctx_lock); >> - >> -void tcp_fastopen_init_key_once(bool publish) >> +void tcp_fastopen_init_key_once(struct net *net) > > Why did you remove the 'publish' logic from this function? > I think this logic is not necessary now, in proc_tcp_fastopen_key, I have removed tcp_fastopen_init_key_once(false) where the ‘publish’ is false: - /* Generate a dummy secret but don't publish it. This -* is needed so we don't regenerate a new key on the -* first invocation of tcp_fastopen_cookie_gen -*/ - tcp_fastopen_init_key_once(false); - tcp_fastopen_reset_cipher(user_key, TCP_FASTOPEN_KEY_LENGTH); + tcp_fastopen_reset_cipher(net, user_key, TCP_FASTOPEN_KEY_LENGTH); It said we don't regenerate a new key on first invocation of tcp_fastopen_cookie_gen, but in tcp_fastopen_cookie_gen,it didn’t call tcp_fastopen_init_key_once since from commit dfea2aa654243 (tcp: Do not call tcp_fastopen_reset_cipher from interrupt context): And in other places where call tcp_fastopen_init_key_once, the ‘publish’ is always true: --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -222,7 +222,7 @@ int inet_listen(struct socket *sock, int backlog) (tcp_fastopen & TFO_SERVER_ENABLE) && !inet_csk(sk)->icsk_accept_queue.fastopenq.max_qlen) { fastopen_queue_tune(sk, backlog); - tcp_fastopen_init_key_once(true); + tcp_fastopen_init_key_once(sock_net(sk)); } --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2749,7 +2749,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level, case TCP_FASTOPEN: if (val >= 0 && ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))) { - tcp_fastopen_init_key_once(true); + tcp_fastopen_init_key_once(net); fastopen_queue_tune(sk, val); } else { So I deleted ‘publish’ logic to ensure it was always true.
Re: [PATCH v4 2/3] ipv4: Namespaceify tcp_fastopen_key knob
> On 2017年9月27日, at 上午2:18, David Miller wrote: > > From: 严海双 > Date: Tue, 26 Sep 2017 09:25:51 +0800 > >>> On 2017年9月26日, at 上午7:24, David Miller wrote: >>> >>> From: Haishuang Yan >>> Date: Fri, 22 Sep 2017 21:48:43 +0800 >>> >>>> @@ -9,13 +9,18 @@ >>>> #include >>>> #include >>>> >>>> -struct tcp_fastopen_context __rcu *tcp_fastopen_ctx; >>>> - >>>> -static DEFINE_SPINLOCK(tcp_fastopen_ctx_lock); >>>> - >>>> -void tcp_fastopen_init_key_once(bool publish) >>>> +void tcp_fastopen_init_key_once(struct net *net) >>> >>> Why did you remove the 'publish' logic from this function? >>> >> >> I think this logic is not necessary now, in proc_tcp_fastopen_key, I have >> removed >> tcp_fastopen_init_key_once(false) where the ‘publish’ is false: >> >> -/* Generate a dummy secret but don't publish it. This >> - * is needed so we don't regenerate a new key on the >> - * first invocation of tcp_fastopen_cookie_gen >> - */ >> -tcp_fastopen_init_key_once(false); >> -tcp_fastopen_reset_cipher(user_key, TCP_FASTOPEN_KEY_LENGTH); >> +tcp_fastopen_reset_cipher(net, user_key, >> TCP_FASTOPEN_KEY_LENGTH); >> >> It said we don't regenerate a new key on first invocation of >> tcp_fastopen_cookie_gen, >> but in tcp_fastopen_cookie_gen,it didn’t call tcp_fastopen_init_key_once >> since >> from commit dfea2aa654243 (tcp: Do not call tcp_fastopen_reset_cipher from >> interrupt context): >> >> And in other places where call tcp_fastopen_init_key_once, the ‘publish’ is >> always true: > > Ok, this simplification seems legitimate. > > But it is unrelated to this namespacification. So it should be in a separate > patch, > and should be documented well in the commit message using the great > explanation you > gave to me above. > > Please respin this series, with this patch #2 split up into two changes. > > Thank you. Okay, thanks David for advise. I will split the patch #2 in next commit.
Re: [Patch v3 1/3] ipv4: Namespaceify tcp_fastopen knob
> On 2017年9月21日, at 上午5:22, David Miller wrote: > > From: Haishuang Yan > Date: Tue, 19 Sep 2017 17:38:14 +0800 > >> -if ((sysctl_tcp_fastopen & TFO_SERVER_WO_SOCKOPT1) && >> -(sysctl_tcp_fastopen & TFO_SERVER_ENABLE) && >> +tcp_fastopen = sock_net(sk)->ipv4.sysctl_tcp_fastopen; > ^^ > > Please change that to one space. > > And also please provide an appropriate "[PATCH vX 0/3] " header > posting when you respin this series. Sorry, it’s my mistake, thanks David. > >> @@ -282,18 +280,19 @@ struct sock *tcp_try_fastopen(struct sock *sk, struct >> sk_buff *skb, >> struct tcp_fastopen_cookie valid_foc = { .len = -1 }; >> bool syn_data = TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq + 1; >> struct sock *child; >> +int tcp_fastopen = sock_net(sk)->ipv4.sysctl_tcp_fastopen; > > Please order local variables from longest to shortest line (aka. reverse > christmas tree format). > Okay, I’ll take care of such coding style in next commit, thanks!
Re: [PATCH 1/2] ip_tunnel: fix potential issue in ip_tunnel_rcv
> On 7 Jun 2017, at 10:48 PM, Eric Dumazet wrote: > > On Wed, 2017-06-07 at 22:16 +0800, Haishuang Yan wrote: >> When ip_tunnel_rcv fails, the tun_dst won't be freed, so move >> skb_dst_set to begin and tun_dst would be freed by kfree_skb. >> >> Signed-off-by: Haishuang Yan >> --- > > Please add the missing Fixes: tag and CC author of the patch that added > this bug, so that he has a chance to comment and avoid future similar > bugs. > > Thanks. > > > Ok, I will add these information in v2 commit. Thanks.
Re: [PATCH v2 1/2] ip_tunnel: fix potential issue in ip_tunnel_rcv
> On 8 Jun 2017, at 10:13 AM, Pravin Shelar wrote: > > On Wed, Jun 7, 2017 at 5:57 PM, Haishuang Yan > wrote: >> When ip_tunnel_rcv fails, the tun_dst won't be freed, so move >> skb_dst_set to begin and tun_dst would be freed by kfree_skb. >> >> CC: Pravin B Shelar >> Fixes: 2e15ea390e6f ("ip_gre: Add support to collect tunnel metadata.") >> Signed-off-by: Haishuang Yan >> --- >> net/ipv4/ip_tunnel.c | 6 +++--- >> 1 file changed, 3 insertions(+), 3 deletions(-) >> >> diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c >> index b878ecb..27fc20f 100644 >> --- a/net/ipv4/ip_tunnel.c >> +++ b/net/ipv4/ip_tunnel.c >> @@ -386,6 +386,9 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct >> sk_buff *skb, >>const struct iphdr *iph = ip_hdr(skb); >>int err; >> >> + if (tun_dst) >> + skb_dst_set(skb, (struct dst_entry *)tun_dst); >> + > If dst is set so early, skb_scrub_packet() would remove the tunnel dst > reference. > It is better to call skb_dst_drop() from error code path. Yes, I will change it in v3 commit, thanks! > >> #ifdef CONFIG_NET_IPGRE_BROADCAST >>if (ipv4_is_multicast(iph->daddr)) { >>tunnel->dev->stats.multicast++; >> @@ -439,9 +442,6 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct >> sk_buff *skb, >>skb->dev = tunnel->dev; >>} >> >> - if (tun_dst) >> - skb_dst_set(skb, (struct dst_entry *)tun_dst); >> - >>gro_cells_receive(&tunnel->gro_cells, skb); >>return 0; >> >> -- >> 1.8.3.1 >> >> >> >
Re: [PATCH] ip6_tunnel: Correct tos value in collect_md mode
> On 14 Jun 2017, at 1:28 PM, Peter Dawson wrote: > > On Wed, 14 Jun 2017 10:54:31 +0800 > 严海双 wrote: > > >>> Changes since v2: >>> * mask key->tos with RT_TOS() suggested by Daniel > > Can you help me understand the rationale for this change? Is there are bug > introduced by dsfield = ip6_tclass(key->label); ? > > The RT_TOS masks out 4bits of the 8bit tos field in accordance with RFC1349 > (obsoleted by RFC2474). IPv6 does not have a TOS field. So it dosen't make > sense to apply a TOS value to the outer header of an IPv6 tunnel. > > Hi, Peter Here the tos also means Traffic Class in IPv6, see the define in struct ip_tunnel_key: u8 tos;/* TOS for IPv4, TC for IPv6 */ RT_TOS mask is suggested by Daniel, please refer to the implement in vxlan or geneve code: fl6.flowlabel = ip6_make_flowinfo(RT_TOS(tos), label); Thanks.
Re: [PATCH v2] ip6_tunnel: Correct tos value in collect_md mode
> On 16 Jun 2017, at 10:44 PM, Daniel Borkmann wrote: > > On 06/15/2017 05:54 AM, Peter Dawson wrote: >> On Thu, 15 Jun 2017 10:30:29 +0800 >> Haishuang Yan wrote: >> >>> Same as ip_gre, geneve and vxlan, use key->tos as tos value. >>> >>> CC: Peter Dawson >>> Fixes: 0e9a709560db ("ip6_tunnel, ip6_gre: fix setting of DSCP on >>> encapsulated packets”) >>> Suggested-by: Daniel Borkmann >>> Signed-off-by: Haishuang Yan >>> >>> --- >>> Changes since v2: >>> * Add fixes information >>> * mask key->tos with RT_TOS() suggested by Daniel >>> --- >>> net/ipv6/ip6_tunnel.c | 4 ++-- >>> 1 file changed, 2 insertions(+), 2 deletions(-) >>> >>> diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c >>> index ef99d59..6400726 100644 >>> --- a/net/ipv6/ip6_tunnel.c >>> +++ b/net/ipv6/ip6_tunnel.c >>> @@ -1249,7 +1249,7 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct >>> net_device *dev, __u8 dsfield, >>> fl6.flowi6_proto = IPPROTO_IPIP; >>> fl6.daddr = key->u.ipv6.dst; >>> fl6.flowlabel = key->label; >>> - dsfield = ip6_tclass(key->label); >>> + dsfield = RT_TOS(key->tos); >>> } else { >>> if (!(t->parms.flags & IP6_TNL_F_IGN_ENCAP_LIMIT)) >>> encap_limit = t->parms.encap_limit; >>> @@ -1320,7 +1320,7 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct >>> net_device *dev, __u8 dsfield, >>> fl6.flowi6_proto = IPPROTO_IPV6; >>> fl6.daddr = key->u.ipv6.dst; >>> fl6.flowlabel = key->label; >>> - dsfield = ip6_tclass(key->label); >>> + dsfield = RT_TOS(key->tos); >>> } else { >>> offset = ip6_tnl_parse_tlv_enc_lim(skb, >>> skb_network_header(skb)); >>> /* ip6_tnl_parse_tlv_enc_lim() might have reallocated skb->head >>> */ >> >> I don't think it is correct to apply RT_TOS >> >> Here is my understanding based on the RFCs. >> >> IPv4/6 Header:0 |0 1 2 3 |0 1 2 3 |0 1 2 3 |0 1 2 3 | >> RFC2460(IPv6) |Version | Traffic Class || >> RFC2474(IPv6) |Version | DSCP|ECN|| >> RFC2474(IPv4) |Version | IHL |DSCP |ECN| >> RFC1349(IPv4) |Version | IHL | PREC | TOS |X| >> RFC791 (IPv4) |Version | IHL | TOS| >> >> u8 key->tos stores the full 8bits of Traffic class from an IPv6 header and; >> u8 key->tos stores the full 8bits of TOS(RFC791) from an IPv4 header >> u8 ip6_tclass will return the full 8bits of Traffic Class from an IPv6 >> flowlabel >> >> RT_TOS will return the RFC1349 4bit TOS field. >> >> Applying RT_TOS to a key->tos will result in lost information and the >> inclusion of 1 bit of ECN if the original field was a DSCP+ECN. >> >> Based on this understanding of the RFCs (but not years of experience) and >> since RFC1349 has been obsoleted by RFC2474 I think the use of RT_TOS should >> be deprecated. >> >> This being said, dsfield = ip6_tclass(key->label) = key->tos isn't fully >> correct either because the result will contain the ECN bits as well as the >> DSCP. >> >> I agree that code should be consistent, but not where there is a potential >> issue. > > Yeah, you're right. Looks like initial dsfield = key->tos diff was > the better choice then, sorry for my confusing comment. > > For example, bpf_skb_set_tunnel_key() helper that populates the collect > metadata as one user of this infra masks the key->label so that it really > only holds the label meaning previous dsfield = ip6_tclass(key->label) > will always be 0 in that case unlike key->tos that actually gets populated > and would propagate it. > Okay, I will change the commit back to initial version, thanks everyone.
Re: [PATCH v2 1/2] ip_tunnel: fix ip tunnel lookup in collect_md mode
> On 19 Jun 2017, at 1:43 PM, Pravin Shelar wrote: > > On Fri, Jun 16, 2017 at 8:27 PM, Haishuang Yan > wrote: >> In collect_md mode, if the tun dev is down, it still can call >> ip_tunnel_rcv to receive on packets, and the rx statistics increase >> improperly. >> >> Fixes: 2e15ea390e6f ("ip_gre: Add support to collect tunnel metadata.") >> Cc: Pravin B Shelar >> Signed-off-by: Haishuang Yan >> >> --- >> Change since v2: >> * Fix wrong recipient addresss >> --- >> net/ipv4/ip_tunnel.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c >> index 0f1d876..a3caba1 100644 >> --- a/net/ipv4/ip_tunnel.c >> +++ b/net/ipv4/ip_tunnel.c >> @@ -176,7 +176,7 @@ struct ip_tunnel *ip_tunnel_lookup(struct ip_tunnel_net >> *itn, >>return cand; >> >>t = rcu_dereference(itn->collect_md_tun); >> - if (t) >> + if (t && (t->dev->flags & IFF_UP)) >>return t; >> > It would be nice if we could increment drop count if tunnel device is not up. > Hi Pravin I think it’s not necessary, for example as gre tunnel, if ipgre_rcv fails, it would trigger send an icmp unreachable message: if (ipgre_rcv(skb, &tpi, hdr_len) == PACKET_RCVD) return 0; icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0); Since the tunnel device didn’t touch the packets, so increase drop statistics is not necessary. Thanks