Re: [PATCH v2 1/2] ip_tunnel: fix ip tunnel lookup in collect_md mode

2017-06-21 Thread

> On 20 Jun 2017, at 2:08 AM, Pravin Shelar  wrote:
> 
> On Mon, Jun 19, 2017 at 6:13 AM, 严海双  
> wrote:
>> 
>> 
>>> On 19 Jun 2017, at 1:43 PM, Pravin Shelar  wrote:
>>> 
>>> On Fri, Jun 16, 2017 at 8:27 PM, Haishuang Yan
>>>  wrote:
>>>> In collect_md mode, if the tun dev is down, it still can call
>>>> ip_tunnel_rcv to receive on packets, and the rx statistics increase
>>>> improperly.
>>>> 
>>>> Fixes: 2e15ea390e6f ("ip_gre: Add support to collect tunnel metadata.")
>>>> Cc: Pravin B Shelar 
>>>> Signed-off-by: Haishuang Yan 
>>>> 
>>>> ---
>>>> Change since v2:
>>>> * Fix wrong recipient addresss
>>>> ---
>>>> net/ipv4/ip_tunnel.c | 2 +-
>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>> 
>>>> diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
>>>> index 0f1d876..a3caba1 100644
>>>> --- a/net/ipv4/ip_tunnel.c
>>>> +++ b/net/ipv4/ip_tunnel.c
>>>> @@ -176,7 +176,7 @@ struct ip_tunnel *ip_tunnel_lookup(struct 
>>>> ip_tunnel_net *itn,
>>>>   return cand;
>>>> 
>>>>   t = rcu_dereference(itn->collect_md_tun);
>>>> -   if (t)
>>>> +   if (t && (t->dev->flags & IFF_UP))
>>>>   return t;
>>>> 
>>> It would be nice if we could increment drop count if tunnel device is not 
>>> up.
>>> 
>> Hi Pravin
>> 
>> I think it’s not necessary, for example as gre tunnel, if ipgre_rcv fails, 
>> it would trigger send an icmp unreachable
>> message:
>> 
>>if (ipgre_rcv(skb, &tpi, hdr_len) == PACKET_RCVD)
>>return 0;
>> 
>>icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0);
>> 
>> Since the tunnel device didn’t touch the packets, so increase drop 
>> statistics is not necessary.
>> 
> icmp err packets are not reliable on all networks. device stats are
> much more convenient during debugging connectivity issues.
> 

Okay, if the tunnel device is not up, packets will transfer to fallback tunnel, 
and if 
the fallback device is up, the RX drops will be increased as expected:

gre0: flags=193  mtu 1476
inet 172.16.20.1  netmask 255.255.255.0
unspec 00-00-00-00-00-00-F0-00-00-00-00-00-00-00-00-00  txqueuelen 1000 
 (UNSPEC)
RX packets 105  bytes 4522 (4.4 KiB)
RX errors 0  dropped 105  overruns 0  frame 0
TX packets 0  bytes 0 (0.0 B)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

My question is  whether we should increase the drops of original tunnel device 
instead of fallback device?






Re: [PATCH 2/2] ip6_gre: Set flowi6_proto as IPPROTO_GRE in xmit path.

2016-05-20 Thread

> On May 21, 2016, at 1:48 AM, David Miller  wrote:
> 
> From: Haishuang Yan 
> Date: Wed, 18 May 2016 18:05:52 +0800
> 
>> In gre6 xmit path, we are sending a GRE packet, so set fl6 proto
>> to IPPROTO_GRE properly.
>> 
>> Signed-off-by: Haishuang Yan 
> 
> I think it would be a lot better to initialize the flow protocol field
> properly in ip6gre_tnl_link_config() instead of fixing it up every
> single transmit.
> 
> 

I agree, I will modify the changes in v2.

Thanks

Haishuang




Re: [PATCH v2] ip_tunnel: enclose a code block in macro IS_ENABLED(CONFIG_IPV6)

2016-05-23 Thread

> On May 24, 2016, at 11:14 AM, Eric Dumazet  wrote:
> 
> On Tue, 2016-05-24 at 10:39 +0800, Haishuang Yan wrote:
>> For ipv6 case, enclose the code block in macro IS_ENABLED(CONFIG_IPV6).
>> 
>> ---
>> Changes in v2:
>>  - Place the "#if IS_ENABLED" block before the "} else if
>> (..) {" piece and the "#endif" before the closing brace and this
>> becomes much easier to look at.
> 
> _Why_ is this patch needed ?
> 
> Please describe in the changelog what _actual_ problem you are trying to
> address.
> 
> We have many points in the kernel using ipv6_get_dsfield() even if
> CONFIG_IPV6=n, and it seems fine to me at least.
> 
> Thanks.
> 
> 
> 

Yes, you’re right, but I think add this patch seems more reasonable in coding.

Thanks for your reviewing.



Re: [PATCH] geneve: fix max_mtu setting

2016-06-26 Thread

> On Jun 26, 2016, at 8:35 PM, zhuyj  wrote:
> 
> +   if (geneve->remote.sa.sa_family == AF_INET)
> +   max_mtu -= sizeof(struct iphdr);
> +   else
> +   max_mtu -= sizeof(struct ipv6hdr);
> 
> Sorry, if sa_family is not AF_NET, it is AF_INET6?
> 
> There is a lot of macros in include/linux/socket.h.
> 
> Zhu Yanjun
> 

There are only two enumerations AF_INET and AF_INET6 have been assigned in 
geneve_newlink:

if (data[IFLA_GENEVE_REMOTE] && data[IFLA_GENEVE_REMOTE6])
return -EINVAL;

if (data[IFLA_GENEVE_REMOTE]) {
remote.sa.sa_family = AF_INET;
remote.sin.sin_addr.s_addr =
nla_get_in_addr(data[IFLA_GENEVE_REMOTE]);
}

if (data[IFLA_GENEVE_REMOTE6]) {
if (!IS_ENABLED(CONFIG_IPV6))
return -EPFNOSUPPORT;

remote.sa.sa_family = AF_INET6;
remote.sin6.sin6_addr =
nla_get_in6_addr(data[IFLA_GENEVE_REMOTE6]);

if (ipv6_addr_type(&remote.sin6.sin6_addr) &
IPV6_ADDR_LINKLOCAL) {
netdev_dbg(dev, "link-local remote is unsupported\n");
return -EINVAL;
}
}

So I think the else case is for AF_INET6 only.



Re: [PATCH 1/2] ip_gre: fix potential memory leak in erspan_rcv

2017-12-14 Thread


> On 2017年12月15日, at 上午2:47, William Tu  wrote:
> 
> On Thu, Dec 14, 2017 at 7:15 AM, Haishuang Yan
>  wrote:
>> If md is NULL, tun_dst must be freed, otherwise it will cause memory
>> leak.
>> 
>> Fixes: 84e54fe0a5ea ("gre: introduce native tunnel support for ERSPAN")
>> Cc: William Tu 
>> Signed-off-by: Haishuang Yan 
>> ---
>> net/ipv4/ip_gre.c | 4 +++-
>> 1 file changed, 3 insertions(+), 1 deletion(-)
>> 
>> diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
>> index d828821..9253d6f 100644
>> --- a/net/ipv4/ip_gre.c
>> +++ b/net/ipv4/ip_gre.c
>> @@ -304,8 +304,10 @@ static int erspan_rcv(struct sk_buff *skb, struct 
>> tnl_ptk_info *tpi,
>>return PACKET_REJECT;
>> 
>>md = ip_tunnel_info_opts(&tun_dst->u.tun_info);
>> -   if (!md)
>> +   if (!md) {
>> +   dst_release((struct dst_entry *)tun_dst);
>>return PACKET_REJECT;
>> +   }
> I'm not sure about this. Maybe we don't even need to check "if (!md)"
> since ip_tun_rx_dst does the memory allocation.
> William
> 


Hi, William

I think we need to check “if (!md)”, if md is okay, ip_tunnel_rcv will be 
responsible to free
tun_dst:

 448 drop:
 449 if (tun_dst)
 450 dst_release((struct dst_entry *)tun_dst);

Thanks.









Re: [PATCH] ip6_gre: fix a pontential issue in ip6erspan_rcv

2017-12-15 Thread


> On 2017年12月16日, at 上午3:11, David Miller  wrote:
> 
> From: Haishuang Yan 
> Date: Fri, 15 Dec 2017 10:46:38 +0800
> 
>> pskb_may_pull() can change skb->data, so we need to load ipv6h/ershdr at
>> the right place.
>> 
>> Fixes: 5a963eb61b7c ("ip6_gre: Add ERSPAN native tunnel support")
>> Cc: William Tu 
>> Signed-off-by: Haishuang Yan 
> 
> The mentioned commit ID only exists in net-next, and this patch does not apply
> cleanly there.
> 

Okay, I will send v2 commit rebased on latest master and with prefix [net-next].

Thanks.





Re: [PATCH] ipv4: Namespaceify tcp_fastopen knob

2017-09-13 Thread


> On 2017年9月13日, at 上午11:57, David Miller  wrote:
> 
> From: Haishuang Yan 
> Date: Tue, 12 Sep 2017 18:30:57 +0800
> 
>> Different namespace application might require enable TCP Fast Open
>> feature independently of the host.
>> 
>> Reported-by: Luca BRUNO 
>> Signed-off-by: Haishuang Yan 
> ...
>> diff --git a/samples/bpf/test_ipip.sh b/samples/bpf/test_ipip.sh
>> index 1969254..7bbc521 100755
>> --- a/samples/bpf/test_ipip.sh
>> +++ b/samples/bpf/test_ipip.sh
>> @@ -173,6 +173,8 @@ function cleanup {
>> cleanup
>> echo "Testing IP tunnels..."
>> test_ipip
>> +sleep 1
>> test_ipip6
>> +sleep 1
>> test_ip6ip6
>> echo "*** PASS ***"
> 
> This seems like a completely unrelated change.
> 

Sorry, I make a mistake for including my local test changes in this patch.

I will remove this change in v2 commit. Thanks David for reviewing.



Re: [PATCH v2] geneve: Fix setting ttl value in collect metadata mode

2017-09-13 Thread


> On 2017年9月13日, at 上午7:43, Pravin Shelar  wrote:
> 
> On Tue, Sep 12, 2017 at 12:05 AM, Haishuang Yan
>  wrote:
>> Similar to vxlan/ipip tunnel, if key->tos is zero in collect metadata
>> mode, tos should also fallback to ip{4,6}_dst_hoplimit.
>> 
>> Signed-off-by: Haishuang Yan 
>> 
>> ---
>> Changes since v2:
>>  * Make the commit message more clearer.
>> ---
>> drivers/net/geneve.c | 6 ++
>> 1 file changed, 2 insertions(+), 4 deletions(-)
>> 
>> diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
>> index f640407..d52a65f 100644
>> --- a/drivers/net/geneve.c
>> +++ b/drivers/net/geneve.c
>> @@ -834,11 +834,10 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct 
>> net_device *dev,
>>sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
>>if (geneve->collect_md) {
>>tos = ip_tunnel_ecn_encap(key->tos, ip_hdr(skb), skb);
>> -   ttl = key->ttl;
>>} else {
>>tos = ip_tunnel_ecn_encap(fl4.flowi4_tos, ip_hdr(skb), skb);
>> -   ttl = key->ttl ? : ip4_dst_hoplimit(&rt->dst);
>>}
>> +   ttl = key->ttl ? : ip4_dst_hoplimit(&rt->dst);
>>df = key->tun_flags & TUNNEL_DONT_FRAGMENT ? htons(IP_DF) : 0;
>> 
> This changes user API of Geneve collect-metadata mode. I do not see
> good reason for this. Why user can not set right TTL for the flow?
> 

For example, I test this case with script samples/bpf/test_tunnel_bpf.sh, 
and modify samples/bpf/tcbpf2_kern.c with following patch:

diff --git a/samples/bpf/tcbpf2_kern.c b/samples/bpf/tcbpf2_kern.c
index 370b749..16b09ed 100644
--- a/samples/bpf/tcbpf2_kern.c
+++ b/samples/bpf/tcbpf2_kern.c
@@ -204,7 +204,6 @@ int _geneve_set_tunnel(struct __sk_buff *skb)
key.remote_ipv4 = 0xac100164; /* 172.16.1.100 */
key.tunnel_id = 2;
key.tunnel_tos = 0;
-   key.tunnel_ttl = 64;

__builtin_memset(&gopt, 0x0, sizeof(gopt));
gopt.opt_class = 0x102; /* Open Virtual Networking (OVN) */

The user didn’t set key.tunnel_ttl and then key.tunnel_ttl is 0, so tcpdump 
output on veth1 would have ttl value of 0 in ip header:

PING 10.1.1.100 (10.1.1.100) 56(84) bytes of data.
18:59:27.347373 82:81:85:8a:72:62 > 7e:99:fb:3b:15:73, ethertype IPv4 (0x0800), 
length 104: (tos 0x0, id 7386, offset 0, flags [none], proto UDP
(17), length 90)
64 bytes from 10.1.1.100: icmp_seq=1 ttl=64 time=0.083 ms
172.16.1.200.57776 > 172.16.1.100.6081: [no cksum] UDP, length 62

--- 10.1.1.100 ping statistics —

After my patch, tcpdump output on veth1 would have ttl value of 64 in 
ip header:

PING 10.1.1.100 (10.1.1.100) 56(84) bytes of data.
19:12:10.870410 f2:56:13:68:e9:a9 > 8a:1e:93:dc:59:27, ethertype IPv4 (0x0800), 
length 104: (tos 0x0, ttl 64, id 48595, offset 0, flags [none], p
roto UDP (17), length 90)
64 bytes from 10.1.1.100: icmp_seq=1 ttl=64 time=0.091 ms
172.16.1.200.57776 > 172.16.1.100.6081: [no cksum] UDP, length 62

--- 10.1.1.100 ping statistics ---









Re: [PATCH v2] ipv4: Namespaceify tcp_fastopen knob

2017-09-13 Thread


> On 2017年9月13日, at 下午9:02, Eric Dumazet  wrote:
> 
> On Wed, 2017-09-13 at 05:44 -0700, Eric Dumazet wrote:
>> On Wed, 2017-09-13 at 19:19 +0800, Haishuang Yan wrote:
>>> Different namespace application might require enable TCP Fast Open
>>> feature independently of the host.
>>> 
>> 
>> Poor changelog, no actual description / list of sysctls that are moved
>> to per netns.
>> 
>> And looking at the patch, it seems your conversion is not complete.
>> 
>> So I will ask you to provide more evidence that you tested your patch
>> next time you submit it.
> 
> I suggest you move one sysctl at a time, in a patch series.
> 
> It will be easier to document and test for you, and review for us.
> 
> Thanks.
> 

Okay, I will split my patch for each sysctl change. Thanks.





Re: [PATCH] be2net: Fix some u16 fields appropriately

2017-08-28 Thread

> On 2017年8月29日, at 上午7:19, David Miller  wrote:
> 
> From: Haishuang Yan 
> Date: Sun, 27 Aug 2017 15:24:45 +0800
> 
>> In be_tx_compl_process, frag_index declared as u32, so it's better to
>> declare last_index as u32 also.
>> 
>> CC: Ajit Khaparde 
>> Fixes: b0fd2eb28bd4 ("be2net: Declare some u16 fields as u32 to improve
>> performance")
>> Signed-off-by: Haishuang Yan 
> 
> That is not a legitimate reason for making this change.
> 
>> @@ -255,7 +255,7 @@ struct be_tx_stats {
>> /* Structure to hold some data of interest obtained from a TX CQE */
>> struct be_tx_compl_info {
>>  u8 status;  /* Completion status */
>> -u16 end_index;  /* Completed TXQ Index */
>> +u32 end_index;  /* Completed TXQ Index */
>> };
>> 
>> struct be_tx_obj {
> 
> The ->end_index comes solely from:
> 
>   txcp->end_index = GET_TX_COMPL_BITS(wrb_index, compl);
> 
> Which is precisely a 16-bit value.
> 
> I'm not applying this, sorry.
> 

Hi David,

The GET_TX_COMPL_BITS comes from amap_get which also returns a 32-bit value:

  6 static inline u32 amap_get(void *ptr, u32 dw_offset, u32 mask, u32 offset)
  5 {
  4 u32 *dw = (u32 *) ptr;
  3 return mask & (*(dw + dw_offset) >> offset);
  2 }
  1
869 #define AMAP_GET_BITS(_struct, field, ptr)  \
  1 amap_get(ptr,   \
  2 offsetof(_struct, field)/32,\
  3 amap_mask(sizeof(((_struct *)0)->field)),   \
  4 AMAP_BIT_OFFSET(_struct, field))





Re: [PATCH] ipv4: Namespaceify tcp_max_orphans knob

2017-09-08 Thread


> On 2017年9月9日, at 上午6:13, Cong Wang  wrote:
> 
> On Wed, Sep 6, 2017 at 8:10 PM, Haishuang Yan
>  wrote:
>> Different namespace application might require different maximal number
>> of TCP sockets independently of the host.
> 
> So after your patch we could have N * net->ipv4.sysctl_tcp_max_orphans
> in a whole system, right? This just makes OOM easier to trigger.
> 

>From my understanding, before the patch, we had N * 
>net->ipv4.sysctl_tcp_max_orphans,
and after the patch, we could have ns1.sysctl_tcp_max_orphans + 
ns2.sysctl_tcp_max_orphans
+ ns3.sysctl_tcp_max_orphans, is that right? Thanks for your reviewing.



Re: [PATCH] ipv4: Namespaceify tcp_max_orphans knob

2017-09-08 Thread


> On 2017年9月9日, at 下午12:35, Cong Wang  wrote:
> 
> On Fri, Sep 8, 2017 at 6:25 PM, 严海双  wrote:
>> 
>> 
>>> On 2017年9月9日, at 上午6:13, Cong Wang  wrote:
>>> 
>>> On Wed, Sep 6, 2017 at 8:10 PM, Haishuang Yan
>>>  wrote:
>>>> Different namespace application might require different maximal number
>>>> of TCP sockets independently of the host.
>>> 
>>> So after your patch we could have N * net->ipv4.sysctl_tcp_max_orphans
>>> in a whole system, right? This just makes OOM easier to trigger.
>>> 
>> 
>> From my understanding, before the patch, we had N * 
>> net->ipv4.sysctl_tcp_max_orphans,
>> and after the patch, we could have ns1.sysctl_tcp_max_orphans + 
>> ns2.sysctl_tcp_max_orphans
>> + ns3.sysctl_tcp_max_orphans, is that right? Thanks for your reviewing.
> 
> Nope, by N I mean the number of containers. Before your patch, the limit
> is global, after your patch it is per container.
> 

Yeah, for example, if there is N containers, before the patch, I mean the limit 
is:

N * net->ipv4.sysctl_tcp_max_orphans

After the patch, the limit is:

ns1. net->ipv4.sysctl_tcp_max_orphans + ns2. 
net->ipv4.sysctl_tcp_max_orphans + …






Re: [PATCH] ipv4: Namespaceify tcp_max_orphans knob

2017-09-09 Thread


> On 2017年9月9日, at 下午1:16, David Miller  wrote:
> 
> From: 严海双 
> Date: Sat, 9 Sep 2017 13:09:57 +0800
> 
>> 
>> 
>>> On 2017年9月9日, at 下午12:35, Cong Wang  wrote:
>>> 
>>> On Fri, Sep 8, 2017 at 6:25 PM, 严海双  
>>> wrote:
>>>> 
>>>> 
>>>>> On 2017年9月9日, at 上午6:13, Cong Wang  wrote:
>>>>> 
>>>>> On Wed, Sep 6, 2017 at 8:10 PM, Haishuang Yan
>>>>>  wrote:
>>>>>> Different namespace application might require different maximal number
>>>>>> of TCP sockets independently of the host.
>>>>> 
>>>>> So after your patch we could have N * net->ipv4.sysctl_tcp_max_orphans
>>>>> in a whole system, right? This just makes OOM easier to trigger.
>>>>> 
>>>> 
>>>> From my understanding, before the patch, we had N * 
>>>> net->ipv4.sysctl_tcp_max_orphans,
>>>> and after the patch, we could have ns1.sysctl_tcp_max_orphans + 
>>>> ns2.sysctl_tcp_max_orphans
>>>> + ns3.sysctl_tcp_max_orphans, is that right? Thanks for your reviewing.
>>> 
>>> Nope, by N I mean the number of containers. Before your patch, the limit
>>> is global, after your patch it is per container.
>>> 
>> 
>> Yeah, for example, if there is N containers, before the patch, I mean the 
>> limit is:
>> 
>>  N * net->ipv4.sysctl_tcp_max_orphans
>> 
>> After the patch, the limit is:
>> 
>>  ns1. net->ipv4.sysctl_tcp_max_orphans + ns2. 
>> net->ipv4.sysctl_tcp_max_orphans + …
> 
> Not true.
> 
> Please remove "N" from your equation of the current situation.
> 
> "sysctl_tcp_max_orphans" applies to entire system, it is a global limit,
> comparing one limit against all orphans in the system, there is no N.

Yes, it’s right. I browse the source code and found that it’s a global limit, 
sorry for my mistake.

Thanks David and Cong.






Re: [PATCH v3 2/2] ip6_tunnel: fix ip6 tunnel lookup in collect_md mode

2017-09-05 Thread
> On 2017年9月6日, at 上午11:14, Alexei Starovoitov  wrote:
> 
> On 9/4/17 1:36 AM, Haishuang Yan wrote:
>> In collect_md mode, if the tun dev is down, it still can call
>> __ip6_tnl_rcv to receive on packets, and the rx statistics increase
>> improperly.
>> 
>> Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
>> Cc: Alexei Starovoitov 
>> Signed-off-by: Haishuang Yan 
>> 
>> ---
>> Change since v3:
>>  * Increment rx_dropped if tunnel device is not up, suggested by
>>  Pravin B Shelar
>>  * Fix wrong recipient address
>> ---
>> net/ipv6/ip6_tunnel.c | 7 +--
>> 1 file changed, 5 insertions(+), 2 deletions(-)
>> 
>> diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
>> index 10a693a..e91d3b6 100644
>> --- a/net/ipv6/ip6_tunnel.c
>> +++ b/net/ipv6/ip6_tunnel.c
>> @@ -171,8 +171,11 @@ static struct net_device_stats *ip6_get_stats(struct 
>> net_device *dev)
>>  }
>> 
>>  t = rcu_dereference(ip6n->collect_md_tun);
>> -if (t)
>> -return t;
>> +if (t) {
>> +if (t->dev->flags & IFF_UP)
>> +return t;
>> +t->dev->stats.rx_dropped++;
>> +}
> 
> Why increment the stats only for this drop case?
Because It was suggested by Pravin on v2 commit of the patch.

> There are plenty of other conditions where packet
> will be dropped in ip6 tunnel. I think it's important
> to present consistent behavior to the users,
> so I'd increment drop stats either for all drop cases
> or for none. And today it's none.
> The ! IFF_UP case should probably be return NULL too
> 
> 





Re: [PATCH] ip6_tunnel: fix potential issue in __ip6_tnl_rcv

2017-06-13 Thread

> On 13 Jun 2017, at 4:56 PM, Haishuang Yan  
> wrote:
> 
> When __ip6_tnl_rcv fails, the tun_dst won't be freed, so call
> dst_release to free it in error code path.
> 
> CC: Alexei Starovoitov 
> Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
> Signed-off-by: Haishuang Yan 
> Tested-by: Zhang Shengju 
> 
> ---
> v4:
>  - Add tester information
> v3:
>  - Free tun_dst from error code path
> v2:
>  - Add the the missing Fixes information
> ---
> net/ipv6/ip6_tunnel.c | 2 ++
> 1 file changed, 2 insertions(+)
> 
> diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
> index 9b37f97..ef99d59 100644
> --- a/net/ipv6/ip6_tunnel.c
> +++ b/net/ipv6/ip6_tunnel.c
> @@ -859,6 +859,8 @@ static int __ip6_tnl_rcv(struct ip6_tnl *tunnel, struct 
> sk_buff *skb,
>   return 0;
> 
> drop:
> + if (tun_dst)
> + dst_release((struct dst_entry *)tun_dst);
>   kfree_skb(skb);
>   return 0;
> }
> -- 
> 1.8.3.1
> 
> 

Hi David, Please ignore the patches, I forgot to add subject prefix, sorry.



Re: [PATCH] ip6_tunnel: Correct tos value in collect_md mode

2017-06-13 Thread


> On 14 Jun 2017, at 10:48 AM, Haishuang Yan 
>  wrote:
> 
> Same as ip_gre, geneve and vxlan, use key->tos as tos value.
> 
> CC: Peter Dawson 
> Fixes: 0e9a709560db ("ip6_tunnel, ip6_gre: fix setting of DSCP on
> encapsulated packets”)
> Suggested-by: Daniel Borkmann 
> Signed-off-by: Haishuang Yan 
> 
> ---
> Changes since v2:
>  * Add fixes information
>  * mask key->tos with RT_TOS() suggested by Daniel
> ---
> net/ipv6/ip6_tunnel.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
> index ef99d59..6400726 100644
> --- a/net/ipv6/ip6_tunnel.c
> +++ b/net/ipv6/ip6_tunnel.c
> @@ -1249,7 +1249,7 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
> *dev, __u8 dsfield,
>   fl6.flowi6_proto = IPPROTO_IPIP;
>   fl6.daddr = key->u.ipv6.dst;
>   fl6.flowlabel = key->label;
> - dsfield = ip6_tclass(key->label);
> + dsfield =  RT_TOS(key->tos);
>   } else {
>   if (!(t->parms.flags & IP6_TNL_F_IGN_ENCAP_LIMIT))
>   encap_limit = t->parms.encap_limit;
> @@ -1320,7 +1320,7 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
> *dev, __u8 dsfield,
>   fl6.flowi6_proto = IPPROTO_IPV6;
>   fl6.daddr = key->u.ipv6.dst;
>   fl6.flowlabel = key->label;
> - dsfield = ip6_tclass(key->label);
> + dsfield = RT_TOS(key->tos);
>   } else {
>   offset = ip6_tnl_parse_tlv_enc_lim(skb, 
> skb_network_header(skb));
>   /* ip6_tnl_parse_tlv_enc_lim() might have reallocated skb->head 
> */
> -- 
> 1.8.3.1
> 
> 

Sorry, I forgot to add subject prefix, please ignore this patch.





Re: [PATCH v4 2/3] ipv4: Namespaceify tcp_fastopen_key knob

2017-09-25 Thread


> On 2017年9月26日, at 上午7:24, David Miller  wrote:
> 
> From: Haishuang Yan 
> Date: Fri, 22 Sep 2017 21:48:43 +0800
> 
>> @@ -9,13 +9,18 @@
>> #include 
>> #include 
>> 
>> -struct tcp_fastopen_context __rcu *tcp_fastopen_ctx;
>> -
>> -static DEFINE_SPINLOCK(tcp_fastopen_ctx_lock);
>> -
>> -void tcp_fastopen_init_key_once(bool publish)
>> +void tcp_fastopen_init_key_once(struct net *net)
> 
> Why did you remove the 'publish' logic from this function?
> 

I think this logic is not necessary now, in proc_tcp_fastopen_key, I have 
removed 
tcp_fastopen_init_key_once(false) where the ‘publish’ is false:

-   /* Generate a dummy secret but don't publish it. This
-* is needed so we don't regenerate a new key on the
-* first invocation of tcp_fastopen_cookie_gen
-*/
-   tcp_fastopen_init_key_once(false);
-   tcp_fastopen_reset_cipher(user_key, TCP_FASTOPEN_KEY_LENGTH);
+   tcp_fastopen_reset_cipher(net, user_key, 
TCP_FASTOPEN_KEY_LENGTH);

It said we don't regenerate a new key on first invocation of 
tcp_fastopen_cookie_gen, 
but in tcp_fastopen_cookie_gen,it didn’t  call tcp_fastopen_init_key_once since
from commit dfea2aa654243 (tcp: Do not call tcp_fastopen_reset_cipher from 
interrupt context):

And in other places where call tcp_fastopen_init_key_once, the ‘publish’ is 
always true:

--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -222,7 +222,7 @@ int inet_listen(struct socket *sock, int backlog)
(tcp_fastopen & TFO_SERVER_ENABLE) &&
!inet_csk(sk)->icsk_accept_queue.fastopenq.max_qlen) {
fastopen_queue_tune(sk, backlog);
-   tcp_fastopen_init_key_once(true);
+   tcp_fastopen_init_key_once(sock_net(sk));
}

--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2749,7 +2749,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
case TCP_FASTOPEN:
if (val >= 0 && ((1 << sk->sk_state) & (TCPF_CLOSE |
TCPF_LISTEN))) {
-   tcp_fastopen_init_key_once(true);
+   tcp_fastopen_init_key_once(net);

fastopen_queue_tune(sk, val);
} else {


So I deleted ‘publish’ logic to ensure it was always true.










Re: [PATCH v4 2/3] ipv4: Namespaceify tcp_fastopen_key knob

2017-09-26 Thread


> On 2017年9月27日, at 上午2:18, David Miller  wrote:
> 
> From: 严海双 
> Date: Tue, 26 Sep 2017 09:25:51 +0800
> 
>>> On 2017年9月26日, at 上午7:24, David Miller  wrote:
>>> 
>>> From: Haishuang Yan 
>>> Date: Fri, 22 Sep 2017 21:48:43 +0800
>>> 
>>>> @@ -9,13 +9,18 @@
>>>> #include 
>>>> #include 
>>>> 
>>>> -struct tcp_fastopen_context __rcu *tcp_fastopen_ctx;
>>>> -
>>>> -static DEFINE_SPINLOCK(tcp_fastopen_ctx_lock);
>>>> -
>>>> -void tcp_fastopen_init_key_once(bool publish)
>>>> +void tcp_fastopen_init_key_once(struct net *net)
>>> 
>>> Why did you remove the 'publish' logic from this function?
>>> 
>> 
>> I think this logic is not necessary now, in proc_tcp_fastopen_key, I have 
>> removed 
>> tcp_fastopen_init_key_once(false) where the ‘publish’ is false:
>> 
>> -/* Generate a dummy secret but don't publish it. This
>> - * is needed so we don't regenerate a new key on the
>> - * first invocation of tcp_fastopen_cookie_gen
>> - */
>> -tcp_fastopen_init_key_once(false);
>> -tcp_fastopen_reset_cipher(user_key, TCP_FASTOPEN_KEY_LENGTH);
>> +tcp_fastopen_reset_cipher(net, user_key, 
>> TCP_FASTOPEN_KEY_LENGTH);
>> 
>> It said we don't regenerate a new key on first invocation of 
>> tcp_fastopen_cookie_gen, 
>> but in tcp_fastopen_cookie_gen,it didn’t  call tcp_fastopen_init_key_once 
>> since
>> from commit dfea2aa654243 (tcp: Do not call tcp_fastopen_reset_cipher from 
>> interrupt context):
>> 
>> And in other places where call tcp_fastopen_init_key_once, the ‘publish’ is 
>> always true:
> 
> Ok, this simplification seems legitimate.
> 
> But it is unrelated to this namespacification.  So it should be in a separate 
> patch,
> and should be documented well in the commit message using the great 
> explanation you
> gave to me above.
> 
> Please respin this series, with this patch #2 split up into two changes.
> 
> Thank you.

Okay, thanks David for advise. I will split the patch #2 in next commit.





Re: [Patch v3 1/3] ipv4: Namespaceify tcp_fastopen knob

2017-09-20 Thread


> On 2017年9月21日, at 上午5:22, David Miller  wrote:
> 
> From: Haishuang Yan 
> Date: Tue, 19 Sep 2017 17:38:14 +0800
> 
>> -if ((sysctl_tcp_fastopen & TFO_SERVER_WO_SOCKOPT1) &&
>> -(sysctl_tcp_fastopen & TFO_SERVER_ENABLE) &&
>> +tcp_fastopen =  sock_net(sk)->ipv4.sysctl_tcp_fastopen;
>  ^^
> 
> Please change that to one space.
> 
> And also please provide an appropriate "[PATCH vX 0/3] " header
> posting when you respin this series.

Sorry, it’s my mistake, thanks David.
> 
>> @@ -282,18 +280,19 @@ struct sock *tcp_try_fastopen(struct sock *sk, struct 
>> sk_buff *skb,
>>  struct tcp_fastopen_cookie valid_foc = { .len = -1 };
>>  bool syn_data = TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq + 1;
>>  struct sock *child;
>> +int tcp_fastopen = sock_net(sk)->ipv4.sysctl_tcp_fastopen;
> 
> Please order local variables from longest to shortest line (aka. reverse
> christmas tree format).
> 

Okay, I’ll take care of such coding style in next commit, thanks!





Re: [PATCH 1/2] ip_tunnel: fix potential issue in ip_tunnel_rcv

2017-06-07 Thread

> On 7 Jun 2017, at 10:48 PM, Eric Dumazet  wrote:
> 
> On Wed, 2017-06-07 at 22:16 +0800, Haishuang Yan wrote:
>> When ip_tunnel_rcv fails, the tun_dst won't be freed, so move
>> skb_dst_set to begin and tun_dst would be freed by kfree_skb.
>> 
>> Signed-off-by: Haishuang Yan 
>> ---
> 
> Please add the missing Fixes: tag and CC author of the patch that added
> this bug, so that he has a chance to comment and avoid future similar
> bugs.
> 
> Thanks.
> 
> 
> 

Ok, I will add these information in v2 commit.
Thanks.




Re: [PATCH v2 1/2] ip_tunnel: fix potential issue in ip_tunnel_rcv

2017-06-07 Thread


> On 8 Jun 2017, at 10:13 AM, Pravin Shelar  wrote:
> 
> On Wed, Jun 7, 2017 at 5:57 PM, Haishuang Yan
>  wrote:
>> When ip_tunnel_rcv fails, the tun_dst won't be freed, so move
>> skb_dst_set to begin and tun_dst would be freed by kfree_skb.
>> 
>> CC: Pravin B Shelar 
>> Fixes: 2e15ea390e6f ("ip_gre: Add support to collect tunnel metadata.")
>> Signed-off-by: Haishuang Yan 
>> ---
>> net/ipv4/ip_tunnel.c | 6 +++---
>> 1 file changed, 3 insertions(+), 3 deletions(-)
>> 
>> diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
>> index b878ecb..27fc20f 100644
>> --- a/net/ipv4/ip_tunnel.c
>> +++ b/net/ipv4/ip_tunnel.c
>> @@ -386,6 +386,9 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct 
>> sk_buff *skb,
>>const struct iphdr *iph = ip_hdr(skb);
>>int err;
>> 
>> +   if (tun_dst)
>> +   skb_dst_set(skb, (struct dst_entry *)tun_dst);
>> +
> If dst is set so early, skb_scrub_packet() would remove the tunnel dst
> reference.
> It is better to call skb_dst_drop() from error code path.

Yes, I will change it in v3 commit, thanks!

> 
>> #ifdef CONFIG_NET_IPGRE_BROADCAST
>>if (ipv4_is_multicast(iph->daddr)) {
>>tunnel->dev->stats.multicast++;
>> @@ -439,9 +442,6 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct 
>> sk_buff *skb,
>>skb->dev = tunnel->dev;
>>}
>> 
>> -   if (tun_dst)
>> -   skb_dst_set(skb, (struct dst_entry *)tun_dst);
>> -
>>gro_cells_receive(&tunnel->gro_cells, skb);
>>return 0;
>> 
>> --
>> 1.8.3.1
>> 
>> 
>> 
> 






Re: [PATCH] ip6_tunnel: Correct tos value in collect_md mode

2017-06-14 Thread

> On 14 Jun 2017, at 1:28 PM, Peter Dawson  wrote:
> 
> On Wed, 14 Jun 2017 10:54:31 +0800
> 严海双  wrote:
> 
> 
>>> Changes since v2:
>>> * mask key->tos with RT_TOS() suggested by Daniel
> 
> Can you help me understand the rationale for this change? Is there are bug 
> introduced by dsfield = ip6_tclass(key->label); ?
> 
> The RT_TOS masks out 4bits of the 8bit tos field in accordance with RFC1349 
> (obsoleted by RFC2474). IPv6 does not have a TOS field. So it dosen't make 
> sense to apply a TOS value to the outer header of an IPv6 tunnel.
> 
> 

Hi, Peter

Here the tos also means Traffic Class in IPv6, see the define in struct 
ip_tunnel_key:

u8  tos;/* TOS for IPv4, TC for IPv6 */

RT_TOS mask is suggested by Daniel, please refer to the implement in vxlan or 
geneve code:

fl6.flowlabel = ip6_make_flowinfo(RT_TOS(tos), label);

Thanks.



Re: [PATCH v2] ip6_tunnel: Correct tos value in collect_md mode

2017-06-16 Thread


> On 16 Jun 2017, at 10:44 PM, Daniel Borkmann  wrote:
> 
> On 06/15/2017 05:54 AM, Peter Dawson wrote:
>> On Thu, 15 Jun 2017 10:30:29 +0800
>> Haishuang Yan  wrote:
>> 
>>> Same as ip_gre, geneve and vxlan, use key->tos as tos value.
>>> 
>>> CC: Peter Dawson 
>>> Fixes: 0e9a709560db ("ip6_tunnel, ip6_gre: fix setting of DSCP on
>>> encapsulated packets”)
>>> Suggested-by: Daniel Borkmann 
>>> Signed-off-by: Haishuang Yan 
>>> 
>>> ---
>>> Changes since v2:
>>>   * Add fixes information
>>>   * mask key->tos with RT_TOS() suggested by Daniel
>>> ---
>>>  net/ipv6/ip6_tunnel.c | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>> 
>>> diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
>>> index ef99d59..6400726 100644
>>> --- a/net/ipv6/ip6_tunnel.c
>>> +++ b/net/ipv6/ip6_tunnel.c
>>> @@ -1249,7 +1249,7 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct 
>>> net_device *dev, __u8 dsfield,
>>> fl6.flowi6_proto = IPPROTO_IPIP;
>>> fl6.daddr = key->u.ipv6.dst;
>>> fl6.flowlabel = key->label;
>>> -   dsfield = ip6_tclass(key->label);
>>> +   dsfield =  RT_TOS(key->tos);
>>> } else {
>>> if (!(t->parms.flags & IP6_TNL_F_IGN_ENCAP_LIMIT))
>>> encap_limit = t->parms.encap_limit;
>>> @@ -1320,7 +1320,7 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct 
>>> net_device *dev, __u8 dsfield,
>>> fl6.flowi6_proto = IPPROTO_IPV6;
>>> fl6.daddr = key->u.ipv6.dst;
>>> fl6.flowlabel = key->label;
>>> -   dsfield = ip6_tclass(key->label);
>>> +   dsfield = RT_TOS(key->tos);
>>> } else {
>>> offset = ip6_tnl_parse_tlv_enc_lim(skb, 
>>> skb_network_header(skb));
>>> /* ip6_tnl_parse_tlv_enc_lim() might have reallocated skb->head 
>>> */
>> 
>> I don't think it is correct to apply RT_TOS
>> 
>> Here is my understanding based on the RFCs.
>> 
>> IPv4/6 Header:0 |0 1 2 3 |0 1 2 3 |0 1 2 3 |0 1 2 3 |
>> RFC2460(IPv6)   |Version | Traffic Class   ||
>> RFC2474(IPv6)   |Version | DSCP|ECN||
>> RFC2474(IPv4)   |Version |  IHL   |DSCP |ECN|
>> RFC1349(IPv4)   |Version |  IHL   | PREC |  TOS   |X|
>> RFC791 (IPv4)   |Version |  IHL   |  TOS|
>> 
>> u8 key->tos stores the full 8bits of Traffic class from an IPv6 header and;
>> u8 key->tos stores the full 8bits of TOS(RFC791) from an IPv4 header
>> u8 ip6_tclass will return the full 8bits of Traffic Class from an IPv6 
>> flowlabel
>> 
>> RT_TOS will return the RFC1349 4bit TOS field.
>> 
>> Applying RT_TOS to a key->tos will result in lost information and the 
>> inclusion of 1 bit of ECN if the original field was a DSCP+ECN.
>> 
>> Based on this understanding of the RFCs (but not years of experience) and 
>> since RFC1349 has been obsoleted by RFC2474 I think the use of RT_TOS should 
>> be deprecated.
>> 
>> This being said, dsfield = ip6_tclass(key->label) = key->tos isn't fully 
>> correct either because the result will contain the ECN bits as well as the 
>> DSCP.
>> 
>> I agree that code should be consistent, but not where there is a potential 
>> issue.
> 
> Yeah, you're right. Looks like initial dsfield = key->tos diff was
> the better choice then, sorry for my confusing comment.
> 
> For example, bpf_skb_set_tunnel_key() helper that populates the collect
> metadata as one user of this infra masks the key->label so that it really
> only holds the label meaning previous dsfield = ip6_tclass(key->label)
> will always be 0 in that case unlike key->tos that actually gets populated
> and would propagate it.
> 
Okay, I will change the commit back to initial version, thanks everyone.





Re: [PATCH v2 1/2] ip_tunnel: fix ip tunnel lookup in collect_md mode

2017-06-19 Thread


> On 19 Jun 2017, at 1:43 PM, Pravin Shelar  wrote:
> 
> On Fri, Jun 16, 2017 at 8:27 PM, Haishuang Yan
>  wrote:
>> In collect_md mode, if the tun dev is down, it still can call
>> ip_tunnel_rcv to receive on packets, and the rx statistics increase
>> improperly.
>> 
>> Fixes: 2e15ea390e6f ("ip_gre: Add support to collect tunnel metadata.")
>> Cc: Pravin B Shelar 
>> Signed-off-by: Haishuang Yan 
>> 
>> ---
>> Change since v2:
>>  * Fix wrong recipient addresss
>> ---
>> net/ipv4/ip_tunnel.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
>> index 0f1d876..a3caba1 100644
>> --- a/net/ipv4/ip_tunnel.c
>> +++ b/net/ipv4/ip_tunnel.c
>> @@ -176,7 +176,7 @@ struct ip_tunnel *ip_tunnel_lookup(struct ip_tunnel_net 
>> *itn,
>>return cand;
>> 
>>t = rcu_dereference(itn->collect_md_tun);
>> -   if (t)
>> +   if (t && (t->dev->flags & IFF_UP))
>>return t;
>> 
> It would be nice if we could increment drop count if tunnel device is not up.
> 
Hi Pravin

I think it’s not necessary, for example as gre tunnel, if ipgre_rcv fails, it 
would trigger send an icmp unreachable
message:

if (ipgre_rcv(skb, &tpi, hdr_len) == PACKET_RCVD)
return 0;

icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0);

Since the tunnel device didn’t touch the packets, so increase drop statistics 
is not necessary.

Thanks