[PATCH] netfilter: fix missing dependencies for NETFILTER_XT_MATCH_CONNLABEL

2013-02-03 Thread Florian Westphal
It was possible to set
NF_CONNTRACK=n
NF_CONNTRACK_LABELS=y

via NETFILTER_XT_MATCH_CONNLABEL=y:

warning: (NETFILTER_XT_MATCH_CONNLABEL) selects NF_CONNTRACK_LABELS which has
unmet direct dependencies (NET && INET && NETFILTER && NF_CONNTRACK)

Reported-by: Randy Dunlap 
Signed-off-by: Florian Westphal 
---
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index eb2c8eb..d4dd702 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -860,6 +860,7 @@ config NETFILTER_XT_MATCH_CONNBYTES
 config NETFILTER_XT_MATCH_CONNLABEL
tristate '"connlabel" match support'
select NF_CONNTRACK_LABELS
+   depends on NF_CONNTRACK
depends on NETFILTER_ADVANCED
---help---
  This match allows you to test and assign userspace-defined labels 
names
-- 
1.7.8.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Regression, bisected: hyperv shutdown panics guest

2013-02-07 Thread Florian Westphal
With 3.7, hyperv guest shutdown no longer works.
Instead, guest kernel throws a bunch of "BUG: scheduling-while-atomic"
errors and then dies.

reverting

commit 6c0c0d4d1080840eabb3d055d2fd8191c5fd
Author: hongfeng 
Date:   Thu Oct 4 17:12:25 2012 -0700
poweroff: fix bug in orderly_poweroff()

fixes this problem.

Greping for users of orderly_poweroff() shows that hyperv isn't
the only caller that invokes the function from irq context.

In fact, kdoc for orderly_poweroff says:

* This may be called from any context to trigger a system shutdown.
* If the orderly shutdown fails, it will force an immediate shutdown.

Any suggestions on how to properly fix this?

Thanks,
Florian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: xt_nat_init: BUG: unable to handle kernel NULL pointer dereference at 00000000000000e0

2012-09-13 Thread Florian Westphal
Fengguang Wu  wrote:
> Hi Patrick,
> 
> This happens in today's linux-next tree and is pretty reproducible.
> [1.834544] nf_conntrack version 0.5.0 (1786 buckets, 7144 max)
> [1.835406] ctnetlink v0.93: registering with nfnetlink.
> [1.836202] BUG: unable to handle kernel NULL pointer dereference at 
> 00e0
> [1.837539] IP: [] mutex_lock_interruptible+0x23/0x70

Should be fixed by
commit 00545bec9412d130c77f72a08d6c8b6ad21d4a1e
Author: Pablo Neira Ayuso 
Subject: netfilter: fix crash during boot if NAT has been compiled built-in

But seems its not yet in net-next.  Pablo?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


"WRITE SAME failed. Manually zeroing" with mptsas driver

2013-08-26 Thread Florian Westphal
Hi.

I get repeated "WRITE SAME" failed errors with
"SAS1064ET" Controller (mptsas driver).  Excerpt:

[ 5898.784829] Sense Key : 0x5 [current]
[ 5898.784833] sd 6:1:0:0: [sda]
[ 5898.784835] ASC=0x0 ASCQ=0x0
[ 5898.784837] sd 6:1:0:0: [sda] CDB:
[ 5898.784838] cdb[0]=0x41: 41 00 07 4f db 12 00 00 08 00
[ 5898.784858] sda6: WRITE SAME failed. Manually zeroing.
[ 5898.74] sd 6:1:0:0: [sda]
[ 5898.78] Result: hostbyte=0x00 driverbyte=0x08
[ 5898.788891] sd 6:1:0:0: [sda]
[ 5898.788893] Sense Key : 0x5 [current]
[ 5898.788896] sd 6:1:0:0: [sda]
[ 5898.788898] ASC=0x0 ASCQ=0x0
[ 5898.788900] sd 6:1:0:0: [sda] CDB:
[ 5898.788902] cdb[0]=0x41: 41 00 07 4f db 1a 00 00 10 00
[ 5898.788922] sda6: WRITE SAME failed. Manually zeroing.
[ 5898.792943] sd 6:1:0:0: [sda]

I tested with commit 66c28f97120e8a ("[SCSI] sd: Update WRITE SAME
heuristics") but that had no effect.

I would be grateful for any other ideas/fixes to try.

For the time being, I applied following hack which suppresses
the message spew:

--- linux-3.8.6.orig/block/blk-lib.c
+++ linux-3.8.6/block/blk-lib.c
@@ -285,7 +285,9 @@ int __blkdev_issue_zeroout(struct block_
 int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 sector_t nr_sects, gfp_t gfp_mask)
 {
-   if (bdev_write_same(bdev)) {
+   static int failcnt;
+
+   if (failcnt < 16 && bdev_write_same(bdev)) {
unsigned char bdn[BDEVNAME_SIZE];
 
if (!blkdev_issue_write_same(bdev, sector, nr_sects, gfp_mask,
@@ -294,6 +296,7 @@ int blkdev_issue_zeroout(struct block_de
 
bdevname(bdev, bdn);
pr_err("%s: WRITE SAME failed. Manually zeroing.\n", bdn);
+   WARN_ON(++failcnt == 16);
}
 
return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);

In case it helps, the WARN yiels following backtrace:

[  641.039233] WARNING: at block/blk-lib.c:299 blkdev_issue_zeroout+0xc4/0xe6()
[  641.039234] Hardware name: PRIMERGY RX100 S7p
[  641.039236] Modules linked in: ifb xt_TCPMSS xt_REDIRECT ipt_MASQUERADE 
xt_policy xt_nat xt_length2(O) xt_CLASSIFY xt_hashlimit
xt_TPROXY nf_tproxy_core xt_socket xt_NFQUEUE xt_connmark xt_limit xt_mark 
xt_set xt_addrtype xt_tcpudp ip_set_hash_ip nfnetlink_queue nf_nat_ftp 
nf_conntrack_ftp
af_packet iptable_mangle iptable_nat nf_nat_ipv4 nf_nat xt_NFLOG 
xt_condition(O) xt_logmark xt_owner ipt_REJECT xt_state ip_set ip_scheduler 
nfnetlink_log
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack iptable_filter iptable_raw xt_CT 
nf_conntrack_netlink nfnetlink nf_conntrack ip_tables x_tables acpi_cpufreq 
mperf
crc32c_intel aesni_intel ablk_helper cryptd lrw aes_x86_64 aes_generic xts 
gf128mul ehci_pci coretemp rtc_cmos ehci_hcd sg i2c_i801 evdev ac sr_mod cdrom
microcode button acpi_power_meter e1000e(O) container sd_mod fan thermal 
processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas
ahci libahci libata scsi_mod edd [last unloaded: ifb]
[  641.039321] Pid: 2425, comm: flush-8:0 Tainted: G   O 
3.8.6-22.geaf5f75-smp64 #1
[  641.039322] Call Trace:
[  641.039327]  [] ? blkdev_issue_zeroout+0xc4/0xe6
[  641.039331]  [] ? warn_slowpath_common+0x78/0x8d
[  641.039334]  [] ? blkdev_issue_zeroout+0xc4/0xe6
[  641.039339]  [] ? ext4_ext_zeroout+0x4a/0x55
[  641.039343]  [] ? ext4_ext_map_blocks+0x78a/0x1660
[  641.039348]  [] ? blk_recount_segments+0x1b/0x2c
[  641.039353]  [] ? mempool_alloc+0x54/0x132
[  641.039358]  [] ? __blk_segment_map_sg+0x115/0x153
[  641.039362]  [] ? ext4_map_blocks+0x149/0x23c
[  641.039366]  [] ? mpage_da_map_and_submit+0x9c/0x79d
[  641.039376]  [] ? mptscsih_qcmd+0x4e4/0x513 [mptscsih]
[  641.039387]  [] ? scsi_finish_command+0xb4/0xb4 [scsi_mod]
[  641.039391]  [] ? blk_peek_request+0x17c/0x18e
[  641.039395]  [] ? start_this_handle+0x418/0x429
[  641.039399]  [] ? find_get_pages_tag+0xf5/0x136
[  641.039404]  [] ? ext4_da_writepages+0x7f4/0x932
[  641.039409]  [] ? __writeback_single_inode+0x39/0xd1
[  641.039412]  [] ? writeback_sb_inodes+0x22e/0x3af
[  641.039416]  [] ? __writeback_inodes_wb+0x65/0xa1
[  641.039419]  [] ? wb_writeback+0x108/0x18c
[  641.039423]  [] ? lock_timer_base+0x26/0x4c
[  641.039426]  [] ? wb_do_writeback+0x141/0x1ac
[  641.039429]  [] ? del_timer+0x7c/0x7c
[  641.039433]  [] ? bdi_writeback_thread+0x82/0x140
[  641.039436]  [] ? wb_do_writeback+0x1ac/0x1ac
[  641.039440]  [] ? kthread+0xad/0xb7
[  641.039444]  [] ? kthread_freezable_should_stop+0x51/0x51
[  641.039450]  [] ? ret_from_fork+0x7c/0xb0
[  641.039453]  [] ? kthread_freezable_should_stop+0x51/0x51
[  641.039456] ---[ end trace d195e97e48c008d0 ]---

Thanks, Florian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: Tree for Sep 4 (netfilter: xt_TPROXY)

2013-09-04 Thread Florian Westphal
Randy Dunlap  wrote:
> On 09/04/13 01:13, Stephen Rothwell wrote:
> > Hi all,
> > 
> > Please do not add any code for v3.13 to your linux-next included branches
> > until after v3.12-rc1 is released.
> > 
> > Changes since 20130902:
> > 
> 
> on x86_64:
> 
> when CONFIG_IPV6=m
> and CONFIG_NETFILTER_XT_TARGET_TPROXY=y:
> 
> net/built-in.o: In function `tproxy_tg6_v1':
> xt_TPROXY.c:(.text+0x5dc05): undefined reference to `udp6_lib_lookup'
> xt_TPROXY.c:(.text+0x5e32f): undefined reference to `udp6_lib_lookup'
> xt_TPROXY.c:(.text+0x5e432): undefined reference to `udp6_lib_lookup'
> net/built-in.o: In function `tproxy_tg_init':
> xt_TPROXY.c:(.init.text+0x1540): undefined reference to 
> `nf_defrag_ipv6_enable'

Thanks for reporting.

I can reproduce same error with 3.10.6 stable tree, so its not
a recent problem.

As always, the tempting solution is to just forbid TPROXY=y with IPV6=m but
it would be better to get rid of the ipv6 link time deps..
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] Fatal exception in interrupt - nf_nat_cleanup_conntrack during IPv6 tests

2013-04-10 Thread Florian Westphal
CAI Qian  wrote:

[ CC'd nf-devel ]

> Just hit this very often during IPv6 tests in both the latest stable
> and mainline kernel.
> 
> [ 3597.206166] Modules linked in:
[..]
> nf_nat_ipv4(F-)
[..]

> [ 3597.804861] RIP: 0010:[]  [] 
> nf_nat_cleanup_conntrack+0x42/0x70 [nf_nat] 
> [ 3597.855207] RSP: 0018:880202c63d40  EFLAGS: 00010246 
> [ 3597.881350] RAX:  RBX: 8801ac7bec28 RCX: 
> 8801d0eedbe0 
> [ 3597.917226] RDX: dead00200200 RSI: 0011 RDI: 
> a03265b8 
[..]

> [ 3598.421036]
> [ 3598.430467]  [] __nf_ct_ext_destroy+0x44/0x60 
> [nf_conntrack] 
> [ 3598.499191]  [] nf_conntrack_free+0x2e/0x70 
> [nf_conntrack] 
> [ 3598.534121]  [] destroy_conntrack+0xbd/0x110 
> [nf_conntrack] 
> [ 3598.569981]  [] nf_conntrack_destroy+0x17/0x20 
> [ 3598.599579]  [] death_by_timeout+0xdc/0x1b0 
> [nf_conntrack]
[..]
> [ 3599.241868] Code: 83 ec 08 0f b6 58 11 84 db 74 43 48 01 c3 48 83 7b 20 00 
> 74 39 48 c7 c7 b8 65 32 a0 e8 98 fc 2e e1 48 8b 03 48 8b 53 08 48 85 c0 <48> 
> 89 02 74 04 48 89 50 08 48 ba 00 02 20 00 00 00 ad de 48 c7  
> [ 3599.337037] RIP  [] nf_nat_cleanup_conntrack+0x42/0x70 
> [nf_nat] 

Looks like we tried to remove bysource hash twice (rdx is
LIST_POISON_2).

I wonder if this would explain it:

static void nf_nat_l4proto_clean(u8 l3proto, u8 l4proto)
{
[..]
  /* Step 1 - remove from bysource hash */
  clean.hash = true;
  for_each_net(net)
nf_ct_iterate_cleanup(net, nf_nat_proto_clean, &clean);

A nfct->timer fires and a conntrack is free'd before step 2 memsets the
nat extension.  In that case, we would try to delete nat->bysource
again?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] net: frag, fix race conditions in LRU list maintenance

2013-05-05 Thread Florian Westphal
Konstantin Khlebnikov  wrote:
> This patch fixes race between inet_frag_lru_move() and inet_frag_lru_add()
> which was introduced in commit 3ef0eb0db4bf92c6d2510fe5c4dc51852746f206
> ("net: frag, move LRU list maintenance outside of rwlock")
> 
> One cpu already added new fragment queue into hash but not into LRU.
> Other cpu found it in hash and tries to move it to the end of LRU.
> This leads to NULL pointer dereference inside of list_move_tail().
>
> Another possible race condition is between inet_frag_lru_move() and
> inet_frag_lru_del(): move can happens after deletion.

True, thanks for fixing this problem.

> This patch initializes LRU list head before adding fragment into hash and
> inet_frag_lru_move() doesn't touches it if it's empty.

Acked-by: Florian Westphal 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


"WRITE SAME failed. Manually zeroing" with 3w-xxxx driver

2013-04-29 Thread Florian Westphal
After update to 3.8 dmesg is spammed with:

kernel: [ 280.272094] 3w-: scsi8: Unknown scsi opcode: 0x41
kernel: [ 280.272107] sd 8:0:0:0: [sda] Unhandled error code
kernel: [ 280.272110] sd 8:0:0:0: [sda]
kernel: [ 280.272112] Result: hostbyte=0x04 driverbyte=0x00
kernel: [ 280.272114] sd 8:0:0:0: [sda] CDB:
kernel: [ 280.272116] cdb[0]=0x41: 41 00 10 09 aa 22 00 00 08 00
kernel: [ 280.272123] end_request: I/O error, dev sda, sector 269068834
kernel: [ 280.272130] sda6: WRITE SAME failed. Manually zeroing.
[..]

This goes on and on.

I am not familiar with block layer or scsi drivers, so I don't know if this
is an issue with the 3w- driver or the block layer.

I've hacked around this by reverting "block: Make blkdev_issue_zeroout use 
WRITE SAME"
(579e8f3c7b2ecf7db91398d942d76457a3ddba21), as this hw doesn't support
write_same anyway.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: "WRITE SAME failed. Manually zeroing" with 3w-xxxx driver

2013-04-29 Thread Florian Westphal
Martin K. Petersen  wrote:
> Florian> After update to 3.8 dmesg is spammed with: kernel: [
> Florian> 280.272094] 3w-: scsi8: Unknown scsi opcode: 0x41 kernel: [
> Florian> 280.272107] sd 8:0:0:0: [sda] Unhandled error code kernel: 
> 
> Interesting. It looks like the 3ware handles this at the driver level
> instead of passing the command through to the disk and letting it
> fail. That in turn means that the logic we have in place to disable WS
> when the disk does not support it does not get triggered.
> 
> The driver should really fill out the sense buffer in that case.
> 
> Could you please test the patch below?

Sure, I'll report back tomorrow.

Thanks for the quick response,
Florian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: "WRITE SAME failed. Manually zeroing" with 3w-xxxx driver

2013-04-30 Thread Florian Westphal
Martin K. Petersen  wrote:
> >>>>> "Florian" == Florian Westphal  writes:
> 
> Florian> After update to 3.8 dmesg is spammed with: kernel: [
> Florian> 280.272094] 3w-: scsi8: Unknown scsi opcode: 0x41 kernel: [
> Florian> 280.272107] sd 8:0:0:0: [sda] Unhandled error code kernel: 
> 
> Could you please test the patch below?

Works. Only one WRITE_SAME error at boot, max_write_same_blocks in sysfs
is 0, which wasn't the case before.

> The second question is what it is that's issuing these zeroouts at boot?
> Which filesystem are you using? What's your DM/MD config?

ext4, no DM/MD is used.  I guess the zeroouts are from postgres, but i'm
not sure.

> 3w-: Create sense buffer for unsupported commands
> 
> Make the driver return appropriate sense data when an unsupported
> operation is queued. This will cause the SCSI layer to stop issuing the
> offending command.
> 
> Reported-by: Florian Westphal 
> CC: adam radford 
> Signed-off-by: Martin K. Petersen 

Tested-by: Florian Westphal 

Thanks again.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next] x86: bpf_jit_comp: secure bpf jit against spraying attacks

2013-05-20 Thread Florian Westphal
Eric Dumazet  wrote:
> From: Eric Dumazet 
> 
> hpa bringed into my attention some security related issues
> with BPF JIT on x86.
> 
> This patch makes sure the bpf generated code is marked read only,
> as other kernel text sections.
> 
> It also splits the unused space (we vmalloc() and only use a fraction of
> the page) in two parts, so that the generated bpf code not starts at a
> known offset in the page, but a pseudo random one.
> 
> Refs:
> http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html

What about emitting additional instructions at random locations in the
generated code itself?

Eg., after every instruction, have random chance to insert
'xor $0xcc,%al; xor $0xcc,%al', etc?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sk_page_frag_refill OOM killing spree

2013-05-22 Thread Florian Westphal
David Rientjes  wrote:
> > On Tue, 2013-05-21 at 14:28 +0200, Florian Westphal wrote:
> > > seems like sk_page_frag_refill() can cause oom-killer invocation:
> > > 
> > > postgres invoked oom-killer: gfp_mask=0x42d0, order=3, oom_score_adj=0
> > > Pid: 10551, comm: postgres Tainted: G   O 3.8.6-5.g613ca40-smp #1
> > > Call Trace:
> > >  [] ? dump_header+0x60/0x191
> > >  [] ? ___ratelimit+0xb2/0xc4
> > >  [] ? oom_kill_process+0x61/0x2d1
> > >  [] ? has_capability_noaudit+0x1c/0x23
> > >  [] ? oom_badness+0x8c/0xef
> > >  [] ? out_of_memory+0x203/0x247
> > >  [] ? __alloc_pages_nodemask+0x42b/0x4c3
> > >  [] ? sk_page_frag_refill+0x6a/0xd2
> > >  [] ? tcp_sendmsg+0x3e8/0x7c6
> > >  [] ? inet_sendmsg+0x6b/0x75
> > >  [] ? sock_sendmsg+0x8d/0xa6
> > >  [] ? sys_sendto+0x105/0x130
> > >  [] ? __kunmap_atomic+0x62/0x8a
> > >  [] ? __kunmap_atomic+0x7b/0x8a
> > >  [] ? __lru_cache_add+0x18/0x47
> > >  [] ? handle_pte_fault+0x745/0x751
> > >  [] ? kmap_atomic_prot+0xd3/0xf1
> > >  [] ? handle_mm_fault+0x112/0x121
> > >  [] ? sys_send+0x37/0x3b
> > > 
> > > The system is busy, so, order-3 alloc failure doesn't strike me as odd.
> > > 
> > > There are no allocation failures with order != 3.
> > > 
> > > Sometimes this can happen in very short sucession, i.e.
> > > and oom-killer did end up zapping 30 processes or so.
> 
> Aside from the __GFP_NORETRY issue, could you post the full oom killer log 
> where it kills more than one process with /proc/sys/vm/oom_dump_tasks 
> enabled?  That shouldn't happen unless you have a memory leak.

http://strlen.de/fw/oom.txt

HOWEVER, before you spend time on this:
I don't think there is an issue with oom killer, I only noticed
that after kernel update oom killer invocations are
 a) more frequent than before
 b) often show above backtrace (sk_page_frag_refill).

I'm not even saying "sk_page_frag_refill is broken", since I don't know if
adding GFP_NORETRY might add silent performance degradation, etc.

Its very well possible that everything is working as intended :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] kvm: disable stealtime via reboot notifier to avoid mem corruption

2012-08-10 Thread Florian Westphal
else, we get memory corruption on reboot; found when tracking down
initramfs unpack error on initial reboot (with qemu-kvm -smp 2,
no problem with single-core).

problem with doing it via kvm_shutdown() is that this file
depends on CONFIG_KVM_CLOCK, also its not enough to call it for one
cpu only.

Signed-off-by: Florian Westphal 
---
 not subscribed, please CC on replies.
 Also, I don't know much about kvm or kexec, so its possible
 that i missed something.

 In any case, this seems to fix the initramfs corruption for me.
 patch is against virt/kvm/kvm.git.

 arch/x86/kernel/kvm.c  |1 +
 arch/x86/kernel/kvmclock.c |1 -
 2 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index c1d61ee..1596cc8 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -354,6 +354,7 @@ static void kvm_pv_guest_cpu_reboot(void *unused)
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
wrmsrl(MSR_KVM_PV_EOI_EN, 0);
kvm_pv_disable_apf();
+   kvm_disable_steal_time();
 }
 
 static int kvm_pv_reboot_notify(struct notifier_block *nb,
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index f1b42b3..5a2fa7d 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -191,7 +191,6 @@ static void kvm_crash_shutdown(struct pt_regs *regs)
 static void kvm_shutdown(void)
 {
native_write_msr(msr_kvm_system_time, 0, 0);
-   kvm_disable_steal_time();
native_machine_shutdown();
 }
 
-- 
1.7.8.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kvm: disable stealtime via reboot notifier to avoid mem corruption

2012-08-14 Thread Florian Westphal
Marcelo Tosatti  wrote:
> On Fri, Aug 10, 2012 at 12:36:22PM +0200, Florian Westphal wrote:
> > --- a/arch/x86/kernel/kvmclock.c
> > +++ b/arch/x86/kernel/kvmclock.c
> > @@ -191,7 +191,6 @@ static void kvm_crash_shutdown(struct pt_regs *regs)
> >  static void kvm_shutdown(void)
> >  {
> > native_write_msr(msr_kvm_system_time, 0, 0);
> > -   kvm_disable_steal_time();
> > native_machine_shutdown();
> >  }
> This part below will introduce a bug for shutdown. Can you retest with
> the addition of kvm_disable_steal_time to kvm_pv_guest_cpu_reboot only,
> retest and resend please?

I can, but the problem with kvm_disable_steal_time() in
kvmclock.c is that with CONFIG_KVM_CLOCK=n the entire
file won't be compiled.

And steal time doesn't depend on CONFIG_KVM_CLOCK=y.
So if removing it there is a bug leaving it in only avoids
that bug for CONFIG_KVM_CLOCK=y.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V2] KVM: x86: disable stealtime on reboot to avoid mem corruption

2012-08-15 Thread Florian Westphal
else, host continues to update stealtime after reboot,
which can corrupt e.g. initramfs area.
found when tracking down initramfs unpack error on initial reboot
(with qemu-kvm -smp 2, no problem with single-core).

Signed-off-by: Florian Westphal 
---
 arch/x86/kernel/kvm.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

Changes since v1:
- leave arch/x86/kernel/kvmclock.c as-is (requested by Marcelo)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index c1d61ee..1596cc8 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -354,6 +354,7 @@ static void kvm_pv_guest_cpu_reboot(void *unused)
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
wrmsrl(MSR_KVM_PV_EOI_EN, 0);
kvm_pv_disable_apf();
+   kvm_disable_steal_time();
 }
 
 static int kvm_pv_reboot_notify(struct notifier_block *nb,
-- 
1.7.8.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression, bisected: openpty fails from 3.7 onwards without devpts

2013-01-10 Thread Florian Westphal
Alan Cox  wrote:
> On Thu, 10 Jan 2013 15:46:26 +0100
> Florian Westphal  wrote:
> > Frank Lichtenheld discovered that openpty() doesn't work anymore when
> > /dev/pts is not present.
> > 
> > We bisected this down to
> > 
> > commit bbb63c514a3464342967237a51a21ea8f61ab951
> > Author: Wanlong Gao 
> > Subject: drivers:tty:fix up ENOIOCTLCMD error handling

[..]

> > #include 
> > #include 
> > int main(void) {
> >  int pty_fd, tty_fd;
> >  if (openpty(&pty_fd, &tty_fd, NULL, NULL, NULL) != 0) {
> >   perror("openpty");
> >   return 1;
> >  }
> > return 0;
> > }
> > 
> > [ compile with cc -lutil pty.c -o pty ]
> > 
> > If devpts is available or above commit reverted openpty works again.
> 
> The commit is fairly general - what we need to do here is to figure out
> which specific thing trips up openpty so we can put the error on that
> back as it was (or find a better way) so it still works.
> 
> Can you attach an strace of the working/failing cases without /dev/pts 

Sure, attached.  /dev/pts is not present.
Both traces are from the same machine, with same kernel version (except
above commit reverted).
execve("./ptytest", ["./ptytest"], [/* 7 vars */]) = 0
brk(0)  = 0x1d04000
access("/etc/ld.so.nohwcap", F_OK)  = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x7fe6784d
access("/etc/ld.so.preload", R_OK)  = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)  = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=15432, ...}) = 0
mmap(NULL, 15432, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fe6784cc000
close(3)= 0
access("/etc/ld.so.nohwcap", F_OK)  = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libutil.so.1", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\20\16\0\0\0\0\0\0"..., 
832) = 832
fstat(3, {st_mode=S_IFREG|0644, st_size=10640, ...}) = 0
mmap(NULL, 2105608, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0x7fe6780b
mprotect(0x7fe6780b2000, 2093056, PROT_NONE) = 0
mmap(0x7fe6782b1000, 8192, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1000) = 0x7fe6782b1000
close(3)= 0
access("/etc/ld.so.nohwcap", F_OK)  = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY) = 3
read(3, 
"\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300\357\1\0\0\0\0\0"..., 832) = 
832
fstat(3, {st_mode=S_IFREG|0755, st_size=1595408, ...}) = 0
mmap(NULL, 3709016, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0x7fe677d26000
mprotect(0x7fe677ea6000, 2097152, PROT_NONE) = 0
mmap(0x7fe6780a6000, 20480, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x18) = 0x7fe6780a6000
mmap(0x7fe6780ab000, 18520, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fe6780ab000
close(3)= 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x7fe6784cb000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x7fe6784ca000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x7fe6784c9000
arch_prctl(ARCH_SET_FS, 0x7fe6784ca700) = 0
mprotect(0x7fe6780a6000, 16384, PROT_READ) = 0
mprotect(0x7fe6782b1000, 4096, PROT_READ) = 0
mprotect(0x7fe6784d2000, 4096, PROT_READ) = 0
munmap(0x7fe6784cc000, 15432)   = 0
open("/dev/ptmx", O_RDWR)   = -1 ENOSPC (No space left on device)
open("/dev/ptyp0", O_RDWR)  = 3
ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
ioctl(3, TIOCGPTN, [0]) = -1 EINVAL (Invalid argument)
fstat(3, {st_mode=S_IFCHR|S_ISVTX|0666, st_rdev=makedev(2, 0), ...}) = 0
stat("/dev/ttyp0", {st_mode=S_IFCHR|S_ISVTX|0660, st_rdev=makedev(3, 0), ...}) 
= 0
getuid()= 0
socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4
connect(4, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT 
(No such file or directory)
close(4)= 0
socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4
connect(4, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT 
(No such file or directory)
close(4)= 0
brk(0)  = 0x1d04000
brk(0x1d25000)  = 0x1d25000
open("/etc/nsswitch.conf", O_RDONLY)= 4
fstat(4, {st_mode=S_IFREG|0644, st_size=475, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_AN

Re: regression, bisected: openpty fails from 3.7 onwards without devpts

2013-01-11 Thread Florian Westphal
Jiri Slaby  wrote:
> On 01/10/2013 11:51 PM, Jiri Slaby wrote:
> > On 01/10/2013 11:45 PM, Alan Cox wrote:
> >> So we should just fix TIOCGPTN on a pty with no suitable name answer to
> >> return -EINVAL
> > 
> > Yes, I agree as I'm expressed in my second mail. Sorry for the confusion.
> 
> Does the attached patch help?

Yes, it does.  Thanks for fixing this.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


regression, bisected: openpty fails from 3.7 onwards without devpts

2013-01-10 Thread Florian Westphal
Frank Lichtenheld discovered that openpty() doesn't work anymore when
/dev/pts is not present.

We bisected this down to

commit bbb63c514a3464342967237a51a21ea8f61ab951
Author: Wanlong Gao 
Subject: drivers:tty:fix up ENOIOCTLCMD error handling

The original program triggering the error was pptpd, but
the test program below is sufficient:

#include 
#include 
int main(void) {
 int pty_fd, tty_fd;
 if (openpty(&pty_fd, &tty_fd, NULL, NULL, NULL) != 0) {
  perror("openpty");
  return 1;
 }
return 0;
}

[ compile with cc -lutil pty.c -o pty ]

If devpts is available or above commit reverted openpty works again.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1 linux-next] netfilter: conntrack: fix kmemleak false positive

2016-09-21 Thread Florian Westphal
Fabian Frederick  wrote:
> Since commit f330a7fdbe16
> ("netfilter: conntrack: get rid of conntrack timer")
> 
> closed connections remain longer in /proc/net/nf_conntrack
> 
> Running current kernel; just after boot:
> cat /proc/net/nf_conntrack | wc -l = 5
> 4 minutes required to clean up the table.

We should reap the stale entries while iterating, just like
we do for ctnetlink interface.

Can you try this patch?

diff --git a/net/netfilter/nf_conntrack_standalone.c 
b/net/netfilter/nf_conntrack_standalone.c
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -212,6 +212,11 @@ static int ct_seq_show(struct seq_file *s, void *v)
if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use)))
return 0;
 
+   if (nf_ct_should_gc(ct)) {
+   nf_ct_kill(ct);
+   goto release;
+   }
+
/* we only want to print DIR_ORIGINAL */
if (NF_CT_DIRECTION(hash))
goto release;

> Going back to kernel version before commit above there are
> no connections after some seconds.
> 
> Referring to the commit changelog this was an expected behaviour but
> it results in temporary kmemleak reports:

I don't see kmemleak complaints on my test vm, I'm reluctant to
turn it off.

Can you explain why we see such 'false positive'?

The conntracks should still be referenced, as they
are in main table.

> unreferenced object 0x88003b0e6600 (size 248):
>   comm "rsyslogd", pid 1595, jiffies 4294741312 (age 7.343s)
>   ...
>   backtrace:
> [] kmemleak_alloc+0x23/0x40
> [] kmem_cache_alloc+0xd9/0x180
> [] __nf_conntrack_alloc.isra.50+0x48/0x170
> [] nf_conntrack_in+0x3a2/0x5f0
> [] ipv4_conntrack_local+0x40/0x50
> [] nf_iterate+0x5d/0x70
> [] nf_hook_slow+0x5f/0xb0
> [] __ip_local_out+0xad/0xe0
> [] ip_local_out+0x17/0x40
> [] ip_send_skb+0x14/0x40
> [] udp_send_skb+0x91/0x260
> [] udp_sendmsg+0x2f5/0x950
> [] inet_sendmsg+0x60/0x90
> [] sock_sendmsg+0x33/0x40
> [] SYSC_sendto+0xee/0x160
> [] SyS_sendto+0x9/0x10
> 
> (248 bytes being an nf_conn structure)
> 
> Those structures being cleared in gc_worker() later on we can't talk
> about unreferenced object so this patch uses kmemleak_not_leak() to
> prevent those warnings.

If thats the case, why is kmemleak complaining? Are you sure this
is a false positive?


Re: [PATCH 1/1 linux-next] netfilter: conntrack: fix kmemleak false positive

2016-09-22 Thread Florian Westphal
Fabian Frederick  wrote:
> Hello Florian,
> 
>         First problem is solved: table gets cleared 3 minutes earlier
> but I still have kmemleak before running the following:
> 
> echo scan > /sys/kernel/debug/kmemleak
> cat /sys/kernel/debug/kmemleak
> Nothing
> echo scan > /sys/kernel/debug/kmemleak
> cat /sys/kernel/debug/kmemleak
> -> rsyslogd
> 
> I talked about false positive because everything is cleared later.

Hmm, I fear this is a real bug and not false positive.

Should be possible to confirm this via slabinfo:

grep nf_conntrack /proc/slabinfo

The active objects should match the conntrack count.
(conntrack -C, or wc -l < /proc/).

> > > unreferenced object 0x88003b0e6600 (size 248):
> > >   comm "rsyslogd", pid 1595, jiffies 4294741312 (age 7.343s)
> > >   ...
> > >   backtrace:
> > >     [] kmemleak_alloc+0x23/0x40
> > >     [] kmem_cache_alloc+0xd9/0x180
> > >     [] __nf_conntrack_alloc.isra.50+0x48/0x170
> > >     [] nf_conntrack_in+0x3a2/0x5f0
> > >     [] ipv4_conntrack_local+0x40/0x50
> > >     [] nf_iterate+0x5d/0x70
> > >     [] nf_hook_slow+0x5f/0xb0
> > >     [] __ip_local_out+0xad/0xe0
> > >     [] ip_local_out+0x17/0x40
> > >     [] ip_send_skb+0x14/0x40
> > >     [] udp_send_skb+0x91/0x260
> > >     [] udp_sendmsg+0x2f5/0x950
> > >     [] inet_sendmsg+0x60/0x90
> > >     [] sock_sendmsg+0x33/0x40
> > >     [] SYSC_sendto+0xee/0x160
> > >     [] SyS_sendto+0x9/0x10

Hmm, so we leak when allocating conntrack for outgoing packet.
Do you do any filtering (DROP) in output/postrouting?

> > > (248 bytes being an nf_conn structure)
> > >
> > > Those structures being cleared in gc_worker() later on we can't talk
> > > about unreferenced object so this patch uses kmemleak_not_leak() to
> > > prevent those warnings.
> >
> > If thats the case, why is kmemleak complaining? Are you sure this
> > is a false positive?

Looks like a real bug to me, but I don't see anything obvious so far.
I'll look at this again tomorrow.


Re: [RFC PATCH 1/3] net: Remove unnecessary memset in __snmp6_fill_stats64

2016-08-08 Thread Florian Westphal
Jia He  wrote:
> buff[] will be assigned later, so memset is not necessary.
> 
> Signed-off-by: Jia He 
> Cc: "David S. Miller" 
> Cc: Alexey Kuznetsov 
> Cc: James Morris 
> Cc: Hideaki YOSHIFUJI 
> Cc: Patrick McHardy 
> ---
>  net/ipv6/addrconf.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index ab3e796..43fa8d0 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -4967,7 +4967,6 @@ static inline void __snmp6_fill_stats64(u64 *stats, 
> void __percpu *mib,
>  
>   BUG_ON(pad < 0);
>  
> - memset(buff, 0, sizeof(buff));
>   buff[0] = IPSTATS_MIB_MAX;
>  
>   for_each_possible_cpu(c) {
for (i = 1; i < IPSTATS_MIB_MAX; i++)
buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff);

Without memset result of buff[i] += ... is undefined.


Re: KASAN: stack-out-of-bounds Read in xfrm_state_find (2)

2017-11-01 Thread Florian Westphal
syzbot  
wrote:

[ cc Thomas Egerer ]

> syzkaller hit the following crash on
> 36ef71cae353f88fd6e095e2aaa3e5953af1685d
> git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> C reproducer is attached
> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> for information about syzkaller reproducers
> 
> BUG: KASAN: stack-out-of-bounds in xfrm_state_find+0x303d/0x3170
> net/xfrm/xfrm_state.c:1051
> Read of size 4 at addr 88003adb7760 by task syzkaller429801/2969

Seems this was added in
commit 8444cf712c5f71845cba9dc30d8f530ff0d5ff83
("xfrm: Allow different selector family in temporary state").

No idea how to fix this:

struct xfrm_state *
xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr,
const struct flowi *fl, struct xfrm_tmpl *tmpl,
struct xfrm_policy *pol, int *err,
unsigned short family)  // AF_INET
{
[..]
unsigned short encap_family = tmpl->encap_family; // AF_INET6
[..]
h = xfrm_dst_hash(net, daddr, saddr, tmpl->reqid, encap_family);

saddr, daddr point to ipv4 addresses inside an on-stack flowi4 struct,
i.e. they get hashed as ipv6 addresses which then results in invalid stack 
access.

What is this supposed to do if family != encap_family?

I also don't understand how address comparision is supposed to work in this 
case,
it seems that if saddr/daddr are v4 and template v6 we compare full ipv6 
addresses
(how would that succeed...?) and, if saddr/daddr is v6 add template is v4 we 
just
compare the first 32bit of the ipv6 addresses...?

This fix silences the reproducer, but I am not sure about it, it looks like it
papers over the real problem...

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1359,16 +1359,19 @@ xfrm_tmpl_resolve_one(struct xfrm_policy *policy, const 
struct flowi *fl,
  struct xfrm_state **xfrm, unsigned short family)
 {
struct net *net = xp_net(policy);
+   xfrm_address_t tmp, daddr, saddr;
int nx;
int i, error;
-   xfrm_address_t *daddr = xfrm_flowi_daddr(fl, family);
-   xfrm_address_t *saddr = xfrm_flowi_saddr(fl, family);
-   xfrm_address_t tmp;
+
+   memset(&saddr, 0, sizeof(saddr));
+   memset(&daddr, 0, sizeof(daddr));
+
+   xfrm_flowi_addr_get(fl, &saddr, &daddr, family);
 
for (nx = 0, i = 0; i < policy->xfrm_nr; i++) {
struct xfrm_state *x;
-   xfrm_address_t *remote = daddr;
-   xfrm_address_t *local  = saddr;
+   xfrm_address_t *remote = &daddr;
+   xfrm_address_t *local  = &saddr;
struct xfrm_tmpl *tmpl = &policy->xfrm_vec[i];
 
if (tmpl->mode == XFRM_MODE_TUNNEL ||
@@ -1389,8 +1392,8 @@ xfrm_tmpl_resolve_one(struct xfrm_policy *policy, const 
struct flowi *fl,
 
if (x && x->km.state == XFRM_STATE_VALID) {
xfrm[nx++] = x;
-   daddr = remote;
-   saddr = local;
+   daddr = *remote;
+   saddr = *local;
continue;
}
if (x) {


Re: KASAN: stack-out-of-bounds Read in xfrm_state_find (2)

2017-11-02 Thread Florian Westphal
Steffen Klassert  wrote:
> On Wed, Nov 01, 2017 at 11:06:08PM +0100, Florian Westphal wrote:
> > I also don't understand how address comparision is supposed to work in this 
> > case,
> > it seems that if saddr/daddr are v4 and template v6 we compare full ipv6 
> > addresses
> > (how would that succeed...?) and, if saddr/daddr is v6 add template is v4 
> > we just
> > compare the first 32bit of the ipv6 addresses...?
> 
> When we do tunnel or beet mode, we pass saddr and daddr from the
> template to xfrm_state_find(), this should be ok. On transport
> mode, we pass the addresses from the flowi, assuming that the
> IP addresses (and address family) don't change during transformation.
> This assumption is wrong in the IPv4 mapped IPv6 case, packet
> is IPv4 and template is IPv6.

Right, sendto() uses ipv4 address on ipv6 socket.

> I'd propose to use the addresses from the template unconditionally,
> like the (untested) patch below does.
> 
> Unfortunalely the reproducer does not work with my config,
> sendto returns EAGAIN. Could anybody try this patch?

The reproducer no longer causes KASAN spew with your patch,
but i don't have a test case that actually creates/uses a tunnel.


Re: suspicious RCU usage at ./include/linux/inetdevice.h:LINE

2017-11-02 Thread Florian Westphal
Cong Wang  wrote:
> > CPU: 0 PID: 23859 Comm: syz-executor2 Not tainted 4.14.0-rc5+ #140
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > Call Trace:
> >  __dump_stack lib/dump_stack.c:16 [inline]
> >  dump_stack+0x194/0x257 lib/dump_stack.c:52
> >  lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4665
> >  __in_dev_get_rtnl include/linux/inetdevice.h:230 [inline]
> >  fib_dump_info+0x1136/0x13d0 net/ipv4/fib_semantics.c:1377
> >  inet_rtm_getroute+0xf97/0x2d70 net/ipv4/route.c:2785
> 
> This is introduced by:
> 
> commit 394f51abb3d04f33fb798f04b16ae6b0491ea4ec
> Author: Florian Westphal 
> Date:   Tue Aug 15 16:34:44 2017 +0200
> 
>     ipv4: route: set ipv4 RTM_GETROUTE to not use rtnl
> 
> Signed-off-by: Florian Westphal 
> Signed-off-by: David S. Miller 
> 
> Looks like we need a wrapper for rcu_dereference_protected(dev->ip_ptr).

Yes, thats the alternative to
https://patchwork.ozlabs.org/patch/833401/

which switches to _rcu version.


Re: [PATCH] Net: netfilter: Moved vmalloc call to kmalloc call

2017-11-02 Thread Florian Westphal
Charlie Sale  wrote:
> Fixed FIXME comment in code my changing a vmalloc call
> to a kmalloc call. Thought it would be a good place to
> start for a first patch.

Please at least compile test your patches.

> - /* FIXME: don't use vmalloc() here or anywhere else -HW */
> - hinfo = vmalloc(sizeof(struct xt_hashlimit_htable) +
> - sizeof(struct hlist_head) * size);
> +
> + hinfo = kmalloc(sizeof(*hinfo) +
> + sizeof(struct hlist_head) * size, GPT_KERNEL);

If anything this should be switched to kvmalloc, not kmalloc.

Also, hinfo cannot be free'd via vfree after this change, so you need to
adjust all free operations too.


Re: [PATCH] Net: netfilter: vmalloc/vfree to kvmalloc/kvfree

2017-11-03 Thread Florian Westphal
Charlie Sale  wrote:
> + hinfo = kvmalloc(sizeof(*hinfo) + sizeof(struct hlist_head) * size,
> +  GPT_KERNEL);

Looks like you did not even compile test this.  Again. :-(


[PATCH] fault-inject: fix inverted interval/probability values in printk

2015-10-17 Thread Florian Westphal
interval displays the probability and vice versa.

Fixes: 6adc4a22f20bb ("fault-inject: add ratelimit option")
Signed-off-by: Florian Westphal 
---
 lib/fault-inject.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/fault-inject.c b/lib/fault-inject.c
index f1cdeb0..6a823a5 100644
--- a/lib/fault-inject.c
+++ b/lib/fault-inject.c
@@ -44,7 +44,7 @@ static void fail_dump(struct fault_attr *attr)
printk(KERN_NOTICE "FAULT_INJECTION: forcing a failure.\n"
   "name %pd, interval %lu, probability %lu, "
   "space %d, times %d\n", attr->dname,
-  attr->probability, attr->interval,
+  attr->interval, attr->probability,
   atomic_read(&attr->space),
   atomic_read(&attr->times));
if (attr->verbose > 1)
-- 
2.0.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] netlink: fix memory leak of dump

2018-07-23 Thread Florian Westphal
Pablo Neira Ayuso  wrote:
> > diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
> > --- a/net/netfilter/nf_tables_api.c
> > +++ b/net/netfilter/nf_tables_api.c
> > @@ -5010,6 +5013,22 @@ nft_obj_filter_alloc(const struct nlattr * const 
> > nla[])
> > return filter;
> >  }
> >  
> > +static int nf_tables_dump_obj_start(struct netlink_callback *cb)
> > +{
> > +   const struct nlattr * const *nla = cb->data;

On-Stack input.
I can't see how its wrong, ->start() happens from same context as
netlink_dump_start so its valid.

> > +   struct nft_obj_filter *filter = NULL;
> > +
> > +   if (nla[NFTA_OBJ_TABLE] ||
> > +   nla[NFTA_OBJ_TYPE]) {
> > +   filter = nft_obj_filter_alloc(nla);
> > +   if (IS_ERR(filter))
> > +   return -ENOMEM;
> > +   }
> > +
> > +   cb->data = filter;

And this replaced the on-stack input with dynamically
allocated one, which will be free'd via ->done().

> >  /* called with rcu_read_lock held */
> >  static int nf_tables_getobj(struct net *net, struct sock *nlsk,
> > struct sk_buff *skb, const struct nlmsghdr *nlh,
> > @@ -5028,21 +5047,13 @@ static int nf_tables_getobj(struct net *net, struct 
> > sock *nlsk,
> >  
> > if (nlh->nlmsg_flags & NLM_F_DUMP) {
> > struct netlink_dump_control c = {
> > +   .start = nf_tables_dump_obj_start,
> > .dump = nf_tables_dump_obj,
> > .done = nf_tables_dump_obj_done,
> > .module = THIS_MODULE,
> > +   .data = (void *)nla,
> 
> You cannot do this.
> 
> nla is allocated in this stack.

Yes.

> the nla will not be available in the second recv(), it won't be there.

Its replaced in ->start().

As David pointed out, once ->start() returns 0 we set cb_running, i.e.
only after successful ->start() netlink core will call ->dump() again.

So I see no problem setting ->data to onstack cookie and then
duplicating it to heap via kmemdup in ->start().

As far as I can see netlink core offers all functionality already,
so we only need to switch netfilter to make use of it.

If you disagree please let me know, otherwise I will cook up
a patch along this pattern for net/netfilter/*.

Thanks.


Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)

2018-01-30 Thread Florian Westphal
Michal Hocko  wrote:
> On Mon 29-01-18 23:35:22, Florian Westphal wrote:
> > Kirill A. Shutemov  wrote:
> [...]
> > > I hate what I'm saying, but I guess we need some tunable here.
> > > Not sure what exactly.
> > 
> > Would memcg help?
> 
> That really depends. I would have to check whether vmalloc path obeys
> __GFP_ACCOUNT (I suspect it does except for page tables allocations but
> that shouldn't be a big deal). But then the other potential problem is
> the life time of the xt_table_info (or other potentially large) data
> structures. Are they bound to any process life time.

No.

> Because if they are
> not then the OOM killer will not help. The OOM panic earlier in this
> thread suggests it doesn't because the test case managed to eat all the
> available memory and killed all the eligible tasks which didn't help.

Yes, which is why we do not want any OOM killer invocation in first
place...

> So in some sense the memcg would help to stop the excessive allocation,
> but it wouldn't resolve it other than kill all tasks in the affected
> memcg/container. Whether this is sufficient or not, I dunno. It sounds
> quite suboptimal to me. But it is true this would be less tricky then
> adding a global knob...

Global knob doesn't really help at all, I can add multiple large
iptables rulesets (so we would have to account), and we have same issue
in virtually all of networking, so we need limits for interface count,
tunnel count, ipsec policies/SAs, nftables, tc, etc etc.


Re: possible deadlock in xt_find_table_lock

2018-01-30 Thread Florian Westphal
#syz dup: possible deadlock in do_ip_getsockopt


Re: possible deadlock in xt_find_revision

2018-01-30 Thread Florian Westphal
#syz dup: possible deadlock in do_ip_getsockopt


Re: possible deadlock in do_ip_setsockopt

2018-01-30 Thread Florian Westphal
#syz dup: possible deadlock in do_ip_getsockopt


Re: possible deadlock in do_ipv6_setsockopt

2018-01-30 Thread Florian Westphal
#syz dup: possible deadlock in do_ip_getsockopt


Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)

2018-01-30 Thread Florian Westphal
> From d48e950f1b04f234b57b9e34c363bdcfec10aeee Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Tue, 30 Jan 2018 14:51:07 +0100
> Subject: [PATCH] net/netfilter/x_tables.c: make allocation less aggressive

Acked-by: Florian Westphal 


Re: possible deadlock in do_ip_getsockopt

2018-02-01 Thread Florian Westphal
syzbot  wrote:
> Hello,
> 
> syzbot hit the following crash on upstream commit
> c4e0ca7fa24137e372d6135fe16e8df8e123f116 (Fri Jan 26 23:10:50 2018 +)
> Merge tag 'riscv-for-linus-4.15-maintainers' of
> git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux
> 
> So far this crash happened 3 times on net-next, upstream.
> Unfortunately, I don't have any reproducer for this crash yet.
> Raw console output is attached.
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached.
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+c6ac05d30245e21f7...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
> 
> 
> ==
> WARNING: possible circular locking dependency detected
> 4.15.0-rc9+ #283 Not tainted
> --
> syz-executor7/5121 is trying to acquire lock:
>  (sk_lock-AF_INET){+.+.}, at: [] lock_sock
> include/net/sock.h:1461 [inline]
>  (sk_lock-AF_INET){+.+.}, at: []
> do_ip_getsockopt+0x1b3/0x2170 net/ipv4/ip_sockglue.c:1331
> 
> but task is already holding lock:
>  (rtnl_mutex){+.+.}, at: [<8db87953>] rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:72
> 
> which lock already depends on the new lock.
> 
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #1 (rtnl_mutex){+.+.}:
>__mutex_lock_common kernel/locking/mutex.c:756 [inline]
>__mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
>mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
>rtnl_lock+0x17/0x20 net/core/rtnetlink.c:72
>register_netdevice_notifier+0xad/0x860 net/core/dev.c:1590
>clusterip_config_init net/ipv4/netfilter/ipt_CLUSTERIP.c:261 [inline]
>clusterip_tg_check+0xeb9/0x1570
> net/ipv4/netfilter/ipt_CLUSTERIP.c:478
>xt_check_target+0x22c/0x7d0 net/netfilter/x_tables.c:845
>check_target net/ipv4/netfilter/ip_tables.c:518 [inline]
>find_check_entry.isra.8+0x8c8/0xcb0
> net/ipv4/netfilter/ip_tables.c:559
>translate_table+0xed1/0x1610 net/ipv4/netfilter/ip_tables.c:730
>do_replace net/ipv4/netfilter/ip_tables.c:1148 [inline]
>do_ipt_set_ctl+0x370/0x5f0 net/ipv4/netfilter/ip_tables.c:1682
>nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
>nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115
>ip_setsockopt+0xa1/0xb0 net/ipv4/ip_sockglue.c:1256
>raw_setsockopt+0xb7/0xd0 net/ipv4/raw.c:857
>sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2968
>SYSC_setsockopt net/socket.c:1831 [inline]
>SyS_setsockopt+0x189/0x360 net/socket.c:1810
>entry_SYSCALL_64_fastpath+0x29/0xa0
> 
> -> #0 (sk_lock-AF_INET){+.+.}:
>lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3914
>lock_sock_nested+0xc2/0x110 net/core/sock.c:2770
>lock_sock include/net/sock.h:1461 [inline]
>do_ip_getsockopt+0x1b3/0x2170 net/ipv4/ip_sockglue.c:1331
>ip_getsockopt+0x90/0x220 net/ipv4/ip_sockglue.c:1562
>tcp_getsockopt+0x82/0xd0 net/ipv4/tcp.c:3329
>sock_common_getsockopt+0x95/0xd0 net/core/sock.c:2927
>SYSC_getsockopt net/socket.c:1862 [inline]
>SyS_getsockopt+0x178/0x340 net/socket.c:1844
>entry_SYSCALL_64_fastpath+0x29/0xa0
> 
> other info that might help us debug this:
> 
>  Possible unsafe locking scenario:
> 
>CPU0CPU1
>
>   lock(rtnl_mutex);
>lock(sk_lock-AF_INET);
>lock(rtnl_mutex);
>   lock(sk_lock-AF_INET);
> 
>  *** DEADLOCK ***
> 
> 1 lock held by syz-executor7/5121:
>  #0:  (rtnl_mutex){+.+.}, at: [<8db87953>] rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:72
> 
> stack backtrace:
> CPU: 1 PID: 5121 Comm: syz-executor7 Not tainted 4.15.0-rc9+ #283
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  print_circular_bug.isra.37+0x2cd/0x2dc kernel/locking/lockdep.c:1218
>  check_prev_add kernel/locking/lockdep.c:1858 [inline]
>  check_prevs_add kernel/locking/lockdep.c:1971 [inline]
>  validate_chain kernel/locking/lockdep.c:2412 [inline]
>  __lock_acquire+0x30a8/0x3e00 kernel/locking/lockdep.c:3426
>  lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3914
>  lock_sock_nested+0xc2/0x110 net/core/sock.c:2770
>  lock_sock include/net/sock.h:1461 [inline]
>  do_ip_getsockopt+0x1b3/0x2170 net/ipv4/ip_sockglue.c:1331
>  ip_getsockopt+0x90/0x220 net/ipv4/ip_sockglue.c:1562
>  tcp_getsockopt+0x82/0xd0 net/ipv4/tcp.c:3329
>  sock_common_getsockopt+0x95/0xd0 net/core/sock.c:2927
>  SYSC_getsockopt net/socket.c:1862 [inline]
>  SyS_

Re: general protection fault in nf_tables_dump_obj_done

2018-01-04 Thread Florian Westphal
#syz fix: netfilter: nf_tables: fix potential NULL-ptr deref in 
nf_tables_dump_obj_done()


Re: possible deadlock in do_ip_getsockopt

2018-01-28 Thread Florian Westphal
syzbot  wrote:
> syzbot hit the following crash on upstream commit
> c4e0ca7fa24137e372d6135fe16e8df8e123f116 (Fri Jan 26 23:10:50 2018 +)
> Merge tag 'riscv-for-linus-4.15-maintainers' of
> git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux
> 
> So far this crash happened 3 times on net-next, upstream.
> Unfortunately, I don't have any reproducer for this crash yet.
> Raw console output is attached.
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached.
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+c6ac05d30245e21f7...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
> 
> 
> ==
> WARNING: possible circular locking dependency detected
> 4.15.0-rc9+ #283 Not tainted
> --
> syz-executor7/5121 is trying to acquire lock:
>  (sk_lock-AF_INET){+.+.}, at: [] lock_sock
> include/net/sock.h:1461 [inline]
>  (sk_lock-AF_INET){+.+.}, at: []
> do_ip_getsockopt+0x1b3/0x2170 net/ipv4/ip_sockglue.c:1331

all of these deadlocks boil down to fact that we acquire sk lock
before calling nf set/getsockopt handlers, yet its not clear to me why
we do this in the first place.

Paolo Abeni is looking at pushing the sk locking down in those nf
handlers that need it which would avoid these deadlocks.


Re: general protection fault in ip6t_do_table

2018-01-28 Thread Florian Westphal
syzbot  wrote:
> CPU: 0 PID: 3675 Comm: syzkaller168273 Not tainted 4.15.0-rc9+ #283
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> RIP: 0010:ip6t_do_table+0x12de/0x19d0 net/ipv6/netfilter/ip6_tables.c:360
> RSP: 0018:8801db206c58 EFLAGS: 00010246
> RAX:  RBX: 8801d9aafb80 RCX: 84d30352
> RDX: 0100 RSI:  RDI: 8801d9aafcde
> RBP: 8801db206e60 R08: 11003b640d54 R09: 
> R10: 00d0 R11:  R12: 0001
> R13:  R14: dc00 R15: 8801d9aafc50
> FS:  00a5d880() GS:8801db20() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 209f1000 CR3: 0001bc9f8001 CR4: 001606f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  
>  ip6table_security_hook+0x65/0x80 net/ipv6/netfilter/ip6table_security.c:45

In case noone else can look at this I'll investigate this next friday
(currently travelling).



Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)

2018-01-28 Thread Florian Westphal
Tetsuo Handa  wrote:
> syzbot wrote:
> > syzbot hit the following crash on net-next commit
> > 6bb46bc57c8e9ce947cc605e555b7204b44d2b10 (Fri Jan 26 16:00:23 2018 +)
> > Merge branch 'cxgb4-fix-dump-collection-when-firmware-crashed'
> > 
> > C reproducer is attached.
> > syzkaller reproducer is attached.
> > Raw console output is attached.
> > compiler: gcc (GCC) 7.1.1 20170620
> > .config is attached.
> > 
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+8630e35fc7287b392...@syzkaller.appspotmail.com
> > It will help syzbot understand when the bug is fixed. See footer for  
> > details.
> > If you forward the report, please keep this part and the footer.
> > 
> > [ 3685] 0  3685178211   1843200 0 sshd
> > [ 3692] 0  3692 43760327680 0  
> > syzkaller025682
> > [ 3695] 0  3695 43760368640 0  
> > syzkaller025682
> > Kernel panic - not syncing: Out of memory and no killable processes...
> > 
> 
> This sounds like too huge vmalloc() request where size is controlled by 
> userspace.

Right.

Before eacd86ca3b036e55e172b7279f101cef4a6ff3a4
this used 

  info = __vmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY,

would it help to re-add that?

> vmalloc() once became killable by commit 5d17a73a2ebeb8d1 ("vmalloc: back
> off when the current task is killed") but then became unkillable by commit
> b8c8a338f75e052d ("Revert "vmalloc: back off when the current task is
> killed""). Therefore, we can't handle this problem from MM side.
> Please consider adding some limit from networking side.

I don't know what "some limit" would be.  I would prefer if there was
a way to supress OOM Killer in first place so we can just -ENOMEM user.

AFAIU NOWARN|NORETRY does that, so I would propose to just read-add it?


Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)

2018-01-29 Thread Florian Westphal
Kirill A. Shutemov  wrote:
> On Mon, Jan 29, 2018 at 08:23:57AM +0100, Florian Westphal wrote:
> > > vmalloc() once became killable by commit 5d17a73a2ebeb8d1 ("vmalloc: back
> > > off when the current task is killed") but then became unkillable by commit
> > > b8c8a338f75e052d ("Revert "vmalloc: back off when the current task is
> > > killed""). Therefore, we can't handle this problem from MM side.
> > > Please consider adding some limit from networking side.
> > 
> > I don't know what "some limit" would be.  I would prefer if there was
> > a way to supress OOM Killer in first place so we can just -ENOMEM user.
> 
> Just supressing OOM kill is a bad idea. We still leave a way to allocate
> arbitrary large buffer in kernel.

Isn't that what we do everywhere in network stack?

I think we should try to allocate whatever amount of memory is needed
for the given xtables ruleset, given that is what admin requested us to do.

I also would not know what limit is sane -- I've seen setups with as much
as 100k iptables rules, and that was 5 years ago.

And even if we add a "Xk rules" limit, it might be too much for
low-memory systems, or not enough for whatever other use case there
might be.



Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)

2018-01-29 Thread Florian Westphal
Kirill A. Shutemov  wrote:
> On Mon, Jan 29, 2018 at 05:57:22PM +0100, Florian Westphal wrote:
> > Kirill A. Shutemov  wrote:
> > > On Mon, Jan 29, 2018 at 08:23:57AM +0100, Florian Westphal wrote:
> > > > > vmalloc() once became killable by commit 5d17a73a2ebeb8d1 ("vmalloc: 
> > > > > back
> > > > > off when the current task is killed") but then became unkillable by 
> > > > > commit
> > > > > b8c8a338f75e052d ("Revert "vmalloc: back off when the current task is
> > > > > killed""). Therefore, we can't handle this problem from MM side.
> > > > > Please consider adding some limit from networking side.
> > > > 
> > > > I don't know what "some limit" would be.  I would prefer if there was
> > > > a way to supress OOM Killer in first place so we can just -ENOMEM user.
> > > 
> > > Just supressing OOM kill is a bad idea. We still leave a way to allocate
> > > arbitrary large buffer in kernel.
> > 
> > Isn't that what we do everywhere in network stack?
> > 
> > I think we should try to allocate whatever amount of memory is needed
> > for the given xtables ruleset, given that is what admin requested us to do.
> 
> Is it correct that "admin" in this case is root in random container?

Yes.

> I mean, can we get access to it with CLONE_NEWUSER|CLONE_NEWNET?

Yes.

> This can be fun.

Do we prevent "admin in random container" to insert 2m ipv6 routes
(alternatively: ipsec tunnels, interfaces etc etc)?

> > I also would not know what limit is sane -- I've seen setups with as much
> > as 100k iptables rules, and that was 5 years ago.
> > 
> > And even if we add a "Xk rules" limit, it might be too much for
> > low-memory systems, or not enough for whatever other use case there
> > might be.
> 
> I hate what I'm saying, but I guess we need some tunable here.
> Not sure what exactly.

Would memcg help?

(I don't buy the "run untrusted binaries on linux is safe" thing, so
 I would not know).


Re: dangers of bots on the mailing lists was Re: divide error in ___bpf_prog_run

2018-01-17 Thread Florian Westphal
Pavel Machek  wrote:
> > > ...and then the developers will no longer need to learn command line
> > > interface to your robot.
> > >
> > > #syz test: git://gcc.gnu.org/git/gcc.git master
> > > #syz dup: `date`
> > 
> > 
> > Pavel, please stop harming the useful process!
> > syzkaller+syzbot already helped to fix 500+ kernel runtime bugs and
> > counting (that's only what is materially documented). Please stop.
> 
> Well, you are also hurting kernel development by spamming the
> lists. You stop.

Bullshit.  Learn to filter your email if you're not interested in fixing
bugs.  Or unsubscribe.


Re: WARNING in __proc_create

2018-03-09 Thread Florian Westphal
Eric Dumazet  wrote:
> >>fs/proc/generic.c:354
> >
> >We need to reject empty names.
> >
> 
> I sent a patch a while back, but Pablo/Florian wanted more than that simple
> fix.
> 
> We also need to filter special characters like '/'
> 
> Or maybe I am mixing with something else.

Argh, sorry, this fell off the truck it seems :(

I'll work on this tomorrow unless someone else does it today ;)


Re: WARNING in __proc_create

2018-03-09 Thread Florian Westphal
Cong Wang  wrote:
> On Fri, Mar 9, 2018 at 2:58 PM, Eric Dumazet  wrote:
> >
> >
> > On 03/09/2018 02:56 PM, Eric Dumazet wrote:
> >
> >>
> >> I sent a patch a while back, but Pablo/Florian wanted more than that
> >> simple fix.
> >>
> >> We also need to filter special characters like '/'
> 
> proc_create_data() itself accepts '/', so it must be xt_hashlimit doesn't
> want it.

--hashimit-name / also triggers WARN for me.
. or .. "work", (no crash), but cause appearance of 2nd ./.. in
/proc/net/ipt_hashlimit , so I think its better to disallow that too.


Re: [PATCH] proc: reject "." and ".." as filenames

2018-03-09 Thread Florian Westphal
Alexey Dobriyan  wrote:
> Various subsystems can create files and directories in /proc
> with names directly controlled by userspace.
> 
> Which means "/", "." and ".." are no-no.
> 
> "/" split is already taken care of, do the other 2 prohibited names.

Acked-by: Florian Westphal 

I'll send a patch for xtables too to reject bogus names
coming from userspace (syzbot reports WARN() ).


Re: [PATCH v2] net: netfilter: Replace printk() with appropriate pr_*() macro

2018-03-11 Thread Florian Westphal
Arushi Singhal  wrote:
> On Mon, Mar 12, 2018 at 2:17 AM, Pablo Neira Ayuso 
> wrote:
> 
> > Hi Joe,
> >
> > On Sun, Mar 11, 2018 at 12:52:41PM -0700, Joe Perches wrote:
> > > On Mon, 2018-03-12 at 01:11 +0530, Arushi Singhal wrote:
> > > > Using pr_() is more concise than
> > > > printk(KERN_).
> > > > Replace printks having a log level with the appropriate
> > > > pr_*() macros.
> > > >
> > > > Signed-off-by: Arushi Singhal 
> > > > ---
> > > > changes in v2
> > > > *in v1 printk() were replaced with netdev_*()
> > >
> > > >  net/netfilter/nf_conntrack_acct.c  | 2 +-
> > > >  net/netfilter/nf_conntrack_ecache.c| 2 +-
> > > >  net/netfilter/nf_conntrack_timestamp.c | 2 +-
> > > >  net/netfilter/nf_nat_core.c| 2 +-
> > > >  net/netfilter/nfnetlink_queue.c| 4 ++--
> > > >  5 files changed, 6 insertions(+), 6 deletions(-)
> > >
> > > None of these files have a #define for pr_fmt so this
> > > should be OK.
> >
> > I think Arushi could add pr_fmt in the same go, so we skip another
> > follow up patch for this. @Arushi: I suggested this in my previous
> > email, please have a look.
> >
> > Hello Pablo
> 
> Should I send two patches, one with the conversion of printk() to pr_() and
> another for defining pr_fmt().
> 
> Or
> 
> only one patch with all the changes?

Both in one, it reduces code churn.

With pr_* + pr_fmt, the module name will be prefixed automatically, see
e.g. commit e016c5e43db51875c2b541b59bd217494d213174 as an example.

> > This would also probably allows us to save the line break in the error
> > message, which IIRC is not a good practise either, eg.
> >
> > pr_warn("nf_queue: OOM "
> > "in mangle, dropping packet\n");
> >
> > > Perhaps coalesce the formats and remove the unnecessary periods too.

The above message should be removed, its useless (on oom allocator
already warns).


Re: [PATCH AUTOSEL for 4.15 070/124] netfilter: core: only allow one nat hook per hook point

2018-03-19 Thread Florian Westphal
Sasha Levin  wrote:
> From: Florian Westphal 
> 
> [ Upstream commit f92b40a8b2645af38bd6814651c59c1e690db53d ]

This patch is broken and a fix is not in any tree yet.


Re: linux-next: ip6tables *broken* - last base chain position %u doesn't match underflow %u (hook %u

2018-03-20 Thread Florian Westphal
valdis.kletni...@vt.edu  wrote:
> (Resending because I haven't heard anything)
[ ip6tables broken ]

Sorry, did not see this email before.

I'll investigate asap, thanks for the detailed report.


Re: [RFC PATCH v2] bridge: make it possible for packets to traverse the bridge without hitting netfilter

2018-03-09 Thread Florian Westphal
David Woodhouse  wrote:
> 
> 
> On Fri, 2015-03-06 at 17:37 +0100, Florian Westphal wrote:
> > 
> > > > I did performance measurements in the following way:
> > > > 
> > > > Removed those pieces of the packet pipeline that I don't necessarily
> > > > need one-by-one.  Then measured their effect on small packet
> > > > performance.
> > > > 
> > > > This was the only part that produced considerable effect.
> > > > 
> > > > The pure speculation was about why the effect is more than 15%
> > > > increase in packet throughput, although the code path avoided
> > > > contains way less code than 15% of the packet pipeline.  It seems,
> > > > Felix Fietkau profiled similar changes, and found my guess well
> > > > founded.
> > > > 
> > > > Now could anybody explain me what else is wrong with my patch?
> > > 
> > > We have to come up with a more generic solution for this.
> > 
> > Jiri Benc suggested to allowing to attach netfilter hooks e.g. via tc
> > action, maybe that would be an option worth investigating.
> > 
> > Then you could for instance add filtering rules only to the bridge port
> > that needs it.
> > 
> > > These sysfs tweaks you're proposing look to me like an obscure way to
> > > tune this.
> > 
> > I agree, adding more tunables isn't all that helpful, in the past this
> > only helped to prolong the problem.
> 
> How feasible would it be to make it completely dynamic?
> 
> A given hook could automatically disable itself (for a given device) if
> the result of running it the first time was *tautologically* false for
> that device (i.e. regardless of the packet itself, or anything else).
> 
> The hook would need to be automatically re-enabled if the rule chain
> ever changes (and might subsequently disable itself again).
> 
> Is that something that's worth exploring for the general case?

AF_BRIDGE hooks sit in the net namespace, so its enough for
one bridge to request filtering to bring in the hook overhead for all
bridges in the same netns.

Alternatives:
- place the bridges that need filtering in different netns
- use tc ingress for filtering
- use nftables ingress hook for filtering (it sits in almost same
  location as tc ingress hook) to attach the ruleset to those bridge
  ports that need packet filtering.

(The original request came from user with tons of bridges where only
 one single bridge needed filtering).

One alternative I see is to place the bridge hooks into the
bridge device (net_bridge struct, which is in netdev private area).

But, as you already mentioned we would need to annotate the hooks
to figure out which device(s) they are for.

This sounds rather fragile to me, so i would just use nft ingress:

#!/sbin/nft -f

table netdev ingress {
  chain in_public { type filter hook ingress device eth0 priority 0;
   ip saddr 192.168.0.1 counter
 }
}


Re: INFO: rcu detected stall in pfkey_sendmsg

2018-12-19 Thread Florian Westphal
Dmitry Vyukov  wrote:
> On Wed, Dec 19, 2018 at 7:37 PM syzbot
>  wrote:
> >
> > Hello,
> >
> > syzbot found the following crash on:
> >
> > HEAD commit:a26d94bff4d5 net: bridge: remove unneeded variable 'err'
> > git tree:   net-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=14c7a4cd40
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=d9655b05acfc97ff
> > dashboard link: https://syzkaller.appspot.com/bug?extid=e1d3a7522b4d05aeede4
> > compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> >
> > Unfortunately, I don't have any reproducer for this crash yet.
> 
> +Florian, this looks related to:
> INFO: rcu detected stall in xfrm_hash_rebuild
> https://syzkaller.appspot.com/bug?id=62ee9df6b17e143dcd22a6bc5383c1b4ba797c8c
> https://groups.google.com/forum/#!msg/syzkaller-bugs/4yD3ts-wWRA/63scKqSyDAAJ
> 
> Should we dup them?

No, not yet anyway.  First report triggers rcu stall during hash
rebuild, this looks like stall is directly on insertion.

(Could obviously still be same bug).


Re: INFO: rcu detected stall in xfrm_hash_rebuild

2018-12-14 Thread Florian Westphal
Wolfgang Walter  wrote:

[ CCing Christophe ]

> Am Montag, 10. Dezember 2018, 09:58:56 schrieb David Miller:
> > From: Florian Westphal 
> > Date: Mon, 10 Dec 2018 13:47:24 +0100
> > 
> > > After recent tree conversion, we could probably make the exact policies
> > > part of the 'inexact tree' (which would be renamed to 'policy tree' or
> > > some such).
> > > 
> > > Special-casing the exact policies made a lot of sense when we had
> > > a single list for the inexact policies (to keep its length down).
> > > 
> > > But now I think we could try to unify all of this and only maintain
> > > the existing tree-based storage.
> > > 
> > > Would also remove the need to do lookups in two different
> > > data structures (bydst-hash-then-inexact-tree).
> > > 
> > > What do you think?
> > 
> > I think this makes a lot of sense.
> 
> Sites mainly using tunnel mode this certainly makes sense.

Ok.  An alternative would be to remove the support for
policy hash table thresholds (which decide what kinds of policies
go to exact table and which ones go into inexact ones), i.e.
partially revert 880a6fab8f6ba5b5abe59ea6
("xfrm: configure policy hash table thresholds by netlink").

This would remove the need for the rehashing support that
re-sorts the policies into either exact/inexact lists) when the
those tunables are changed.

We could also easily convert the exact table to an rhashtable
then if we wanted to.

I guess we should probably wait to get some operational feedback on the
inexact storage first to see if it really improves things enough to
make threshold tuning unneccessary.

Christophe, whats your take?


Re: INFO: rcu detected stall in xfrm_hash_rebuild

2018-12-14 Thread Florian Westphal
Christophe Gouault  wrote:
> The main use cases I have encountered and tried to address with the
> hash-based lookup were network operator use cases:
> - a lot of dynamic /32 <=> /32 policies (protecting GTP tunnels)
> - or a lot of dynamic policies with the same prefix lengths (e.g. /16 <=> /24)
> and a few non-hashed policies stored in the linked list.

Thanks for the detailed information.

[..]

> Could you verify how the configuration time would behave in such use
> cases if the hash-based lookup was replaced by a tree-based lookup?

I won't send a patch to remove your work, at least not at this time.

In case I'd do this removal (thresholds, hash table, or both)
i will make these tests to see how large the impact is.


Re: INFO: rcu detected stall in xfrm_hash_rebuild

2018-12-10 Thread Florian Westphal
syzbot  wrote:
> Hello,
> 
> syzbot found the following crash on:
[..]

> Workqueue: events xfrm_hash_rebuild

Ignoring this report for a second -- I think it makes sense to see
if we can just remove the entire hash table rebuild/resize code.

After recent tree conversion, we could probably make the exact policies
part of the 'inexact tree' (which would be renamed to 'policy tree' or
some such).

Special-casing the exact policies made a lot of sense when we had
a single list for the inexact policies (to keep its length down).

But now I think we could try to unify all of this and only maintain
the existing tree-based storage.

Would also remove the need to do lookups in two different
data structures (bydst-hash-then-inexact-tree).

What do you think?


Re: WARNING in xfrm_policy_inexact_gc_tree

2018-12-10 Thread Florian Westphal
syzbot  wrote:
> 
> HEAD commit:74c4a24df7ca Add linux-next specific files for 20181207
> git tree:   linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=17bbea7d40
> kernel config:  https://syzkaller.appspot.com/x/.config?x=6e9413388bf37bed
> dashboard link: https://syzkaller.appspot.com/bug?extid=0b2f1bc876d301440d44
> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> 
> Unfortunately, I don't have any reproducer for this crash yet.
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+0b2f1bc876d301440...@syzkaller.appspotmail.com
> 
> IPv6: ADDRCONF(NETDEV_UP): veth0: link is not ready
> IPv6: ADDRCONF(NETDEV_UP): veth1: link is not ready
> IPv6: ADDRCONF(NETDEV_CHANGE): veth1: link becomes ready
> IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
> 8021q: adding VLAN 0 to HW filter on device team0
> WARNING: CPU: 0 PID: 13116 at net/xfrm/xfrm_policy.c:1066
> xfrm_policy_inexact_gc_tree+0x233/0x280 net/xfrm/xfrm_policy.c:1066

This means we're exiting from net namespace, but for some reason
policies would be left behind.

At this point the storage should only contain empty nodes,
as all policies are supposed to be gone.

I'll have a look.


Re: linux-next: Tree for Dec 21

2018-12-21 Thread Florian Westphal
Guenter Roeck  wrote:
>   mips:cavium_octeon_defconfig [4]

> git bisect bad 4165079ba328dd47262a2183049d3591f0a750b1
> # first bad commit: [4165079ba328dd47262a2183049d3591f0a750b1] net: switch 
> secpath to use skb extension infrastructure

Indeed, sorry.  staging/octeon needs a small fix.  Will send a patch asap.


[PATCH net-next] staging: octeon: fix build failure with XFRM enabled

2018-12-21 Thread Florian Westphal
skb->sp doesn't exist anymore in the next-next tree, so mips defconfig
no longer builds.  Use helper instead to reset the secpath.

Not even compile tested.

Cc: Greg Kroah-Hartman 
Reported-by: Guenter Roeck 
Fixes: 4165079ba328d ("net: switch secpath to use skb extension infrastructure")
Signed-off-by: Florian Westphal 
---
 Greg, David:

 The patch will not break build for a tree that lacks the 'Fixes'
 commit, so this can also go in via staging tree.
 OTOH, net-next build is broken for mips/octeon, so I think in
 this case net-next might make more sense?

diff --git a/drivers/staging/octeon/ethernet-tx.c 
b/drivers/staging/octeon/ethernet-tx.c
index df3441b815bb..317c9720467c 100644
--- a/drivers/staging/octeon/ethernet-tx.c
+++ b/drivers/staging/octeon/ethernet-tx.c
@@ -359,8 +359,7 @@ int cvm_oct_xmit(struct sk_buff *skb, struct net_device 
*dev)
dst_release(skb_dst(skb));
skb_dst_set(skb, NULL);
 #ifdef CONFIG_XFRM
-   secpath_put(skb->sp);
-   skb->sp = NULL;
+   secpath_reset(skb);
 #endif
nf_reset(skb);
 
-- 
2.19.2



Re: KASAN: use-after-free Write in __xfrm_policy_unlink

2018-12-26 Thread Florian Westphal
syzbot  wrote:
> syzbot has found a reproducer for the following crash on:
> 
> HEAD commit:ce28bb445388 Merge git://git.kernel.org/pub/scm/linux/kern..
> git tree:   net-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=1673fb1b40
> kernel config:  https://syzkaller.appspot.com/x/.config?x=67a2081147a23142
> dashboard link: https://syzkaller.appspot.com/bug?extid=9d971dd21eb26567036b
> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1134dcc740
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=126986ed40
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+9d971dd21eb265670...@syzkaller.appspotmail.com

I've fixed this one.  Chances are that at least some of the other
reports are duplicates of this one.

I will continue to look at other reports over the next few days and plan
to send out fixes and test cases next week.


Re: [PATCH] netfilter: account ebt_table_info to kmemcg

2018-12-29 Thread Florian Westphal
Michal Hocko  wrote:
> On Fri 28-12-18 17:55:24, Shakeel Butt wrote:
> > The [ip,ip6,arp]_tables use x_tables_info internally and the underlying
> > memory is already accounted to kmemcg. Do the same for ebtables. The
> > syzbot, by using setsockopt(EBT_SO_SET_ENTRIES), was able to OOM the
> > whole system from a restricted memcg, a potential DoS.
> 
> What is the lifetime of these objects? Are they bound to any process?

No, they are not.
They are free'd only when userspace requests it or the netns is
destroyed.


Re: linux-next: build warnings after merge of the net-next tree

2015-06-20 Thread Florian Westphal
Stephen Rothwell  wrote:
> After merging the net-next tree, today's linux-next build (i386 defconfig)
> produced these warnings:
> 
> In file included from include/net/netfilter/nf_conntrack_tuple.h:13:0,
>  from include/linux/netfilter/nf_conntrack_dccp.h:28,
>  from include/net/netfilter/nf_conntrack.h:22,
>  from net/netfilter/nf_conntrack_core.c:37:
> include/linux/netfilter/x_tables.h: In function 'xt_percpu_counter_alloc':
> include/linux/netfilter/x_tables.h:373:10: warning: cast from pointer to 
> integer of different size [-Wpointer-to-int-cast]
>return (__force u64) res;
>   ^
> include/linux/netfilter/x_tables.h: In function 'xt_percpu_counter_free':
> include/linux/netfilter/x_tables.h:381:15: warning: cast to pointer from 
> integer of different size [-Wint-to-pointer-cast]
>free_percpu((void __percpu *) pcnt);
>^
> In file included from include/asm-generic/percpu.h:6:0,
>  from arch/x86/include/asm/percpu.h:551,
>  from arch/x86/include/asm/preempt.h:5,
>  from include/linux/preempt.h:64,
>  from include/linux/spinlock.h:50,
>  from include/linux/mm_types.h:8,
>  from include/linux/kmemcheck.h:4,
>  from include/linux/skbuff.h:18,
>  from include/linux/netfilter.h:5,
>  from net/netfilter/nf_conntrack_core.c:16:
> include/linux/netfilter/x_tables.h: In function 'xt_get_this_cpu_counter':
> include/linux/netfilter/x_tables.h:388:23: warning: cast to pointer from 
> integer of different size [-Wint-to-pointer-cast]
>return this_cpu_ptr((void __percpu *) cnt->pcnt);
>^
> include/linux/percpu-defs.h:206:47: note: in definition of macro 
> '__verify_pcpu_ptr'
>   const void __percpu *__vpp_verify = (typeof((ptr) + 0))NULL; \
>^
> include/linux/percpu-defs.h:239:27: note: in expansion of macro 'raw_cpu_ptr'
>  #define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
>^
> include/linux/netfilter/x_tables.h:388:10: note: in expansion of macro 
> 'this_cpu_ptr'
>return this_cpu_ptr((void __percpu *) cnt->pcnt);
>   ^
> 
> and many more.
> 
> Introduced by commit:
> 
>   71ae0dff02d7 ("netfilter: xtables: use percpu rule counters")

Yes, sorry about this, should be fixed by dcb8f5c8139ef945cdfd
("netfilter: xtables: fix warnings on 32bit platforms").

Thanks,
Florian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: build warning after merge of the netfilter-next tree

2018-05-15 Thread Florian Westphal
Geert Uytterhoeven  wrote:
> On Tue, May 8, 2018 at 9:17 AM, Florian Westphal  wrote:
> > Stephen Rothwell  wrote:
> >> On Mon, 7 May 2018 10:55:19 +1000 Stephen Rothwell  
> >> wrote:
> >> >
> >> > After merging the netfilter-next tree, today's linux-next build (x86_64
> >> > allmodconfig) produced this warning:
> >> >
> >> > ./usr/include/linux/netfilter/nf_osf.h:25: found __[us]{8,16,32,64} type 
> >> > without #include 
> >>
> >> > Introduced by commit
> >> >
> >> >   bfb15f2a95cb ("netfilter: extract Passive OS fingerprint 
> >> > infrastructure from xt_osf")
> >
> > I'll send a fix for this, thanks for reporting.
> 
> +config NF_OSF
> +   tristate 'Passive OS fingerprint infrastructure'
> 
> "There is no help available for this option."
> 
> Is this meant to be a user-visible symbol?

No, its not.  I can send a patch in case you're too busy, let me know.


Re: about selftests/netfilter test related issue

2019-04-01 Thread Florian Westphal
Jeffrin Thalakkottoor  wrote:
> i think the script nft_nat.sh  is assuming devices eth0 and eth1

No it does not.
These are arbitrary names given to veth devices.

> Error: Unknown device type.

No Veth device support in kernel?


Re: about selftests/netfilter test related issue

2019-04-01 Thread Florian Westphal
Jeffrin Thalakkottoor  wrote:
> Error: Unknown device type.

Feel free to send a patch that makes it display a more reasonable
exit+error here.


Re: [PATCH ghak124 v3] audit: log nftables configuration change events

2021-02-11 Thread Florian Westphal
Richard Guy Briggs  wrote:
> > > I personally would notify once per transaction. This is easy and quick.
> 
> This was the goal.  iptables was atomic.  nftables appears to no longer
> be so.  If I have this wrong, please show how that works.

nftables transactions are atomic, either the entire batch takes effect or not
at all.

The audit_log_nfcfg() calls got added to the the nft monitor infra which
is designed to allow userspace to follow the entire content of the
transaction log.

So, if its just a 'something was changed' event that is needed all of
them can be removed. ATM the audit_log_nfcfg() looks like this:

/* step 3. Start new generation, rules_gen_X now in use. */
net->nft.gencursor = nft_gencursor_next(net);

list_for_each_entry_safe(trans, next, &net->nft.commit_list, list) {
switch (trans->msg_type) {
case NFT_MSG_NEWTABLE:
audit_log_nfcfg();
...
case NFT_MSG_...
audit_log_nfcfg();
..
}

which gives an audit_log for every single change in the batch.

So, if just a summary is needed a single audit_log_nfcfg()
after 'step 3' and outside of the list_for_each_entry_safe() is all
that is needed.

If a summary is wanted as well one could fe. count the number of
transaction types in the batch, e.g. table adds, chain adds, rule
adds etc. and then log a summary count instead.


Re: [v3 net-next 08/10] skbuff: reuse NAPI skb cache on allocation path (__build_skb())

2021-02-10 Thread Florian Westphal
Alexander Lobakin  wrote:
> we're in such context. This includes: build_skb() (called only
> from NIC drivers in NAPI Rx context) and {,__}napi_alloc_skb()
> (called from the same place or from kernel network softirq
> functions).

build_skb is called from sleepable context in drivers/net/tun.c .
Perhaps there are other cases.


Re: [PATCH ghak124 v3] audit: log nftables configuration change events

2021-02-18 Thread Florian Westphal
Richard Guy Briggs  wrote:
> On 2021-02-11 23:09, Florian Westphal wrote:
> > So, if just a summary is needed a single audit_log_nfcfg()
> > after 'step 3' and outside of the list_for_each_entry_safe() is all
> > that is needed.
> 
> Ok, so it should not matter if it is before or after that
> list_for_each_entry_safe(), which could be used to collect that summary.

Right, it won't matter.

> > If a summary is wanted as well one could fe. count the number of
> > transaction types in the batch, e.g. table adds, chain adds, rule
> > adds etc. and then log a summary count instead.
> 
> The current fields are "table", "family", "entries", "op".
> 
> Could one batch change more than one table?  (I think it could?)

Yes.

> It appears it can change more than one family.
> "family" is currently a single integer, so that might need to be changed
> to a list, or something to indicate multi-family.

Yes, it can also affect different families.

> Listing all the ops seems a bit onerous.  Is there a hierarchy to the
> ops and if so, are they in that order in a batch or in nf_tables_commit()?

No.  There is a hierarchy, e.g. you can't add a chain without first
adding a table, BUT in case the table was already created by an earlier
transaction it can also be stand-alone.

> It seems I'd need to filter out the NFT_MSG_GET_* ops.

No need, the GET ops do not cause changes and will not trigger a
generation id change.


Re: [PATCH ghak124 v3] audit: log nftables configuration change events

2021-02-18 Thread Florian Westphal
Richard Guy Briggs  wrote:
> On 2021-02-18 09:22, Florian Westphal wrote:
> > No.  There is a hierarchy, e.g. you can't add a chain without first
> > adding a table, BUT in case the table was already created by an earlier
> > transaction it can also be stand-alone.
> 
> Ok, so there could be a stand-alone chain mod with one table, then a
> table add of a different one with a "higher ranking" op...

Yes, that can happen.

> > > It seems I'd need to filter out the NFT_MSG_GET_* ops.
> > 
> > No need, the GET ops do not cause changes and will not trigger a
> > generation id change.
> 
> Ah, so it could trigger on generation change.  Would GET ops be included
> in any other change

No, GET ops are standalone, they cannot be part of a transaction.
If you look at

static const struct nfnl_callback nf_tables_cb[NFT_MSG_MAX] = {

array in nf_tables_api.c, then those ops with a '.call_batch' can
appear in transaction (i.e., can cause modification).

The other ones (.call_rcu) are read-only.

If they appear in a batch tehy will be ignored, if the batch consists of
such non-modifying ops only then nf_tables_commit() returns early
because the transaction list is empty (nothing to do/change).

> such that it would be desirable to filter them out
> to reduce noise in that single log line if it is attempted to list all
> the change ops?  It almost sounds like it would be better to do one
> audit log line for each table for each family, and possibly for each op
> to avoid the need to change userspace.  This would already be a
> significant improvement picking the highest ranking op.

I think i understand what you'd like to do.  Yes, that would reduce
the log output a lot.


Re: [PATCH ghak124 v3] audit: log nftables configuration change events

2021-02-18 Thread Florian Westphal
Richard Guy Briggs  wrote:
> Ok, can I get one more clarification on this "hierarchy"?  Is it roughly
> in the order they appear in nf_tables_commit() after step 3?  It appears
> it might be mostly already.  If it isn't already, would it be reasonable
> to re-order them?  Would you suggest a different order?

For audit purposes I think enum nf_tables_msg_types is already in the
most relevant order, the lower numbers being more imporant.

So e.g. NEWTABLE would be more interesting than DELRULE, if both
are in same batch.

> > > such that it would be desirable to filter them out
> > > to reduce noise in that single log line if it is attempted to list all
> > > the change ops?  It almost sounds like it would be better to do one
> > > audit log line for each table for each family, and possibly for each op
> > > to avoid the need to change userspace.  This would already be a
> > > significant improvement picking the highest ranking op.
> > 
> > I think i understand what you'd like to do.  Yes, that would reduce
> > the log output a lot.
> 
> Would the generation change id be useful outside the kernel?

Yes, we already announce it to interested parties via nfnetlink.

> What
> exactly does it look like?

Its just a u64 counter that gets incremented whenever there is a change.

> I don't quite understand the genmask purpose.

Thats an implementation detail only.  When we process a transaction,
changes to the ruleset are being made but they should not have any
effect until the entire transaction is processed.

So there are two 'generations' at any time:
1. The active ruleset
2. The future ruleset

2) is what is being changed/modified.

When the transaction completes, then the future ruleset becomes
the active ruleset.  If the transaction has to be aborted, the
pending changes are reverted and the genid/genmasks are not changed.


Re: [PATCH ghak124 v3] audit: log nftables configuration change events

2021-02-18 Thread Florian Westphal
Richard Guy Briggs  wrote:
> > If they appear in a batch tehy will be ignored, if the batch consists of
> > such non-modifying ops only then nf_tables_commit() returns early
> > because the transaction list is empty (nothing to do/change).
> 
> Ok, one little inconvenient question: what about GETOBJ_RESET?  That
> looks like a hybrid that modifies kernel table counters and reports
> synchronously.  That could be a special case call in
> nf_tables_dump_obj() and nf_tables_getobj().  Will that cause a storm
> per commit?

No, since they can't be part of a commit (they don't implement the
'call_batch' function).

I'm not sure GETOBJ_RESET should be reported in the first place:
RESET only affects expr internal state, and that state changes all the time
anyway in response to network traffic.


Re: [PATCH] xfrm: Fix incorrect types in assignment

2021-02-19 Thread Florian Westphal
Yang Li  wrote:
> Fix the following sparse warnings:
> net/xfrm/xfrm_policy.c:1303:22: warning: incorrect type in assignment
> (different address spaces)
> Reported-by: Abaci Robot 
> Signed-off-by: Yang Li 
> ---
>  net/xfrm/xfrm_policy.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
> index b74f28c..5c67407 100644
> --- a/net/xfrm/xfrm_policy.c
> +++ b/net/xfrm/xfrm_policy.c
> @@ -1225,7 +1225,7 @@ static void xfrm_hash_rebuild(struct work_struct *work)
>   struct xfrm_policy *pol;
>   struct xfrm_policy *policy;
>   struct hlist_head *chain;
> - struct hlist_head *odst;
> + struct hlist_head __rcu *odst;

This doesn't look right.  Try something like

- odst = net->xfrm.policy_bydst[dir].table;
+ odst = rcu_dereference_protected(net->xfrm.policy_bydst[dir].table,
   
lockdep_is_held(&net->xfrm.xfrm_policy_lock));


[PATCH nf] netfilter: nft_compat: remove flush counter optimization

2020-08-09 Thread Florian Westphal
WARNING: CPU: 1 PID: 16059 at lib/refcount.c:31 refcount_warn_saturate+0xdf/0xf
[..]
 __nft_mt_tg_destroy+0x42/0x50 [nft_compat]
 nft_target_destroy+0x63/0x80 [nft_compat]
 nf_tables_expr_destroy+0x1b/0x30 [nf_tables]
 nf_tables_rule_destroy+0x3a/0x70 [nf_tables]
 nf_tables_exit_net+0x186/0x3d0 [nf_tables]

Happens when a compat expr is destoyed from abort path.
There is no functional impact; after this work queue is flushed
unconditionally if its pending.

This removes the waitcount optimization.  Test of repeated
iptables-restore of a ~60k kubernetes ruleset doesn't indicate
a slowdown.  In case the counter is needed after all for some workloads
we can revert this and increment the refcount for the
!= NFT_PREPARE_TRANS case to avoid the increment/decrement imbalance.

While at it, also flush for match case, this was an oversight
in the original patch.

Fixes: ffe8923f109b7e ("netfilter: nft_compat: make sure xtables destructors 
have run")
Reported-by: kernel test robot 
Signed-off-by: Florian Westphal 
---
 net/netfilter/nft_compat.c | 37 ++---
 1 file changed, 14 insertions(+), 23 deletions(-)

diff --git a/net/netfilter/nft_compat.c b/net/netfilter/nft_compat.c
index 6428856ccbec..8e56f353ff35 100644
--- a/net/netfilter/nft_compat.c
+++ b/net/netfilter/nft_compat.c
@@ -27,8 +27,6 @@ struct nft_xt_match_priv {
void *info;
 };
 
-static refcount_t nft_compat_pending_destroy = REFCOUNT_INIT(1);
-
 static int nft_compat_chain_validate_dependency(const struct nft_ctx *ctx,
const char *tablename)
 {
@@ -215,6 +213,17 @@ static int nft_parse_compat(const struct nlattr *attr, u16 
*proto, bool *inv)
return 0;
 }
 
+static void nft_compat_wait_for_destructors(void)
+{
+   /* xtables matches or targets can have side effects, e.g.
+* creation/destruction of /proc files.
+* The xt ->destroy functions are run asynchronously from
+* work queue.  If we have pending invocations we thus
+* need to wait for those to finish.
+*/
+   nf_tables_trans_destroy_flush_work();
+}
+
 static int
 nft_target_init(const struct nft_ctx *ctx, const struct nft_expr *expr,
const struct nlattr * const tb[])
@@ -238,14 +247,7 @@ nft_target_init(const struct nft_ctx *ctx, const struct 
nft_expr *expr,
 
nft_target_set_tgchk_param(&par, ctx, target, info, &e, proto, inv);
 
-   /* xtables matches or targets can have side effects, e.g.
-* creation/destruction of /proc files.
-* The xt ->destroy functions are run asynchronously from
-* work queue.  If we have pending invocations we thus
-* need to wait for those to finish.
-*/
-   if (refcount_read(&nft_compat_pending_destroy) > 1)
-   nf_tables_trans_destroy_flush_work();
+   nft_compat_wait_for_destructors();
 
ret = xt_check_target(&par, size, proto, inv);
if (ret < 0)
@@ -260,7 +262,6 @@ nft_target_init(const struct nft_ctx *ctx, const struct 
nft_expr *expr,
 
 static void __nft_mt_tg_destroy(struct module *me, const struct nft_expr *expr)
 {
-   refcount_dec(&nft_compat_pending_destroy);
module_put(me);
kfree(expr->ops);
 }
@@ -468,6 +469,8 @@ __nft_match_init(const struct nft_ctx *ctx, const struct 
nft_expr *expr,
 
nft_match_set_mtchk_param(&par, ctx, match, info, &e, proto, inv);
 
+   nft_compat_wait_for_destructors();
+
return xt_check_match(&par, size, proto, inv);
 }
 
@@ -716,14 +719,6 @@ static const struct nfnetlink_subsystem nfnl_compat_subsys 
= {
 
 static struct nft_expr_type nft_match_type;
 
-static void nft_mt_tg_deactivate(const struct nft_ctx *ctx,
-const struct nft_expr *expr,
-enum nft_trans_phase phase)
-{
-   if (phase == NFT_TRANS_COMMIT)
-   refcount_inc(&nft_compat_pending_destroy);
-}
-
 static const struct nft_expr_ops *
 nft_match_select_ops(const struct nft_ctx *ctx,
 const struct nlattr * const tb[])
@@ -762,7 +757,6 @@ nft_match_select_ops(const struct nft_ctx *ctx,
ops->type = &nft_match_type;
ops->eval = nft_match_eval;
ops->init = nft_match_init;
-   ops->deactivate = nft_mt_tg_deactivate,
ops->destroy = nft_match_destroy;
ops->dump = nft_match_dump;
ops->validate = nft_match_validate;
@@ -853,7 +847,6 @@ nft_target_select_ops(const struct nft_ctx *ctx,
ops->size = NFT_EXPR_SIZE(XT_ALIGN(target->targetsize));
ops->init = nft_target_init;
ops->destroy = nft_target_destroy;
-   ops->deactivate = nft_mt_tg_deactivate,
ops->dump = nft_target_dump;
ops->validate = nft_target_validate;
ops->data = target;
@@ -917,8 +910,6 @@ static void __exit

Re: [PATCH] net: eliminate meaningless memcpy to data in pskb_carve_inside_nonlinear()

2020-08-10 Thread Florian Westphal
Miaohe Lin  wrote:
> The skb_shared_info part of the data is assigned in the following loop.

Where?


Re: kernel BUG at lib/string.c:LINE! (6)

2020-12-22 Thread Florian Westphal
Linus Torvalds  wrote:
> On Tue, Dec 22, 2020 at 6:44 AM syzbot
>  wrote:
> >
> > The issue was bisected to:
> >
> > commit 2f78788b55ba ("ilog2: improve ilog2 for constant arguments")
> 
> That looks unlikely, although possibly some constant folding
> improvement might make the fortify code notice something with it.
> 
> > detected buffer overflow in strlen
> > [ cut here ]
> > kernel BUG at lib/string.c:1149!
> > Call Trace:
> >  strlen include/linux/string.h:325 [inline]
> >  strlcpy include/linux/string.h:348 [inline]
> >  xt_rateest_tg_checkentry+0x2a5/0x6b0 net/netfilter/xt_RATEEST.c:143
> 
> Honestly, this just looks like the traditional bug in "strlcpy()".

Yes, thats exactly what this is, no idea why the bisection points
at ilog2 changes.

> That BSD function is complete garbage, exactly because it doesn't
> limit the source length. People tend to _think_ it does ("what's that
> size_t argument for?") but strlcpy() only limits the *destination*
> size, and the source is always read fully.

Right, I'll send a patch shortly.


Re: [PATCH] netfilter: Fix memleak in nf_nat_init

2021-01-09 Thread Florian Westphal
Dinghao Liu  wrote:
> When register_pernet_subsys() fails, nf_nat_bysource
> should be freed just like when nf_ct_extend_register()
> fails.

Acked-by: Florian Westphal 


Re: [PATCH] selftests: xfrm: fix test return value override issue in xfrm_policy.sh

2020-12-30 Thread Florian Westphal
Po-Hsu Lin  wrote:
> When running this xfrm_policy.sh test script, even with some cases
> marked as FAIL, the overall test result will still be PASS:
> 
> $ sudo ./xfrm_policy.sh
> PASS: policy before exception matches
> FAIL: expected ping to .254 to fail (exceptions)
> PASS: direct policy matches (exceptions)
> PASS: policy matches (exceptions)
> FAIL: expected ping to .254 to fail (exceptions and block policies)
> PASS: direct policy matches (exceptions and block policies)
> PASS: policy matches (exceptions and block policies)
> FAIL: expected ping to .254 to fail (exceptions and block policies after 
> hresh changes)
> PASS: direct policy matches (exceptions and block policies after hresh 
> changes)
> PASS: policy matches (exceptions and block policies after hresh changes)
> FAIL: expected ping to .254 to fail (exceptions and block policies after 
> hthresh change in ns3)
> PASS: direct policy matches (exceptions and block policies after hthresh 
> change in ns3)
> PASS: policy matches (exceptions and block policies after hthresh change in 
> ns3)
> FAIL: expected ping to .254 to fail (exceptions and block policies after 
> htresh change to normal)
> PASS: direct policy matches (exceptions and block policies after htresh 
> change to normal)
> PASS: policy matches (exceptions and block policies after htresh change to 
> normal)
> PASS: policies with repeated htresh change
> $ echo $?
> 0
> 
> This is because the $lret in check_xfrm() is not a local variable.

Acked-by: Florian Westphal 


Re: [PATCH v3] audit: log nftables configuration change events once per table

2021-03-23 Thread Florian Westphal
Richard Guy Briggs  wrote:
>   nft_commit_notify(net, NETLINK_CB(skb).portid);
>   nf_tables_gen_notify(net, skb, NFT_MSG_NEWGEN);
>   nf_tables_commit_release(net);
>  
> + nf_tables_commit_audit_log(&adl, net->nft.base_seq);

This meeds to be before nf_tables_commit_release() call, afaics this function
dereferences data structures that might be free'd already here.


Re: [PATCH 5.10 104/157] mptcp: put subflow sock on connect error

2021-03-24 Thread Florian Westphal
Naresh Kamboju  wrote:
> On Mon, 22 Mar 2021 at 18:15, Greg Kroah-Hartman
>  wrote:
> >
> > From: Florian Westphal 
> >
> > [ Upstream commit f07157792c633b528de5fc1dbe2e4ea54f8e09d4 ]
> >
> > mptcp_add_pending_subflow() performs a sock_hold() on the subflow,
> > then adds the subflow to the join list.
> >
> > Without a sock_put the subflow sk won't be freed in case connect() fails.
> >
> > unreferenced object 0x88810c03b100 (size 3000):
> > [..]
> > sk_prot_alloc.isra.0+0x2f/0x110
> > sk_alloc+0x5d/0xc20
> > inet6_create+0x2b7/0xd30
> > __sock_create+0x17f/0x410
> > mptcp_subflow_create_socket+0xff/0x9c0
> > __mptcp_subflow_connect+0x1da/0xaf0
> > mptcp_pm_nl_work+0x6e0/0x1120
> > mptcp_worker+0x508/0x9a0
> >
> > Fixes: 5b950ff4331ddda ("mptcp: link MPC subflow into msk only after 
> > accept")

I don't see this change in 5.10, so why is this fix queued up?

> I have reported the following warnings and kernel crash on 5.10.26-rc2 [1]
> The bisect reported that issue pointing out to this commit.
> 
> commit 460916534896e6d4f80a37152e0948db33376873
> mptcp: put subflow sock on connect error
> 
> This problem is specific to 5.10.26-rc2.
> 
> Warning:
> 
> [ 1040.114695] refcount_t: addition on 0; use-after-free.
> [ 1040.119857] WARNING: CPU: 3 PID: 31925 at
> /usr/src/kernel/lib/refcount.c:25 refcount_warn_saturate+0xd7/0x100
> [ 1040.129769] Modules linked in: act_mirred cls_u32 sch_netem sch_etf
> ip6table_nat xt_nat iptable_nat nf_nat ip6table_filter xt_conntrack
> nf_conntrack nf_defrag_ipv4 libcrc32c ip6_tables nf_defrag_ipv6 sch_fq
> iptable_filter xt_mark ip_tables cls_bpf sch_ingress algif_hash
> x86_pkg_temp_thermal fuse [last unloaded: test_blackhole_dev]
> [ 1040.159030] CPU: 3 PID: 31925 Comm: mptcp_connect Tainted: G
> W K   5.10.26-rc2 #1
> [ 1040.167459] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> 2.2 05/23/2018
> [ 1040.174851] RIP: 0010:refcount_warn_saturate+0xd7/0x100
> 
> And
> 
> Kernel Panic:
> -
> [ 1069.557485] BUG: kernel NULL pointer dereference, address: 0010
> [ 1069.564446] #PF: supervisor read access in kernel mode
> [ 1069.569583] #PF: error_code(0x) - not-present page
> [ 1069.574714] PGD 0 P4D 0
> [ 1069.577246] Oops:  [#1] SMP PTI
> > index 16adba172fb9..591546d0953f 100644
> > --- a/net/mptcp/subflow.c
> > +++ b/net/mptcp/subflow.c
> > @@ -1133,6 +1133,7 @@ int __mptcp_subflow_connect(struct sock *sk, const 
> > struct mptcp_addr_info *loc,
> > spin_lock_bh(&msk->join_list_lock);
> > list_add_tail(&subflow->node, &msk->join_list);
> > spin_unlock_bh(&msk->join_list_lock);
> > +   sock_put(mptcp_subflow_tcp_sock(subflow));
> >
> > return err;

Crash is not surprising, the backport puts the socket in the 'success' path
(list_add_tail).

I don't see why this is in -stable, the faulty commit is not there?

The upstream patch is:
list_del(&subflow->node);
spin_unlock_bh(&msk->join_list_lock);
+   sock_put(mptcp_subflow_tcp_sock(subflow));

[ Note the 'list_del', this is in the error unwind path ]


Re: [PATCH][next] netfilter: nf_log_bridge: Fix missing assignment of ret on a call to nf_log_register

2021-03-31 Thread Florian Westphal
Colin King  wrote:
> From: Colin Ian King 
> 
> Currently the call to nf_log_register is returning an error code that
> is not being assigned to ret and yet ret is being checked. Fix this by
> adding in the missing assignment.

Thanks for catching this.

Acked-by: Florian Westphal 


Re: [PATCH 1/3] Revert "netfilter: x_tables: Update remaining dereference to RCU"

2021-03-03 Thread Florian Westphal
Mark Tomlinson  wrote:
> This reverts commit 443d6e86f821a165fae3fc3fc13086d27ac140b1.
> 
> This (and the following) patch basically re-implemented the RCU
> mechanisms of patch 784544739a25. That patch was replaced because of the
> performance problems that it created when replacing tables. Now, we have
> the same issue: the call to synchronize_rcu() makes replacing tables
> slower by as much as an order of magnitude.
> 
> See https://lore.kernel.org/patchwork/patch/151796/ for why using RCU is
> not a good idea.

Please don't add links for the rationale.


Re: [PATCH 2/3] Revert "netfilter: x_tables: Switch synchronization to RCU"

2021-03-03 Thread Florian Westphal
Mark Tomlinson  wrote:
> This reverts commit cc00bcaa589914096edef7fb87ca5cee4a166b5c.
> 
> This (and the preceding) patch basically re-implemented the RCU
> mechanisms of patch 784544739a25. That patch was replaced because of the
> performance problems that it created when replacing tables. Now, we have
> the same issue: the call to synchronize_rcu() makes replacing tables
> slower by as much as an order of magnitude.

Can you give roigh numbers?

> See https://lore.kernel.org/patchwork/patch/151796/ for why using RCU is
> not a good idea.

Please, no link.


Re: [PATCH 3/3] netfilter: x_tables: Use correct memory barriers.

2021-03-03 Thread Florian Westphal
Mark Tomlinson  wrote:
> When a new table value was assigned, it was followed by a write memory
> barrier. This ensured that all writes before this point would complete
> before any writes after this point. However, to determine whether the
> rules are unused, the sequence counter is read. To ensure that all
> writes have been done before these reads, a full memory barrier is
> needed, not just a write memory barrier. The same argument applies when
> incrementing the counter, before the rules are read.
> 
> Changing to using smp_mb() instead of smp_wmb() fixes the kernel panic
> reported in cc00bcaa5899,

Can you reproduce the crashes without this change?

> while still maintaining the same speed of replacing tables.

How much of an impact is the MB change on the packet path?

Please also CC authors of the patches you want reverted when reposting.


Re: linux-next: build failure after merge of the net-next tree

2021-04-12 Thread Florian Westphal
Stephen Rothwell  wrote:
> net/bridge/netfilter/ebtables.c:1248:33: error: 'struct netns_xt' has no 
> member named 'tables'
>  1248 |  list_for_each_entry(t, &net->xt.tables[NFPROTO_BRIDGE], list) {
>   | ^
> include/linux/list.h:619:20: note: in definition of macro 'list_entry_is_head'
>   619 |  (&pos->member == (head))
>   |^~~~
> net/bridge/netfilter/ebtables.c:1248:2: note: in expansion of macro 
> 'list_for_each_entry'
>  1248 |  list_for_each_entry(t, &net->xt.tables[NFPROTO_BRIDGE], list) {
>   |  ^~~
> 
> Caused by commit
> 
>   5b53951cfc85 ("netfilter: ebtables: use net_generic infra")
> 
> interacting with commit
> 
>   7ee3c61dcd28 ("netfilter: bridge: add pre_exit hooks for ebtable 
> unregistration")

Right, the fixup is correct.


Re: [PATCH] netfilter: nf_conntrack: Add conntrack helper for ESP/IPsec

2021-04-14 Thread Florian Westphal
Cole Dishington  wrote:
> Introduce changes to add ESP connection tracking helper to netfilter
> conntrack. The connection tracking of ESP is based on IPsec SPIs. The
> underlying motivation for this patch was to allow multiple VPN ESP
> clients to be distinguished when using NAT.
>
> Added config flag CONFIG_NF_CT_PROTO_ESP to enable the ESP/IPsec
> conntrack helper.

Thanks for the effort to upstream out of tree code.

A couple of comments and questions below.

Preface: AFAIU this tracker aims to 'soft-splice' two independent
ESP connections, i.e.:

saddr:spi1 -> daddr
daddr:spi2 <- saddr

So that we basically get this conntrack:

saddr,daddr,spi1 (original)   daddr,saddr,spi2 (remote)

This can't be done as-is, because we don't know spi2 at the time the
first ESP packet is received.

The solution implemented here is introduction of a 'virtual esp id',
computed when first ESP packet is received, so conntrack really stores:

saddr,daddr,ID (original)   daddr,saddr,ID (remote)

Because the ID is never carried on the wire, this tracker hooks into
pkt_to_tuple() infra so that the conntrack tuple gets populated
as-needed.

If I got that right, I think it would be good to place some description
like this in the source code, this is unlike all the other trackers.

> index ..2441e031c68e
> --- /dev/null
> +++ b/include/linux/netfilter/nf_conntrack_proto_esp.h
> @@ -0,0 +1,25 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _CONNTRACK_PROTO_ESP_H
> +#define _CONNTRACK_PROTO_ESP_H
> +#include 
> +
> +/* ESP PROTOCOL HEADER */
> +
> +struct esphdr {
> + __u32 spi;
> +};
> +
> +struct nf_ct_esp {
> + unsigned int stream_timeout;
> + unsigned int timeout;
> +};
> +
> +#ifdef __KERNEL__
> +#include 
> +
> +void destroy_esp_conntrack_entry(struct nf_conn *ct);
> +
> +bool esp_pkt_to_tuple(const struct sk_buff *skb, unsigned int dataoff,
> +   struct net *net, struct nf_conntrack_tuple *tuple);
> +#endif /* __KERNEL__ */

No need for the __KERNEL__, this header is not exposed to userspace
(only those in include/uapi/).

>   struct {
>   __be16 key;
>   } gre;
> + struct {
> + __be16 spi;

__be32 ?

I now see that this "spi" seems to be allocated by the esp tracker.
Maybe 'esp_id' or something like that?

It doesn't appear to be related to the ESP header SPI value.

> --- a/include/net/netns/conntrack.h
> +++ b/include/net/netns/conntrack.h
> @@ -69,6 +69,27 @@ struct nf_gre_net {
>  };
>  #endif
>  
> +#ifdef CONFIG_NF_CT_PROTO_ESP
> +#define ESP_MAX_PORTS  1000
> +#define HASH_TAB_SIZE  ESP_MAX_PORTS

ESP? Ports?  Should this be 'slots'?  Maybe a comment helps, I don't
expect to see ports in an ESP tracker.

> +enum esp_conntrack {
> + ESP_CT_UNREPLIED,
> + ESP_CT_REPLIED,
> + ESP_CT_MAX
> +};
> +
> +struct nf_esp_net {
> + rwlock_t esp_table_lock;

This uses a rwlock but i only see writer locks being taken.
So this either should use a spinlock, or reader-parts should
take readlock, not wrlock.

(but also see below).

> + struct hlist_head ltable[HASH_TAB_SIZE];
> + struct hlist_head rtable[HASH_TAB_SIZE];
> + /* Initial lookup for remote end until rspi is known */
> + struct hlist_head incmpl_rtable[HASH_TAB_SIZE];
> + struct _esp_table *esp_table[ESP_MAX_PORTS];
> + unsigned int esp_timeouts[ESP_CT_MAX];
> +};

This is large structure -- >32kb.

Could this be moved to nf_conntrack_net?

It would also be good to not allocate these hash slots until after conntrack
is needed.

The esp_timeouts[] can be kept to avoid module dep problems.

(But also see below, I'm not sure homegrown hash table is the way to go).

>  struct ct_pcpu {
> diff --git a/include/uapi/linux/netfilter/nf_conntrack_tuple_common.h 
> b/include/uapi/linux/netfilter/nf_conntrack_tuple_common.h
> index 64390fac6f7e..9bbd76c325d2 100644
> --- a/include/uapi/linux/netfilter/nf_conntrack_tuple_common.h
> +++ b/include/uapi/linux/netfilter/nf_conntrack_tuple_common.h
> @@ -39,6 +39,9 @@ union nf_conntrack_man_proto {
..
> +#if 0
> +#define ESP_DEBUG 1
> +#define DEBUGP(format, args...) printk(KERN_DEBUG "%s: " format, __func__, 
> ## args)
> +#else
> +#undef ESP_DEBUG
> +#define DEBUGP(x, args...)
> +#endif

I suggest to get rid of all of DEBUGP(), either drop them, or, in cases
where they are useful, switch to pr_debug().

> +#define TEMP_SPI_START 1500
> +#define TEMP_SPI_MAX   (TEMP_SPI_START + ESP_MAX_PORTS - 1)

I think this could use an explanation.

> +struct _esp_table {
> + /* Hash table nodes for each required lookup
> +  * lnode: l_spi, l_ip, r_ip
> +  * rnode: r_spi, r_ip
> +  * incmpl_rnode: r_ip
> +  */
> + struct hlist_node lnode;
> + struct hlist_node rnode;
> + struct hlist_node incmpl_rnode;
> +
> + u32 l_spi;
> + u32 r_spi;
> + u32 l_ip;
> + u32 r_ip;

Hmm, ipv4 only.  Could this be chan

Re: [PATCH v6 0/3] net, mac80211, kernel: enable KCOV remote coverage collection for 802.11 frame handling

2020-11-25 Thread Florian Westphal
Marco Elver  wrote:
[..]

> v6:
> * Revert usage of skb extensions due to potential memory leak. Patch 2/3 is 
> now
>   idential to that in v2.
> * Patches 1/3 and 3/3 are otherwise identical to v5.

The earlier series was already applied to net-next, so you need to
rebase on top of net-next and include a revert of the patch that added
the kcov skb extension.

Also, please indicate the git tree that you want this applied to in the
subject.


Re: [PATCH net-next 3/3] net: core: Namespace-ify sysctl_rmem_max and sysctl_wmem_max

2021-01-20 Thread Florian Westphal
menglong8.d...@gmail.com  wrote:
> From: Menglong Dong 
> 
> For now, sysctl_wmem_max and sysctl_rmem_max are globally unified.
> It's not convenient in some case. For example, when we use docker
> and try to control the default udp socket receive buffer for each
> container.
> 
> For that reason, make sysctl_wmem_max and sysctl_rmem_max
> per-namespace.

I think having those values be restricted by init netns is a desirable
property.


Re: [PATCH] net: bridge: fix error return code of do_update_counters()

2021-03-09 Thread Florian Westphal
Jia-Ju Bai  wrote:
> When find_table_lock() returns NULL to t, no error return code of
> do_update_counters() is assigned.

Its -ENOENT.

>   t = find_table_lock(net, name, &ret, &ebt_mutex);
   ^

ret is passed to find_table_lock, which passes it to
find_inlist_lock_noload() which will set *ret = -ENOENT
for NULL case.


Re: [PATCH v2 3/3] netfilter: x_tables: Use correct memory barriers.

2021-03-09 Thread Florian Westphal
Mark Tomlinson  wrote:
> When a new table value was assigned, it was followed by a write memory
> barrier. This ensured that all writes before this point would complete
> before any writes after this point. However, to determine whether the
> rules are unused, the sequence counter is read. To ensure that all
> writes have been done before these reads, a full memory barrier is
> needed, not just a write memory barrier. The same argument applies when
> incrementing the counter, before the rules are read.
> 
> Changing to using smp_mb() instead of smp_wmb() fixes the kernel panic
> reported in cc00bcaa5899 (which is still present), while still
> maintaining the same speed of replacing tables.
> 
> The smb_mb() barriers potentially slow the packet path, however testing
> has shown no measurable change in performance on a 4-core MIPS64
> platform.

Okay, thanks for testing.  I have no further feedback.


Re: [PATCH 015/141] netfilter: Fix fall-through warnings for Clang

2020-11-20 Thread Florian Westphal
Gustavo A. R. Silva  wrote:
> In preparation to enable -Wimplicit-fallthrough for Clang, fix multiple
> warnings by explicitly adding multiple break statements instead of just
> letting the code fall through to the next case.

Acked-by: Florian Westphal 

Feel free to carry this in next iteration of series, if any.


Re: [PATCH 108/141] netfilter: ipt_REJECT: Fix fall-through warnings for Clang

2020-11-20 Thread Florian Westphal
Gustavo A. R. Silva  wrote:
> In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
> by explicitly adding a break statement instead of letting the code fall
> through to the next case.

Acked-by: Florian Westphal 


Re: [PATCH v5 2/3] net: add kcov handle to skb extensions

2020-11-21 Thread Florian Westphal
Ido Schimmel  wrote:
> On Thu, Oct 29, 2020 at 05:36:19PM +, Aleksandr Nogikh wrote:
> > From: Aleksandr Nogikh 
> > 
> > Remote KCOV coverage collection enables coverage-guided fuzzing of the
> > code that is not reachable during normal system call execution. It is
> > especially helpful for fuzzing networking subsystems, where it is
> > common to perform packet handling in separate work queues even for the
> > packets that originated directly from the user space.
> > 
> > Enable coverage-guided frame injection by adding kcov remote handle to
> > skb extensions. Default initialization in __alloc_skb and
> > __build_skb_around ensures that no socket buffer that was generated
> > during a system call will be missed.
> > 
> > Code that is of interest and that performs packet processing should be
> > annotated with kcov_remote_start()/kcov_remote_stop().
> > 
> > An alternative approach is to determine kcov_handle solely on the
> > basis of the device/interface that received the specific socket
> > buffer. However, in this case it would be impossible to distinguish
> > between packets that originated during normal background network
> > processes or were intentionally injected from the user space.
> > 
> > Signed-off-by: Aleksandr Nogikh 
> > Acked-by: Willem de Bruijn 
> 
> [...]
> 
> > @@ -249,6 +249,9 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
> > gfp_mask,
> >  
> > fclones->skb2.fclone = SKB_FCLONE_CLONE;
> > }
> > +
> > +   skb_set_kcov_handle(skb, kcov_common_handle());
> 
> Hi,
> 
> This causes skb extensions to be allocated for the allocated skb, but
> there are instances that blindly overwrite 'skb->extensions' by invoking
> skb_copy_header() after __alloc_skb(). For example, skb_copy(),
> __pskb_copy_fclone() and skb_copy_expand(). This results in the skb
> extensions being leaked [1].

[..]
> Other suggestions?

Aleksandr, why was this made into an skb extension in the first place?

AFAIU this feature is usually always disabled at build time.
For debug builds (test farm /debug kernel etc) its always needed.

If thats the case this u64 should be an sk_buff member, not an
extension.


Re: BUG: ip6tables IPv6-REDIRECT over bridges

2014-02-20 Thread Florian Westphal
Artie Hamilton  wrote:
> Now the same thing should be done for IPv6. It should works quite similar
> (I just assume the above mentioned steps are already done):
> 
> $ sysctl -w net.ipv6.conf.br0.accept_ra=2
> $ sysctl -w net.bridge.bridge-nf-call-ip6tables=1
> $ ip6tables -t nat -A PREROUTING -p tcp -m tcp --dport 8080 -j REDIRECT 
> --to-ports 81
> 
> But here is the problem: Connections will not be started. I see for example
> connections getting started to the service like this on the client:

Yes, because bridge layer does not detect when addresses have been
rewritten.

The check is only implemented for ipv4, see dnat_took_place use in 
br_netfilter.c
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get

2014-01-07 Thread Florian Westphal
Eric Dumazet  wrote:
> > diff --git a/net/netfilter/nf_conntrack_core.c 
> > b/net/netfilter/nf_conntrack_core.c
> > index 43549eb..7a34bb2 100644
> > --- a/net/netfilter/nf_conntrack_core.c
> > +++ b/net/netfilter/nf_conntrack_core.c
> > @@ -387,8 +387,12 @@ begin:
> >  !atomic_inc_not_zero(&ct->ct_general.use)))
> > h = NULL;
> > else {
> > +   /* A conntrack can be recreated with the equal tuple,
> > +* so we need to check that the conntrack is initialized
> > +*/
> > if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple) ||
> > -nf_ct_zone(ct) != zone)) {
> > +nf_ct_zone(ct) != zone) ||
> > +!nf_ct_is_confirmed(ct)) {
> > nf_ct_put(ct);
> > goto begin;
> > }
> 
> I do not think this is the right way to fix this problem (if said
> problem is confirmed)
> 
> Remember the rule about SLAB_DESTROY_BY_RCU :
> 
> When a struct is freed, then reused, its important to set the its refcnt
> (from 0 to 1) only when the structure is fully ready for use.
> 
> If a lookup finds a structure which is not yet setup, the
> atomic_inc_not_zero() will fail.

Indeed.  But, the structure itself might be ready (or rather,
can be ready since the allocation side will set the refcount to one
after doing the initial work, such as zapping old ->status flags and
setting tuple information).

The problem is with nat extension area stored in the ct->ext area.
This extension area is preallocated but the snat/dnat action
information is only set up after the ct (or rather, the skb that grabbed
a reference to the nf_conn entry) traverses nat pre/postrouting.

This will also set up a null-binding when no matching SNAT/DNAT/MASQERUADE
rule existed.

The manipulations of the skb->nfct->ext nat area are performed without
a lock.  Concurrent access is supposedly impossible as the conntrack
should not (yet) be in the hash table.

The confirmed bit is set right before we insert the conntrack into
the hash table (after we traversed rules, ct is ready to be
'published').

i.e. when the confirmed bit is NOT set we should not be 'seeing' the nf_conn
struct when we perform the lookup, as it should still be sitting on the
'unconfirmed' list, being invisible to readers.

Does that explanation make sense to you?

Thanks for looking into this.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get

2014-01-08 Thread Florian Westphal
Eric Dumazet  wrote:
> > This will also set up a null-binding when no matching SNAT/DNAT/MASQERUADE
> > rule existed.
> > 
> > The manipulations of the skb->nfct->ext nat area are performed without
> > a lock.  Concurrent access is supposedly impossible as the conntrack
> > should not (yet) be in the hash table.
> > 
> > The confirmed bit is set right before we insert the conntrack into
> > the hash table (after we traversed rules, ct is ready to be
> > 'published').
> > 
> > i.e. when the confirmed bit is NOT set we should not be 'seeing' the nf_conn
> > struct when we perform the lookup, as it should still be sitting on the
> > 'unconfirmed' list, being invisible to readers.
> > 
> > Does that explanation make sense to you?
> > 
> > Thanks for looking into this.
> 
> Still, this patch adds a loop. And maybe an infinite one if confirmed
> bit is set from an context that was interrupted by this one.

Hmm.  There should be at most one retry.

The confirmed bit should always be set here.
If it isn't then this conntrack shouldn't be in the hash table, i.e.
when we re-try we should find the same conntrack again with the bit set.

Asuming the other cpu git interrupted after setting confirmed
bit but before inserting it into the hash table, then our re-try
should not be able find a matching entry.

Maybe I am missing something, but I don't see how we could (upon
retry) find the very same entry again with the bit still not set.

> If you need to test the confirmed bit, then you also need to test it
> before taking the refcount.

I don't think that would make sense, because it should always be
set (inserting conntrack into hash table without confirmed set is
illegal, and it is never unset again).

[ when allocating a new conntrack, ct->status is zeroed, which also
  clears the flag.  This happens just before we set the new objects
  refcount to 1 ]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get

2014-01-08 Thread Florian Westphal
Eric Dumazet  wrote:
> > The confirmed bit should always be set here.
> 
> So why are you testing it ?

To detect ct object recycling when tuple is identical.

This is my understanding of how we can end up with two
cpus thinking they have exclusive ownership of the same ct:

A cpu0: starts lookup: find ct for tuple t
B cpu1: starts lookup: find ct for tuple t
C cpu0: finds ct c for tuple t, no refcnt taken yet
  cpu1: finds ct c for tuple t, no refcnt taken yet
   cpu2: destroys ct c, removes from hash table, calls ->destroy function
D cpu0: tries to increment refcnt; fails since its 0: lookup ends
E cpu0: allocates a new ct object since no acceptable ct was found for t
F cpu0: allocator gives us just-freed ct c
G cpu0: initialises ct, sets refcnt to 1
H cpu0: adds extensions, ct object is put on unconfirmed list and
assigned to skb->nfct
I cpu0: skb continues through network stack
J cpu1: tries to increment refcnt, ok
K cpu1: checks if ct matches requested tuple t: it does
L cpu0: sets refcnt conntrack tuple, allocates extensions, etc.
  cpu1: sets skb->nfct to ct, skb continues through network stack

-> both cpu0 and cpu1 reference a ct object that was not in hash table

cpu0 and cpu1 will then race, for example in
net/ipv4/netfilter/iptable_nat.c:nf_nat_rule_find():

if (!nf_nat_initialized(ct, HOOK2MANIP(hooknum)))
  ret = alloc_null_binding(ct, hooknum);

[ Its possible that I misunderstand and that there is something that
  precents this from happening.  Usually its the 'tuple equal' test
  that is performed post-atomic-inc-not-zero that detects the recycling,
  so step K above would fail ]

The idea of the 'confirmed bit test' is that when its not set then the
conntrack was recycled and should not be used before the cpu that
currently 'owns' that entry has put it into the hash table again.

> I did this RCU conversion, so I think I know what I am talking about.

Yes, I know.  If you have any suggestions on how to fix it, I'd be very
interested to hear about them.

> The entry should not be put into hash table (or refcnt set to 1),
> if its not ready. It is that simple.

I understand what you're saying, but I don't see how we can do it.

I think the assumption that we have a refcnt on skb->nfct is everywhere.

If I understand you correctly we'd have to differentiate between
'can be used fully (e.g. nf_nat_setup_info done for both directions)'
and 'a new conntrack was created (extensions may not be ready yet)'.

But currently in both cases the ct is assigned to the skb, and in both
cases a refcnt is taken.  I am out of ideas, except perhaps using
ct->lock to serialize the nat setup (but I don't like that since
I'm not sure that the nat race is the only one).

> We need to address this problem without adding an extra test and
> possible loop for the lookups.

Agreed.  I don't like the extra test either.

Many thanks for looking into this Eric!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get

2014-01-08 Thread Florian Westphal
Florian Westphal  wrote:
> Eric Dumazet  wrote:
> > > The confirmed bit should always be set here.
> > 
> > So why are you testing it ?
> 
> To detect ct object recycling when tuple is identical.
> 
> This is my understanding of how we can end up with two
> cpus thinking they have exclusive ownership of the same ct:
> 
> A cpu0: starts lookup: find ct for tuple t
> B cpu1: starts lookup: find ct for tuple t
> C cpu0: finds ct c for tuple t, no refcnt taken yet
>   cpu1: finds ct c for tuple t, no refcnt taken yet
>cpu2: destroys ct c, removes from hash table, calls ->destroy function
> D cpu0: tries to increment refcnt; fails since its 0: lookup ends
> E cpu0: allocates a new ct object since no acceptable ct was found for t
> F cpu0: allocator gives us just-freed ct c
> G cpu0: initialises ct, sets refcnt to 1
> H cpu0: adds extensions, ct object is put on unconfirmed list and
> assigned to skb->nfct
> I cpu0: skb continues through network stack
> J cpu1: tries to increment refcnt, ok
> K cpu1: checks if ct matches requested tuple t: it does
> L cpu0: sets refcnt conntrack tuple, allocates extensions, etc.

>   cpu1: sets skb->nfct to ct, skb continues through network stack

sorry, for that brain fart  This should only say
  L cpu1: sets skb->nfct to ct, skb continues...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get

2014-01-09 Thread Florian Westphal
Andrew Vagin  wrote:
> Can we allocate conntrack with zero ct_general.use and increment it at
> the first time before inserting the conntrack into the hash table?
> When conntrack is allocated it is attached exclusively to one skb.
> It must be destroyed with skb, if it has not been confirmed, so we
> don't need refcnt on this stage.
> 
> I found only one place, where a reference counter of unconfirmed
> conntract can incremented. It's ctnetlink_dump_table().

What about skb_clone, etc?  They will also increment the refcnt
if a conntrack entry is attached to the skb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get

2014-01-09 Thread Florian Westphal
Andrew Vagin  wrote:
> On Thu, Jan 09, 2014 at 09:56:22PM +0100, Florian Westphal wrote:
> > Andrew Vagin  wrote:
> > > Can we allocate conntrack with zero ct_general.use and increment it at
> > > the first time before inserting the conntrack into the hash table?
> > > When conntrack is allocated it is attached exclusively to one skb.
> > > It must be destroyed with skb, if it has not been confirmed, so we
> > > don't need refcnt on this stage.
> > > 
> > > I found only one place, where a reference counter of unconfirmed
> > > conntract can incremented. It's ctnetlink_dump_table().
> > 
> > What about skb_clone, etc?  They will also increment the refcnt
> > if a conntrack entry is attached to the skb.
> 
> We can not attach an unconfirmed conntrack to a few skb, because

s/few/new/?

> nf_nat_setup_info can be executed concurrently for the same conntrack.
> 
> How do we avoid this race condition for cloned skb-s?

Simple, the assumption is that only one cpu owns the nfct, so it does
not matter if the skb is cloned in between, as there are no parallel
users.

The only possibility (that I know of) to violate this is to create
a bridge, enable call-iptables sysctl, add -j NFQUEUE rules and then
wait for packets that need to be forwarded to several recipients,
e.g. multicast traffic.

see
http://marc.info/?l=netfilter-devel&m=131471083501656&w=2

or search 'netfilter: nat: work around shared nfct struct in bridge
case'

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] netfilter: nf_conntrack: don't relase a conntrack with non-zero refcnt

2014-01-14 Thread Florian Westphal
Andrey Vagin  wrote:
> 
> Eric and Florian, could you look at this patch. When you say,
> that it looks good, I will ask the user to validate it.
> I can't reorder these actions, because it's reproduced on a real host
> with real users. Thanks.
> 
> 
> nf_conntrack_free can't be called for a conntract with non-zero ref-counter,
> because it can race with nf_conntrack_find_get().

Indeed.

> A conntrack slab is created with SLAB_DESTROY_BY_RCU. Non-zero
> ref-conunter says that this conntrack is used now. So when we release a
> conntrack with non-zero counter, we break this assumption.
> 
> CPU1CPU2
> nf_conntrack_find()
> nf_ct_put()
>  destroy_conntrack()
> ...
> init_conntrack
>  __nf_conntrack_alloc (set use = 1)
> atomic_inc_not_zero(&ct->use) (use = 2)
>  if (!l4proto->new(ct, skb, dataoff, 
> timeouts))
>   nf_conntrack_free(ct); (use = 2 !!!)
> ...

Yes, I think this sequence is possible; we must not use nf_conntrack_free here.

> - /* We overload first tuple to link into unconfirmed or dying list.*/
> - BUG_ON(hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode));
> - hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
> + if (!hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode))
> + hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);

This is the only thing that I don't like about this patch.  Currently
all the conntracks in the system are always put on a list before they're
supposed to be visible/handled via refcnt system (unconfirmed, hash, or
dying list).

I think it would be nice if we could keep it that way.
If everything fails we could proably intoduce a 'larval' dummy list
similar to the one used by template conntracks?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] [RFC] netfilter: nf_conntrack: don't relase a conntrack with non-zero refcnt

2014-01-16 Thread Florian Westphal
Andrew Vagin  wrote:
> > I think it would be nice if we could keep it that way.
> > If everything fails we could proably intoduce a 'larval' dummy list
> > similar to the one used by template conntracks?
> 
> I'm not sure, that this is required. Could you elaborate when this can
> be useful?

You can dump the lists via ctnetlink.  Its meant as a debugging aid in
case one suspects refcnt leaks.

Granted, in this situation there should be no leak since we put the newly
allocated entry in the error case.

> Now I see only overhead, because we need to take the nf_conntrack_lock
> lock to add conntrack in a list.

True. I don't have any preference, I guess I'd just do the insertion into the
unconfirmed list when we know we cannot track to keep the "unhashed"
bug trap in the destroy function.

Pablo, any preference?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)

2014-05-19 Thread Florian Westphal
David Newall  wrote:
> Having received no feedback of substance from netdev, I now address
> my previous email to a wider audience for discussion and in
> preparation for submitting a patch based closely on that below.
> 
> This email is not addressed to Bandan Das ,
> who is the author of the commit I propose reverting, as his email
> address is no longer current.  I believe I have otherwise addressed
> all appropriate recipients and will circulate a formal patch to the
> same recipients if no adverse comments are received.  (That would
> surprise me.)
> 
>  Original Message 
> Subject:  Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge :
> Sanitize skb before it enters the IP stack)
> Date: Sat, 17 May 2014 00:03:16 +0930
> From: David Newall 
> To:   Lukas Tribus , Eric Dumazet
> , Netdev 
> CC:   f...@strlen.de 
> 
> 
> 
> We should revert commit 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge
> : Sanitize skb before it enters the IP stack) because it corrupts IP
> packets with RR or TS options set, only partially updating the IP header
> and leaving an incorrect checksum.
> 
> The argument for introducing the change is at lkml.org/lkml/2010/8/30/391:
> 
> The bridge layer can overwrite the IPCB using the
> BR_INPUT_SKB_CB macro. In br_nf_dev_queue_xmit,
> if we recieve a packet greater in size than the bridge
> device MTU, we call ip_fragment which in turn will lead to
> icmp_send calling ip_options_echo if the DF flag is set.
> ip_options_echo will then incorrectly try to parse the IPCB as
> IP options resulting in a buffer overflow.
> This change refills the CB area back with IP
> options before ip_fragment calls icmp_send. If we fail parsing,
> we zero out the IPCB area to guarantee that the stack does
> not get corrupted.
>
> A bridge should not fragment packets; an *ethernet* bridge should not
> need to.  Fragmenting packets is the job of higher level protocol.

Well, did you test what happens if we try to refrag a packet
containing ip options after the revert?

can happen e.g. when using netfilter conntrack on top of a bridge.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ip_set: protocol %u message -- useful?

2014-02-13 Thread Florian Westphal
Ilia Mirkin  wrote:
> > Maybe printing "using protocol version X" will make it appear less like
> > a debugging message referring to packet contents or something similar.
> 
> With pr_info it'll still appear in dmesg, and it'll still be "random
> non-sensical message appears over and over in dmesg" type of
> situation, to the vast majority of users. Do we need a print every
> time someone creates a new tcp connection too? I'm still not totally
> clear on the cause of this message getting printed, but I was seeing
> it a whole bunch in my configuration...

Yes, because it erronously got moved into the netns init function.

And thats what causes the spew.  Moving it back into module init
function should be enough.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   >