[PATCH] netfilter: fix missing dependencies for NETFILTER_XT_MATCH_CONNLABEL
It was possible to set NF_CONNTRACK=n NF_CONNTRACK_LABELS=y via NETFILTER_XT_MATCH_CONNLABEL=y: warning: (NETFILTER_XT_MATCH_CONNLABEL) selects NF_CONNTRACK_LABELS which has unmet direct dependencies (NET && INET && NETFILTER && NF_CONNTRACK) Reported-by: Randy Dunlap Signed-off-by: Florian Westphal --- diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig index eb2c8eb..d4dd702 100644 --- a/net/netfilter/Kconfig +++ b/net/netfilter/Kconfig @@ -860,6 +860,7 @@ config NETFILTER_XT_MATCH_CONNBYTES config NETFILTER_XT_MATCH_CONNLABEL tristate '"connlabel" match support' select NF_CONNTRACK_LABELS + depends on NF_CONNTRACK depends on NETFILTER_ADVANCED ---help--- This match allows you to test and assign userspace-defined labels names -- 1.7.8.6 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Regression, bisected: hyperv shutdown panics guest
With 3.7, hyperv guest shutdown no longer works. Instead, guest kernel throws a bunch of "BUG: scheduling-while-atomic" errors and then dies. reverting commit 6c0c0d4d1080840eabb3d055d2fd8191c5fd Author: hongfeng Date: Thu Oct 4 17:12:25 2012 -0700 poweroff: fix bug in orderly_poweroff() fixes this problem. Greping for users of orderly_poweroff() shows that hyperv isn't the only caller that invokes the function from irq context. In fact, kdoc for orderly_poweroff says: * This may be called from any context to trigger a system shutdown. * If the orderly shutdown fails, it will force an immediate shutdown. Any suggestions on how to properly fix this? Thanks, Florian -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: xt_nat_init: BUG: unable to handle kernel NULL pointer dereference at 00000000000000e0
Fengguang Wu wrote: > Hi Patrick, > > This happens in today's linux-next tree and is pretty reproducible. > [1.834544] nf_conntrack version 0.5.0 (1786 buckets, 7144 max) > [1.835406] ctnetlink v0.93: registering with nfnetlink. > [1.836202] BUG: unable to handle kernel NULL pointer dereference at > 00e0 > [1.837539] IP: [] mutex_lock_interruptible+0x23/0x70 Should be fixed by commit 00545bec9412d130c77f72a08d6c8b6ad21d4a1e Author: Pablo Neira Ayuso Subject: netfilter: fix crash during boot if NAT has been compiled built-in But seems its not yet in net-next. Pablo? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
"WRITE SAME failed. Manually zeroing" with mptsas driver
Hi. I get repeated "WRITE SAME" failed errors with "SAS1064ET" Controller (mptsas driver). Excerpt: [ 5898.784829] Sense Key : 0x5 [current] [ 5898.784833] sd 6:1:0:0: [sda] [ 5898.784835] ASC=0x0 ASCQ=0x0 [ 5898.784837] sd 6:1:0:0: [sda] CDB: [ 5898.784838] cdb[0]=0x41: 41 00 07 4f db 12 00 00 08 00 [ 5898.784858] sda6: WRITE SAME failed. Manually zeroing. [ 5898.74] sd 6:1:0:0: [sda] [ 5898.78] Result: hostbyte=0x00 driverbyte=0x08 [ 5898.788891] sd 6:1:0:0: [sda] [ 5898.788893] Sense Key : 0x5 [current] [ 5898.788896] sd 6:1:0:0: [sda] [ 5898.788898] ASC=0x0 ASCQ=0x0 [ 5898.788900] sd 6:1:0:0: [sda] CDB: [ 5898.788902] cdb[0]=0x41: 41 00 07 4f db 1a 00 00 10 00 [ 5898.788922] sda6: WRITE SAME failed. Manually zeroing. [ 5898.792943] sd 6:1:0:0: [sda] I tested with commit 66c28f97120e8a ("[SCSI] sd: Update WRITE SAME heuristics") but that had no effect. I would be grateful for any other ideas/fixes to try. For the time being, I applied following hack which suppresses the message spew: --- linux-3.8.6.orig/block/blk-lib.c +++ linux-3.8.6/block/blk-lib.c @@ -285,7 +285,9 @@ int __blkdev_issue_zeroout(struct block_ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask) { - if (bdev_write_same(bdev)) { + static int failcnt; + + if (failcnt < 16 && bdev_write_same(bdev)) { unsigned char bdn[BDEVNAME_SIZE]; if (!blkdev_issue_write_same(bdev, sector, nr_sects, gfp_mask, @@ -294,6 +296,7 @@ int blkdev_issue_zeroout(struct block_de bdevname(bdev, bdn); pr_err("%s: WRITE SAME failed. Manually zeroing.\n", bdn); + WARN_ON(++failcnt == 16); } return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask); In case it helps, the WARN yiels following backtrace: [ 641.039233] WARNING: at block/blk-lib.c:299 blkdev_issue_zeroout+0xc4/0xe6() [ 641.039234] Hardware name: PRIMERGY RX100 S7p [ 641.039236] Modules linked in: ifb xt_TCPMSS xt_REDIRECT ipt_MASQUERADE xt_policy xt_nat xt_length2(O) xt_CLASSIFY xt_hashlimit xt_TPROXY nf_tproxy_core xt_socket xt_NFQUEUE xt_connmark xt_limit xt_mark xt_set xt_addrtype xt_tcpudp ip_set_hash_ip nfnetlink_queue nf_nat_ftp nf_conntrack_ftp af_packet iptable_mangle iptable_nat nf_nat_ipv4 nf_nat xt_NFLOG xt_condition(O) xt_logmark xt_owner ipt_REJECT xt_state ip_set ip_scheduler nfnetlink_log nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack iptable_filter iptable_raw xt_CT nf_conntrack_netlink nfnetlink nf_conntrack ip_tables x_tables acpi_cpufreq mperf crc32c_intel aesni_intel ablk_helper cryptd lrw aes_x86_64 aes_generic xts gf128mul ehci_pci coretemp rtc_cmos ehci_hcd sg i2c_i801 evdev ac sr_mod cdrom microcode button acpi_power_meter e1000e(O) container sd_mod fan thermal processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas ahci libahci libata scsi_mod edd [last unloaded: ifb] [ 641.039321] Pid: 2425, comm: flush-8:0 Tainted: G O 3.8.6-22.geaf5f75-smp64 #1 [ 641.039322] Call Trace: [ 641.039327] [] ? blkdev_issue_zeroout+0xc4/0xe6 [ 641.039331] [] ? warn_slowpath_common+0x78/0x8d [ 641.039334] [] ? blkdev_issue_zeroout+0xc4/0xe6 [ 641.039339] [] ? ext4_ext_zeroout+0x4a/0x55 [ 641.039343] [] ? ext4_ext_map_blocks+0x78a/0x1660 [ 641.039348] [] ? blk_recount_segments+0x1b/0x2c [ 641.039353] [] ? mempool_alloc+0x54/0x132 [ 641.039358] [] ? __blk_segment_map_sg+0x115/0x153 [ 641.039362] [] ? ext4_map_blocks+0x149/0x23c [ 641.039366] [] ? mpage_da_map_and_submit+0x9c/0x79d [ 641.039376] [] ? mptscsih_qcmd+0x4e4/0x513 [mptscsih] [ 641.039387] [] ? scsi_finish_command+0xb4/0xb4 [scsi_mod] [ 641.039391] [] ? blk_peek_request+0x17c/0x18e [ 641.039395] [] ? start_this_handle+0x418/0x429 [ 641.039399] [] ? find_get_pages_tag+0xf5/0x136 [ 641.039404] [] ? ext4_da_writepages+0x7f4/0x932 [ 641.039409] [] ? __writeback_single_inode+0x39/0xd1 [ 641.039412] [] ? writeback_sb_inodes+0x22e/0x3af [ 641.039416] [] ? __writeback_inodes_wb+0x65/0xa1 [ 641.039419] [] ? wb_writeback+0x108/0x18c [ 641.039423] [] ? lock_timer_base+0x26/0x4c [ 641.039426] [] ? wb_do_writeback+0x141/0x1ac [ 641.039429] [] ? del_timer+0x7c/0x7c [ 641.039433] [] ? bdi_writeback_thread+0x82/0x140 [ 641.039436] [] ? wb_do_writeback+0x1ac/0x1ac [ 641.039440] [] ? kthread+0xad/0xb7 [ 641.039444] [] ? kthread_freezable_should_stop+0x51/0x51 [ 641.039450] [] ? ret_from_fork+0x7c/0xb0 [ 641.039453] [] ? kthread_freezable_should_stop+0x51/0x51 [ 641.039456] ---[ end trace d195e97e48c008d0 ]--- Thanks, Florian -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: Tree for Sep 4 (netfilter: xt_TPROXY)
Randy Dunlap wrote: > On 09/04/13 01:13, Stephen Rothwell wrote: > > Hi all, > > > > Please do not add any code for v3.13 to your linux-next included branches > > until after v3.12-rc1 is released. > > > > Changes since 20130902: > > > > on x86_64: > > when CONFIG_IPV6=m > and CONFIG_NETFILTER_XT_TARGET_TPROXY=y: > > net/built-in.o: In function `tproxy_tg6_v1': > xt_TPROXY.c:(.text+0x5dc05): undefined reference to `udp6_lib_lookup' > xt_TPROXY.c:(.text+0x5e32f): undefined reference to `udp6_lib_lookup' > xt_TPROXY.c:(.text+0x5e432): undefined reference to `udp6_lib_lookup' > net/built-in.o: In function `tproxy_tg_init': > xt_TPROXY.c:(.init.text+0x1540): undefined reference to > `nf_defrag_ipv6_enable' Thanks for reporting. I can reproduce same error with 3.10.6 stable tree, so its not a recent problem. As always, the tempting solution is to just forbid TPROXY=y with IPV6=m but it would be better to get rid of the ipv6 link time deps.. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] Fatal exception in interrupt - nf_nat_cleanup_conntrack during IPv6 tests
CAI Qian wrote: [ CC'd nf-devel ] > Just hit this very often during IPv6 tests in both the latest stable > and mainline kernel. > > [ 3597.206166] Modules linked in: [..] > nf_nat_ipv4(F-) [..] > [ 3597.804861] RIP: 0010:[] [] > nf_nat_cleanup_conntrack+0x42/0x70 [nf_nat] > [ 3597.855207] RSP: 0018:880202c63d40 EFLAGS: 00010246 > [ 3597.881350] RAX: RBX: 8801ac7bec28 RCX: > 8801d0eedbe0 > [ 3597.917226] RDX: dead00200200 RSI: 0011 RDI: > a03265b8 [..] > [ 3598.421036] > [ 3598.430467] [] __nf_ct_ext_destroy+0x44/0x60 > [nf_conntrack] > [ 3598.499191] [] nf_conntrack_free+0x2e/0x70 > [nf_conntrack] > [ 3598.534121] [] destroy_conntrack+0xbd/0x110 > [nf_conntrack] > [ 3598.569981] [] nf_conntrack_destroy+0x17/0x20 > [ 3598.599579] [] death_by_timeout+0xdc/0x1b0 > [nf_conntrack] [..] > [ 3599.241868] Code: 83 ec 08 0f b6 58 11 84 db 74 43 48 01 c3 48 83 7b 20 00 > 74 39 48 c7 c7 b8 65 32 a0 e8 98 fc 2e e1 48 8b 03 48 8b 53 08 48 85 c0 <48> > 89 02 74 04 48 89 50 08 48 ba 00 02 20 00 00 00 ad de 48 c7 > [ 3599.337037] RIP [] nf_nat_cleanup_conntrack+0x42/0x70 > [nf_nat] Looks like we tried to remove bysource hash twice (rdx is LIST_POISON_2). I wonder if this would explain it: static void nf_nat_l4proto_clean(u8 l3proto, u8 l4proto) { [..] /* Step 1 - remove from bysource hash */ clean.hash = true; for_each_net(net) nf_ct_iterate_cleanup(net, nf_nat_proto_clean, &clean); A nfct->timer fires and a conntrack is free'd before step 2 memsets the nat extension. In that case, we would try to delete nat->bysource again? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] net: frag, fix race conditions in LRU list maintenance
Konstantin Khlebnikov wrote: > This patch fixes race between inet_frag_lru_move() and inet_frag_lru_add() > which was introduced in commit 3ef0eb0db4bf92c6d2510fe5c4dc51852746f206 > ("net: frag, move LRU list maintenance outside of rwlock") > > One cpu already added new fragment queue into hash but not into LRU. > Other cpu found it in hash and tries to move it to the end of LRU. > This leads to NULL pointer dereference inside of list_move_tail(). > > Another possible race condition is between inet_frag_lru_move() and > inet_frag_lru_del(): move can happens after deletion. True, thanks for fixing this problem. > This patch initializes LRU list head before adding fragment into hash and > inet_frag_lru_move() doesn't touches it if it's empty. Acked-by: Florian Westphal -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
"WRITE SAME failed. Manually zeroing" with 3w-xxxx driver
After update to 3.8 dmesg is spammed with: kernel: [ 280.272094] 3w-: scsi8: Unknown scsi opcode: 0x41 kernel: [ 280.272107] sd 8:0:0:0: [sda] Unhandled error code kernel: [ 280.272110] sd 8:0:0:0: [sda] kernel: [ 280.272112] Result: hostbyte=0x04 driverbyte=0x00 kernel: [ 280.272114] sd 8:0:0:0: [sda] CDB: kernel: [ 280.272116] cdb[0]=0x41: 41 00 10 09 aa 22 00 00 08 00 kernel: [ 280.272123] end_request: I/O error, dev sda, sector 269068834 kernel: [ 280.272130] sda6: WRITE SAME failed. Manually zeroing. [..] This goes on and on. I am not familiar with block layer or scsi drivers, so I don't know if this is an issue with the 3w- driver or the block layer. I've hacked around this by reverting "block: Make blkdev_issue_zeroout use WRITE SAME" (579e8f3c7b2ecf7db91398d942d76457a3ddba21), as this hw doesn't support write_same anyway. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: "WRITE SAME failed. Manually zeroing" with 3w-xxxx driver
Martin K. Petersen wrote: > Florian> After update to 3.8 dmesg is spammed with: kernel: [ > Florian> 280.272094] 3w-: scsi8: Unknown scsi opcode: 0x41 kernel: [ > Florian> 280.272107] sd 8:0:0:0: [sda] Unhandled error code kernel: > > Interesting. It looks like the 3ware handles this at the driver level > instead of passing the command through to the disk and letting it > fail. That in turn means that the logic we have in place to disable WS > when the disk does not support it does not get triggered. > > The driver should really fill out the sense buffer in that case. > > Could you please test the patch below? Sure, I'll report back tomorrow. Thanks for the quick response, Florian -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: "WRITE SAME failed. Manually zeroing" with 3w-xxxx driver
Martin K. Petersen wrote: > >>>>> "Florian" == Florian Westphal writes: > > Florian> After update to 3.8 dmesg is spammed with: kernel: [ > Florian> 280.272094] 3w-: scsi8: Unknown scsi opcode: 0x41 kernel: [ > Florian> 280.272107] sd 8:0:0:0: [sda] Unhandled error code kernel: > > Could you please test the patch below? Works. Only one WRITE_SAME error at boot, max_write_same_blocks in sysfs is 0, which wasn't the case before. > The second question is what it is that's issuing these zeroouts at boot? > Which filesystem are you using? What's your DM/MD config? ext4, no DM/MD is used. I guess the zeroouts are from postgres, but i'm not sure. > 3w-: Create sense buffer for unsupported commands > > Make the driver return appropriate sense data when an unsupported > operation is queued. This will cause the SCSI layer to stop issuing the > offending command. > > Reported-by: Florian Westphal > CC: adam radford > Signed-off-by: Martin K. Petersen Tested-by: Florian Westphal Thanks again. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net-next] x86: bpf_jit_comp: secure bpf jit against spraying attacks
Eric Dumazet wrote: > From: Eric Dumazet > > hpa bringed into my attention some security related issues > with BPF JIT on x86. > > This patch makes sure the bpf generated code is marked read only, > as other kernel text sections. > > It also splits the unused space (we vmalloc() and only use a fraction of > the page) in two parts, so that the generated bpf code not starts at a > known offset in the page, but a pseudo random one. > > Refs: > http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html What about emitting additional instructions at random locations in the generated code itself? Eg., after every instruction, have random chance to insert 'xor $0xcc,%al; xor $0xcc,%al', etc? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sk_page_frag_refill OOM killing spree
David Rientjes wrote: > > On Tue, 2013-05-21 at 14:28 +0200, Florian Westphal wrote: > > > seems like sk_page_frag_refill() can cause oom-killer invocation: > > > > > > postgres invoked oom-killer: gfp_mask=0x42d0, order=3, oom_score_adj=0 > > > Pid: 10551, comm: postgres Tainted: G O 3.8.6-5.g613ca40-smp #1 > > > Call Trace: > > > [] ? dump_header+0x60/0x191 > > > [] ? ___ratelimit+0xb2/0xc4 > > > [] ? oom_kill_process+0x61/0x2d1 > > > [] ? has_capability_noaudit+0x1c/0x23 > > > [] ? oom_badness+0x8c/0xef > > > [] ? out_of_memory+0x203/0x247 > > > [] ? __alloc_pages_nodemask+0x42b/0x4c3 > > > [] ? sk_page_frag_refill+0x6a/0xd2 > > > [] ? tcp_sendmsg+0x3e8/0x7c6 > > > [] ? inet_sendmsg+0x6b/0x75 > > > [] ? sock_sendmsg+0x8d/0xa6 > > > [] ? sys_sendto+0x105/0x130 > > > [] ? __kunmap_atomic+0x62/0x8a > > > [] ? __kunmap_atomic+0x7b/0x8a > > > [] ? __lru_cache_add+0x18/0x47 > > > [] ? handle_pte_fault+0x745/0x751 > > > [] ? kmap_atomic_prot+0xd3/0xf1 > > > [] ? handle_mm_fault+0x112/0x121 > > > [] ? sys_send+0x37/0x3b > > > > > > The system is busy, so, order-3 alloc failure doesn't strike me as odd. > > > > > > There are no allocation failures with order != 3. > > > > > > Sometimes this can happen in very short sucession, i.e. > > > and oom-killer did end up zapping 30 processes or so. > > Aside from the __GFP_NORETRY issue, could you post the full oom killer log > where it kills more than one process with /proc/sys/vm/oom_dump_tasks > enabled? That shouldn't happen unless you have a memory leak. http://strlen.de/fw/oom.txt HOWEVER, before you spend time on this: I don't think there is an issue with oom killer, I only noticed that after kernel update oom killer invocations are a) more frequent than before b) often show above backtrace (sk_page_frag_refill). I'm not even saying "sk_page_frag_refill is broken", since I don't know if adding GFP_NORETRY might add silent performance degradation, etc. Its very well possible that everything is working as intended :-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] kvm: disable stealtime via reboot notifier to avoid mem corruption
else, we get memory corruption on reboot; found when tracking down initramfs unpack error on initial reboot (with qemu-kvm -smp 2, no problem with single-core). problem with doing it via kvm_shutdown() is that this file depends on CONFIG_KVM_CLOCK, also its not enough to call it for one cpu only. Signed-off-by: Florian Westphal --- not subscribed, please CC on replies. Also, I don't know much about kvm or kexec, so its possible that i missed something. In any case, this seems to fix the initramfs corruption for me. patch is against virt/kvm/kvm.git. arch/x86/kernel/kvm.c |1 + arch/x86/kernel/kvmclock.c |1 - 2 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index c1d61ee..1596cc8 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -354,6 +354,7 @@ static void kvm_pv_guest_cpu_reboot(void *unused) if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) wrmsrl(MSR_KVM_PV_EOI_EN, 0); kvm_pv_disable_apf(); + kvm_disable_steal_time(); } static int kvm_pv_reboot_notify(struct notifier_block *nb, diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index f1b42b3..5a2fa7d 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -191,7 +191,6 @@ static void kvm_crash_shutdown(struct pt_regs *regs) static void kvm_shutdown(void) { native_write_msr(msr_kvm_system_time, 0, 0); - kvm_disable_steal_time(); native_machine_shutdown(); } -- 1.7.8.6 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kvm: disable stealtime via reboot notifier to avoid mem corruption
Marcelo Tosatti wrote: > On Fri, Aug 10, 2012 at 12:36:22PM +0200, Florian Westphal wrote: > > --- a/arch/x86/kernel/kvmclock.c > > +++ b/arch/x86/kernel/kvmclock.c > > @@ -191,7 +191,6 @@ static void kvm_crash_shutdown(struct pt_regs *regs) > > static void kvm_shutdown(void) > > { > > native_write_msr(msr_kvm_system_time, 0, 0); > > - kvm_disable_steal_time(); > > native_machine_shutdown(); > > } > This part below will introduce a bug for shutdown. Can you retest with > the addition of kvm_disable_steal_time to kvm_pv_guest_cpu_reboot only, > retest and resend please? I can, but the problem with kvm_disable_steal_time() in kvmclock.c is that with CONFIG_KVM_CLOCK=n the entire file won't be compiled. And steal time doesn't depend on CONFIG_KVM_CLOCK=y. So if removing it there is a bug leaving it in only avoids that bug for CONFIG_KVM_CLOCK=y. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2] KVM: x86: disable stealtime on reboot to avoid mem corruption
else, host continues to update stealtime after reboot, which can corrupt e.g. initramfs area. found when tracking down initramfs unpack error on initial reboot (with qemu-kvm -smp 2, no problem with single-core). Signed-off-by: Florian Westphal --- arch/x86/kernel/kvm.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) Changes since v1: - leave arch/x86/kernel/kvmclock.c as-is (requested by Marcelo) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index c1d61ee..1596cc8 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -354,6 +354,7 @@ static void kvm_pv_guest_cpu_reboot(void *unused) if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) wrmsrl(MSR_KVM_PV_EOI_EN, 0); kvm_pv_disable_apf(); + kvm_disable_steal_time(); } static int kvm_pv_reboot_notify(struct notifier_block *nb, -- 1.7.8.6 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression, bisected: openpty fails from 3.7 onwards without devpts
Alan Cox wrote: > On Thu, 10 Jan 2013 15:46:26 +0100 > Florian Westphal wrote: > > Frank Lichtenheld discovered that openpty() doesn't work anymore when > > /dev/pts is not present. > > > > We bisected this down to > > > > commit bbb63c514a3464342967237a51a21ea8f61ab951 > > Author: Wanlong Gao > > Subject: drivers:tty:fix up ENOIOCTLCMD error handling [..] > > #include > > #include > > int main(void) { > > int pty_fd, tty_fd; > > if (openpty(&pty_fd, &tty_fd, NULL, NULL, NULL) != 0) { > > perror("openpty"); > > return 1; > > } > > return 0; > > } > > > > [ compile with cc -lutil pty.c -o pty ] > > > > If devpts is available or above commit reverted openpty works again. > > The commit is fairly general - what we need to do here is to figure out > which specific thing trips up openpty so we can put the error on that > back as it was (or find a better way) so it still works. > > Can you attach an strace of the working/failing cases without /dev/pts Sure, attached. /dev/pts is not present. Both traces are from the same machine, with same kernel version (except above commit reverted). execve("./ptytest", ["./ptytest"], [/* 7 vars */]) = 0 brk(0) = 0x1d04000 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fe6784d access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=15432, ...}) = 0 mmap(NULL, 15432, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fe6784cc000 close(3)= 0 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) open("/lib/x86_64-linux-gnu/libutil.so.1", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\20\16\0\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0644, st_size=10640, ...}) = 0 mmap(NULL, 2105608, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fe6780b mprotect(0x7fe6780b2000, 2093056, PROT_NONE) = 0 mmap(0x7fe6782b1000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1000) = 0x7fe6782b1000 close(3)= 0 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300\357\1\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=1595408, ...}) = 0 mmap(NULL, 3709016, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fe677d26000 mprotect(0x7fe677ea6000, 2097152, PROT_NONE) = 0 mmap(0x7fe6780a6000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x18) = 0x7fe6780a6000 mmap(0x7fe6780ab000, 18520, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fe6780ab000 close(3)= 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fe6784cb000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fe6784ca000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fe6784c9000 arch_prctl(ARCH_SET_FS, 0x7fe6784ca700) = 0 mprotect(0x7fe6780a6000, 16384, PROT_READ) = 0 mprotect(0x7fe6782b1000, 4096, PROT_READ) = 0 mprotect(0x7fe6784d2000, 4096, PROT_READ) = 0 munmap(0x7fe6784cc000, 15432) = 0 open("/dev/ptmx", O_RDWR) = -1 ENOSPC (No space left on device) open("/dev/ptyp0", O_RDWR) = 3 ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0 ioctl(3, TIOCGPTN, [0]) = -1 EINVAL (Invalid argument) fstat(3, {st_mode=S_IFCHR|S_ISVTX|0666, st_rdev=makedev(2, 0), ...}) = 0 stat("/dev/ttyp0", {st_mode=S_IFCHR|S_ISVTX|0660, st_rdev=makedev(3, 0), ...}) = 0 getuid()= 0 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4 connect(4, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory) close(4)= 0 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4 connect(4, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory) close(4)= 0 brk(0) = 0x1d04000 brk(0x1d25000) = 0x1d25000 open("/etc/nsswitch.conf", O_RDONLY)= 4 fstat(4, {st_mode=S_IFREG|0644, st_size=475, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_AN
Re: regression, bisected: openpty fails from 3.7 onwards without devpts
Jiri Slaby wrote: > On 01/10/2013 11:51 PM, Jiri Slaby wrote: > > On 01/10/2013 11:45 PM, Alan Cox wrote: > >> So we should just fix TIOCGPTN on a pty with no suitable name answer to > >> return -EINVAL > > > > Yes, I agree as I'm expressed in my second mail. Sorry for the confusion. > > Does the attached patch help? Yes, it does. Thanks for fixing this. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
regression, bisected: openpty fails from 3.7 onwards without devpts
Frank Lichtenheld discovered that openpty() doesn't work anymore when /dev/pts is not present. We bisected this down to commit bbb63c514a3464342967237a51a21ea8f61ab951 Author: Wanlong Gao Subject: drivers:tty:fix up ENOIOCTLCMD error handling The original program triggering the error was pptpd, but the test program below is sufficient: #include #include int main(void) { int pty_fd, tty_fd; if (openpty(&pty_fd, &tty_fd, NULL, NULL, NULL) != 0) { perror("openpty"); return 1; } return 0; } [ compile with cc -lutil pty.c -o pty ] If devpts is available or above commit reverted openpty works again. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1 linux-next] netfilter: conntrack: fix kmemleak false positive
Fabian Frederick wrote: > Since commit f330a7fdbe16 > ("netfilter: conntrack: get rid of conntrack timer") > > closed connections remain longer in /proc/net/nf_conntrack > > Running current kernel; just after boot: > cat /proc/net/nf_conntrack | wc -l = 5 > 4 minutes required to clean up the table. We should reap the stale entries while iterating, just like we do for ctnetlink interface. Can you try this patch? diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c --- a/net/netfilter/nf_conntrack_standalone.c +++ b/net/netfilter/nf_conntrack_standalone.c @@ -212,6 +212,11 @@ static int ct_seq_show(struct seq_file *s, void *v) if (unlikely(!atomic_inc_not_zero(&ct->ct_general.use))) return 0; + if (nf_ct_should_gc(ct)) { + nf_ct_kill(ct); + goto release; + } + /* we only want to print DIR_ORIGINAL */ if (NF_CT_DIRECTION(hash)) goto release; > Going back to kernel version before commit above there are > no connections after some seconds. > > Referring to the commit changelog this was an expected behaviour but > it results in temporary kmemleak reports: I don't see kmemleak complaints on my test vm, I'm reluctant to turn it off. Can you explain why we see such 'false positive'? The conntracks should still be referenced, as they are in main table. > unreferenced object 0x88003b0e6600 (size 248): > comm "rsyslogd", pid 1595, jiffies 4294741312 (age 7.343s) > ... > backtrace: > [] kmemleak_alloc+0x23/0x40 > [] kmem_cache_alloc+0xd9/0x180 > [] __nf_conntrack_alloc.isra.50+0x48/0x170 > [] nf_conntrack_in+0x3a2/0x5f0 > [] ipv4_conntrack_local+0x40/0x50 > [] nf_iterate+0x5d/0x70 > [] nf_hook_slow+0x5f/0xb0 > [] __ip_local_out+0xad/0xe0 > [] ip_local_out+0x17/0x40 > [] ip_send_skb+0x14/0x40 > [] udp_send_skb+0x91/0x260 > [] udp_sendmsg+0x2f5/0x950 > [] inet_sendmsg+0x60/0x90 > [] sock_sendmsg+0x33/0x40 > [] SYSC_sendto+0xee/0x160 > [] SyS_sendto+0x9/0x10 > > (248 bytes being an nf_conn structure) > > Those structures being cleared in gc_worker() later on we can't talk > about unreferenced object so this patch uses kmemleak_not_leak() to > prevent those warnings. If thats the case, why is kmemleak complaining? Are you sure this is a false positive?
Re: [PATCH 1/1 linux-next] netfilter: conntrack: fix kmemleak false positive
Fabian Frederick wrote: > Hello Florian, > > First problem is solved: table gets cleared 3 minutes earlier > but I still have kmemleak before running the following: > > echo scan > /sys/kernel/debug/kmemleak > cat /sys/kernel/debug/kmemleak > Nothing > echo scan > /sys/kernel/debug/kmemleak > cat /sys/kernel/debug/kmemleak > -> rsyslogd > > I talked about false positive because everything is cleared later. Hmm, I fear this is a real bug and not false positive. Should be possible to confirm this via slabinfo: grep nf_conntrack /proc/slabinfo The active objects should match the conntrack count. (conntrack -C, or wc -l < /proc/). > > > unreferenced object 0x88003b0e6600 (size 248): > > > comm "rsyslogd", pid 1595, jiffies 4294741312 (age 7.343s) > > > ... > > > backtrace: > > > [] kmemleak_alloc+0x23/0x40 > > > [] kmem_cache_alloc+0xd9/0x180 > > > [] __nf_conntrack_alloc.isra.50+0x48/0x170 > > > [] nf_conntrack_in+0x3a2/0x5f0 > > > [] ipv4_conntrack_local+0x40/0x50 > > > [] nf_iterate+0x5d/0x70 > > > [] nf_hook_slow+0x5f/0xb0 > > > [] __ip_local_out+0xad/0xe0 > > > [] ip_local_out+0x17/0x40 > > > [] ip_send_skb+0x14/0x40 > > > [] udp_send_skb+0x91/0x260 > > > [] udp_sendmsg+0x2f5/0x950 > > > [] inet_sendmsg+0x60/0x90 > > > [] sock_sendmsg+0x33/0x40 > > > [] SYSC_sendto+0xee/0x160 > > > [] SyS_sendto+0x9/0x10 Hmm, so we leak when allocating conntrack for outgoing packet. Do you do any filtering (DROP) in output/postrouting? > > > (248 bytes being an nf_conn structure) > > > > > > Those structures being cleared in gc_worker() later on we can't talk > > > about unreferenced object so this patch uses kmemleak_not_leak() to > > > prevent those warnings. > > > > If thats the case, why is kmemleak complaining? Are you sure this > > is a false positive? Looks like a real bug to me, but I don't see anything obvious so far. I'll look at this again tomorrow.
Re: [RFC PATCH 1/3] net: Remove unnecessary memset in __snmp6_fill_stats64
Jia He wrote: > buff[] will be assigned later, so memset is not necessary. > > Signed-off-by: Jia He > Cc: "David S. Miller" > Cc: Alexey Kuznetsov > Cc: James Morris > Cc: Hideaki YOSHIFUJI > Cc: Patrick McHardy > --- > net/ipv6/addrconf.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c > index ab3e796..43fa8d0 100644 > --- a/net/ipv6/addrconf.c > +++ b/net/ipv6/addrconf.c > @@ -4967,7 +4967,6 @@ static inline void __snmp6_fill_stats64(u64 *stats, > void __percpu *mib, > > BUG_ON(pad < 0); > > - memset(buff, 0, sizeof(buff)); > buff[0] = IPSTATS_MIB_MAX; > > for_each_possible_cpu(c) { for (i = 1; i < IPSTATS_MIB_MAX; i++) buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff); Without memset result of buff[i] += ... is undefined.
Re: KASAN: stack-out-of-bounds Read in xfrm_state_find (2)
syzbot wrote: [ cc Thomas Egerer ] > syzkaller hit the following crash on > 36ef71cae353f88fd6e095e2aaa3e5953af1685d > git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/master > compiler: gcc (GCC) 7.1.1 20170620 > .config is attached > Raw console output is attached. > C reproducer is attached > syzkaller reproducer is attached. See https://goo.gl/kgGztJ > for information about syzkaller reproducers > > BUG: KASAN: stack-out-of-bounds in xfrm_state_find+0x303d/0x3170 > net/xfrm/xfrm_state.c:1051 > Read of size 4 at addr 88003adb7760 by task syzkaller429801/2969 Seems this was added in commit 8444cf712c5f71845cba9dc30d8f530ff0d5ff83 ("xfrm: Allow different selector family in temporary state"). No idea how to fix this: struct xfrm_state * xfrm_state_find(const xfrm_address_t *daddr, const xfrm_address_t *saddr, const struct flowi *fl, struct xfrm_tmpl *tmpl, struct xfrm_policy *pol, int *err, unsigned short family) // AF_INET { [..] unsigned short encap_family = tmpl->encap_family; // AF_INET6 [..] h = xfrm_dst_hash(net, daddr, saddr, tmpl->reqid, encap_family); saddr, daddr point to ipv4 addresses inside an on-stack flowi4 struct, i.e. they get hashed as ipv6 addresses which then results in invalid stack access. What is this supposed to do if family != encap_family? I also don't understand how address comparision is supposed to work in this case, it seems that if saddr/daddr are v4 and template v6 we compare full ipv6 addresses (how would that succeed...?) and, if saddr/daddr is v6 add template is v4 we just compare the first 32bit of the ipv6 addresses...? This fix silences the reproducer, but I am not sure about it, it looks like it papers over the real problem... diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -1359,16 +1359,19 @@ xfrm_tmpl_resolve_one(struct xfrm_policy *policy, const struct flowi *fl, struct xfrm_state **xfrm, unsigned short family) { struct net *net = xp_net(policy); + xfrm_address_t tmp, daddr, saddr; int nx; int i, error; - xfrm_address_t *daddr = xfrm_flowi_daddr(fl, family); - xfrm_address_t *saddr = xfrm_flowi_saddr(fl, family); - xfrm_address_t tmp; + + memset(&saddr, 0, sizeof(saddr)); + memset(&daddr, 0, sizeof(daddr)); + + xfrm_flowi_addr_get(fl, &saddr, &daddr, family); for (nx = 0, i = 0; i < policy->xfrm_nr; i++) { struct xfrm_state *x; - xfrm_address_t *remote = daddr; - xfrm_address_t *local = saddr; + xfrm_address_t *remote = &daddr; + xfrm_address_t *local = &saddr; struct xfrm_tmpl *tmpl = &policy->xfrm_vec[i]; if (tmpl->mode == XFRM_MODE_TUNNEL || @@ -1389,8 +1392,8 @@ xfrm_tmpl_resolve_one(struct xfrm_policy *policy, const struct flowi *fl, if (x && x->km.state == XFRM_STATE_VALID) { xfrm[nx++] = x; - daddr = remote; - saddr = local; + daddr = *remote; + saddr = *local; continue; } if (x) {
Re: KASAN: stack-out-of-bounds Read in xfrm_state_find (2)
Steffen Klassert wrote: > On Wed, Nov 01, 2017 at 11:06:08PM +0100, Florian Westphal wrote: > > I also don't understand how address comparision is supposed to work in this > > case, > > it seems that if saddr/daddr are v4 and template v6 we compare full ipv6 > > addresses > > (how would that succeed...?) and, if saddr/daddr is v6 add template is v4 > > we just > > compare the first 32bit of the ipv6 addresses...? > > When we do tunnel or beet mode, we pass saddr and daddr from the > template to xfrm_state_find(), this should be ok. On transport > mode, we pass the addresses from the flowi, assuming that the > IP addresses (and address family) don't change during transformation. > This assumption is wrong in the IPv4 mapped IPv6 case, packet > is IPv4 and template is IPv6. Right, sendto() uses ipv4 address on ipv6 socket. > I'd propose to use the addresses from the template unconditionally, > like the (untested) patch below does. > > Unfortunalely the reproducer does not work with my config, > sendto returns EAGAIN. Could anybody try this patch? The reproducer no longer causes KASAN spew with your patch, but i don't have a test case that actually creates/uses a tunnel.
Re: suspicious RCU usage at ./include/linux/inetdevice.h:LINE
Cong Wang wrote: > > CPU: 0 PID: 23859 Comm: syz-executor2 Not tainted 4.14.0-rc5+ #140 > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > > Google 01/01/2011 > > Call Trace: > > __dump_stack lib/dump_stack.c:16 [inline] > > dump_stack+0x194/0x257 lib/dump_stack.c:52 > > lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4665 > > __in_dev_get_rtnl include/linux/inetdevice.h:230 [inline] > > fib_dump_info+0x1136/0x13d0 net/ipv4/fib_semantics.c:1377 > > inet_rtm_getroute+0xf97/0x2d70 net/ipv4/route.c:2785 > > This is introduced by: > > commit 394f51abb3d04f33fb798f04b16ae6b0491ea4ec > Author: Florian Westphal > Date: Tue Aug 15 16:34:44 2017 +0200 > > ipv4: route: set ipv4 RTM_GETROUTE to not use rtnl > > Signed-off-by: Florian Westphal > Signed-off-by: David S. Miller > > Looks like we need a wrapper for rcu_dereference_protected(dev->ip_ptr). Yes, thats the alternative to https://patchwork.ozlabs.org/patch/833401/ which switches to _rcu version.
Re: [PATCH] Net: netfilter: Moved vmalloc call to kmalloc call
Charlie Sale wrote: > Fixed FIXME comment in code my changing a vmalloc call > to a kmalloc call. Thought it would be a good place to > start for a first patch. Please at least compile test your patches. > - /* FIXME: don't use vmalloc() here or anywhere else -HW */ > - hinfo = vmalloc(sizeof(struct xt_hashlimit_htable) + > - sizeof(struct hlist_head) * size); > + > + hinfo = kmalloc(sizeof(*hinfo) + > + sizeof(struct hlist_head) * size, GPT_KERNEL); If anything this should be switched to kvmalloc, not kmalloc. Also, hinfo cannot be free'd via vfree after this change, so you need to adjust all free operations too.
Re: [PATCH] Net: netfilter: vmalloc/vfree to kvmalloc/kvfree
Charlie Sale wrote: > + hinfo = kvmalloc(sizeof(*hinfo) + sizeof(struct hlist_head) * size, > + GPT_KERNEL); Looks like you did not even compile test this. Again. :-(
[PATCH] fault-inject: fix inverted interval/probability values in printk
interval displays the probability and vice versa. Fixes: 6adc4a22f20bb ("fault-inject: add ratelimit option") Signed-off-by: Florian Westphal --- lib/fault-inject.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/fault-inject.c b/lib/fault-inject.c index f1cdeb0..6a823a5 100644 --- a/lib/fault-inject.c +++ b/lib/fault-inject.c @@ -44,7 +44,7 @@ static void fail_dump(struct fault_attr *attr) printk(KERN_NOTICE "FAULT_INJECTION: forcing a failure.\n" "name %pd, interval %lu, probability %lu, " "space %d, times %d\n", attr->dname, - attr->probability, attr->interval, + attr->interval, attr->probability, atomic_read(&attr->space), atomic_read(&attr->times)); if (attr->verbose > 1) -- 2.0.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] netlink: fix memory leak of dump
Pablo Neira Ayuso wrote: > > diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c > > --- a/net/netfilter/nf_tables_api.c > > +++ b/net/netfilter/nf_tables_api.c > > @@ -5010,6 +5013,22 @@ nft_obj_filter_alloc(const struct nlattr * const > > nla[]) > > return filter; > > } > > > > +static int nf_tables_dump_obj_start(struct netlink_callback *cb) > > +{ > > + const struct nlattr * const *nla = cb->data; On-Stack input. I can't see how its wrong, ->start() happens from same context as netlink_dump_start so its valid. > > + struct nft_obj_filter *filter = NULL; > > + > > + if (nla[NFTA_OBJ_TABLE] || > > + nla[NFTA_OBJ_TYPE]) { > > + filter = nft_obj_filter_alloc(nla); > > + if (IS_ERR(filter)) > > + return -ENOMEM; > > + } > > + > > + cb->data = filter; And this replaced the on-stack input with dynamically allocated one, which will be free'd via ->done(). > > /* called with rcu_read_lock held */ > > static int nf_tables_getobj(struct net *net, struct sock *nlsk, > > struct sk_buff *skb, const struct nlmsghdr *nlh, > > @@ -5028,21 +5047,13 @@ static int nf_tables_getobj(struct net *net, struct > > sock *nlsk, > > > > if (nlh->nlmsg_flags & NLM_F_DUMP) { > > struct netlink_dump_control c = { > > + .start = nf_tables_dump_obj_start, > > .dump = nf_tables_dump_obj, > > .done = nf_tables_dump_obj_done, > > .module = THIS_MODULE, > > + .data = (void *)nla, > > You cannot do this. > > nla is allocated in this stack. Yes. > the nla will not be available in the second recv(), it won't be there. Its replaced in ->start(). As David pointed out, once ->start() returns 0 we set cb_running, i.e. only after successful ->start() netlink core will call ->dump() again. So I see no problem setting ->data to onstack cookie and then duplicating it to heap via kmemdup in ->start(). As far as I can see netlink core offers all functionality already, so we only need to switch netfilter to make use of it. If you disagree please let me know, otherwise I will cook up a patch along this pattern for net/netfilter/*. Thanks.
Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)
Michal Hocko wrote: > On Mon 29-01-18 23:35:22, Florian Westphal wrote: > > Kirill A. Shutemov wrote: > [...] > > > I hate what I'm saying, but I guess we need some tunable here. > > > Not sure what exactly. > > > > Would memcg help? > > That really depends. I would have to check whether vmalloc path obeys > __GFP_ACCOUNT (I suspect it does except for page tables allocations but > that shouldn't be a big deal). But then the other potential problem is > the life time of the xt_table_info (or other potentially large) data > structures. Are they bound to any process life time. No. > Because if they are > not then the OOM killer will not help. The OOM panic earlier in this > thread suggests it doesn't because the test case managed to eat all the > available memory and killed all the eligible tasks which didn't help. Yes, which is why we do not want any OOM killer invocation in first place... > So in some sense the memcg would help to stop the excessive allocation, > but it wouldn't resolve it other than kill all tasks in the affected > memcg/container. Whether this is sufficient or not, I dunno. It sounds > quite suboptimal to me. But it is true this would be less tricky then > adding a global knob... Global knob doesn't really help at all, I can add multiple large iptables rulesets (so we would have to account), and we have same issue in virtually all of networking, so we need limits for interface count, tunnel count, ipsec policies/SAs, nftables, tc, etc etc.
Re: possible deadlock in xt_find_table_lock
#syz dup: possible deadlock in do_ip_getsockopt
Re: possible deadlock in xt_find_revision
#syz dup: possible deadlock in do_ip_getsockopt
Re: possible deadlock in do_ip_setsockopt
#syz dup: possible deadlock in do_ip_getsockopt
Re: possible deadlock in do_ipv6_setsockopt
#syz dup: possible deadlock in do_ip_getsockopt
Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)
> From d48e950f1b04f234b57b9e34c363bdcfec10aeee Mon Sep 17 00:00:00 2001 > From: Michal Hocko > Date: Tue, 30 Jan 2018 14:51:07 +0100 > Subject: [PATCH] net/netfilter/x_tables.c: make allocation less aggressive Acked-by: Florian Westphal
Re: possible deadlock in do_ip_getsockopt
syzbot wrote: > Hello, > > syzbot hit the following crash on upstream commit > c4e0ca7fa24137e372d6135fe16e8df8e123f116 (Fri Jan 26 23:10:50 2018 +) > Merge tag 'riscv-for-linus-4.15-maintainers' of > git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux > > So far this crash happened 3 times on net-next, upstream. > Unfortunately, I don't have any reproducer for this crash yet. > Raw console output is attached. > compiler: gcc (GCC) 7.1.1 20170620 > .config is attached. > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+c6ac05d30245e21f7...@syzkaller.appspotmail.com > It will help syzbot understand when the bug is fixed. See footer for > details. > If you forward the report, please keep this part and the footer. > > > == > WARNING: possible circular locking dependency detected > 4.15.0-rc9+ #283 Not tainted > -- > syz-executor7/5121 is trying to acquire lock: > (sk_lock-AF_INET){+.+.}, at: [] lock_sock > include/net/sock.h:1461 [inline] > (sk_lock-AF_INET){+.+.}, at: [ ] > do_ip_getsockopt+0x1b3/0x2170 net/ipv4/ip_sockglue.c:1331 > > but task is already holding lock: > (rtnl_mutex){+.+.}, at: [<8db87953>] rtnl_lock+0x17/0x20 > net/core/rtnetlink.c:72 > > which lock already depends on the new lock. > > > the existing dependency chain (in reverse order) is: > > -> #1 (rtnl_mutex){+.+.}: >__mutex_lock_common kernel/locking/mutex.c:756 [inline] >__mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893 >mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908 >rtnl_lock+0x17/0x20 net/core/rtnetlink.c:72 >register_netdevice_notifier+0xad/0x860 net/core/dev.c:1590 >clusterip_config_init net/ipv4/netfilter/ipt_CLUSTERIP.c:261 [inline] >clusterip_tg_check+0xeb9/0x1570 > net/ipv4/netfilter/ipt_CLUSTERIP.c:478 >xt_check_target+0x22c/0x7d0 net/netfilter/x_tables.c:845 >check_target net/ipv4/netfilter/ip_tables.c:518 [inline] >find_check_entry.isra.8+0x8c8/0xcb0 > net/ipv4/netfilter/ip_tables.c:559 >translate_table+0xed1/0x1610 net/ipv4/netfilter/ip_tables.c:730 >do_replace net/ipv4/netfilter/ip_tables.c:1148 [inline] >do_ipt_set_ctl+0x370/0x5f0 net/ipv4/netfilter/ip_tables.c:1682 >nf_sockopt net/netfilter/nf_sockopt.c:106 [inline] >nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115 >ip_setsockopt+0xa1/0xb0 net/ipv4/ip_sockglue.c:1256 >raw_setsockopt+0xb7/0xd0 net/ipv4/raw.c:857 >sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2968 >SYSC_setsockopt net/socket.c:1831 [inline] >SyS_setsockopt+0x189/0x360 net/socket.c:1810 >entry_SYSCALL_64_fastpath+0x29/0xa0 > > -> #0 (sk_lock-AF_INET){+.+.}: >lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3914 >lock_sock_nested+0xc2/0x110 net/core/sock.c:2770 >lock_sock include/net/sock.h:1461 [inline] >do_ip_getsockopt+0x1b3/0x2170 net/ipv4/ip_sockglue.c:1331 >ip_getsockopt+0x90/0x220 net/ipv4/ip_sockglue.c:1562 >tcp_getsockopt+0x82/0xd0 net/ipv4/tcp.c:3329 >sock_common_getsockopt+0x95/0xd0 net/core/sock.c:2927 >SYSC_getsockopt net/socket.c:1862 [inline] >SyS_getsockopt+0x178/0x340 net/socket.c:1844 >entry_SYSCALL_64_fastpath+0x29/0xa0 > > other info that might help us debug this: > > Possible unsafe locking scenario: > >CPU0CPU1 > > lock(rtnl_mutex); >lock(sk_lock-AF_INET); >lock(rtnl_mutex); > lock(sk_lock-AF_INET); > > *** DEADLOCK *** > > 1 lock held by syz-executor7/5121: > #0: (rtnl_mutex){+.+.}, at: [<8db87953>] rtnl_lock+0x17/0x20 > net/core/rtnetlink.c:72 > > stack backtrace: > CPU: 1 PID: 5121 Comm: syz-executor7 Not tainted 4.15.0-rc9+ #283 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: > __dump_stack lib/dump_stack.c:17 [inline] > dump_stack+0x194/0x257 lib/dump_stack.c:53 > print_circular_bug.isra.37+0x2cd/0x2dc kernel/locking/lockdep.c:1218 > check_prev_add kernel/locking/lockdep.c:1858 [inline] > check_prevs_add kernel/locking/lockdep.c:1971 [inline] > validate_chain kernel/locking/lockdep.c:2412 [inline] > __lock_acquire+0x30a8/0x3e00 kernel/locking/lockdep.c:3426 > lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3914 > lock_sock_nested+0xc2/0x110 net/core/sock.c:2770 > lock_sock include/net/sock.h:1461 [inline] > do_ip_getsockopt+0x1b3/0x2170 net/ipv4/ip_sockglue.c:1331 > ip_getsockopt+0x90/0x220 net/ipv4/ip_sockglue.c:1562 > tcp_getsockopt+0x82/0xd0 net/ipv4/tcp.c:3329 > sock_common_getsockopt+0x95/0xd0 net/core/sock.c:2927 > SYSC_getsockopt net/socket.c:1862 [inline] > SyS_
Re: general protection fault in nf_tables_dump_obj_done
#syz fix: netfilter: nf_tables: fix potential NULL-ptr deref in nf_tables_dump_obj_done()
Re: possible deadlock in do_ip_getsockopt
syzbot wrote: > syzbot hit the following crash on upstream commit > c4e0ca7fa24137e372d6135fe16e8df8e123f116 (Fri Jan 26 23:10:50 2018 +) > Merge tag 'riscv-for-linus-4.15-maintainers' of > git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux > > So far this crash happened 3 times on net-next, upstream. > Unfortunately, I don't have any reproducer for this crash yet. > Raw console output is attached. > compiler: gcc (GCC) 7.1.1 20170620 > .config is attached. > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+c6ac05d30245e21f7...@syzkaller.appspotmail.com > It will help syzbot understand when the bug is fixed. See footer for > details. > If you forward the report, please keep this part and the footer. > > > == > WARNING: possible circular locking dependency detected > 4.15.0-rc9+ #283 Not tainted > -- > syz-executor7/5121 is trying to acquire lock: > (sk_lock-AF_INET){+.+.}, at: [] lock_sock > include/net/sock.h:1461 [inline] > (sk_lock-AF_INET){+.+.}, at: [ ] > do_ip_getsockopt+0x1b3/0x2170 net/ipv4/ip_sockglue.c:1331 all of these deadlocks boil down to fact that we acquire sk lock before calling nf set/getsockopt handlers, yet its not clear to me why we do this in the first place. Paolo Abeni is looking at pushing the sk locking down in those nf handlers that need it which would avoid these deadlocks.
Re: general protection fault in ip6t_do_table
syzbot wrote: > CPU: 0 PID: 3675 Comm: syzkaller168273 Not tainted 4.15.0-rc9+ #283 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > RIP: 0010:ip6t_do_table+0x12de/0x19d0 net/ipv6/netfilter/ip6_tables.c:360 > RSP: 0018:8801db206c58 EFLAGS: 00010246 > RAX: RBX: 8801d9aafb80 RCX: 84d30352 > RDX: 0100 RSI: RDI: 8801d9aafcde > RBP: 8801db206e60 R08: 11003b640d54 R09: > R10: 00d0 R11: R12: 0001 > R13: R14: dc00 R15: 8801d9aafc50 > FS: 00a5d880() GS:8801db20() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 209f1000 CR3: 0001bc9f8001 CR4: 001606f0 > DR0: DR1: DR2: > DR3: DR6: fffe0ff0 DR7: 0400 > Call Trace: > > ip6table_security_hook+0x65/0x80 net/ipv6/netfilter/ip6table_security.c:45 In case noone else can look at this I'll investigate this next friday (currently travelling).
Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)
Tetsuo Handa wrote: > syzbot wrote: > > syzbot hit the following crash on net-next commit > > 6bb46bc57c8e9ce947cc605e555b7204b44d2b10 (Fri Jan 26 16:00:23 2018 +) > > Merge branch 'cxgb4-fix-dump-collection-when-firmware-crashed' > > > > C reproducer is attached. > > syzkaller reproducer is attached. > > Raw console output is attached. > > compiler: gcc (GCC) 7.1.1 20170620 > > .config is attached. > > > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > > Reported-by: syzbot+8630e35fc7287b392...@syzkaller.appspotmail.com > > It will help syzbot understand when the bug is fixed. See footer for > > details. > > If you forward the report, please keep this part and the footer. > > > > [ 3685] 0 3685178211 1843200 0 sshd > > [ 3692] 0 3692 43760327680 0 > > syzkaller025682 > > [ 3695] 0 3695 43760368640 0 > > syzkaller025682 > > Kernel panic - not syncing: Out of memory and no killable processes... > > > > This sounds like too huge vmalloc() request where size is controlled by > userspace. Right. Before eacd86ca3b036e55e172b7279f101cef4a6ff3a4 this used info = __vmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY, would it help to re-add that? > vmalloc() once became killable by commit 5d17a73a2ebeb8d1 ("vmalloc: back > off when the current task is killed") but then became unkillable by commit > b8c8a338f75e052d ("Revert "vmalloc: back off when the current task is > killed""). Therefore, we can't handle this problem from MM side. > Please consider adding some limit from networking side. I don't know what "some limit" would be. I would prefer if there was a way to supress OOM Killer in first place so we can just -ENOMEM user. AFAIU NOWARN|NORETRY does that, so I would propose to just read-add it?
Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)
Kirill A. Shutemov wrote: > On Mon, Jan 29, 2018 at 08:23:57AM +0100, Florian Westphal wrote: > > > vmalloc() once became killable by commit 5d17a73a2ebeb8d1 ("vmalloc: back > > > off when the current task is killed") but then became unkillable by commit > > > b8c8a338f75e052d ("Revert "vmalloc: back off when the current task is > > > killed""). Therefore, we can't handle this problem from MM side. > > > Please consider adding some limit from networking side. > > > > I don't know what "some limit" would be. I would prefer if there was > > a way to supress OOM Killer in first place so we can just -ENOMEM user. > > Just supressing OOM kill is a bad idea. We still leave a way to allocate > arbitrary large buffer in kernel. Isn't that what we do everywhere in network stack? I think we should try to allocate whatever amount of memory is needed for the given xtables ruleset, given that is what admin requested us to do. I also would not know what limit is sane -- I've seen setups with as much as 100k iptables rules, and that was 5 years ago. And even if we add a "Xk rules" limit, it might be too much for low-memory systems, or not enough for whatever other use case there might be.
Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)
Kirill A. Shutemov wrote: > On Mon, Jan 29, 2018 at 05:57:22PM +0100, Florian Westphal wrote: > > Kirill A. Shutemov wrote: > > > On Mon, Jan 29, 2018 at 08:23:57AM +0100, Florian Westphal wrote: > > > > > vmalloc() once became killable by commit 5d17a73a2ebeb8d1 ("vmalloc: > > > > > back > > > > > off when the current task is killed") but then became unkillable by > > > > > commit > > > > > b8c8a338f75e052d ("Revert "vmalloc: back off when the current task is > > > > > killed""). Therefore, we can't handle this problem from MM side. > > > > > Please consider adding some limit from networking side. > > > > > > > > I don't know what "some limit" would be. I would prefer if there was > > > > a way to supress OOM Killer in first place so we can just -ENOMEM user. > > > > > > Just supressing OOM kill is a bad idea. We still leave a way to allocate > > > arbitrary large buffer in kernel. > > > > Isn't that what we do everywhere in network stack? > > > > I think we should try to allocate whatever amount of memory is needed > > for the given xtables ruleset, given that is what admin requested us to do. > > Is it correct that "admin" in this case is root in random container? Yes. > I mean, can we get access to it with CLONE_NEWUSER|CLONE_NEWNET? Yes. > This can be fun. Do we prevent "admin in random container" to insert 2m ipv6 routes (alternatively: ipsec tunnels, interfaces etc etc)? > > I also would not know what limit is sane -- I've seen setups with as much > > as 100k iptables rules, and that was 5 years ago. > > > > And even if we add a "Xk rules" limit, it might be too much for > > low-memory systems, or not enough for whatever other use case there > > might be. > > I hate what I'm saying, but I guess we need some tunable here. > Not sure what exactly. Would memcg help? (I don't buy the "run untrusted binaries on linux is safe" thing, so I would not know).
Re: dangers of bots on the mailing lists was Re: divide error in ___bpf_prog_run
Pavel Machek wrote: > > > ...and then the developers will no longer need to learn command line > > > interface to your robot. > > > > > > #syz test: git://gcc.gnu.org/git/gcc.git master > > > #syz dup: `date` > > > > > > Pavel, please stop harming the useful process! > > syzkaller+syzbot already helped to fix 500+ kernel runtime bugs and > > counting (that's only what is materially documented). Please stop. > > Well, you are also hurting kernel development by spamming the > lists. You stop. Bullshit. Learn to filter your email if you're not interested in fixing bugs. Or unsubscribe.
Re: WARNING in __proc_create
Eric Dumazet wrote: > >>fs/proc/generic.c:354 > > > >We need to reject empty names. > > > > I sent a patch a while back, but Pablo/Florian wanted more than that simple > fix. > > We also need to filter special characters like '/' > > Or maybe I am mixing with something else. Argh, sorry, this fell off the truck it seems :( I'll work on this tomorrow unless someone else does it today ;)
Re: WARNING in __proc_create
Cong Wang wrote: > On Fri, Mar 9, 2018 at 2:58 PM, Eric Dumazet wrote: > > > > > > On 03/09/2018 02:56 PM, Eric Dumazet wrote: > > > >> > >> I sent a patch a while back, but Pablo/Florian wanted more than that > >> simple fix. > >> > >> We also need to filter special characters like '/' > > proc_create_data() itself accepts '/', so it must be xt_hashlimit doesn't > want it. --hashimit-name / also triggers WARN for me. . or .. "work", (no crash), but cause appearance of 2nd ./.. in /proc/net/ipt_hashlimit , so I think its better to disallow that too.
Re: [PATCH] proc: reject "." and ".." as filenames
Alexey Dobriyan wrote: > Various subsystems can create files and directories in /proc > with names directly controlled by userspace. > > Which means "/", "." and ".." are no-no. > > "/" split is already taken care of, do the other 2 prohibited names. Acked-by: Florian Westphal I'll send a patch for xtables too to reject bogus names coming from userspace (syzbot reports WARN() ).
Re: [PATCH v2] net: netfilter: Replace printk() with appropriate pr_*() macro
Arushi Singhal wrote: > On Mon, Mar 12, 2018 at 2:17 AM, Pablo Neira Ayuso > wrote: > > > Hi Joe, > > > > On Sun, Mar 11, 2018 at 12:52:41PM -0700, Joe Perches wrote: > > > On Mon, 2018-03-12 at 01:11 +0530, Arushi Singhal wrote: > > > > Using pr_() is more concise than > > > > printk(KERN_). > > > > Replace printks having a log level with the appropriate > > > > pr_*() macros. > > > > > > > > Signed-off-by: Arushi Singhal > > > > --- > > > > changes in v2 > > > > *in v1 printk() were replaced with netdev_*() > > > > > > > net/netfilter/nf_conntrack_acct.c | 2 +- > > > > net/netfilter/nf_conntrack_ecache.c| 2 +- > > > > net/netfilter/nf_conntrack_timestamp.c | 2 +- > > > > net/netfilter/nf_nat_core.c| 2 +- > > > > net/netfilter/nfnetlink_queue.c| 4 ++-- > > > > 5 files changed, 6 insertions(+), 6 deletions(-) > > > > > > None of these files have a #define for pr_fmt so this > > > should be OK. > > > > I think Arushi could add pr_fmt in the same go, so we skip another > > follow up patch for this. @Arushi: I suggested this in my previous > > email, please have a look. > > > > Hello Pablo > > Should I send two patches, one with the conversion of printk() to pr_() and > another for defining pr_fmt(). > > Or > > only one patch with all the changes? Both in one, it reduces code churn. With pr_* + pr_fmt, the module name will be prefixed automatically, see e.g. commit e016c5e43db51875c2b541b59bd217494d213174 as an example. > > This would also probably allows us to save the line break in the error > > message, which IIRC is not a good practise either, eg. > > > > pr_warn("nf_queue: OOM " > > "in mangle, dropping packet\n"); > > > > > Perhaps coalesce the formats and remove the unnecessary periods too. The above message should be removed, its useless (on oom allocator already warns).
Re: [PATCH AUTOSEL for 4.15 070/124] netfilter: core: only allow one nat hook per hook point
Sasha Levin wrote: > From: Florian Westphal > > [ Upstream commit f92b40a8b2645af38bd6814651c59c1e690db53d ] This patch is broken and a fix is not in any tree yet.
Re: linux-next: ip6tables *broken* - last base chain position %u doesn't match underflow %u (hook %u
valdis.kletni...@vt.edu wrote: > (Resending because I haven't heard anything) [ ip6tables broken ] Sorry, did not see this email before. I'll investigate asap, thanks for the detailed report.
Re: [RFC PATCH v2] bridge: make it possible for packets to traverse the bridge without hitting netfilter
David Woodhouse wrote: > > > On Fri, 2015-03-06 at 17:37 +0100, Florian Westphal wrote: > > > > > > I did performance measurements in the following way: > > > > > > > > Removed those pieces of the packet pipeline that I don't necessarily > > > > need one-by-one. Then measured their effect on small packet > > > > performance. > > > > > > > > This was the only part that produced considerable effect. > > > > > > > > The pure speculation was about why the effect is more than 15% > > > > increase in packet throughput, although the code path avoided > > > > contains way less code than 15% of the packet pipeline. It seems, > > > > Felix Fietkau profiled similar changes, and found my guess well > > > > founded. > > > > > > > > Now could anybody explain me what else is wrong with my patch? > > > > > > We have to come up with a more generic solution for this. > > > > Jiri Benc suggested to allowing to attach netfilter hooks e.g. via tc > > action, maybe that would be an option worth investigating. > > > > Then you could for instance add filtering rules only to the bridge port > > that needs it. > > > > > These sysfs tweaks you're proposing look to me like an obscure way to > > > tune this. > > > > I agree, adding more tunables isn't all that helpful, in the past this > > only helped to prolong the problem. > > How feasible would it be to make it completely dynamic? > > A given hook could automatically disable itself (for a given device) if > the result of running it the first time was *tautologically* false for > that device (i.e. regardless of the packet itself, or anything else). > > The hook would need to be automatically re-enabled if the rule chain > ever changes (and might subsequently disable itself again). > > Is that something that's worth exploring for the general case? AF_BRIDGE hooks sit in the net namespace, so its enough for one bridge to request filtering to bring in the hook overhead for all bridges in the same netns. Alternatives: - place the bridges that need filtering in different netns - use tc ingress for filtering - use nftables ingress hook for filtering (it sits in almost same location as tc ingress hook) to attach the ruleset to those bridge ports that need packet filtering. (The original request came from user with tons of bridges where only one single bridge needed filtering). One alternative I see is to place the bridge hooks into the bridge device (net_bridge struct, which is in netdev private area). But, as you already mentioned we would need to annotate the hooks to figure out which device(s) they are for. This sounds rather fragile to me, so i would just use nft ingress: #!/sbin/nft -f table netdev ingress { chain in_public { type filter hook ingress device eth0 priority 0; ip saddr 192.168.0.1 counter } }
Re: INFO: rcu detected stall in pfkey_sendmsg
Dmitry Vyukov wrote: > On Wed, Dec 19, 2018 at 7:37 PM syzbot > wrote: > > > > Hello, > > > > syzbot found the following crash on: > > > > HEAD commit:a26d94bff4d5 net: bridge: remove unneeded variable 'err' > > git tree: net-next > > console output: https://syzkaller.appspot.com/x/log.txt?x=14c7a4cd40 > > kernel config: https://syzkaller.appspot.com/x/.config?x=d9655b05acfc97ff > > dashboard link: https://syzkaller.appspot.com/bug?extid=e1d3a7522b4d05aeede4 > > compiler: gcc (GCC) 8.0.1 20180413 (experimental) > > > > Unfortunately, I don't have any reproducer for this crash yet. > > +Florian, this looks related to: > INFO: rcu detected stall in xfrm_hash_rebuild > https://syzkaller.appspot.com/bug?id=62ee9df6b17e143dcd22a6bc5383c1b4ba797c8c > https://groups.google.com/forum/#!msg/syzkaller-bugs/4yD3ts-wWRA/63scKqSyDAAJ > > Should we dup them? No, not yet anyway. First report triggers rcu stall during hash rebuild, this looks like stall is directly on insertion. (Could obviously still be same bug).
Re: INFO: rcu detected stall in xfrm_hash_rebuild
Wolfgang Walter wrote: [ CCing Christophe ] > Am Montag, 10. Dezember 2018, 09:58:56 schrieb David Miller: > > From: Florian Westphal > > Date: Mon, 10 Dec 2018 13:47:24 +0100 > > > > > After recent tree conversion, we could probably make the exact policies > > > part of the 'inexact tree' (which would be renamed to 'policy tree' or > > > some such). > > > > > > Special-casing the exact policies made a lot of sense when we had > > > a single list for the inexact policies (to keep its length down). > > > > > > But now I think we could try to unify all of this and only maintain > > > the existing tree-based storage. > > > > > > Would also remove the need to do lookups in two different > > > data structures (bydst-hash-then-inexact-tree). > > > > > > What do you think? > > > > I think this makes a lot of sense. > > Sites mainly using tunnel mode this certainly makes sense. Ok. An alternative would be to remove the support for policy hash table thresholds (which decide what kinds of policies go to exact table and which ones go into inexact ones), i.e. partially revert 880a6fab8f6ba5b5abe59ea6 ("xfrm: configure policy hash table thresholds by netlink"). This would remove the need for the rehashing support that re-sorts the policies into either exact/inexact lists) when the those tunables are changed. We could also easily convert the exact table to an rhashtable then if we wanted to. I guess we should probably wait to get some operational feedback on the inexact storage first to see if it really improves things enough to make threshold tuning unneccessary. Christophe, whats your take?
Re: INFO: rcu detected stall in xfrm_hash_rebuild
Christophe Gouault wrote: > The main use cases I have encountered and tried to address with the > hash-based lookup were network operator use cases: > - a lot of dynamic /32 <=> /32 policies (protecting GTP tunnels) > - or a lot of dynamic policies with the same prefix lengths (e.g. /16 <=> /24) > and a few non-hashed policies stored in the linked list. Thanks for the detailed information. [..] > Could you verify how the configuration time would behave in such use > cases if the hash-based lookup was replaced by a tree-based lookup? I won't send a patch to remove your work, at least not at this time. In case I'd do this removal (thresholds, hash table, or both) i will make these tests to see how large the impact is.
Re: INFO: rcu detected stall in xfrm_hash_rebuild
syzbot wrote: > Hello, > > syzbot found the following crash on: [..] > Workqueue: events xfrm_hash_rebuild Ignoring this report for a second -- I think it makes sense to see if we can just remove the entire hash table rebuild/resize code. After recent tree conversion, we could probably make the exact policies part of the 'inexact tree' (which would be renamed to 'policy tree' or some such). Special-casing the exact policies made a lot of sense when we had a single list for the inexact policies (to keep its length down). But now I think we could try to unify all of this and only maintain the existing tree-based storage. Would also remove the need to do lookups in two different data structures (bydst-hash-then-inexact-tree). What do you think?
Re: WARNING in xfrm_policy_inexact_gc_tree
syzbot wrote: > > HEAD commit:74c4a24df7ca Add linux-next specific files for 20181207 > git tree: linux-next > console output: https://syzkaller.appspot.com/x/log.txt?x=17bbea7d40 > kernel config: https://syzkaller.appspot.com/x/.config?x=6e9413388bf37bed > dashboard link: https://syzkaller.appspot.com/bug?extid=0b2f1bc876d301440d44 > compiler: gcc (GCC) 8.0.1 20180413 (experimental) > > Unfortunately, I don't have any reproducer for this crash yet. > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+0b2f1bc876d301440...@syzkaller.appspotmail.com > > IPv6: ADDRCONF(NETDEV_UP): veth0: link is not ready > IPv6: ADDRCONF(NETDEV_UP): veth1: link is not ready > IPv6: ADDRCONF(NETDEV_CHANGE): veth1: link becomes ready > IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready > 8021q: adding VLAN 0 to HW filter on device team0 > WARNING: CPU: 0 PID: 13116 at net/xfrm/xfrm_policy.c:1066 > xfrm_policy_inexact_gc_tree+0x233/0x280 net/xfrm/xfrm_policy.c:1066 This means we're exiting from net namespace, but for some reason policies would be left behind. At this point the storage should only contain empty nodes, as all policies are supposed to be gone. I'll have a look.
Re: linux-next: Tree for Dec 21
Guenter Roeck wrote: > mips:cavium_octeon_defconfig [4] > git bisect bad 4165079ba328dd47262a2183049d3591f0a750b1 > # first bad commit: [4165079ba328dd47262a2183049d3591f0a750b1] net: switch > secpath to use skb extension infrastructure Indeed, sorry. staging/octeon needs a small fix. Will send a patch asap.
[PATCH net-next] staging: octeon: fix build failure with XFRM enabled
skb->sp doesn't exist anymore in the next-next tree, so mips defconfig no longer builds. Use helper instead to reset the secpath. Not even compile tested. Cc: Greg Kroah-Hartman Reported-by: Guenter Roeck Fixes: 4165079ba328d ("net: switch secpath to use skb extension infrastructure") Signed-off-by: Florian Westphal --- Greg, David: The patch will not break build for a tree that lacks the 'Fixes' commit, so this can also go in via staging tree. OTOH, net-next build is broken for mips/octeon, so I think in this case net-next might make more sense? diff --git a/drivers/staging/octeon/ethernet-tx.c b/drivers/staging/octeon/ethernet-tx.c index df3441b815bb..317c9720467c 100644 --- a/drivers/staging/octeon/ethernet-tx.c +++ b/drivers/staging/octeon/ethernet-tx.c @@ -359,8 +359,7 @@ int cvm_oct_xmit(struct sk_buff *skb, struct net_device *dev) dst_release(skb_dst(skb)); skb_dst_set(skb, NULL); #ifdef CONFIG_XFRM - secpath_put(skb->sp); - skb->sp = NULL; + secpath_reset(skb); #endif nf_reset(skb); -- 2.19.2
Re: KASAN: use-after-free Write in __xfrm_policy_unlink
syzbot wrote: > syzbot has found a reproducer for the following crash on: > > HEAD commit:ce28bb445388 Merge git://git.kernel.org/pub/scm/linux/kern.. > git tree: net-next > console output: https://syzkaller.appspot.com/x/log.txt?x=1673fb1b40 > kernel config: https://syzkaller.appspot.com/x/.config?x=67a2081147a23142 > dashboard link: https://syzkaller.appspot.com/bug?extid=9d971dd21eb26567036b > compiler: gcc (GCC) 8.0.1 20180413 (experimental) > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1134dcc740 > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=126986ed40 > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+9d971dd21eb265670...@syzkaller.appspotmail.com I've fixed this one. Chances are that at least some of the other reports are duplicates of this one. I will continue to look at other reports over the next few days and plan to send out fixes and test cases next week.
Re: [PATCH] netfilter: account ebt_table_info to kmemcg
Michal Hocko wrote: > On Fri 28-12-18 17:55:24, Shakeel Butt wrote: > > The [ip,ip6,arp]_tables use x_tables_info internally and the underlying > > memory is already accounted to kmemcg. Do the same for ebtables. The > > syzbot, by using setsockopt(EBT_SO_SET_ENTRIES), was able to OOM the > > whole system from a restricted memcg, a potential DoS. > > What is the lifetime of these objects? Are they bound to any process? No, they are not. They are free'd only when userspace requests it or the netns is destroyed.
Re: linux-next: build warnings after merge of the net-next tree
Stephen Rothwell wrote: > After merging the net-next tree, today's linux-next build (i386 defconfig) > produced these warnings: > > In file included from include/net/netfilter/nf_conntrack_tuple.h:13:0, > from include/linux/netfilter/nf_conntrack_dccp.h:28, > from include/net/netfilter/nf_conntrack.h:22, > from net/netfilter/nf_conntrack_core.c:37: > include/linux/netfilter/x_tables.h: In function 'xt_percpu_counter_alloc': > include/linux/netfilter/x_tables.h:373:10: warning: cast from pointer to > integer of different size [-Wpointer-to-int-cast] >return (__force u64) res; > ^ > include/linux/netfilter/x_tables.h: In function 'xt_percpu_counter_free': > include/linux/netfilter/x_tables.h:381:15: warning: cast to pointer from > integer of different size [-Wint-to-pointer-cast] >free_percpu((void __percpu *) pcnt); >^ > In file included from include/asm-generic/percpu.h:6:0, > from arch/x86/include/asm/percpu.h:551, > from arch/x86/include/asm/preempt.h:5, > from include/linux/preempt.h:64, > from include/linux/spinlock.h:50, > from include/linux/mm_types.h:8, > from include/linux/kmemcheck.h:4, > from include/linux/skbuff.h:18, > from include/linux/netfilter.h:5, > from net/netfilter/nf_conntrack_core.c:16: > include/linux/netfilter/x_tables.h: In function 'xt_get_this_cpu_counter': > include/linux/netfilter/x_tables.h:388:23: warning: cast to pointer from > integer of different size [-Wint-to-pointer-cast] >return this_cpu_ptr((void __percpu *) cnt->pcnt); >^ > include/linux/percpu-defs.h:206:47: note: in definition of macro > '__verify_pcpu_ptr' > const void __percpu *__vpp_verify = (typeof((ptr) + 0))NULL; \ >^ > include/linux/percpu-defs.h:239:27: note: in expansion of macro 'raw_cpu_ptr' > #define this_cpu_ptr(ptr) raw_cpu_ptr(ptr) >^ > include/linux/netfilter/x_tables.h:388:10: note: in expansion of macro > 'this_cpu_ptr' >return this_cpu_ptr((void __percpu *) cnt->pcnt); > ^ > > and many more. > > Introduced by commit: > > 71ae0dff02d7 ("netfilter: xtables: use percpu rule counters") Yes, sorry about this, should be fixed by dcb8f5c8139ef945cdfd ("netfilter: xtables: fix warnings on 32bit platforms"). Thanks, Florian -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: build warning after merge of the netfilter-next tree
Geert Uytterhoeven wrote: > On Tue, May 8, 2018 at 9:17 AM, Florian Westphal wrote: > > Stephen Rothwell wrote: > >> On Mon, 7 May 2018 10:55:19 +1000 Stephen Rothwell > >> wrote: > >> > > >> > After merging the netfilter-next tree, today's linux-next build (x86_64 > >> > allmodconfig) produced this warning: > >> > > >> > ./usr/include/linux/netfilter/nf_osf.h:25: found __[us]{8,16,32,64} type > >> > without #include > >> > >> > Introduced by commit > >> > > >> > bfb15f2a95cb ("netfilter: extract Passive OS fingerprint > >> > infrastructure from xt_osf") > > > > I'll send a fix for this, thanks for reporting. > > +config NF_OSF > + tristate 'Passive OS fingerprint infrastructure' > > "There is no help available for this option." > > Is this meant to be a user-visible symbol? No, its not. I can send a patch in case you're too busy, let me know.
Re: about selftests/netfilter test related issue
Jeffrin Thalakkottoor wrote: > i think the script nft_nat.sh is assuming devices eth0 and eth1 No it does not. These are arbitrary names given to veth devices. > Error: Unknown device type. No Veth device support in kernel?
Re: about selftests/netfilter test related issue
Jeffrin Thalakkottoor wrote: > Error: Unknown device type. Feel free to send a patch that makes it display a more reasonable exit+error here.
Re: [PATCH ghak124 v3] audit: log nftables configuration change events
Richard Guy Briggs wrote: > > > I personally would notify once per transaction. This is easy and quick. > > This was the goal. iptables was atomic. nftables appears to no longer > be so. If I have this wrong, please show how that works. nftables transactions are atomic, either the entire batch takes effect or not at all. The audit_log_nfcfg() calls got added to the the nft monitor infra which is designed to allow userspace to follow the entire content of the transaction log. So, if its just a 'something was changed' event that is needed all of them can be removed. ATM the audit_log_nfcfg() looks like this: /* step 3. Start new generation, rules_gen_X now in use. */ net->nft.gencursor = nft_gencursor_next(net); list_for_each_entry_safe(trans, next, &net->nft.commit_list, list) { switch (trans->msg_type) { case NFT_MSG_NEWTABLE: audit_log_nfcfg(); ... case NFT_MSG_... audit_log_nfcfg(); .. } which gives an audit_log for every single change in the batch. So, if just a summary is needed a single audit_log_nfcfg() after 'step 3' and outside of the list_for_each_entry_safe() is all that is needed. If a summary is wanted as well one could fe. count the number of transaction types in the batch, e.g. table adds, chain adds, rule adds etc. and then log a summary count instead.
Re: [v3 net-next 08/10] skbuff: reuse NAPI skb cache on allocation path (__build_skb())
Alexander Lobakin wrote: > we're in such context. This includes: build_skb() (called only > from NIC drivers in NAPI Rx context) and {,__}napi_alloc_skb() > (called from the same place or from kernel network softirq > functions). build_skb is called from sleepable context in drivers/net/tun.c . Perhaps there are other cases.
Re: [PATCH ghak124 v3] audit: log nftables configuration change events
Richard Guy Briggs wrote: > On 2021-02-11 23:09, Florian Westphal wrote: > > So, if just a summary is needed a single audit_log_nfcfg() > > after 'step 3' and outside of the list_for_each_entry_safe() is all > > that is needed. > > Ok, so it should not matter if it is before or after that > list_for_each_entry_safe(), which could be used to collect that summary. Right, it won't matter. > > If a summary is wanted as well one could fe. count the number of > > transaction types in the batch, e.g. table adds, chain adds, rule > > adds etc. and then log a summary count instead. > > The current fields are "table", "family", "entries", "op". > > Could one batch change more than one table? (I think it could?) Yes. > It appears it can change more than one family. > "family" is currently a single integer, so that might need to be changed > to a list, or something to indicate multi-family. Yes, it can also affect different families. > Listing all the ops seems a bit onerous. Is there a hierarchy to the > ops and if so, are they in that order in a batch or in nf_tables_commit()? No. There is a hierarchy, e.g. you can't add a chain without first adding a table, BUT in case the table was already created by an earlier transaction it can also be stand-alone. > It seems I'd need to filter out the NFT_MSG_GET_* ops. No need, the GET ops do not cause changes and will not trigger a generation id change.
Re: [PATCH ghak124 v3] audit: log nftables configuration change events
Richard Guy Briggs wrote: > On 2021-02-18 09:22, Florian Westphal wrote: > > No. There is a hierarchy, e.g. you can't add a chain without first > > adding a table, BUT in case the table was already created by an earlier > > transaction it can also be stand-alone. > > Ok, so there could be a stand-alone chain mod with one table, then a > table add of a different one with a "higher ranking" op... Yes, that can happen. > > > It seems I'd need to filter out the NFT_MSG_GET_* ops. > > > > No need, the GET ops do not cause changes and will not trigger a > > generation id change. > > Ah, so it could trigger on generation change. Would GET ops be included > in any other change No, GET ops are standalone, they cannot be part of a transaction. If you look at static const struct nfnl_callback nf_tables_cb[NFT_MSG_MAX] = { array in nf_tables_api.c, then those ops with a '.call_batch' can appear in transaction (i.e., can cause modification). The other ones (.call_rcu) are read-only. If they appear in a batch tehy will be ignored, if the batch consists of such non-modifying ops only then nf_tables_commit() returns early because the transaction list is empty (nothing to do/change). > such that it would be desirable to filter them out > to reduce noise in that single log line if it is attempted to list all > the change ops? It almost sounds like it would be better to do one > audit log line for each table for each family, and possibly for each op > to avoid the need to change userspace. This would already be a > significant improvement picking the highest ranking op. I think i understand what you'd like to do. Yes, that would reduce the log output a lot.
Re: [PATCH ghak124 v3] audit: log nftables configuration change events
Richard Guy Briggs wrote: > Ok, can I get one more clarification on this "hierarchy"? Is it roughly > in the order they appear in nf_tables_commit() after step 3? It appears > it might be mostly already. If it isn't already, would it be reasonable > to re-order them? Would you suggest a different order? For audit purposes I think enum nf_tables_msg_types is already in the most relevant order, the lower numbers being more imporant. So e.g. NEWTABLE would be more interesting than DELRULE, if both are in same batch. > > > such that it would be desirable to filter them out > > > to reduce noise in that single log line if it is attempted to list all > > > the change ops? It almost sounds like it would be better to do one > > > audit log line for each table for each family, and possibly for each op > > > to avoid the need to change userspace. This would already be a > > > significant improvement picking the highest ranking op. > > > > I think i understand what you'd like to do. Yes, that would reduce > > the log output a lot. > > Would the generation change id be useful outside the kernel? Yes, we already announce it to interested parties via nfnetlink. > What > exactly does it look like? Its just a u64 counter that gets incremented whenever there is a change. > I don't quite understand the genmask purpose. Thats an implementation detail only. When we process a transaction, changes to the ruleset are being made but they should not have any effect until the entire transaction is processed. So there are two 'generations' at any time: 1. The active ruleset 2. The future ruleset 2) is what is being changed/modified. When the transaction completes, then the future ruleset becomes the active ruleset. If the transaction has to be aborted, the pending changes are reverted and the genid/genmasks are not changed.
Re: [PATCH ghak124 v3] audit: log nftables configuration change events
Richard Guy Briggs wrote: > > If they appear in a batch tehy will be ignored, if the batch consists of > > such non-modifying ops only then nf_tables_commit() returns early > > because the transaction list is empty (nothing to do/change). > > Ok, one little inconvenient question: what about GETOBJ_RESET? That > looks like a hybrid that modifies kernel table counters and reports > synchronously. That could be a special case call in > nf_tables_dump_obj() and nf_tables_getobj(). Will that cause a storm > per commit? No, since they can't be part of a commit (they don't implement the 'call_batch' function). I'm not sure GETOBJ_RESET should be reported in the first place: RESET only affects expr internal state, and that state changes all the time anyway in response to network traffic.
Re: [PATCH] xfrm: Fix incorrect types in assignment
Yang Li wrote: > Fix the following sparse warnings: > net/xfrm/xfrm_policy.c:1303:22: warning: incorrect type in assignment > (different address spaces) > Reported-by: Abaci Robot > Signed-off-by: Yang Li > --- > net/xfrm/xfrm_policy.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c > index b74f28c..5c67407 100644 > --- a/net/xfrm/xfrm_policy.c > +++ b/net/xfrm/xfrm_policy.c > @@ -1225,7 +1225,7 @@ static void xfrm_hash_rebuild(struct work_struct *work) > struct xfrm_policy *pol; > struct xfrm_policy *policy; > struct hlist_head *chain; > - struct hlist_head *odst; > + struct hlist_head __rcu *odst; This doesn't look right. Try something like - odst = net->xfrm.policy_bydst[dir].table; + odst = rcu_dereference_protected(net->xfrm.policy_bydst[dir].table, lockdep_is_held(&net->xfrm.xfrm_policy_lock));
[PATCH nf] netfilter: nft_compat: remove flush counter optimization
WARNING: CPU: 1 PID: 16059 at lib/refcount.c:31 refcount_warn_saturate+0xdf/0xf [..] __nft_mt_tg_destroy+0x42/0x50 [nft_compat] nft_target_destroy+0x63/0x80 [nft_compat] nf_tables_expr_destroy+0x1b/0x30 [nf_tables] nf_tables_rule_destroy+0x3a/0x70 [nf_tables] nf_tables_exit_net+0x186/0x3d0 [nf_tables] Happens when a compat expr is destoyed from abort path. There is no functional impact; after this work queue is flushed unconditionally if its pending. This removes the waitcount optimization. Test of repeated iptables-restore of a ~60k kubernetes ruleset doesn't indicate a slowdown. In case the counter is needed after all for some workloads we can revert this and increment the refcount for the != NFT_PREPARE_TRANS case to avoid the increment/decrement imbalance. While at it, also flush for match case, this was an oversight in the original patch. Fixes: ffe8923f109b7e ("netfilter: nft_compat: make sure xtables destructors have run") Reported-by: kernel test robot Signed-off-by: Florian Westphal --- net/netfilter/nft_compat.c | 37 ++--- 1 file changed, 14 insertions(+), 23 deletions(-) diff --git a/net/netfilter/nft_compat.c b/net/netfilter/nft_compat.c index 6428856ccbec..8e56f353ff35 100644 --- a/net/netfilter/nft_compat.c +++ b/net/netfilter/nft_compat.c @@ -27,8 +27,6 @@ struct nft_xt_match_priv { void *info; }; -static refcount_t nft_compat_pending_destroy = REFCOUNT_INIT(1); - static int nft_compat_chain_validate_dependency(const struct nft_ctx *ctx, const char *tablename) { @@ -215,6 +213,17 @@ static int nft_parse_compat(const struct nlattr *attr, u16 *proto, bool *inv) return 0; } +static void nft_compat_wait_for_destructors(void) +{ + /* xtables matches or targets can have side effects, e.g. +* creation/destruction of /proc files. +* The xt ->destroy functions are run asynchronously from +* work queue. If we have pending invocations we thus +* need to wait for those to finish. +*/ + nf_tables_trans_destroy_flush_work(); +} + static int nft_target_init(const struct nft_ctx *ctx, const struct nft_expr *expr, const struct nlattr * const tb[]) @@ -238,14 +247,7 @@ nft_target_init(const struct nft_ctx *ctx, const struct nft_expr *expr, nft_target_set_tgchk_param(&par, ctx, target, info, &e, proto, inv); - /* xtables matches or targets can have side effects, e.g. -* creation/destruction of /proc files. -* The xt ->destroy functions are run asynchronously from -* work queue. If we have pending invocations we thus -* need to wait for those to finish. -*/ - if (refcount_read(&nft_compat_pending_destroy) > 1) - nf_tables_trans_destroy_flush_work(); + nft_compat_wait_for_destructors(); ret = xt_check_target(&par, size, proto, inv); if (ret < 0) @@ -260,7 +262,6 @@ nft_target_init(const struct nft_ctx *ctx, const struct nft_expr *expr, static void __nft_mt_tg_destroy(struct module *me, const struct nft_expr *expr) { - refcount_dec(&nft_compat_pending_destroy); module_put(me); kfree(expr->ops); } @@ -468,6 +469,8 @@ __nft_match_init(const struct nft_ctx *ctx, const struct nft_expr *expr, nft_match_set_mtchk_param(&par, ctx, match, info, &e, proto, inv); + nft_compat_wait_for_destructors(); + return xt_check_match(&par, size, proto, inv); } @@ -716,14 +719,6 @@ static const struct nfnetlink_subsystem nfnl_compat_subsys = { static struct nft_expr_type nft_match_type; -static void nft_mt_tg_deactivate(const struct nft_ctx *ctx, -const struct nft_expr *expr, -enum nft_trans_phase phase) -{ - if (phase == NFT_TRANS_COMMIT) - refcount_inc(&nft_compat_pending_destroy); -} - static const struct nft_expr_ops * nft_match_select_ops(const struct nft_ctx *ctx, const struct nlattr * const tb[]) @@ -762,7 +757,6 @@ nft_match_select_ops(const struct nft_ctx *ctx, ops->type = &nft_match_type; ops->eval = nft_match_eval; ops->init = nft_match_init; - ops->deactivate = nft_mt_tg_deactivate, ops->destroy = nft_match_destroy; ops->dump = nft_match_dump; ops->validate = nft_match_validate; @@ -853,7 +847,6 @@ nft_target_select_ops(const struct nft_ctx *ctx, ops->size = NFT_EXPR_SIZE(XT_ALIGN(target->targetsize)); ops->init = nft_target_init; ops->destroy = nft_target_destroy; - ops->deactivate = nft_mt_tg_deactivate, ops->dump = nft_target_dump; ops->validate = nft_target_validate; ops->data = target; @@ -917,8 +910,6 @@ static void __exit
Re: [PATCH] net: eliminate meaningless memcpy to data in pskb_carve_inside_nonlinear()
Miaohe Lin wrote: > The skb_shared_info part of the data is assigned in the following loop. Where?
Re: kernel BUG at lib/string.c:LINE! (6)
Linus Torvalds wrote: > On Tue, Dec 22, 2020 at 6:44 AM syzbot > wrote: > > > > The issue was bisected to: > > > > commit 2f78788b55ba ("ilog2: improve ilog2 for constant arguments") > > That looks unlikely, although possibly some constant folding > improvement might make the fortify code notice something with it. > > > detected buffer overflow in strlen > > [ cut here ] > > kernel BUG at lib/string.c:1149! > > Call Trace: > > strlen include/linux/string.h:325 [inline] > > strlcpy include/linux/string.h:348 [inline] > > xt_rateest_tg_checkentry+0x2a5/0x6b0 net/netfilter/xt_RATEEST.c:143 > > Honestly, this just looks like the traditional bug in "strlcpy()". Yes, thats exactly what this is, no idea why the bisection points at ilog2 changes. > That BSD function is complete garbage, exactly because it doesn't > limit the source length. People tend to _think_ it does ("what's that > size_t argument for?") but strlcpy() only limits the *destination* > size, and the source is always read fully. Right, I'll send a patch shortly.
Re: [PATCH] netfilter: Fix memleak in nf_nat_init
Dinghao Liu wrote: > When register_pernet_subsys() fails, nf_nat_bysource > should be freed just like when nf_ct_extend_register() > fails. Acked-by: Florian Westphal
Re: [PATCH] selftests: xfrm: fix test return value override issue in xfrm_policy.sh
Po-Hsu Lin wrote: > When running this xfrm_policy.sh test script, even with some cases > marked as FAIL, the overall test result will still be PASS: > > $ sudo ./xfrm_policy.sh > PASS: policy before exception matches > FAIL: expected ping to .254 to fail (exceptions) > PASS: direct policy matches (exceptions) > PASS: policy matches (exceptions) > FAIL: expected ping to .254 to fail (exceptions and block policies) > PASS: direct policy matches (exceptions and block policies) > PASS: policy matches (exceptions and block policies) > FAIL: expected ping to .254 to fail (exceptions and block policies after > hresh changes) > PASS: direct policy matches (exceptions and block policies after hresh > changes) > PASS: policy matches (exceptions and block policies after hresh changes) > FAIL: expected ping to .254 to fail (exceptions and block policies after > hthresh change in ns3) > PASS: direct policy matches (exceptions and block policies after hthresh > change in ns3) > PASS: policy matches (exceptions and block policies after hthresh change in > ns3) > FAIL: expected ping to .254 to fail (exceptions and block policies after > htresh change to normal) > PASS: direct policy matches (exceptions and block policies after htresh > change to normal) > PASS: policy matches (exceptions and block policies after htresh change to > normal) > PASS: policies with repeated htresh change > $ echo $? > 0 > > This is because the $lret in check_xfrm() is not a local variable. Acked-by: Florian Westphal
Re: [PATCH v3] audit: log nftables configuration change events once per table
Richard Guy Briggs wrote: > nft_commit_notify(net, NETLINK_CB(skb).portid); > nf_tables_gen_notify(net, skb, NFT_MSG_NEWGEN); > nf_tables_commit_release(net); > > + nf_tables_commit_audit_log(&adl, net->nft.base_seq); This meeds to be before nf_tables_commit_release() call, afaics this function dereferences data structures that might be free'd already here.
Re: [PATCH 5.10 104/157] mptcp: put subflow sock on connect error
Naresh Kamboju wrote: > On Mon, 22 Mar 2021 at 18:15, Greg Kroah-Hartman > wrote: > > > > From: Florian Westphal > > > > [ Upstream commit f07157792c633b528de5fc1dbe2e4ea54f8e09d4 ] > > > > mptcp_add_pending_subflow() performs a sock_hold() on the subflow, > > then adds the subflow to the join list. > > > > Without a sock_put the subflow sk won't be freed in case connect() fails. > > > > unreferenced object 0x88810c03b100 (size 3000): > > [..] > > sk_prot_alloc.isra.0+0x2f/0x110 > > sk_alloc+0x5d/0xc20 > > inet6_create+0x2b7/0xd30 > > __sock_create+0x17f/0x410 > > mptcp_subflow_create_socket+0xff/0x9c0 > > __mptcp_subflow_connect+0x1da/0xaf0 > > mptcp_pm_nl_work+0x6e0/0x1120 > > mptcp_worker+0x508/0x9a0 > > > > Fixes: 5b950ff4331ddda ("mptcp: link MPC subflow into msk only after > > accept") I don't see this change in 5.10, so why is this fix queued up? > I have reported the following warnings and kernel crash on 5.10.26-rc2 [1] > The bisect reported that issue pointing out to this commit. > > commit 460916534896e6d4f80a37152e0948db33376873 > mptcp: put subflow sock on connect error > > This problem is specific to 5.10.26-rc2. > > Warning: > > [ 1040.114695] refcount_t: addition on 0; use-after-free. > [ 1040.119857] WARNING: CPU: 3 PID: 31925 at > /usr/src/kernel/lib/refcount.c:25 refcount_warn_saturate+0xd7/0x100 > [ 1040.129769] Modules linked in: act_mirred cls_u32 sch_netem sch_etf > ip6table_nat xt_nat iptable_nat nf_nat ip6table_filter xt_conntrack > nf_conntrack nf_defrag_ipv4 libcrc32c ip6_tables nf_defrag_ipv6 sch_fq > iptable_filter xt_mark ip_tables cls_bpf sch_ingress algif_hash > x86_pkg_temp_thermal fuse [last unloaded: test_blackhole_dev] > [ 1040.159030] CPU: 3 PID: 31925 Comm: mptcp_connect Tainted: G > W K 5.10.26-rc2 #1 > [ 1040.167459] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS > 2.2 05/23/2018 > [ 1040.174851] RIP: 0010:refcount_warn_saturate+0xd7/0x100 > > And > > Kernel Panic: > - > [ 1069.557485] BUG: kernel NULL pointer dereference, address: 0010 > [ 1069.564446] #PF: supervisor read access in kernel mode > [ 1069.569583] #PF: error_code(0x) - not-present page > [ 1069.574714] PGD 0 P4D 0 > [ 1069.577246] Oops: [#1] SMP PTI > > index 16adba172fb9..591546d0953f 100644 > > --- a/net/mptcp/subflow.c > > +++ b/net/mptcp/subflow.c > > @@ -1133,6 +1133,7 @@ int __mptcp_subflow_connect(struct sock *sk, const > > struct mptcp_addr_info *loc, > > spin_lock_bh(&msk->join_list_lock); > > list_add_tail(&subflow->node, &msk->join_list); > > spin_unlock_bh(&msk->join_list_lock); > > + sock_put(mptcp_subflow_tcp_sock(subflow)); > > > > return err; Crash is not surprising, the backport puts the socket in the 'success' path (list_add_tail). I don't see why this is in -stable, the faulty commit is not there? The upstream patch is: list_del(&subflow->node); spin_unlock_bh(&msk->join_list_lock); + sock_put(mptcp_subflow_tcp_sock(subflow)); [ Note the 'list_del', this is in the error unwind path ]
Re: [PATCH][next] netfilter: nf_log_bridge: Fix missing assignment of ret on a call to nf_log_register
Colin King wrote: > From: Colin Ian King > > Currently the call to nf_log_register is returning an error code that > is not being assigned to ret and yet ret is being checked. Fix this by > adding in the missing assignment. Thanks for catching this. Acked-by: Florian Westphal
Re: [PATCH 1/3] Revert "netfilter: x_tables: Update remaining dereference to RCU"
Mark Tomlinson wrote: > This reverts commit 443d6e86f821a165fae3fc3fc13086d27ac140b1. > > This (and the following) patch basically re-implemented the RCU > mechanisms of patch 784544739a25. That patch was replaced because of the > performance problems that it created when replacing tables. Now, we have > the same issue: the call to synchronize_rcu() makes replacing tables > slower by as much as an order of magnitude. > > See https://lore.kernel.org/patchwork/patch/151796/ for why using RCU is > not a good idea. Please don't add links for the rationale.
Re: [PATCH 2/3] Revert "netfilter: x_tables: Switch synchronization to RCU"
Mark Tomlinson wrote: > This reverts commit cc00bcaa589914096edef7fb87ca5cee4a166b5c. > > This (and the preceding) patch basically re-implemented the RCU > mechanisms of patch 784544739a25. That patch was replaced because of the > performance problems that it created when replacing tables. Now, we have > the same issue: the call to synchronize_rcu() makes replacing tables > slower by as much as an order of magnitude. Can you give roigh numbers? > See https://lore.kernel.org/patchwork/patch/151796/ for why using RCU is > not a good idea. Please, no link.
Re: [PATCH 3/3] netfilter: x_tables: Use correct memory barriers.
Mark Tomlinson wrote: > When a new table value was assigned, it was followed by a write memory > barrier. This ensured that all writes before this point would complete > before any writes after this point. However, to determine whether the > rules are unused, the sequence counter is read. To ensure that all > writes have been done before these reads, a full memory barrier is > needed, not just a write memory barrier. The same argument applies when > incrementing the counter, before the rules are read. > > Changing to using smp_mb() instead of smp_wmb() fixes the kernel panic > reported in cc00bcaa5899, Can you reproduce the crashes without this change? > while still maintaining the same speed of replacing tables. How much of an impact is the MB change on the packet path? Please also CC authors of the patches you want reverted when reposting.
Re: linux-next: build failure after merge of the net-next tree
Stephen Rothwell wrote: > net/bridge/netfilter/ebtables.c:1248:33: error: 'struct netns_xt' has no > member named 'tables' > 1248 | list_for_each_entry(t, &net->xt.tables[NFPROTO_BRIDGE], list) { > | ^ > include/linux/list.h:619:20: note: in definition of macro 'list_entry_is_head' > 619 | (&pos->member == (head)) > |^~~~ > net/bridge/netfilter/ebtables.c:1248:2: note: in expansion of macro > 'list_for_each_entry' > 1248 | list_for_each_entry(t, &net->xt.tables[NFPROTO_BRIDGE], list) { > | ^~~ > > Caused by commit > > 5b53951cfc85 ("netfilter: ebtables: use net_generic infra") > > interacting with commit > > 7ee3c61dcd28 ("netfilter: bridge: add pre_exit hooks for ebtable > unregistration") Right, the fixup is correct.
Re: [PATCH] netfilter: nf_conntrack: Add conntrack helper for ESP/IPsec
Cole Dishington wrote: > Introduce changes to add ESP connection tracking helper to netfilter > conntrack. The connection tracking of ESP is based on IPsec SPIs. The > underlying motivation for this patch was to allow multiple VPN ESP > clients to be distinguished when using NAT. > > Added config flag CONFIG_NF_CT_PROTO_ESP to enable the ESP/IPsec > conntrack helper. Thanks for the effort to upstream out of tree code. A couple of comments and questions below. Preface: AFAIU this tracker aims to 'soft-splice' two independent ESP connections, i.e.: saddr:spi1 -> daddr daddr:spi2 <- saddr So that we basically get this conntrack: saddr,daddr,spi1 (original) daddr,saddr,spi2 (remote) This can't be done as-is, because we don't know spi2 at the time the first ESP packet is received. The solution implemented here is introduction of a 'virtual esp id', computed when first ESP packet is received, so conntrack really stores: saddr,daddr,ID (original) daddr,saddr,ID (remote) Because the ID is never carried on the wire, this tracker hooks into pkt_to_tuple() infra so that the conntrack tuple gets populated as-needed. If I got that right, I think it would be good to place some description like this in the source code, this is unlike all the other trackers. > index ..2441e031c68e > --- /dev/null > +++ b/include/linux/netfilter/nf_conntrack_proto_esp.h > @@ -0,0 +1,25 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _CONNTRACK_PROTO_ESP_H > +#define _CONNTRACK_PROTO_ESP_H > +#include > + > +/* ESP PROTOCOL HEADER */ > + > +struct esphdr { > + __u32 spi; > +}; > + > +struct nf_ct_esp { > + unsigned int stream_timeout; > + unsigned int timeout; > +}; > + > +#ifdef __KERNEL__ > +#include > + > +void destroy_esp_conntrack_entry(struct nf_conn *ct); > + > +bool esp_pkt_to_tuple(const struct sk_buff *skb, unsigned int dataoff, > + struct net *net, struct nf_conntrack_tuple *tuple); > +#endif /* __KERNEL__ */ No need for the __KERNEL__, this header is not exposed to userspace (only those in include/uapi/). > struct { > __be16 key; > } gre; > + struct { > + __be16 spi; __be32 ? I now see that this "spi" seems to be allocated by the esp tracker. Maybe 'esp_id' or something like that? It doesn't appear to be related to the ESP header SPI value. > --- a/include/net/netns/conntrack.h > +++ b/include/net/netns/conntrack.h > @@ -69,6 +69,27 @@ struct nf_gre_net { > }; > #endif > > +#ifdef CONFIG_NF_CT_PROTO_ESP > +#define ESP_MAX_PORTS 1000 > +#define HASH_TAB_SIZE ESP_MAX_PORTS ESP? Ports? Should this be 'slots'? Maybe a comment helps, I don't expect to see ports in an ESP tracker. > +enum esp_conntrack { > + ESP_CT_UNREPLIED, > + ESP_CT_REPLIED, > + ESP_CT_MAX > +}; > + > +struct nf_esp_net { > + rwlock_t esp_table_lock; This uses a rwlock but i only see writer locks being taken. So this either should use a spinlock, or reader-parts should take readlock, not wrlock. (but also see below). > + struct hlist_head ltable[HASH_TAB_SIZE]; > + struct hlist_head rtable[HASH_TAB_SIZE]; > + /* Initial lookup for remote end until rspi is known */ > + struct hlist_head incmpl_rtable[HASH_TAB_SIZE]; > + struct _esp_table *esp_table[ESP_MAX_PORTS]; > + unsigned int esp_timeouts[ESP_CT_MAX]; > +}; This is large structure -- >32kb. Could this be moved to nf_conntrack_net? It would also be good to not allocate these hash slots until after conntrack is needed. The esp_timeouts[] can be kept to avoid module dep problems. (But also see below, I'm not sure homegrown hash table is the way to go). > struct ct_pcpu { > diff --git a/include/uapi/linux/netfilter/nf_conntrack_tuple_common.h > b/include/uapi/linux/netfilter/nf_conntrack_tuple_common.h > index 64390fac6f7e..9bbd76c325d2 100644 > --- a/include/uapi/linux/netfilter/nf_conntrack_tuple_common.h > +++ b/include/uapi/linux/netfilter/nf_conntrack_tuple_common.h > @@ -39,6 +39,9 @@ union nf_conntrack_man_proto { .. > +#if 0 > +#define ESP_DEBUG 1 > +#define DEBUGP(format, args...) printk(KERN_DEBUG "%s: " format, __func__, > ## args) > +#else > +#undef ESP_DEBUG > +#define DEBUGP(x, args...) > +#endif I suggest to get rid of all of DEBUGP(), either drop them, or, in cases where they are useful, switch to pr_debug(). > +#define TEMP_SPI_START 1500 > +#define TEMP_SPI_MAX (TEMP_SPI_START + ESP_MAX_PORTS - 1) I think this could use an explanation. > +struct _esp_table { > + /* Hash table nodes for each required lookup > + * lnode: l_spi, l_ip, r_ip > + * rnode: r_spi, r_ip > + * incmpl_rnode: r_ip > + */ > + struct hlist_node lnode; > + struct hlist_node rnode; > + struct hlist_node incmpl_rnode; > + > + u32 l_spi; > + u32 r_spi; > + u32 l_ip; > + u32 r_ip; Hmm, ipv4 only. Could this be chan
Re: [PATCH v6 0/3] net, mac80211, kernel: enable KCOV remote coverage collection for 802.11 frame handling
Marco Elver wrote: [..] > v6: > * Revert usage of skb extensions due to potential memory leak. Patch 2/3 is > now > idential to that in v2. > * Patches 1/3 and 3/3 are otherwise identical to v5. The earlier series was already applied to net-next, so you need to rebase on top of net-next and include a revert of the patch that added the kcov skb extension. Also, please indicate the git tree that you want this applied to in the subject.
Re: [PATCH net-next 3/3] net: core: Namespace-ify sysctl_rmem_max and sysctl_wmem_max
menglong8.d...@gmail.com wrote: > From: Menglong Dong > > For now, sysctl_wmem_max and sysctl_rmem_max are globally unified. > It's not convenient in some case. For example, when we use docker > and try to control the default udp socket receive buffer for each > container. > > For that reason, make sysctl_wmem_max and sysctl_rmem_max > per-namespace. I think having those values be restricted by init netns is a desirable property.
Re: [PATCH] net: bridge: fix error return code of do_update_counters()
Jia-Ju Bai wrote: > When find_table_lock() returns NULL to t, no error return code of > do_update_counters() is assigned. Its -ENOENT. > t = find_table_lock(net, name, &ret, &ebt_mutex); ^ ret is passed to find_table_lock, which passes it to find_inlist_lock_noload() which will set *ret = -ENOENT for NULL case.
Re: [PATCH v2 3/3] netfilter: x_tables: Use correct memory barriers.
Mark Tomlinson wrote: > When a new table value was assigned, it was followed by a write memory > barrier. This ensured that all writes before this point would complete > before any writes after this point. However, to determine whether the > rules are unused, the sequence counter is read. To ensure that all > writes have been done before these reads, a full memory barrier is > needed, not just a write memory barrier. The same argument applies when > incrementing the counter, before the rules are read. > > Changing to using smp_mb() instead of smp_wmb() fixes the kernel panic > reported in cc00bcaa5899 (which is still present), while still > maintaining the same speed of replacing tables. > > The smb_mb() barriers potentially slow the packet path, however testing > has shown no measurable change in performance on a 4-core MIPS64 > platform. Okay, thanks for testing. I have no further feedback.
Re: [PATCH 015/141] netfilter: Fix fall-through warnings for Clang
Gustavo A. R. Silva wrote: > In preparation to enable -Wimplicit-fallthrough for Clang, fix multiple > warnings by explicitly adding multiple break statements instead of just > letting the code fall through to the next case. Acked-by: Florian Westphal Feel free to carry this in next iteration of series, if any.
Re: [PATCH 108/141] netfilter: ipt_REJECT: Fix fall-through warnings for Clang
Gustavo A. R. Silva wrote: > In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning > by explicitly adding a break statement instead of letting the code fall > through to the next case. Acked-by: Florian Westphal
Re: [PATCH v5 2/3] net: add kcov handle to skb extensions
Ido Schimmel wrote: > On Thu, Oct 29, 2020 at 05:36:19PM +, Aleksandr Nogikh wrote: > > From: Aleksandr Nogikh > > > > Remote KCOV coverage collection enables coverage-guided fuzzing of the > > code that is not reachable during normal system call execution. It is > > especially helpful for fuzzing networking subsystems, where it is > > common to perform packet handling in separate work queues even for the > > packets that originated directly from the user space. > > > > Enable coverage-guided frame injection by adding kcov remote handle to > > skb extensions. Default initialization in __alloc_skb and > > __build_skb_around ensures that no socket buffer that was generated > > during a system call will be missed. > > > > Code that is of interest and that performs packet processing should be > > annotated with kcov_remote_start()/kcov_remote_stop(). > > > > An alternative approach is to determine kcov_handle solely on the > > basis of the device/interface that received the specific socket > > buffer. However, in this case it would be impossible to distinguish > > between packets that originated during normal background network > > processes or were intentionally injected from the user space. > > > > Signed-off-by: Aleksandr Nogikh > > Acked-by: Willem de Bruijn > > [...] > > > @@ -249,6 +249,9 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t > > gfp_mask, > > > > fclones->skb2.fclone = SKB_FCLONE_CLONE; > > } > > + > > + skb_set_kcov_handle(skb, kcov_common_handle()); > > Hi, > > This causes skb extensions to be allocated for the allocated skb, but > there are instances that blindly overwrite 'skb->extensions' by invoking > skb_copy_header() after __alloc_skb(). For example, skb_copy(), > __pskb_copy_fclone() and skb_copy_expand(). This results in the skb > extensions being leaked [1]. [..] > Other suggestions? Aleksandr, why was this made into an skb extension in the first place? AFAIU this feature is usually always disabled at build time. For debug builds (test farm /debug kernel etc) its always needed. If thats the case this u64 should be an sk_buff member, not an extension.
Re: BUG: ip6tables IPv6-REDIRECT over bridges
Artie Hamilton wrote: > Now the same thing should be done for IPv6. It should works quite similar > (I just assume the above mentioned steps are already done): > > $ sysctl -w net.ipv6.conf.br0.accept_ra=2 > $ sysctl -w net.bridge.bridge-nf-call-ip6tables=1 > $ ip6tables -t nat -A PREROUTING -p tcp -m tcp --dport 8080 -j REDIRECT > --to-ports 81 > > But here is the problem: Connections will not be started. I see for example > connections getting started to the service like this on the client: Yes, because bridge layer does not detect when addresses have been rewritten. The check is only implemented for ipv4, see dnat_took_place use in br_netfilter.c -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
Eric Dumazet wrote: > > diff --git a/net/netfilter/nf_conntrack_core.c > > b/net/netfilter/nf_conntrack_core.c > > index 43549eb..7a34bb2 100644 > > --- a/net/netfilter/nf_conntrack_core.c > > +++ b/net/netfilter/nf_conntrack_core.c > > @@ -387,8 +387,12 @@ begin: > > !atomic_inc_not_zero(&ct->ct_general.use))) > > h = NULL; > > else { > > + /* A conntrack can be recreated with the equal tuple, > > +* so we need to check that the conntrack is initialized > > +*/ > > if (unlikely(!nf_ct_tuple_equal(tuple, &h->tuple) || > > -nf_ct_zone(ct) != zone)) { > > +nf_ct_zone(ct) != zone) || > > +!nf_ct_is_confirmed(ct)) { > > nf_ct_put(ct); > > goto begin; > > } > > I do not think this is the right way to fix this problem (if said > problem is confirmed) > > Remember the rule about SLAB_DESTROY_BY_RCU : > > When a struct is freed, then reused, its important to set the its refcnt > (from 0 to 1) only when the structure is fully ready for use. > > If a lookup finds a structure which is not yet setup, the > atomic_inc_not_zero() will fail. Indeed. But, the structure itself might be ready (or rather, can be ready since the allocation side will set the refcount to one after doing the initial work, such as zapping old ->status flags and setting tuple information). The problem is with nat extension area stored in the ct->ext area. This extension area is preallocated but the snat/dnat action information is only set up after the ct (or rather, the skb that grabbed a reference to the nf_conn entry) traverses nat pre/postrouting. This will also set up a null-binding when no matching SNAT/DNAT/MASQERUADE rule existed. The manipulations of the skb->nfct->ext nat area are performed without a lock. Concurrent access is supposedly impossible as the conntrack should not (yet) be in the hash table. The confirmed bit is set right before we insert the conntrack into the hash table (after we traversed rules, ct is ready to be 'published'). i.e. when the confirmed bit is NOT set we should not be 'seeing' the nf_conn struct when we perform the lookup, as it should still be sitting on the 'unconfirmed' list, being invisible to readers. Does that explanation make sense to you? Thanks for looking into this. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
Eric Dumazet wrote: > > This will also set up a null-binding when no matching SNAT/DNAT/MASQERUADE > > rule existed. > > > > The manipulations of the skb->nfct->ext nat area are performed without > > a lock. Concurrent access is supposedly impossible as the conntrack > > should not (yet) be in the hash table. > > > > The confirmed bit is set right before we insert the conntrack into > > the hash table (after we traversed rules, ct is ready to be > > 'published'). > > > > i.e. when the confirmed bit is NOT set we should not be 'seeing' the nf_conn > > struct when we perform the lookup, as it should still be sitting on the > > 'unconfirmed' list, being invisible to readers. > > > > Does that explanation make sense to you? > > > > Thanks for looking into this. > > Still, this patch adds a loop. And maybe an infinite one if confirmed > bit is set from an context that was interrupted by this one. Hmm. There should be at most one retry. The confirmed bit should always be set here. If it isn't then this conntrack shouldn't be in the hash table, i.e. when we re-try we should find the same conntrack again with the bit set. Asuming the other cpu git interrupted after setting confirmed bit but before inserting it into the hash table, then our re-try should not be able find a matching entry. Maybe I am missing something, but I don't see how we could (upon retry) find the very same entry again with the bit still not set. > If you need to test the confirmed bit, then you also need to test it > before taking the refcount. I don't think that would make sense, because it should always be set (inserting conntrack into hash table without confirmed set is illegal, and it is never unset again). [ when allocating a new conntrack, ct->status is zeroed, which also clears the flag. This happens just before we set the new objects refcount to 1 ] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
Eric Dumazet wrote: > > The confirmed bit should always be set here. > > So why are you testing it ? To detect ct object recycling when tuple is identical. This is my understanding of how we can end up with two cpus thinking they have exclusive ownership of the same ct: A cpu0: starts lookup: find ct for tuple t B cpu1: starts lookup: find ct for tuple t C cpu0: finds ct c for tuple t, no refcnt taken yet cpu1: finds ct c for tuple t, no refcnt taken yet cpu2: destroys ct c, removes from hash table, calls ->destroy function D cpu0: tries to increment refcnt; fails since its 0: lookup ends E cpu0: allocates a new ct object since no acceptable ct was found for t F cpu0: allocator gives us just-freed ct c G cpu0: initialises ct, sets refcnt to 1 H cpu0: adds extensions, ct object is put on unconfirmed list and assigned to skb->nfct I cpu0: skb continues through network stack J cpu1: tries to increment refcnt, ok K cpu1: checks if ct matches requested tuple t: it does L cpu0: sets refcnt conntrack tuple, allocates extensions, etc. cpu1: sets skb->nfct to ct, skb continues through network stack -> both cpu0 and cpu1 reference a ct object that was not in hash table cpu0 and cpu1 will then race, for example in net/ipv4/netfilter/iptable_nat.c:nf_nat_rule_find(): if (!nf_nat_initialized(ct, HOOK2MANIP(hooknum))) ret = alloc_null_binding(ct, hooknum); [ Its possible that I misunderstand and that there is something that precents this from happening. Usually its the 'tuple equal' test that is performed post-atomic-inc-not-zero that detects the recycling, so step K above would fail ] The idea of the 'confirmed bit test' is that when its not set then the conntrack was recycled and should not be used before the cpu that currently 'owns' that entry has put it into the hash table again. > I did this RCU conversion, so I think I know what I am talking about. Yes, I know. If you have any suggestions on how to fix it, I'd be very interested to hear about them. > The entry should not be put into hash table (or refcnt set to 1), > if its not ready. It is that simple. I understand what you're saying, but I don't see how we can do it. I think the assumption that we have a refcnt on skb->nfct is everywhere. If I understand you correctly we'd have to differentiate between 'can be used fully (e.g. nf_nat_setup_info done for both directions)' and 'a new conntrack was created (extensions may not be ready yet)'. But currently in both cases the ct is assigned to the skb, and in both cases a refcnt is taken. I am out of ideas, except perhaps using ct->lock to serialize the nat setup (but I don't like that since I'm not sure that the nat race is the only one). > We need to address this problem without adding an extra test and > possible loop for the lookups. Agreed. I don't like the extra test either. Many thanks for looking into this Eric! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
Florian Westphal wrote: > Eric Dumazet wrote: > > > The confirmed bit should always be set here. > > > > So why are you testing it ? > > To detect ct object recycling when tuple is identical. > > This is my understanding of how we can end up with two > cpus thinking they have exclusive ownership of the same ct: > > A cpu0: starts lookup: find ct for tuple t > B cpu1: starts lookup: find ct for tuple t > C cpu0: finds ct c for tuple t, no refcnt taken yet > cpu1: finds ct c for tuple t, no refcnt taken yet >cpu2: destroys ct c, removes from hash table, calls ->destroy function > D cpu0: tries to increment refcnt; fails since its 0: lookup ends > E cpu0: allocates a new ct object since no acceptable ct was found for t > F cpu0: allocator gives us just-freed ct c > G cpu0: initialises ct, sets refcnt to 1 > H cpu0: adds extensions, ct object is put on unconfirmed list and > assigned to skb->nfct > I cpu0: skb continues through network stack > J cpu1: tries to increment refcnt, ok > K cpu1: checks if ct matches requested tuple t: it does > L cpu0: sets refcnt conntrack tuple, allocates extensions, etc. > cpu1: sets skb->nfct to ct, skb continues through network stack sorry, for that brain fart This should only say L cpu1: sets skb->nfct to ct, skb continues... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
Andrew Vagin wrote: > Can we allocate conntrack with zero ct_general.use and increment it at > the first time before inserting the conntrack into the hash table? > When conntrack is allocated it is attached exclusively to one skb. > It must be destroyed with skb, if it has not been confirmed, so we > don't need refcnt on this stage. > > I found only one place, where a reference counter of unconfirmed > conntract can incremented. It's ctnetlink_dump_table(). What about skb_clone, etc? They will also increment the refcnt if a conntrack entry is attached to the skb. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
Andrew Vagin wrote: > On Thu, Jan 09, 2014 at 09:56:22PM +0100, Florian Westphal wrote: > > Andrew Vagin wrote: > > > Can we allocate conntrack with zero ct_general.use and increment it at > > > the first time before inserting the conntrack into the hash table? > > > When conntrack is allocated it is attached exclusively to one skb. > > > It must be destroyed with skb, if it has not been confirmed, so we > > > don't need refcnt on this stage. > > > > > > I found only one place, where a reference counter of unconfirmed > > > conntract can incremented. It's ctnetlink_dump_table(). > > > > What about skb_clone, etc? They will also increment the refcnt > > if a conntrack entry is attached to the skb. > > We can not attach an unconfirmed conntrack to a few skb, because s/few/new/? > nf_nat_setup_info can be executed concurrently for the same conntrack. > > How do we avoid this race condition for cloned skb-s? Simple, the assumption is that only one cpu owns the nfct, so it does not matter if the skb is cloned in between, as there are no parallel users. The only possibility (that I know of) to violate this is to create a bridge, enable call-iptables sysctl, add -j NFQUEUE rules and then wait for packets that need to be forwarded to several recipients, e.g. multicast traffic. see http://marc.info/?l=netfilter-devel&m=131471083501656&w=2 or search 'netfilter: nat: work around shared nfct struct in bridge case' -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] netfilter: nf_conntrack: don't relase a conntrack with non-zero refcnt
Andrey Vagin wrote: > > Eric and Florian, could you look at this patch. When you say, > that it looks good, I will ask the user to validate it. > I can't reorder these actions, because it's reproduced on a real host > with real users. Thanks. > > > nf_conntrack_free can't be called for a conntract with non-zero ref-counter, > because it can race with nf_conntrack_find_get(). Indeed. > A conntrack slab is created with SLAB_DESTROY_BY_RCU. Non-zero > ref-conunter says that this conntrack is used now. So when we release a > conntrack with non-zero counter, we break this assumption. > > CPU1CPU2 > nf_conntrack_find() > nf_ct_put() > destroy_conntrack() > ... > init_conntrack > __nf_conntrack_alloc (set use = 1) > atomic_inc_not_zero(&ct->use) (use = 2) > if (!l4proto->new(ct, skb, dataoff, > timeouts)) > nf_conntrack_free(ct); (use = 2 !!!) > ... Yes, I think this sequence is possible; we must not use nf_conntrack_free here. > - /* We overload first tuple to link into unconfirmed or dying list.*/ > - BUG_ON(hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode)); > - hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode); > + if (!hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode)) > + hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode); This is the only thing that I don't like about this patch. Currently all the conntracks in the system are always put on a list before they're supposed to be visible/handled via refcnt system (unconfirmed, hash, or dying list). I think it would be nice if we could keep it that way. If everything fails we could proably intoduce a 'larval' dummy list similar to the one used by template conntracks? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [RFC] netfilter: nf_conntrack: don't relase a conntrack with non-zero refcnt
Andrew Vagin wrote: > > I think it would be nice if we could keep it that way. > > If everything fails we could proably intoduce a 'larval' dummy list > > similar to the one used by template conntracks? > > I'm not sure, that this is required. Could you elaborate when this can > be useful? You can dump the lists via ctnetlink. Its meant as a debugging aid in case one suspects refcnt leaks. Granted, in this situation there should be no leak since we put the newly allocated entry in the error case. > Now I see only overhead, because we need to take the nf_conntrack_lock > lock to add conntrack in a list. True. I don't have any preference, I guess I'd just do the insertion into the unconfirmed list when we know we cannot track to keep the "unhashed" bug trap in the destroy function. Pablo, any preference? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : Sanitize skb before it enters the IP stack)
David Newall wrote: > Having received no feedback of substance from netdev, I now address > my previous email to a wider audience for discussion and in > preparation for submitting a patch based closely on that below. > > This email is not addressed to Bandan Das , > who is the author of the commit I propose reverting, as his email > address is no longer current. I believe I have otherwise addressed > all appropriate recipients and will circulate a formal patch to the > same recipients if no adverse comments are received. (That would > surprise me.) > > Original Message > Subject: Revert 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge : > Sanitize skb before it enters the IP stack) > Date: Sat, 17 May 2014 00:03:16 +0930 > From: David Newall > To: Lukas Tribus , Eric Dumazet > , Netdev > CC: f...@strlen.de > > > > We should revert commit 462fb2af9788a82a534f8184abfde31574e1cfa0 (bridge > : Sanitize skb before it enters the IP stack) because it corrupts IP > packets with RR or TS options set, only partially updating the IP header > and leaving an incorrect checksum. > > The argument for introducing the change is at lkml.org/lkml/2010/8/30/391: > > The bridge layer can overwrite the IPCB using the > BR_INPUT_SKB_CB macro. In br_nf_dev_queue_xmit, > if we recieve a packet greater in size than the bridge > device MTU, we call ip_fragment which in turn will lead to > icmp_send calling ip_options_echo if the DF flag is set. > ip_options_echo will then incorrectly try to parse the IPCB as > IP options resulting in a buffer overflow. > This change refills the CB area back with IP > options before ip_fragment calls icmp_send. If we fail parsing, > we zero out the IPCB area to guarantee that the stack does > not get corrupted. > > A bridge should not fragment packets; an *ethernet* bridge should not > need to. Fragmenting packets is the job of higher level protocol. Well, did you test what happens if we try to refrag a packet containing ip options after the revert? can happen e.g. when using netfilter conntrack on top of a bridge. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ip_set: protocol %u message -- useful?
Ilia Mirkin wrote: > > Maybe printing "using protocol version X" will make it appear less like > > a debugging message referring to packet contents or something similar. > > With pr_info it'll still appear in dmesg, and it'll still be "random > non-sensical message appears over and over in dmesg" type of > situation, to the vast majority of users. Do we need a print every > time someone creates a new tcp connection too? I'm still not totally > clear on the cause of this message getting printed, but I was seeing > it a whole bunch in my configuration... Yes, because it erronously got moved into the netns init function. And thats what causes the spew. Moving it back into module init function should be enough. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/