Apoligies for my bug reporting style having turned into something like personal blog postings... I'm distressed about this bug. I'm worried that a dozen production machines that are currently running Debian stable with similar IPv6 + IPsec configuration will be affected once stretch is released. Therefore I'm trying my best to learn the tools and diagnose the bug. Any tips would be greatly appreciated.
On Wed, Nov 25 2015, Gerald Turner wrote: > On Wed, Nov 25 2015, Gerald Turner wrote: >> I suppose I'll restart bisection at last 'bad' and let the kernels >> run for a day before issueing 'git bisect good'. > > I'm in the process of doing this, may take a week. I took a week to re-perform bisection, this time booting twice and waiting for a day of uptime before issueing 'git bisect good'. Nevertheless the result was the exact same replay I copied two emails back. Nothing gained. I then scrutinized over the backtrace disassembly (three emails back). Panic occurs at the return from inline function rt6_get_cookie declared in ip6_fib.h. This function was introduced during 4.2 with merge c1a34035: commit c1a34035506d3a7ad62403125d59c86b763c477d Merge: 01b6961 d52d399 Author: David S. Miller <da...@davemloft.net> Date: Mon May 25 13:25:35 2015 -0400 Merge branch 'ipv6_route_sharing' commit d52d3997f843ffefaa8d8462790ffcaca6c74192 Author: Martin KaFai Lau <ka...@fb.com> Date: Fri May 22 20:56:06 2015 -0700 ipv6: Create percpu rt6_info commit 83a09abd1a8badbbb715f928d07c65ac47709c47 Author: Martin KaFai Lau <ka...@fb.com> Date: Fri May 22 20:56:05 2015 -0700 ipv6: Break up ip6_rt_copy() commit 8d0b94afdca84598912347e61defa846a0988d04 Author: Martin KaFai Lau <ka...@fb.com> Date: Fri May 22 20:56:04 2015 -0700 ipv6: Keep track of DST_NOCACHE routes in case of iface down/unregister commit 3da59bd94583d1239e4fbdee452265a160b9cd71 Author: Martin KaFai Lau <ka...@fb.com> Date: Fri May 22 20:56:03 2015 -0700 ipv6: Create RTF_CACHE clone when FLOWI_FLAG_KNOWN_NH is set commit 48e8aa6e3137692d38f20e8bfff100e408c6bc53 Author: Martin KaFai Lau <ka...@fb.com> Date: Fri May 22 20:56:02 2015 -0700 ipv6: Set FLOWI_FLAG_KNOWN_NH at flowi6_flags commit b197df4f0f3782782e9ea8996e91b65ae33e8dd9 Author: Martin KaFai Lau <ka...@fb.com> Date: Fri May 22 20:56:01 2015 -0700 ipv6: Add rt6_get_cookie() function commit 45e4fd26683c9a5f88600d91b08a484f7f09226a Author: Martin KaFai Lau <ka...@fb.com> Date: Fri May 22 20:56:00 2015 -0700 ipv6: Only create RTF_CACHE routes after encountering pmtu exception commit 8b9df2657704dd313333a79497dde429f9190caa Author: Martin KaFai Lau <ka...@fb.com> Date: Fri May 22 20:55:59 2015 -0700 ipv6: Combine rt6_alloc_cow and rt6_alloc_clone commit 2647a9b07032c5a95ddee1fcb65d95bddbc6b7f9 Author: Martin KaFai Lau <ka...@fb.com> Date: Fri May 22 20:55:58 2015 -0700 ipv6: Remove external dependency on rt6i_gateway and RTF_ANYCAST commit fd0273d7939f2ce3247f6aac5f6b9a0135d4cd39 Author: Martin KaFai Lau <ka...@fb.com> Date: Fri May 22 20:55:57 2015 -0700 ipv6: Remove external dependency on rt6i_dst and rt6i_src commit 286c2349f6665c3e67f464a5faa14a0e28be4842 Author: Martin KaFai Lau <ka...@fb.com> Date: Fri May 22 20:55:56 2015 -0700 ipv6: Clean up ipv6_select_ident() and ip6_fragment() This following is all conjecture, but evidently with this merge the IPv6 routing cache gained some optimization, is now using per-CPU structures, and has relegated PMTU updates to a slower path. My IPv6 + IPsec environments have had their share of PMTU problems in the past (two of the three sites are behind 6in4 tunnels, all three sites have differing MTU's, used to get stalls, even on interactive SSH traffic, due to PMTU cache eviction/re-discovery). Also the crash occurs immediately after boot (or login for the desktop system), and I'm using systemd, highly concurrent, maybe a race with the per-CPU change? Also the "Merge: 01b6961 d52d399" line is vaguely interesting (to me anway, because I'm a git newbie) because commit 01b6961 happens to be the same exotic driver as as the _first bad commit_ from my bisect runs. Therefore I think I'm onto something... I spent some time trying to build 4.2.6 with these commits reverted, unfortunately there are a few commits that came later that modify lines From this merge, so simply running 'git revert -m 1 c1a340355' is not possible. I eventually built a 4.2.6 kernel with the following commits reverted: git revert 9c7370a1 # ipv6: Fix a potential deadlock when creating pcpu rt git revert a73e4195 # ipv6: Add rt6_make_pcpu_route git revert ad706862 # ipv6: Remove un-used argument from ip6_dst_alloc git revert 87775312 # net-ipv6: Delete an unnecessary check before the function call "free_percpu" git revert d52d3997 # ipv6: Create percpu rt6_info Sadly this too crashed, however at least it was a different crash! [ 45.751104] BUG: unable to handle kernel NULL pointer dereference at (null) [ 45.751127] IP: [<ffffffff815526a7>] _raw_spin_lock_bh+0x17/0x30 [ 45.751144] PGD 0 [ 45.751151] Oops: 0002 [#1] SMP [ 45.751159] Modules linked in: xfrm4_mode_transport ccm xfrm6_mode_tunnel xfrm4_mode_tunnel deflate ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables tun sit ip_tunnel rfcomm twofish_generic twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common serpent_avx2 serpent_avx_x86_64 serpent_sse2_x86_64 serpent_generic blowfish_generic blowfish_x86_64 blowfish_common cast5_avx_x86_64 cast5_generic cast_common seqiv crypto_null ctr ecb des_generic cbc camellia_generic camellia_aesni_avx2 camellia_aesni_avx_x86_64 camellia_x86_64 xts xcbc sha512_ssse3 sha512_generic md4 algif_hash xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo bnep binfmt_misc nls_utf8 nls_cp437 vfat fat ext4 mbcache jbd2 x86_pkg_temp_thermal intel_powerclamp intel_rapl [ 45.751357] eeepc_wmi iosf_mbi snd_hda_codec_realtek asus_wmi iTCO_wdt sparse_keymap snd_hda_codec_hdmi iTCO_vendor_support snd_hda_codec_generic coretemp kvm_intel snd_hda_intel kvm btusb psmouse btrtl snd_hda_codec btbcm btintel snd_hda_core bluetooth mei_me serio_raw lpc_ich efivars pcspkr snd_hwdep sg mei mfd_core dw_dmac rfkill 8250_fintek crc16 i2c_i801 dw_dmac_core snd_soc_rt5640 snd_soc_rl6231 snd_soc_core snd_compress acpi_pad snd_pcm snd_timer snd soundcore regmap_i2c tpm_infineon tpm_tis i2c_designware_platform shpchp battery i2c_designware_core evdev tpm snd_soc_sst_acpi processor cuse fuse parport_pc ppdev lp parport efivarfs autofs4 btrfs xor raid6_pq algif_skcipher af_alg hid_generic usbhid dm_crypt dm_mod sd_mod crct10dif_pclmul crc32_pclmul crc32c_intel jitterentropy_rng sha256_ssse3 [ 45.751558] sha256_generic hmac ahci libahci drbg libata ansi_cprng i915 i2c_algo_bit xhci_pci ehci_pci mxm_wmi xhci_hcd ehci_hcd drm_kms_helper e1000e ptp aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd usbcore scsi_mod pps_core drm usb_common fan thermal sdhci_acpi sdhci video mmc_core thermal_sys wmi i2c_hid hid button [ 45.751648] CPU: 2 PID: 564 Comm: kworker/2:2 Not tainted 4.2.6-gt+ #1 [ 45.751662] Hardware name: ASUS All Series/Z97-AR, BIOS 1304 07/11/2014 [ 45.751679] Workqueue: events dst_gc_task [ 45.751688] task: ffff8808171dee80 ti: ffff88080ea48000 task.ti: ffff88080ea48000 [ 45.751705] RIP: 0010:[<ffffffff815526a7>] [<ffffffff815526a7>] _raw_spin_lock_bh+0x17/0x30 [ 45.751724] RSP: 0018:ffff88080ea4bcf0 EFLAGS: 00010246 [ 45.751736] RAX: 0000000000000000 RBX: ffff8807ec1397c0 RCX: 0000000000000020 [ 45.751751] RDX: 0000000000000001 RSI: ffffffff81672c21 RDI: 0000000000000000 [ 45.751766] RBP: ffff8807ec139900 R08: ffffffff81acb9c8 R09: ffff88083fa9254c [ 45.751781] R10: 0000000000000653 R11: 00000000000003ed R12: 0000000000000000 [ 45.751796] R13: 0000000000000000 R14: ffff880035fdee40 R15: 0000000000000080 [ 45.751811] FS: 0000000000000000(0000) GS:ffff88083fa80000(0000) knlGS:0000000000000000 [ 45.751828] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 45.751840] CR2: 0000000000000000 CR3: 0000000001c0c000 CR4: 00000000001406e0 [ 45.751855] Stack: [ 45.751860] ffffffff8150b43f 0000000200000000 ffff8807ec1397c0 0000000000000000 [ 45.751878] 0000000000000001 0000000000000001 ffffffff81466bea ffff88080ea4bd28 [ 45.751896] ffff8807ec1397c0 ffff88081732ac40 ffffffff81466d25 000000000000016b [ 45.751915] Call Trace: [ 45.751921] [<ffffffff8150b43f>] ? ip6_dst_destroy+0x3f/0xa0 [ 45.751935] [<ffffffff81466bea>] ? dst_destroy+0x2a/0xc0 [ 45.751948] [<ffffffff81466d25>] ? dst_gc_task+0xa5/0x210 [ 45.751962] [<ffffffff8101c633>] ? native_sched_clock+0x23/0x80 [ 45.751975] [<ffffffff8101c695>] ? sched_clock+0x5/0x10 [ 45.751988] [<ffffffff810a4134>] ? pick_next_task_fair+0x594/0x8d0 [ 45.752003] [<ffffffff8101263b>] ? __switch_to+0x1cb/0x560 [ 45.752016] [<ffffffff81084c0f>] ? process_one_work+0x19f/0x3d0 [ 45.752029] [<ffffffff81084e8d>] ? worker_thread+0x4d/0x450 [ 45.752042] [<ffffffff8154ec4d>] ? __schedule+0x2bd/0x8c0 [ 45.752054] [<ffffffff81084e40>] ? process_one_work+0x3d0/0x3d0 [ 45.752068] [<ffffffff8108ac81>] ? kthread+0xc1/0xe0 [ 45.752080] [<ffffffff8108abc0>] ? kthread_create_on_node+0x170/0x170 [ 45.752094] [<ffffffff8155301f>] ? ret_from_fork+0x3f/0x70 [ 45.752106] [<ffffffff8108abc0>] ? kthread_create_on_node+0x170/0x170 [ 45.752120] Code: 01 00 00 00 c3 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 65 81 05 b0 92 ab 7e 00 02 00 00 31 c0 ba 01 00 00 00 <f0> 0f b1 17 85 c0 75 02 f3 c3 89 c6 e8 c8 c4 b5 ff 66 90 c3 0f [ 45.752199] RIP [<ffffffff815526a7>] _raw_spin_lock_bh+0x17/0x30 [ 45.752214] RSP <ffff88080ea4bcf0> [ 45.752221] CR2: 0000000000000000 I'm lost. If reverting a few commits, cleanly (no conflicts) can bust a kernel locking mechanism, then I'm afraid this endeavor is futile. -- Gerald Turner <gtur...@unzane.com> Encrypted mail preferred! OpenPGP: 4096R / CA89 B27A 30FA 66C5 1B80 3858 EC94 2276 FDB8 716D
signature.asc
Description: PGP signature