Thank you, I will try to figure out what is happening. Pavel
On 05/04/2018 12:01 PM, Andrei Vagin wrote: > On Fri, May 04, 2018 at 12:47:53PM +0000, Pavel Tatashin wrote: >> Hi Andrei, >> >> Could you please provide me with scripts to reproduce this issue? > > I boot this kernel in a kvm virtual machine. The kernel is built without > modules. A config file is attahced. > > Here is a qemu command line what I use to reproduce the problem: > > qemu-kvm -kernel /home/avagin/git/linux-next/arch/x86/boot/bzImage \ > -append 'root=/dev/vda2 ro debug console=ttyS0,115200 > LANG=en_US.UTF-8 slub_debug=FZP raid=noautodetect selinux=0 > earlyprintk=serial,ttyS0,115200' \ > -boot c \ > -smp 2,sockets=2,cores=1,threads=1 \ > -drive > file=/home/vms/fc22.img,format=raw,if=none,id=drive-virtio-disk0 \ > --display none \ > -serial telnet:127.0.0.1:4444,server,nowait -cpu > Skylake-Client-IBRS,ss=on,hypervisor=on,tsc_adjust=on,clflushopt=on,xsaves=on,pdpe1gb=on,ibpb=on > \ > -m 4096 \ > -realtime mlock=off \ > -machine pc-i440fx-2.3,accel=kvm,usb=off,dump-guest-core=off \ > -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x6.0x7 -device > ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x6 > \ > -device > ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x6.0x1 \ > -device > ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x6.0x2 \ > -device > virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 > -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on > > > [avagin@laptop linux-next]$ cat /proc/cpuinfo > processor : 0 > vendor_id : GenuineIntel > cpu family : 6 > model : 78 > model name : Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz > stepping : 3 > microcode : 0xc2 > cpu MHz : 1213.986 > cache size : 3072 KB > physical id : 0 > siblings : 4 > core id : 0 > cpu cores : 2 > apicid : 0 > initial apicid : 0 > fpu : yes > fpu_exception : yes > cpuid level : 22 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov > pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb > rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology > nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor > ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 > x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm > 3dnowprefetch cpuid_fault epb invpcid_single pti tpr_shadow vnmi flexpriority > ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx > rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves ibpb ibrs > stibp dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp > bugs : cpu_meltdown spectre_v1 spectre_v2 > bogomips : 4992.00 > clflush size : 64 > cache_alignment : 64 > address sizes : 39 bits physical, 48 bits virtual > power management: > >> >> Thank you, >> Pavel >> On Fri, May 4, 2018 at 4:27 AM Andrei Vagin <ava...@virtuozzo.com> wrote: >> >>> Hello, >> >>> We have a robot which runs criu tests on linux-next kernels. >> >>> All tests passed on 4.17.0-rc3-next-20180502. >> >>> But the 4.17.0-rc3-next-20180504 kernel didn't boot. >> >>> git bisect points on this patch. >> >>> On Thu, Apr 26, 2018 at 04:26:19PM -0400, Pavel Tatashin wrote: >>>> The following two bugs were reported by Fengguang Wu: >>>> >>>> kernel reboot-without-warning in early-boot stage, last printk: >>>> early console in setup code >>>> >>>> >> http://lkml.kernel.org/r/20180418135300.inazvpxjxowog...@wfg-t540p.sh.intel.com >> >>> The problem looks similar with this one. >> >>> [ 5.596975] devtmpfs: mounted >>> [ 5.855754] Freeing unused kernel memory: 1704K >>> [ 5.858162] Write protecting the kernel read-only data: 18432k >>> [ 5.860772] Freeing unused kernel memory: 2012K >>> [ 5.861838] Freeing unused kernel memory: 160K >>> [ 5.862572] rodata_test: all tests were successful >>> [ 5.866857] random: fast init done >>> early console in setup code >>> [ 0.000000] Linux version 4.17.0-rc3-00023-g7c4cc2d022a1 >>> (avagin@laptop) (gcc version 8.0.1 20180324 (Red Hat 8.0.1-0.20) (GCC)) >>> #13 SMP Fri May 4 01:10:51 PDT 2018 >>> [ 0.000000] Command line: root=/dev/vda2 ro debug >>> console=ttyS0,115200 LANG=en_US.UTF-8 slub_debug=FZP raid=noautodetect >>> selinux=0 earlyprintk=serial,ttyS0,115200 >>> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating >>> point registers' >>> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' >>> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' >>> [ 0.000000] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds >>> registers' >> >>> $ git describe HEAD >>> v4.17-rc3-23-g7c4cc2d022a1 >> >>> [avagin@laptop linux-next]$ git log --pretty=oneline | head -n 1 >>> 7c4cc2d022a1fd56eb2ee555533b8666bc780f1e mm: access to uninitialized >> struct page >> >> >>>> >>>> And, also: >>>> [per_cpu_ptr_to_phys] PANIC: early exception 0x0d >>>> IP 10:ffffffffa892f15f error 0 cr2 0xffff88001fbff000 >>>> >>>> >> http://lkml.kernel.org/r/20180419013128.iurzouiqxvcnp...@wfg-t540p.sh.intel.com >>>> >>>> Both of the problems are due to accessing uninitialized struct page from >>>> trap_init(). We must first do mm_init() in order to initialize allocated >>>> struct pages, and than we can access fields of any struct page that >> belongs >>>> to memory that's been allocated. >>>> >>>> Below is explanation of the root cause. >>>> >>>> The issue arises in this stack: >>>> >>>> start_kernel() >>>> trap_init() >>>> setup_cpu_entry_areas() >>>> setup_cpu_entry_area(cpu) >>>> get_cpu_gdt_paddr(cpu) >>>> per_cpu_ptr_to_phys(addr) >>>> pcpu_addr_to_page(addr) >>>> virt_to_page(addr) >>>> pfn_to_page(__pa(addr) >> PAGE_SHIFT) >>>> The returned "struct page" is sometimes uninitialized, and thus >>>> failing later when used. It turns out sometimes is because it depends >>>> on KASLR. >>>> >>>> When boot is failing we have this when pfn_to_page() is called: >>>> kasrl: 0x000000000d600000 >>>> addr: ffffffff83e0d000 >>>> pa: 1040d000 >>>> pfn: 1040d >>>> page: ffff88001f113340 >>>> page->flags ffffffffffffffff <- Uninitialized! >>>> >>>> When boot is successful: >>>> kaslr: 0x000000000a800000 >>>> addr: ffffffff83e0d000 >>>> pa: d60d000 >>>> pfn: d60d >>>> page: ffff88001f05b340 >>>> page->flags 280000000000 <- Initialized! >>>> >>>> Here are physical addresses that BIOS provided to us: >>>> e820: BIOS-provided physical RAM map: >>>> BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable >>>> BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved >>>> BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved >>>> BIOS-e820: [mem 0x0000000000100000-0x000000001ffdffff] usable >>>> BIOS-e820: [mem 0x000000001ffe0000-0x000000001fffffff] reserved >>>> BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved >>>> BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved >>>> >>>> In both cases, working and non-working the real physical address is >>>> the same: >>>> >>>> pa - kasrl = 0x2E0D000 >>>> >>>> The only thing that is different is PFN. >>>> >>>> We initialize struct pages in four places: >>>> >>>> 1. Early in boot a small set of struct pages is initialized to fill >>>> the first section, and lower zones. >>>> 2. During mm_init() we initialize "struct pages" for all the memory >>>> that is allocated, i.e reserved in memblock. >>>> 3. Using on-demand logic when pages are allocated after mm_init call >>>> 4. After smp_init() when the rest free deferred pages are initialized. >>>> >>>> The above path happens before deferred memory is initialized, and thus >>>> it must be covered either by 1, 2 or 3. >>>> >>>> So, lets check what PFNs are initialized after (1). >>>> >>>> memmap_init_zone() is called for pfn ranges: >>>> 1 - 1000, and 1000 - 1ffe0, but it quits after reaching pfn 0x10000, >>>> as it leaves the rest to be initialized as deferred pages. >>>> >>>> In the working scenario pfn ended up being below 1000, but in the >>>> failing scenario it is above. Hence, we must initialize this page in >>>> (2). But trap_init() is called before mm_init(). >>>> >>>> The bug was introduced by "mm: initialize pages on demand during boot" >>>> because we lowered amount of pages that is initialized in the step >>>> (1). But, it still could happen, because the number of initialized >>>> pages was a guessing. >>>> >>>> The current fix moves trap_init() to be called after mm_init, but as >>>> alternative, we could increase pgdat->static_init_pgcnt: >>>> In free_area_init_node we can increase: >>>> pgdat->static_init_pgcnt = min_t(unsigned long, >> PAGES_PER_SECTION, >>>> pgdat->node_spanned_pages); >>>> Instead of one PAGES_PER_SECTION, set several, so the text is >>>> covered for all KASLR offsets. But, this would still be guessing. >>>> Therefore, I prefer the current fix. >>>> >>>> Fixes: c9e97a1997fb ("mm: initialize pages on demand during boot") >>>> >>>> Signed-off-by: Pavel Tatashin <pasha.tatas...@oracle.com> >>>> Reviewed-by: Steven Rostedt (VMware) <rost...@goodmis.org> >>>> --- >>>> init/main.c | 2 +- >>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>> >>>> diff --git a/init/main.c b/init/main.c >>>> index b795aa341a3a..870f75581cea 100644 >>>> --- a/init/main.c >>>> +++ b/init/main.c >>>> @@ -585,8 +585,8 @@ asmlinkage __visible void __init start_kernel(void) >>>> setup_log_buf(0); >>>> vfs_caches_init_early(); >>>> sort_main_extable(); >>>> - trap_init(); >>>> mm_init(); >>>> + trap_init(); >>>> >>>> ftrace_init(); >>>>