On 04/14/2012 03:44 PM, 陳韋任 wrote: >> I've made a test from the grub multiboot sample, you may find it here: >> http://jcmvbkbc.spb.ru/git/?p=dumb/qemu-test-kernel.git;a=summary >> >> With it I see that an attempt to execute a TB that spans two pages causes >> an exception when the second page is unmapped. It happens because both >> tlb_flush and tlb_flush_page invalidate relevant tb_jmp_cache entries: >> the former flushes all of them, the latter flushes them for two adjacent pages >> around the given address. Later tb_find_fast fails to find a TB in the >> tb_jmp_cache and has to call tb_find_slow which retranslates TB, triggering >> a pagefault. > > Thanks for the example, Max. But..., I want to repeat the experiment you did > and cannot figure out how to do that. Would you mind to give me some hints? For > example, how did you locate the TB spanning pages whose second page happened to > be unmapped?
First two patches in the mentioned repository is a grub multiboot kernel sample, the third patch is my test. It can be built and run like this (you'll need autotools): $ git clone git://jcmvbkbc.spb.ru/dumb/qemu-test-kernel.git $ cd qemu-test-kernel $ git checkout HEAD~1 # to see how the original kernel works $ ./autogen.sh $ ./configure $ make $ qemu-system-x86_64 -kernel docs/kernel According to multiboot specification [1] multiboot kernel starts its execution in protected mode with paging disabled. The following fragment allocates properly aligned page directory and one page table, makes 1:1 virtual to physical mapping for the first 4MB of virtual/physical memory, loads page directory address into CR3 and enables paging (bit 31 in CR0): uint32_t page_directory[1024] __attribute__((aligned(4096))); uint32_t page_table[1024] __attribute__((aligned(4096))); static void start_paging(void) { unsigned i; for (i = 0; i < ARRAY_SIZE(page_table); ++i) page_table[i] = (i << 12) | 3; page_directory[0] = ((uint32_t)page_table) | 3; asm __volatile__ ( "movl %0, %%cr3\n" "movl %%cr0, %0\n" "orl $0x80000000, %0\n" "movl %0, %%cr0\n" : : "r"(page_directory) : "memory"); } The following fragment allocates two adjacent pages and puts test code around the page boundary between them: 20 'nop' instructions (opcode 0x90), 10 in the first page, 10 in the second page, followed by a 'ret' instruction (opcode 0xc3): uint8_t code_buf[8192] __attribute__((aligned(4096))); static void make_test_code(void) { unsigned i; for (i = 0; i < 20; ++i) code_buf[4096 - 10 + i] = 0x90; code_buf[4096 + 10] = 0xc3; } The following fragment makes a function pointer f pointing to the beginning of 'nop' series and calls this function to make a TB (and check that it works at all). If a return is put right after the first 'f();' the sample kernel should print a few lines describing memory map and halt execution. Then 'code_pfn' is a page frame number of the second page of the test code. 'page_table[code_pfn] = 0;' marks that page as non-present, following invlpg instruction invalidates its TLB entry. Commented code that reloads CR3 register may be used to invalidate the whole TLB. The following 'f();' invocation fails, resulting in machine reset (because the IDT is not initialized). static void test_code(void) { void (*f)(void) = (void*)(code_buf + 4096 - 10); uint32_t code_pfn = (uint32_t)(code_buf + 4096) >> 12; f(); page_table[code_pfn] = 0; //asm __volatile__ ( // "movl %%cr3, %%eax\n" // "movl %%eax, %%cr3\n" // ::: "memory"); asm __volatile__ ( "invlpg (%0)\n" : : "r"(code_buf + 4096) :"memory"); f(); } When the kernel is run with '-d in_asm,cpu,exec,int' I see the following in the log: IN: cmain 0x0000000000100272: movb $0xc3,0x10900a 0x0000000000100279: mov $0x108ff6,%ebx 0x000000000010027e: call *%ebx Trace 0x4191ad90 [0000000000100272] cmain EAX=00000014 EBX=00108ff6 ECX=00100000 EDX=003ff003 ESI=00009500 EDI=2badb002 EBP=00000000 ESP=00104fc4 EIP=00108ff6 EFL=00000046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA] CS =0008 00000000 ffffffff 00cf9a00 DPL=0 CS32 [-R-] SS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA] DS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA] FS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA] GS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA] LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy GDT= 000cca10 00000027 IDT= 00000000 000003ff CR0=80000011 CR2=00000000 CR3=00107000 CR4=00000000 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 CCS=00000014 CCD=00000000 CCO=SUBL EFER=0000000000000000 ---------------- IN: 0x0000000000108ff6: nop 0x0000000000108ff7: nop 0x0000000000108ff8: nop 0x0000000000108ff9: nop 0x0000000000108ffa: nop 0x0000000000108ffb: nop 0x0000000000108ffc: nop 0x0000000000108ffd: nop 0x0000000000108ffe: nop 0x0000000000108fff: nop 0x0000000000109000: nop 0x0000000000109001: nop 0x0000000000109002: nop 0x0000000000109003: nop 0x0000000000109004: nop 0x0000000000109005: nop 0x0000000000109006: nop 0x0000000000109007: nop 0x0000000000109008: nop 0x0000000000109009: nop 0x000000000010900a: ret Trace 0x4191ae60 [0000000000108ff6] EAX=00000014 EBX=00108ff6 ECX=00100000 EDX=003ff003 ESI=00009500 EDI=2badb002 EBP=00000000 ESP=00104fc8 EIP=00100280 EFL=00000046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA] CS =0008 00000000 ffffffff 00cf9a00 DPL=0 CS32 [-R-] SS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA] DS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA] FS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA] GS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA] LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy GDT= 000cca10 00000027 IDT= 00000000 000003ff CR0=80000011 CR2=00000000 CR3=00107000 CR4=00000000 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 CCS=00000014 CCD=00000000 CCO=SUBL EFER=0000000000000000 ---------------- IN: cmain 0x0000000000100280: mov $0x109000,%eax 0x0000000000100285: shr $0xc,%eax 0x0000000000100288: movl $0x0,0x106000(,%eax,4) 0x0000000000100293: mov $0x109000,%eax 0x0000000000100298: invlpg (%eax) Trace 0x4191aed0 [0000000000100280] cmain EAX=00109000 EBX=00108ff6 ECX=00100000 EDX=003ff003 ESI=00009500 EDI=2badb002 EBP=00000000 ESP=00104fc8 EIP=0010029b EFL=00000006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA] CS =0008 00000000 ffffffff 00cf9a00 DPL=0 CS32 [-R-] SS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA] DS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA] FS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA] GS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA] LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy GDT= 000cca10 00000027 IDT= 00000000 000003ff CR0=80000011 CR2=00000000 CR3=00107000 CR4=00000000 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 CCS=00000212 CCD=00000109 CCO=SARL EFER=0000000000000000 ---------------- IN: cmain 0x000000000010029b: call *%ebx Trace 0x4191afa0 [000000000010029b] cmain EAX=00109000 EBX=00108ff6 ECX=00100000 EDX=003ff003 ESI=00009500 EDI=2badb002 EBP=00000000 ESP=00104fc4 EIP=00108ff6 EFL=00000006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA] CS =0008 00000000 ffffffff 00cf9a00 DPL=0 CS32 [-R-] SS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA] DS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA] FS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA] GS =0010 00000000 ffffffff 00cf9300 DPL=0 DS [-WA] LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy GDT= 000cca10 00000027 IDT= 00000000 000003ff CR0=80000011 CR2=00000000 CR3=00107000 CR4=00000000 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 CCS=00000212 CCD=00000109 CCO=SARL EFER=0000000000000000 check_exception old: 0xffffffff new 0xe 0: v=0e e=0000 i=0 cpl=0 IP=0008:0000000000108ff6 pc=0000000000108ff6 SP=0010:0000000000104fc4 CR2=0000000000109000 That's it (: > Also, I found something interesting in function cpu_exec (cpu-exec.c). The > code snip below will do block linking only when the target tb does NOT span > guest pages. Is it necessary? According to your observation, it seems QEMU > handle tb spanning pages appropriately, why it still needs to check if the > target tb spanning guest pages? Because QEMU handling of TB spanning pages happens in the tb_find_fast/tb_find_slow, which wouldn't be called in case of direct linking. This can be easily verified with the testing kernel with the direct short jump (opcode 0xeb, jump target offset +8 bytes) added to the test code: static void make_test_code(void) { unsigned i; code_buf[4096 - 20] = 0xeb; code_buf[4096 - 19] = 8; for (i = 0; i < 20; ++i) code_buf[4096 - 10 + i] = 0x90; code_buf[4096 + 10] = 0xc3; } static void test_code(void) { void (*f)(void) = (void*)(code_buf + 4096 - 20); ... > --- > if (next_tb != 0 && tb->page_addr[1] == -1) { > ^^^^^^^^^^^^^^^^^^^^^^ > tb_add_jump((TranslationBlock *)(next_tb & ~3), next_tb & 3, tb); > } > --- > > Finally, does the comment on gen_goto_tb (target-i386/translate.c) still > hold? Maybe we should change it to something like "we handle the case where > the block linking spans two pages here"? I'd say that it does: the check is that pc is in the same page as the TB beginning or the TB ending, they only differ when the TB spans two pages. > --- > /* NOTE: we handle the case where the TB spans two pages here */ > if ((pc & TARGET_PAGE_MASK) == (tb->pc & TARGET_PAGE_MASK) || > (pc & TARGET_PAGE_MASK) == ((s->pc - 1) & TARGET_PAGE_MASK)) { > } > --- [1] http://www.gnu.org/software/grub/manual/multiboot/multiboot.html#Machine-state -- Thanks. -- Max