Jon Wilson <jonwilson030...@googlemail.com> writes: (Adding Richard, the TCG maintainer to CC)
> I am attempting to optimize some TCG code which I have previously written for > qemu-libafl-bridge (https://github.com/AFLplusplus/qemu-libafl-bridge), the > component used by the LibAFL project when fuzzing binaries using QEMU to > provide runtime instrumentation. The code is used to write additional TCG > into > basic blocks whenever a load or store operation is performed in order to > provide > address sanitizer functionality. I would like to figure out if we can push any of this instrumentation into TCG plugins so you can avoid patching QEMU itself. I guess you need something that allows you to hook a memory address into some sort of gadget? > Address sanitizer is quite simple and works by mapping each 8-byte region of > address space to single byte within a region called the shadow map. The > address > (on a 64-bit platform) of the shadow map for a given address is: > > Shadow = (Mem >> 3) + 0x7fff8000; > > The value in the shadow map encodes the accessibility of an address: > > 0 - The whole 8 byte region is accessible. > 1 .. 7 - The first n bytes are accessible. > negative - The whole 8 byte region is inaccessible. > > The following pseudo code shows the algorithm: > > //////////////////////////////////////////////////////////////////////////////// > > https://github.com/google/sanitizers/wiki/addresssanitizeralgorithm > > byte *shadow_address = MemToShadow(address); > byte shadow_value = *shadow_address; > if (shadow_value) { > if (SlowPathCheck(shadow_value, address, kAccessSize)) { > ReportError(address, kAccessSize, kIsWrite); > } > } > > // Check the cases where we access first k bytes of the qword > // and these k bytes are unpoisoned. > bool SlowPathCheck(shadow_value, address, kAccessSize) { > last_accessed_byte = (address & 7) + kAccessSize - 1; > return (last_accessed_byte >= shadow_value); > } > > //////////////////////////////////////////////////////////////////////////////// > > My current implementation makes use of conditional move instructions to > trigger > a segfault by way of null dereference in the event that the shadow map > indicates > that a memory access is invalid. > > //////////////////////////////////////////////////////////////////////////////// > > #if TARGET_LONG_BITS == 32 > #define SHADOW_BASE (0x20000000) > #elif TARGET_LONG_BITS == 64 > #define SHADOW_BASE (0x7fff8000) > #else > #error Unhandled TARGET_LONG_BITS value > #endif > > void libafl_tcg_gen_asan(TCGTemp * addr, size_t size) > { > if (size == 0) > return; > > TCGv addr_val = temp_tcgv_tl(addr); > TCGv k = tcg_temp_new(); > TCGv shadow_addr = tcg_temp_new(); > TCGv_ptr shadow_ptr = tcg_temp_new_ptr(); > TCGv shadow_val = tcg_temp_new(); > TCGv test_addr = tcg_temp_new(); > TCGv_ptr test_ptr = tcg_temp_new_ptr(); > > tcg_gen_andi_tl(k, addr_val, 7); > tcg_gen_addi_tl(k, k, size - 1); > > tcg_gen_shri_tl(shadow_addr, addr_val, 3); > tcg_gen_addi_tl(shadow_addr, shadow_addr, SHADOW_BASE); > tcg_gen_tl_ptr(shadow_ptr, shadow_addr); > tcg_gen_ld8s_tl(shadow_val, shadow_ptr, 0); > > /* > * Making conditional branches here appears to cause QEMU issues with dead > * temporaries so we will instead avoid branches. This sounds like a TCG bug that may have been fixed. > We will cause the guest > * to perform a NULL dereference in the event of an ASAN fault. Note that > * we will do this by using a store rather than a load, since the TCG may > * otherwise determine that the result of the load is unused and simply > * discard the operation. In the event that the shadow memory doesn't > * detect a fault, we will simply write the value read from the shadow > * memory back to it's original location. If, however, the shadow memory > * detects an invalid access, we will instead attempt to write the value > * at 0x0. > */ Why not conditionally call a helper here? Forcing the guest to actually fault seems a bit hammer like. > tcg_gen_movcond_tl(TCG_COND_EQ, test_addr, > shadow_val, tcg_constant_tl(0), > shadow_addr, tcg_constant_tl(0)); > > if (size < 8) > { > tcg_gen_movcond_tl(TCG_COND_GE, test_addr, > k, shadow_val, > test_addr, shadow_addr); > } > > tcg_gen_tl_ptr(test_ptr, test_addr); > tcg_gen_st8_tl(shadow_val, test_ptr, 0); > } > > //////////////////////////////////////////////////////////////////////////////// > > However, I would like test an implementation more like the following to see > how > the performance compares. Whilst this introduces branches, the fast path is > much > more likely to be executed than the slow path and hence bypassing the > additional > checks and unnecessary memory writes I am hopeful it will improve performance. > > //////////////////////////////////////////////////////////////////////////////// > > void libafl_tcg_gen_asan(TCGTemp* addr, size_t size) > { > if (size == 0) { > return; > } > > if (size > 8) { > size = 8; > } > > TCGLabel *done = gen_new_label(); > > TCGv addr_val = temp_tcgv_tl(addr); > TCGv shadow_addr = tcg_temp_new(); > TCGv_ptr shadow_ptr = tcg_temp_new_ptr(); > TCGv shadow_val = tcg_temp_new(); > TCGv k = tcg_temp_new(); > TCGv zero = tcg_constant_tl(0); > TCGv_ptr null_ptr = tcg_temp_new_ptr(); > > tcg_gen_shri_tl(shadow_addr, addr_val, 3); > tcg_gen_addi_tl(shadow_addr, shadow_addr, SHADOW_BASE); > tcg_gen_tl_ptr(shadow_ptr, shadow_addr); > tcg_gen_ld8s_tl(shadow_val, shadow_ptr, 0); > > tcg_gen_brcond_tl(TCG_COND_EQ, shadow_val, zero, done); > > tcg_gen_andi_tl(k, addr_val, 7); > tcg_gen_addi_tl(k, k, size - 1); > > tcg_gen_brcond_tl(TCG_COND_LT, shadow_val, k, done); > > tcg_gen_tl_ptr(null_ptr, zero); > tcg_gen_st8_tl(zero, null_ptr, 0); > > gen_set_label(done); > } > > //////////////////////////////////////////////////////////////////////////////// > > However, when I change to using this implementation, I get the following > error. > I have tested it with a trivial hello world implementation for x86_64 running > in > qemu-user. It doesn't occur the first time the block is executed, therefore I > think the issue is caused by the surrounding TCG in the block it is injected > into? > > //////////////////////////////////////////////////////////////////////////////// > runner-x86_64: ../tcg/tcg.c:4852: tcg_reg_alloc_mov: Assertion `ts->val_type > == TEMP_VAL_REG' failed. > Aborted (core dumped) > //////////////////////////////////////////////////////////////////////////////// > > I would be very grateful for any advice of how to resolve this issue, or any > alternative approaches I could use to optimize my original implementation. The > code is obviously a very hot path and so even a tiny performance improvement > could result in a large performance gain overall. -- Alex Bennée Virtualisation Tech Lead @ Linaro