I am attempting to optimize some TCG code which I have previously written for qemu-libafl-bridge (https://github.com/AFLplusplus/qemu-libafl-bridge), the component used by the LibAFL project when fuzzing binaries using QEMU to provide runtime instrumentation. The code is used to write additional TCG into basic blocks whenever a load or store operation is performed in order to provide address sanitizer functionality.
Address sanitizer is quite simple and works by mapping each 8-byte region of address space to single byte within a region called the shadow map. The address (on a 64-bit platform) of the shadow map for a given address is: Shadow = (Mem >> 3) + 0x7fff8000; The value in the shadow map encodes the accessibility of an address: 0 - The whole 8 byte region is accessible. 1 .. 7 - The first n bytes are accessible. negative - The whole 8 byte region is inaccessible. The following pseudo code shows the algorithm: //////////////////////////////////////////////////////////////////////////////// https://github.com/google/sanitizers/wiki/addresssanitizeralgorithm byte *shadow_address = MemToShadow(address); byte shadow_value = *shadow_address; if (shadow_value) { if (SlowPathCheck(shadow_value, address, kAccessSize)) { ReportError(address, kAccessSize, kIsWrite); } } // Check the cases where we access first k bytes of the qword // and these k bytes are unpoisoned. bool SlowPathCheck(shadow_value, address, kAccessSize) { last_accessed_byte = (address & 7) + kAccessSize - 1; return (last_accessed_byte >= shadow_value); } //////////////////////////////////////////////////////////////////////////////// My current implementation makes use of conditional move instructions to trigger a segfault by way of null dereference in the event that the shadow map indicates that a memory access is invalid. //////////////////////////////////////////////////////////////////////////////// #if TARGET_LONG_BITS == 32 #define SHADOW_BASE (0x20000000) #elif TARGET_LONG_BITS == 64 #define SHADOW_BASE (0x7fff8000) #else #error Unhandled TARGET_LONG_BITS value #endif void libafl_tcg_gen_asan(TCGTemp * addr, size_t size) { if (size == 0) return; TCGv addr_val = temp_tcgv_tl(addr); TCGv k = tcg_temp_new(); TCGv shadow_addr = tcg_temp_new(); TCGv_ptr shadow_ptr = tcg_temp_new_ptr(); TCGv shadow_val = tcg_temp_new(); TCGv test_addr = tcg_temp_new(); TCGv_ptr test_ptr = tcg_temp_new_ptr(); tcg_gen_andi_tl(k, addr_val, 7); tcg_gen_addi_tl(k, k, size - 1); tcg_gen_shri_tl(shadow_addr, addr_val, 3); tcg_gen_addi_tl(shadow_addr, shadow_addr, SHADOW_BASE); tcg_gen_tl_ptr(shadow_ptr, shadow_addr); tcg_gen_ld8s_tl(shadow_val, shadow_ptr, 0); /* * Making conditional branches here appears to cause QEMU issues with dead * temporaries so we will instead avoid branches. We will cause the guest * to perform a NULL dereference in the event of an ASAN fault. Note that * we will do this by using a store rather than a load, since the TCG may * otherwise determine that the result of the load is unused and simply * discard the operation. In the event that the shadow memory doesn't * detect a fault, we will simply write the value read from the shadow * memory back to it's original location. If, however, the shadow memory * detects an invalid access, we will instead attempt to write the value * at 0x0. */ tcg_gen_movcond_tl(TCG_COND_EQ, test_addr, shadow_val, tcg_constant_tl(0), shadow_addr, tcg_constant_tl(0)); if (size < 8) { tcg_gen_movcond_tl(TCG_COND_GE, test_addr, k, shadow_val, test_addr, shadow_addr); } tcg_gen_tl_ptr(test_ptr, test_addr); tcg_gen_st8_tl(shadow_val, test_ptr, 0); } //////////////////////////////////////////////////////////////////////////////// However, I would like test an implementation more like the following to see how the performance compares. Whilst this introduces branches, the fast path is much more likely to be executed than the slow path and hence bypassing the additional checks and unnecessary memory writes I am hopeful it will improve performance. //////////////////////////////////////////////////////////////////////////////// void libafl_tcg_gen_asan(TCGTemp* addr, size_t size) { if (size == 0) { return; } if (size > 8) { size = 8; } TCGLabel *done = gen_new_label(); TCGv addr_val = temp_tcgv_tl(addr); TCGv shadow_addr = tcg_temp_new(); TCGv_ptr shadow_ptr = tcg_temp_new_ptr(); TCGv shadow_val = tcg_temp_new(); TCGv k = tcg_temp_new(); TCGv zero = tcg_constant_tl(0); TCGv_ptr null_ptr = tcg_temp_new_ptr(); tcg_gen_shri_tl(shadow_addr, addr_val, 3); tcg_gen_addi_tl(shadow_addr, shadow_addr, SHADOW_BASE); tcg_gen_tl_ptr(shadow_ptr, shadow_addr); tcg_gen_ld8s_tl(shadow_val, shadow_ptr, 0); tcg_gen_brcond_tl(TCG_COND_EQ, shadow_val, zero, done); tcg_gen_andi_tl(k, addr_val, 7); tcg_gen_addi_tl(k, k, size - 1); tcg_gen_brcond_tl(TCG_COND_LT, shadow_val, k, done); tcg_gen_tl_ptr(null_ptr, zero); tcg_gen_st8_tl(zero, null_ptr, 0); gen_set_label(done); } //////////////////////////////////////////////////////////////////////////////// However, when I change to using this implementation, I get the following error. I have tested it with a trivial hello world implementation for x86_64 running in qemu-user. It doesn't occur the first time the block is executed, therefore I think the issue is caused by the surrounding TCG in the block it is injected into? //////////////////////////////////////////////////////////////////////////////// runner-x86_64: ../tcg/tcg.c:4852: tcg_reg_alloc_mov: Assertion `ts->val_type == TEMP_VAL_REG' failed. Aborted (core dumped) //////////////////////////////////////////////////////////////////////////////// I would be very grateful for any advice of how to resolve this issue, or any alternative approaches I could use to optimize my original implementation. The code is obviously a very hot path and so even a tiny performance improvement could result in a large performance gain overall.