Jon Wilson <jonwilson030...@googlemail.com> writes: > It would be good if we could have QEMU provide clean APIs to allow the sort > of additional instrumentation that fuzzing > requires. I guess the qemu-libafl-bridge project show the sort of > modification which has been required so far... > https://github.com/AFLplusplus/qemu-libafl-bridge/tree/main/libafl > > I would like to conditionally call a helper, or even just insert a breakpoint > instruction, but like I say I don't seem to be able to > make use of any sort of branches.
For TCG plugins we have: /** * qemu_plugin_register_vcpu_tb_exec_cond_cb() - register conditional callback * @tb: the opaque qemu_plugin_tb handle for the translation * @cb: callback function * @cond: condition to enable callback * @entry: first operand for condition * @imm: second operand for condition * @flags: does the plugin read or write the CPU's registers? * @userdata: any plugin data to pass to the @cb? * * The @cb function is called when a translated unit executes if * entry @cond imm is true. * If condition is QEMU_PLUGIN_COND_ALWAYS, condition is never interpreted and * this function is equivalent to qemu_plugin_register_vcpu_tb_exec_cb. * If condition QEMU_PLUGIN_COND_NEVER, condition is never interpreted and * callback is never installed. */ QEMU_PLUGIN_API void qemu_plugin_register_vcpu_tb_exec_cond_cb(struct qemu_plugin_tb *tb, qemu_plugin_vcpu_udata_cb_t cb, enum qemu_plugin_cb_flags flags, enum qemu_plugin_cond cond, qemu_plugin_u64 entry, uint64_t imm, void *userdata); Along with qemu_plugin_register_vcpu_insn_exec_cond_cb() for individual instructions. They are designed work with per-cpu indexed scoreboards using inline operations (e.g. qemu_plugin_register_vcpu_tb_exec_inline_per_vcpu) which can either add or store values. Currently we inline memory operations are a bit limited: /** * qemu_plugin_register_vcpu_mem_inline_per_vcpu() - inline op for mem access * @insn: handle for instruction to instrument * @rw: apply to reads, writes or both * @op: the op, of type qemu_plugin_op * @entry: entry to run op * @imm: immediate data for @op * * This registers a inline op every memory access generated by the * instruction. */ QEMU_PLUGIN_API void qemu_plugin_register_vcpu_mem_inline_per_vcpu( struct qemu_plugin_insn *insn, enum qemu_plugin_mem_rw rw, enum qemu_plugin_op op, qemu_plugin_u64 entry, uint64_t imm); Although you can get access to the memory information through helpers: /** * qemu_plugin_register_vcpu_mem_cb() - register memory access callback * @insn: handle for instruction to instrument * @cb: callback of type qemu_plugin_vcpu_mem_cb_t * @flags: (currently unused) callback flags * @rw: monitor reads, writes or both * @userdata: opaque pointer for userdata * * This registers a full callback for every memory access generated by * an instruction. If the instruction doesn't access memory no * callback will be made. * * The callback reports the vCPU the access took place on, the virtual * address of the access and a handle for further queries. The user * can attach some userdata to the callback for additional purposes. * * Other execution threads will continue to execute during the * callback so the plugin is responsible for ensuring it doesn't get * confused by making appropriate use of locking if required. */ QEMU_PLUGIN_API void qemu_plugin_register_vcpu_mem_cb(struct qemu_plugin_insn *insn, qemu_plugin_vcpu_mem_cb_t cb, enum qemu_plugin_cb_flags flags, enum qemu_plugin_mem_rw rw, void *userdata); where you can then process qemu_plugin_meminfo_t info in the callback to get the value or address of a memory operation. I guess you want some sort of operation to do an inline transform of the address to use a lookup to compare and branch to. The trick is coming up with an interface which is general and flexible enough and not just "what asan needs for a specific use case". > Even if I add a benign instrumentation that simply conditionally branches at > a label and > nothing else (e.g. no actual functionality), I still have the same problem. > e.g. > > //////////////////////////////////////////////////////////////////////////////// > > > TCGLabel *done = gen_new_label(); > TCGv addr_val = temp_tcgv_tl(addr); > TCGv zero = tcg_constant_tl(0); > tcg_gen_brcond_tl(TCG_COND_EQ, addr_val, zero, done); > gen_set_label(done); > > //////////////////////////////////////////////////////////////////////////////// > > > Hence the current implementation is a little clumsy! > > Thanks for your advice. > > Jon > > On Mon, Jun 2, 2025 at 4:09 PM Alex Bennée <alex.ben...@linaro.org> wrote: > > Jon Wilson <jonwilson030...@googlemail.com> writes: > > (Adding Richard, the TCG maintainer to CC) > > > I am attempting to optimize some TCG code which I have previously written > for > > qemu-libafl-bridge (https://github.com/AFLplusplus/qemu-libafl-bridge), > the > > component used by the LibAFL project when fuzzing binaries using QEMU to > > provide runtime instrumentation. The code is used to write additional TCG > into > > basic blocks whenever a load or store operation is performed in order to > provide > > address sanitizer functionality. > > I would like to figure out if we can push any of this instrumentation > into TCG plugins so you can avoid patching QEMU itself. I guess you need > something that allows you to hook a memory address into some sort of > gadget? > > > Address sanitizer is quite simple and works by mapping each 8-byte region > of > > address space to single byte within a region called the shadow map. The > address > > (on a 64-bit platform) of the shadow map for a given address is: > > > > Shadow = (Mem >> 3) + 0x7fff8000; > > > > The value in the shadow map encodes the accessibility of an address: > > > > 0 - The whole 8 byte region is accessible. > > 1 .. 7 - The first n bytes are accessible. > > negative - The whole 8 byte region is inaccessible. > > > > The following pseudo code shows the algorithm: > > > > > //////////////////////////////////////////////////////////////////////////////// > > > > https://github.com/google/sanitizers/wiki/addresssanitizeralgorithm > > > > byte *shadow_address = MemToShadow(address); > > byte shadow_value = *shadow_address; > > if (shadow_value) { > > if (SlowPathCheck(shadow_value, address, kAccessSize)) { > > ReportError(address, kAccessSize, kIsWrite); > > } > > } > > > > // Check the cases where we access first k bytes of the qword > > // and these k bytes are unpoisoned. > > bool SlowPathCheck(shadow_value, address, kAccessSize) { > > last_accessed_byte = (address & 7) + kAccessSize - 1; > > return (last_accessed_byte >= shadow_value); > > } > > > > > //////////////////////////////////////////////////////////////////////////////// > > > > My current implementation makes use of conditional move instructions to > trigger > > a segfault by way of null dereference in the event that the shadow map > indicates > > that a memory access is invalid. > > > > > //////////////////////////////////////////////////////////////////////////////// > > > > #if TARGET_LONG_BITS == 32 > > #define SHADOW_BASE (0x20000000) > > #elif TARGET_LONG_BITS == 64 > > #define SHADOW_BASE (0x7fff8000) > > #else > > #error Unhandled TARGET_LONG_BITS value > > #endif > > > > void libafl_tcg_gen_asan(TCGTemp * addr, size_t size) > > { > > if (size == 0) > > return; > > > > TCGv addr_val = temp_tcgv_tl(addr); > > TCGv k = tcg_temp_new(); > > TCGv shadow_addr = tcg_temp_new(); > > TCGv_ptr shadow_ptr = tcg_temp_new_ptr(); > > TCGv shadow_val = tcg_temp_new(); > > TCGv test_addr = tcg_temp_new(); > > TCGv_ptr test_ptr = tcg_temp_new_ptr(); > > > > tcg_gen_andi_tl(k, addr_val, 7); > > tcg_gen_addi_tl(k, k, size - 1); > > > > tcg_gen_shri_tl(shadow_addr, addr_val, 3); > > tcg_gen_addi_tl(shadow_addr, shadow_addr, SHADOW_BASE); > > tcg_gen_tl_ptr(shadow_ptr, shadow_addr); > > tcg_gen_ld8s_tl(shadow_val, shadow_ptr, 0); > > > > /* > > * Making conditional branches here appears to cause QEMU issues with > dead > > * temporaries so we will instead avoid branches. > > This sounds like a TCG bug that may have been fixed. > > > We will cause the guest > > * to perform a NULL dereference in the event of an ASAN fault. Note > that > > * we will do this by using a store rather than a load, since the TCG > may > > * otherwise determine that the result of the load is unused and simply > > * discard the operation. In the event that the shadow memory doesn't > > * detect a fault, we will simply write the value read from the shadow > > * memory back to it's original location. If, however, the shadow > memory > > * detects an invalid access, we will instead attempt to write the > value > > * at 0x0. > > */ > > Why not conditionally call a helper here? Forcing the guest to actually > fault seems a bit hammer like. > > > tcg_gen_movcond_tl(TCG_COND_EQ, test_addr, > > shadow_val, tcg_constant_tl(0), > > shadow_addr, tcg_constant_tl(0)); > > > > if (size < 8) > > { > > tcg_gen_movcond_tl(TCG_COND_GE, test_addr, > > k, shadow_val, > > test_addr, shadow_addr); > > } > > > > tcg_gen_tl_ptr(test_ptr, test_addr); > > tcg_gen_st8_tl(shadow_val, test_ptr, 0); > > } > > > > > //////////////////////////////////////////////////////////////////////////////// > > > > However, I would like test an implementation more like the following to > see how > > the performance compares. Whilst this introduces branches, the fast path > is much > > more likely to be executed than the slow path and hence bypassing the > additional > > checks and unnecessary memory writes I am hopeful it will improve > performance. > > > > > //////////////////////////////////////////////////////////////////////////////// > > > > void libafl_tcg_gen_asan(TCGTemp* addr, size_t size) > > { > > if (size == 0) { > > return; > > } > > > > if (size > 8) { > > size = 8; > > } > > > > TCGLabel *done = gen_new_label(); > > > > TCGv addr_val = temp_tcgv_tl(addr); > > TCGv shadow_addr = tcg_temp_new(); > > TCGv_ptr shadow_ptr = tcg_temp_new_ptr(); > > TCGv shadow_val = tcg_temp_new(); > > TCGv k = tcg_temp_new(); > > TCGv zero = tcg_constant_tl(0); > > TCGv_ptr null_ptr = tcg_temp_new_ptr(); > > > > tcg_gen_shri_tl(shadow_addr, addr_val, 3); > > tcg_gen_addi_tl(shadow_addr, shadow_addr, SHADOW_BASE); > > tcg_gen_tl_ptr(shadow_ptr, shadow_addr); > > tcg_gen_ld8s_tl(shadow_val, shadow_ptr, 0); > > > > tcg_gen_brcond_tl(TCG_COND_EQ, shadow_val, zero, done); > > > > tcg_gen_andi_tl(k, addr_val, 7); > > tcg_gen_addi_tl(k, k, size - 1); > > > > tcg_gen_brcond_tl(TCG_COND_LT, shadow_val, k, done); > > > > tcg_gen_tl_ptr(null_ptr, zero); > > tcg_gen_st8_tl(zero, null_ptr, 0); > > > > gen_set_label(done); > > } > > > > > //////////////////////////////////////////////////////////////////////////////// > > > > However, when I change to using this implementation, I get the following > error. > > I have tested it with a trivial hello world implementation for x86_64 > running in > > qemu-user. It doesn't occur the first time the block is executed, > therefore I > > think the issue is caused by the surrounding TCG in the block it is > injected > > into? > > > > > //////////////////////////////////////////////////////////////////////////////// > > runner-x86_64: ../tcg/tcg.c:4852: tcg_reg_alloc_mov: Assertion > `ts->val_type == TEMP_VAL_REG' failed. > > Aborted (core dumped) > > > //////////////////////////////////////////////////////////////////////////////// > > > > I would be very grateful for any advice of how to resolve this issue, or > any > > alternative approaches I could use to optimize my original implementation. > The > > code is obviously a very hot path and so even a tiny performance > improvement > > could result in a large performance gain overall. > > -- > Alex Bennée > Virtualisation Tech Lead @ Linaro -- Alex Bennée Virtualisation Tech Lead @ Linaro