It would be good if we could have QEMU provide clean APIs to allow the sort
of additional instrumentation that fuzzing requires. I guess the
qemu-libafl-bridge project show the sort of modification which has been
required so far...
https://github.com/AFLplusplus/qemu-libafl-bridge/tree/main/libafl

I would like to conditionally call a helper, or even just insert a
breakpoint instruction, but like I say I don't seem to be able to make use
of any sort of branches. Even if I add a benign instrumentation that simply
conditionally branches at a label and nothing else (e.g. no actual
functionality), I still have the same problem.
e.g.

////////////////////////////////////////////////////////////////////////////////


TCGLabel *done = gen_new_label();
TCGv addr_val = temp_tcgv_tl(addr);
TCGv zero = tcg_constant_tl(0);
tcg_gen_brcond_tl(TCG_COND_EQ, addr_val, zero, done);
gen_set_label(done);

////////////////////////////////////////////////////////////////////////////////


Hence the current implementation is a little clumsy!

Thanks for your advice.

Jon



On Mon, Jun 2, 2025 at 4:09 PM Alex Bennée <alex.ben...@linaro.org> wrote:

> Jon Wilson <jonwilson030...@googlemail.com> writes:
>
> (Adding Richard, the TCG maintainer to CC)
>
> > I am attempting to optimize some TCG code which I have previously
> written for
> > qemu-libafl-bridge (https://github.com/AFLplusplus/qemu-libafl-bridge),
> the
> > component used by the LibAFL project when fuzzing binaries using QEMU to
> > provide runtime instrumentation. The code is used to write additional
> TCG into
> > basic blocks whenever a load or store operation is performed in order to
> provide
> > address sanitizer functionality.
>
> I would like to figure out if we can push any of this instrumentation
> into TCG plugins so you can avoid patching QEMU itself. I guess you need
> something that allows you to hook a memory address into some sort of
> gadget?
>
> > Address sanitizer is quite simple and works by mapping each 8-byte
> region of
> > address space to single byte within a region called the shadow map. The
> address
> > (on a 64-bit platform) of the shadow map for a given address is:
> >
> >     Shadow = (Mem >> 3) + 0x7fff8000;
> >
> > The value in the shadow map encodes the accessibility of an address:
> >
> >     0  - The whole 8 byte region is accessible.
> >     1 .. 7 - The first n bytes are accessible.
> >     negative - The whole 8 byte region is inaccessible.
> >
> > The following pseudo code shows the algorithm:
> >
> >
> ////////////////////////////////////////////////////////////////////////////////
> >
> > https://github.com/google/sanitizers/wiki/addresssanitizeralgorithm
> >
> > byte *shadow_address = MemToShadow(address);
> > byte shadow_value = *shadow_address;
> > if (shadow_value) {
> >   if (SlowPathCheck(shadow_value, address, kAccessSize)) {
> >     ReportError(address, kAccessSize, kIsWrite);
> >   }
> > }
> >
> > // Check the cases where we access first k bytes of the qword
> > // and these k bytes are unpoisoned.
> > bool SlowPathCheck(shadow_value, address, kAccessSize) {
> >   last_accessed_byte = (address & 7) + kAccessSize - 1;
> >   return (last_accessed_byte >= shadow_value);
> > }
> >
> >
> ////////////////////////////////////////////////////////////////////////////////
> >
> > My current implementation makes use of conditional move instructions to
> trigger
> > a segfault by way of null dereference in the event that the shadow map
> indicates
> > that a memory access is invalid.
> >
> >
> ////////////////////////////////////////////////////////////////////////////////
> >
> > #if TARGET_LONG_BITS == 32
> > #define SHADOW_BASE (0x20000000)
> > #elif TARGET_LONG_BITS == 64
> > #define SHADOW_BASE (0x7fff8000)
> > #else
> > #error Unhandled TARGET_LONG_BITS value
> > #endif
> >
> > void libafl_tcg_gen_asan(TCGTemp * addr, size_t size)
> > {
> >     if (size == 0)
> >         return;
> >
> >     TCGv addr_val = temp_tcgv_tl(addr);
> >     TCGv k = tcg_temp_new();
> >     TCGv shadow_addr = tcg_temp_new();
> >     TCGv_ptr shadow_ptr = tcg_temp_new_ptr();
> >     TCGv shadow_val = tcg_temp_new();
> >     TCGv test_addr = tcg_temp_new();
> >     TCGv_ptr test_ptr = tcg_temp_new_ptr();
> >
> >     tcg_gen_andi_tl(k, addr_val, 7);
> >     tcg_gen_addi_tl(k, k, size - 1);
> >
> >     tcg_gen_shri_tl(shadow_addr, addr_val, 3);
> >     tcg_gen_addi_tl(shadow_addr, shadow_addr, SHADOW_BASE);
> >     tcg_gen_tl_ptr(shadow_ptr, shadow_addr);
> >     tcg_gen_ld8s_tl(shadow_val, shadow_ptr, 0);
> >
> >     /*
> >      * Making conditional branches here appears to cause QEMU issues
> with dead
> >      * temporaries so we will instead avoid branches.
>
> This sounds like a TCG bug that may have been fixed.
>
> >      We will cause the guest
> >      * to perform a NULL dereference in the event of an ASAN fault. Note
> that
> >      * we will do this by using a store rather than a load, since the
> TCG may
> >      * otherwise determine that the result of the load is unused and
> simply
> >      * discard the operation. In the event that the shadow memory doesn't
> >      * detect a fault, we will simply write the value read from the
> shadow
> >      * memory back to it's original location. If, however, the shadow
> memory
> >      * detects an invalid access, we will instead attempt to write the
> value
> >      * at 0x0.
> >      */
>
> Why not conditionally call a helper here? Forcing the guest to actually
> fault seems a bit hammer like.
>
> >     tcg_gen_movcond_tl(TCG_COND_EQ, test_addr,
> >         shadow_val, tcg_constant_tl(0),
> >         shadow_addr, tcg_constant_tl(0));
> >
> >     if (size < 8)
> >     {
> >         tcg_gen_movcond_tl(TCG_COND_GE, test_addr,
> >             k, shadow_val,
> >             test_addr, shadow_addr);
> >     }
> >
> >     tcg_gen_tl_ptr(test_ptr, test_addr);
> >     tcg_gen_st8_tl(shadow_val, test_ptr, 0);
> > }
> >
> >
> ////////////////////////////////////////////////////////////////////////////////
> >
> > However, I would like test an implementation more like the following to
> see how
> > the performance compares. Whilst this introduces branches, the fast path
> is much
> > more likely to be executed than the slow path and hence bypassing the
> additional
> > checks and unnecessary memory writes I am hopeful it will improve
> performance.
> >
> >
> ////////////////////////////////////////////////////////////////////////////////
> >
> > void libafl_tcg_gen_asan(TCGTemp* addr, size_t size)
> > {
> >     if (size == 0) {
> >         return;
> >     }
> >
> >     if (size > 8) {
> >         size = 8;
> >     }
> >
> >     TCGLabel *done = gen_new_label();
> >
> >     TCGv addr_val = temp_tcgv_tl(addr);
> >     TCGv shadow_addr = tcg_temp_new();
> >     TCGv_ptr shadow_ptr = tcg_temp_new_ptr();
> >     TCGv shadow_val = tcg_temp_new();
> >     TCGv k = tcg_temp_new();
> >     TCGv zero = tcg_constant_tl(0);
> >     TCGv_ptr null_ptr = tcg_temp_new_ptr();
> >
> >     tcg_gen_shri_tl(shadow_addr, addr_val, 3);
> >     tcg_gen_addi_tl(shadow_addr, shadow_addr, SHADOW_BASE);
> >     tcg_gen_tl_ptr(shadow_ptr, shadow_addr);
> >     tcg_gen_ld8s_tl(shadow_val, shadow_ptr, 0);
> >
> >     tcg_gen_brcond_tl(TCG_COND_EQ, shadow_val, zero, done);
> >
> >     tcg_gen_andi_tl(k, addr_val, 7);
> >     tcg_gen_addi_tl(k, k, size - 1);
> >
> >     tcg_gen_brcond_tl(TCG_COND_LT, shadow_val, k, done);
> >
> >     tcg_gen_tl_ptr(null_ptr, zero);
> >     tcg_gen_st8_tl(zero, null_ptr, 0);
> >
> >     gen_set_label(done);
> > }
> >
> >
> ////////////////////////////////////////////////////////////////////////////////
> >
> > However, when I change to using this implementation, I get the following
> error.
> > I have tested it with a trivial hello world implementation for x86_64
> running in
> > qemu-user. It doesn't occur the first time the block is executed,
> therefore I
> > think the issue is caused by the surrounding TCG in the block it is
> injected
> > into?
> >
> >
> ////////////////////////////////////////////////////////////////////////////////
> > runner-x86_64: ../tcg/tcg.c:4852: tcg_reg_alloc_mov: Assertion
> `ts->val_type == TEMP_VAL_REG' failed.
> > Aborted (core dumped)
> >
> ////////////////////////////////////////////////////////////////////////////////
> >
> > I would be very grateful for any advice of how to resolve this issue, or
> any
> > alternative approaches I could use to optimize my original
> implementation. The
> > code is obviously a very hot path and so even a tiny performance
> improvement
> > could result in a large performance gain overall.
>
> --
> Alex Bennée
> Virtualisation Tech Lead @ Linaro
>

Reply via email to