Jon Wilson <jonwilson030...@googlemail.com> writes:

(Adding Richard, the TCG maintainer to CC)

> I am attempting to optimize some TCG code which I have previously written for
> qemu-libafl-bridge (https://github.com/AFLplusplus/qemu-libafl-bridge), the 
> component used by the LibAFL project when fuzzing binaries using QEMU to 
> provide runtime instrumentation. The code is used to write additional TCG 
> into 
> basic blocks whenever a load or store operation is performed in order to 
> provide
> address sanitizer functionality.

I would like to figure out if we can push any of this instrumentation
into TCG plugins so you can avoid patching QEMU itself. I guess you need
something that allows you to hook a memory address into some sort of
gadget?

> Address sanitizer is quite simple and works by mapping each 8-byte region of
> address space to single byte within a region called the shadow map. The 
> address
> (on a 64-bit platform) of the shadow map for a given address is:
>
>     Shadow = (Mem >> 3) + 0x7fff8000;
>
> The value in the shadow map encodes the accessibility of an address:
>
>     0  - The whole 8 byte region is accessible.
>     1 .. 7 - The first n bytes are accessible.
>     negative - The whole 8 byte region is inaccessible.
>
> The following pseudo code shows the algorithm:
>
> ////////////////////////////////////////////////////////////////////////////////
>
> https://github.com/google/sanitizers/wiki/addresssanitizeralgorithm
>
> byte *shadow_address = MemToShadow(address);
> byte shadow_value = *shadow_address;
> if (shadow_value) {
>   if (SlowPathCheck(shadow_value, address, kAccessSize)) {
>     ReportError(address, kAccessSize, kIsWrite);
>   }
> }
>
> // Check the cases where we access first k bytes of the qword
> // and these k bytes are unpoisoned.
> bool SlowPathCheck(shadow_value, address, kAccessSize) {
>   last_accessed_byte = (address & 7) + kAccessSize - 1;
>   return (last_accessed_byte >= shadow_value);
> }
>
> ////////////////////////////////////////////////////////////////////////////////
>
> My current implementation makes use of conditional move instructions to 
> trigger
> a segfault by way of null dereference in the event that the shadow map 
> indicates
> that a memory access is invalid.
>
> ////////////////////////////////////////////////////////////////////////////////
>
> #if TARGET_LONG_BITS == 32
> #define SHADOW_BASE (0x20000000)
> #elif TARGET_LONG_BITS == 64
> #define SHADOW_BASE (0x7fff8000)
> #else
> #error Unhandled TARGET_LONG_BITS value
> #endif
>
> void libafl_tcg_gen_asan(TCGTemp * addr, size_t size)
> {
>     if (size == 0)
>         return;
>
>     TCGv addr_val = temp_tcgv_tl(addr);
>     TCGv k = tcg_temp_new();
>     TCGv shadow_addr = tcg_temp_new();
>     TCGv_ptr shadow_ptr = tcg_temp_new_ptr();
>     TCGv shadow_val = tcg_temp_new();
>     TCGv test_addr = tcg_temp_new();
>     TCGv_ptr test_ptr = tcg_temp_new_ptr();
>
>     tcg_gen_andi_tl(k, addr_val, 7);
>     tcg_gen_addi_tl(k, k, size - 1);
>
>     tcg_gen_shri_tl(shadow_addr, addr_val, 3);
>     tcg_gen_addi_tl(shadow_addr, shadow_addr, SHADOW_BASE);
>     tcg_gen_tl_ptr(shadow_ptr, shadow_addr);
>     tcg_gen_ld8s_tl(shadow_val, shadow_ptr, 0);
>
>     /*
>      * Making conditional branches here appears to cause QEMU issues with dead
>      * temporaries so we will instead avoid branches.

This sounds like a TCG bug that may have been fixed.

>      We will cause the guest
>      * to perform a NULL dereference in the event of an ASAN fault. Note that
>      * we will do this by using a store rather than a load, since the TCG may
>      * otherwise determine that the result of the load is unused and simply
>      * discard the operation. In the event that the shadow memory doesn't
>      * detect a fault, we will simply write the value read from the shadow
>      * memory back to it's original location. If, however, the shadow memory
>      * detects an invalid access, we will instead attempt to write the value
>      * at 0x0.
>      */

Why not conditionally call a helper here? Forcing the guest to actually
fault seems a bit hammer like.

>     tcg_gen_movcond_tl(TCG_COND_EQ, test_addr,
>         shadow_val, tcg_constant_tl(0),
>         shadow_addr, tcg_constant_tl(0));
>
>     if (size < 8)
>     {
>         tcg_gen_movcond_tl(TCG_COND_GE, test_addr,
>             k, shadow_val,
>             test_addr, shadow_addr);
>     }
>
>     tcg_gen_tl_ptr(test_ptr, test_addr);
>     tcg_gen_st8_tl(shadow_val, test_ptr, 0);
> }
>
> ////////////////////////////////////////////////////////////////////////////////
>
> However, I would like test an implementation more like the following to see 
> how
> the performance compares. Whilst this introduces branches, the fast path is 
> much
> more likely to be executed than the slow path and hence bypassing the 
> additional
> checks and unnecessary memory writes I am hopeful it will improve performance.
>
> ////////////////////////////////////////////////////////////////////////////////
>
> void libafl_tcg_gen_asan(TCGTemp* addr, size_t size)
> {
>     if (size == 0) {
>         return;
>     }
>
>     if (size > 8) {
>         size = 8;
>     }
>
>     TCGLabel *done = gen_new_label();
>
>     TCGv addr_val = temp_tcgv_tl(addr);
>     TCGv shadow_addr = tcg_temp_new();
>     TCGv_ptr shadow_ptr = tcg_temp_new_ptr();
>     TCGv shadow_val = tcg_temp_new();
>     TCGv k = tcg_temp_new();
>     TCGv zero = tcg_constant_tl(0);
>     TCGv_ptr null_ptr = tcg_temp_new_ptr();
>
>     tcg_gen_shri_tl(shadow_addr, addr_val, 3);
>     tcg_gen_addi_tl(shadow_addr, shadow_addr, SHADOW_BASE);
>     tcg_gen_tl_ptr(shadow_ptr, shadow_addr);
>     tcg_gen_ld8s_tl(shadow_val, shadow_ptr, 0);
>
>     tcg_gen_brcond_tl(TCG_COND_EQ, shadow_val, zero, done);
>
>     tcg_gen_andi_tl(k, addr_val, 7);
>     tcg_gen_addi_tl(k, k, size - 1);
>
>     tcg_gen_brcond_tl(TCG_COND_LT, shadow_val, k, done);
>
>     tcg_gen_tl_ptr(null_ptr, zero);
>     tcg_gen_st8_tl(zero, null_ptr, 0);
>
>     gen_set_label(done);
> }
>
> ////////////////////////////////////////////////////////////////////////////////
>
> However, when I change to using this implementation, I get the following 
> error.
> I have tested it with a trivial hello world implementation for x86_64 running 
> in
> qemu-user. It doesn't occur the first time the block is executed, therefore I
> think the issue is caused by the surrounding TCG in the block it is injected
> into?
>
> ////////////////////////////////////////////////////////////////////////////////
> runner-x86_64: ../tcg/tcg.c:4852: tcg_reg_alloc_mov: Assertion `ts->val_type 
> == TEMP_VAL_REG' failed.
> Aborted (core dumped)
> ////////////////////////////////////////////////////////////////////////////////
>
> I would be very grateful for any advice of how to resolve this issue, or any
> alternative approaches I could use to optimize my original implementation. The
> code is obviously a very hot path and so even a tiny performance improvement
> could result in a large performance gain overall.

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro

Reply via email to