This change improves performance by moving the hot path of the 
trace_vhost_commit()(or any other trace function) logic to the header file.
Previously, even when the trace event was disabled, the function call chain:-
        trace_vhost_commit()(Or any other trace function) →  
_nocheck__trace_vhost_commit() →  _simple_trace_vhost_commit()
        incurred a significant function prologue overhead before checking the 
trace state.

Disassembly of _simple_trace_vhost_commit() (from the .c file) showed that 11 
out of the first 14 instructions were prologue-related, including:
0x10    stp x29, x30, [sp, #-64]!       Prologue: allocates 64-byte frame and 
saves old FP (x29) & LR (x30)
0x14    adrp x3, trace_events_enabled_count     Prologue: computes page-base of 
the trace-enable counter
0x18    adrp x2, __stack_chk_guard      Important (maybe prolog don't 
know?)(stack-protector): starts up the stack-canary load
0x1c    mov x29, sp     Prologue: sets new frame pointer
0x20    ldr x3, [x3]    Prologue: loads the actual trace-enabled count
0x24    stp x19, x20, [sp, #16] Prologue: spills callee-saved regs used by this 
function (x19, x20)
0x28    and w20, w0, #0xff      Tracepoint setup: extracts the low-8 bits of 
arg0 as the “event boolean”
0x2c    ldr x2, [x2]    Prologue (cont’d): completes loading of the 
stack-canary value
0x30    and w19, w1, #0xff      Tracepoint setup: extracts low-8 bits of arg1
0x34    ldr w0, [x3]    Important: loads the current trace-enabled flag from 
memory
0x38    ldr x1, [x2]    Prologue (cont’d): reads the canary
0x3c    str x1, [sp, #56]       Prologue (cont’d): writes the canary into the 
new frame
0x40    mov x1, #0      Prologue (cont’d): zeroes out x1 for the upcoming 
branch test
0x44    cbnz w0, 0x88   Important: if tracing is disabled (w0==0) skip the 
heavy path entirely

The trace-enabled check happens after the prologue. This is wasteful when 
tracing is disabled, which is often the case in production.
To optimize this:
_nocheck__trace_vhost_commit() is now fully inlined in the .h file with
the hot path.It checks trace_event_get_state() before calling into 
_simple_trace_vhost_commit(), which remains in .c.
This avoids calling into the .c function altogether when the tracepoint is 
disabled, thereby skipping unnecessary prologue instructions.

This results in better performance by removing redundant instructions in the 
tracing fast path.

Signed-off-by: Tanish Desai <tanishdesa...@gmail.com>
---
 scripts/tracetool/backend/simple.py | 23 ++++++++++-------------
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/scripts/tracetool/backend/simple.py 
b/scripts/tracetool/backend/simple.py
index a74d61fcd6..2688d4b64b 100644
--- a/scripts/tracetool/backend/simple.py
+++ b/scripts/tracetool/backend/simple.py
@@ -36,8 +36,17 @@ def generate_h_begin(events, group):
 
 
 def generate_h(event, group):
-    out('    _simple_%(api)s(%(args)s);',
+    event_id = 'TRACE_' + event.name.upper()
+    if "vcpu" in event.properties:
+        # already checked on the generic format code
+        cond = "true"
+    else:
+        cond = "trace_event_get_state(%s)" % event_id
+    out('    if (%(cond)s) {',
+        '        _simple_%(api)s(%(args)s);',
+        '    }',
         api=event.api(),
+        cond=cond,
         args=", ".join(event.args.names()))
 
 
@@ -72,22 +81,10 @@ def generate_c(event, group):
     if len(event.args) == 0:
         sizestr = '0'
 
-    event_id = 'TRACE_' + event.name.upper()
-    if "vcpu" in event.properties:
-        # already checked on the generic format code
-        cond = "true"
-    else:
-        cond = "trace_event_get_state(%s)" % event_id
-
     out('',
-        '    if (!%(cond)s) {',
-        '        return;',
-        '    }',
-        '',
         '    if (trace_record_start(&rec, %(event_obj)s.id, %(size_str)s)) {',
         '        return; /* Trace Buffer Full, Event Dropped ! */',
         '    }',
-        cond=cond,
         event_obj=event.api(event.QEMU_EVENT),
         size_str=sizestr)
 
-- 
2.34.1


Reply via email to