To further illustrate what I meant by the impact of compiler optimizations, i 
ran the following quick experiment:

#include <tvm/runtime/c_runtime_api.h>                                          
// implement the function using PackedCFunc calling convention                  
inline int PackedCFunc(TVMValue* args, int* type_codes, int num_args,           
                       TVMValue* out_ret_value, int* out_ret_tcode,             
                       void* resource_handle) {                                 
  int v0 = args[0].v_int64;                                                     
  void* ptr = args[1].v_handle;                                                 
  out_ret_tcode[0] = kTVMArgInt;                                                
  out_ret_value[0].v_int64 = v0 + ((int*)ptr)[0];                               
  return 0;                                                                     
// return x + ptr[0];                                                           
extern "C" int AddViaPackedCFunc(int x, int* ptr) {                             
  TVMValue args[2];                                                             
  int type_codes[2];                                                            
  TVMValue out_ret_value;                                                       
  int out_ret_tcode;                                                            
  args[0].v_int64 = x;                                                          
  args[1].v_handle = ptr;                                                       
  type_codes[0] = kTVMArgInt;                                                   
  type_codes[1] = kTVMOpaqueHandle;                                             
  PackedCFunc(args, type_codes, 2, &out_ret_value, &out_ret_tcode, nullptr);    
  return out_ret_value.v_int64;                                                 

### Result of Clang
Run command
clang-10 -O2 -S -emit-llvm -I /path/to/tvm/3rdparty/dlpack/include -I 
/path/to/tvm/include -o test.ll   
cat test.ll

Gives the following code(meta data removed)
; Function Attrs: nounwind readonly uwtable
define dso_local i32 @AddViaPackedCFunc(i32 %0, i32* %1) local_unnamed_addr #0 {
  %3 = load i32, i32* %1, align 4, !tbaa !2
  %4 = add nsw i32 %3, %0
  ret i32 %4
### Result of GCC
gcc -O2 -S  -I /path/to/tvm/3rdparty/dlpack/include -I /path/to/tvm/include -o 
cat test.s
        .file   ""
        .p2align 4,,15
        .globl  AddViaPackedCFunc
        .type   AddViaPackedCFunc, @function
        movl    (%rsi), %eax
        addl    %edi, %eax
        .size   AddViaPackedCFunc, .-AddViaPackedCFunc
        .ident  "GCC: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0"
        .section        .note.GNU-stack,"",@progbits

### Discussions
As we can see this is esssentially equivalent to the direct C calling
int Add(int x, int *ptr) {
  return x + ptr[0]

To understand what is happening under the hood, the following optimization are 
- Inlining that inlines the call
- Mem2reg that promote the head store/load to register operations
- Deadcode elimination that eliminates the unused type id
- Reasoning around in32 passing via int64, `cast<int32>(cast<int64>(x)) = x` 
when x is i32

[Visit Topic]( 
to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 

Reply via email to