@varunnaw Good point, in my project we use this approach to retrieve 
attributes, including the dynamic shared memory size and block/grid 
information, which might be helpful to you.

https://github.com/microsoft/BitBLAS/blob/main/bitblas/builder/wrapper/tir.py#L64-L80

## Why this is important? 

When users integrate the tvm runtime with 3rdparty frameworks like torch, using 
dlpack can introduce significant runtime overheads on smaller data shapes, such 
as gemv and small batched gemv on data-center GPUs. In our benchmarks, we 
observed delays of around 10 to 50 us. For more details, please refer to this 
discussion: [Strange overhead of tvm.runtime.ndarray.from_dlpack - Apache TVM 
Discuss](https://discuss.tvm.apache.org/t/strange-overhead-of-tvm-runtime-ndarray-from-dlpack/16516).

These overheads arise not only from the ctypes overhead required to initialize 
a TVMValue from dlpack, but also from occasional calls to `CUDASetDevice` 
during the conversion process, which is also cost.

Moreover, when we want to extract the generated code for another usages, tvm 
doesn't provide a tool to extract the BlockDim and GridDim and the unified 
shared memory usage automatically (which can help us to initialize the dynamic 
shared memory), maybe we can learn a possible solution from the link that I put 
forward. :)





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/phasing-out-legacy-components/17703/8) 
to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/368655b8a730088722d89e6d82824fd76e514c67a29449046da862500950b44c).

Reply via email to