@varunnaw Good point, in my project we use this approach to retrieve attributes, including the dynamic shared memory size and block/grid information, which might be helpful to you.
https://github.com/microsoft/BitBLAS/blob/main/bitblas/builder/wrapper/tir.py#L64-L80 ## Why this is important? When users integrate the tvm runtime with 3rdparty frameworks like torch, using dlpack can introduce significant runtime overheads on smaller data shapes, such as gemv and small batched gemv on data-center GPUs. In our benchmarks, we observed delays of around 10 to 50 us. For more details, please refer to this discussion: [Strange overhead of tvm.runtime.ndarray.from_dlpack - Apache TVM Discuss](https://discuss.tvm.apache.org/t/strange-overhead-of-tvm-runtime-ndarray-from-dlpack/16516). These overheads arise not only from the ctypes overhead required to initialize a TVMValue from dlpack, but also from occasional calls to `CUDASetDevice` during the conversion process, which is also cost. Moreover, when we want to extract the generated code for another usages, tvm doesn't provide a tool to extract the BlockDim and GridDim and the unified shared memory usage automatically (which can help us to initialize the dynamic shared memory), maybe we can learn a possible solution from the link that I put forward. :) --- [Visit Topic](https://discuss.tvm.apache.org/t/phasing-out-legacy-components/17703/8) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/368655b8a730088722d89e6d82824fd76e514c67a29449046da862500950b44c).