# Introduction The TVM community has worked since the v0.15.0 release to deliver the following new exciting improvements! This release version is:
- **First support of Relax**, with dynamic shape and pipeline - Dlight module for optimizing LLM TIR workloads on GPU - Disco module for initial SPMD multi-GPU support The main tags are below (**bold text is with lots of progress**): - Community, RFCs - Adreno, ArmComputeLibrary, Metal, cuda & cutlass & tensorrt, micoNPU, Runtime - **Relax**, **Dlight**, **Disco** - Arith, **TIR**, TVMScript - Docs, CI, **Misc**, **BugFix** Please visit the full listing of commits for a complete view: [v0.16.dev0...v0.16.0.rc0](https://github.com/apache/tvm/compare/v0.16.dev0...v0.16.0.rc0). ### Community * [#16695](https://github.com/apache/tvm/pull/16695) - Add new key for release signing * [#16419](https://github.com/apache/tvm/pull/16419) - Add new key for release signing ### RFCs This new RFC explores how TVM can be utilized to generate code for the SME ISA to achieve improved inference performance on supported Arm®-based hardware implementing the SME extension. * [#107](https://github.com/apache/tvm-rfcs/pull/107) - [RFC] Scalable Matrix Extension enablement ---- ### Arith * [#16735](https://github.com/apache/tvm/pull/16735) - [Fixup] Require feature flag for tighter inequality bounds * [#16588](https://github.com/apache/tvm/pull/16588) - Provide tighter ConstIntBounds for special cases * [#16704](https://github.com/apache/tvm/pull/16704) - [Fix]Fix canonical simplification of LE ### BYOC * [#16567](https://github.com/apache/tvm/pull/16567) - Skip processed functions in FuseOpsByPattern and RunCodegen ### BugFix * [#16766](https://github.com/apache/tvm/pull/16766) - [Target] Added null check to fix segfault at ->defined() in cpu.cc DetectSystemTriple() * [#16739](https://github.com/apache/tvm/pull/16739) - [Ansor] Fixing Ansor Gradient Bug * [#16820](https://github.com/apache/tvm/pull/16820) - [Fix] PAPI docs * [#16793](https://github.com/apache/tvm/pull/16793) - [Fix] fix for numpy 2.0 compatibility * [#16790](https://github.com/apache/tvm/pull/16790) - [Fix] Fix build errors with VS2022 * [#16780](https://github.com/apache/tvm/pull/16780) - [Fix] Fix numpy dtype map * [#16773](https://github.com/apache/tvm/pull/16773) - [Fix] Fix the purity flag of "vm.call_tir_dyn" and "kill" ops * [#16770](https://github.com/apache/tvm/pull/16770) - [Hotfix] Revert driver API pass ordering that breaks MLC, mark failing test * [#16771](https://github.com/apache/tvm/pull/16771) - [Fix] Remove redundant "remove_all_unused" in IPC memory lowering * [#16746](https://github.com/apache/tvm/pull/16746) - [Fix][Builtin] Fix "GetQueryPosition" of PagedKVCache * [#16728](https://github.com/apache/tvm/pull/16728) - [Fix] Introduce TVM_DEBUG_WITH_ABI_CHANGE to warn ABI changes in debug mode * [#16714](https://github.com/apache/tvm/pull/16714) - [Fix] PagedKVCache fetching compute stream when copy stream is needed * [#16684](https://github.com/apache/tvm/pull/16684) - [SLM] Produce well-formed Relax for nn.modules.KVCache * [#16659](https://github.com/apache/tvm/pull/16659) - add the default value for DFT in ONNX frontend * [#16637](https://github.com/apache/tvm/pull/16637) - [Transform] Preserve symbolic variables in FuseOps * [#16649](https://github.com/apache/tvm/pull/16649) - [FFI] Add a missing default for datatype lanes * [#16492](https://github.com/apache/tvm/pull/16492) - [Executor] fix debug_executor function debug_get_output * [#16598](https://github.com/apache/tvm/pull/16598) - [Transform]Handle non-composite lambda functions in FuseOps * [#16565](https://github.com/apache/tvm/pull/16565) - [Transform] Keep private non-primitive functions in FuseTIR * [#16518](https://github.com/apache/tvm/pull/16518) - Use x*x*x instead of pow(x,3) * [#16436](https://github.com/apache/tvm/pull/16436) - Ensure that bf16 arrays are created as expected * [#16361](https://github.com/apache/tvm/pull/16361) - Disable SingleEnvThreadVerifier * [#16289](https://github.com/apache/tvm/pull/16289) - [AUTOTVM][FIX] Typo fixes and add a warning in the Droplet Search ### CI * [#16837](https://github.com/apache/tvm/pull/16837) - Disable flaky unit test * [#16765](https://github.com/apache/tvm/pull/16765) - [AOT][Testing] Improve output mismatch information on test failure * [#16661](https://github.com/apache/tvm/pull/16661) - add merge_with_main in unity * [#16611](https://github.com/apache/tvm/pull/16611) - [AOT][Testing] Print output values on test failure * [#16546](https://github.com/apache/tvm/pull/16546) - Disable testing that downloads from mxnet * [#16521](https://github.com/apache/tvm/pull/16521) - Fix CI Script and Broken Tests * [#16502](https://github.com/apache/tvm/pull/16502) - Support tvm-bot rerun for tvm-unity task * [#16435](https://github.com/apache/tvm/pull/16435) - Update image tag to 20240126-070121-8ade9c30e * [#16420](https://github.com/apache/tvm/pull/16420) - [WASM] Update emsdk and nodejs version * [#16384](https://github.com/apache/tvm/pull/16384) - Remove NVIDIA_DISABLE_REQUIRE * [#16382](https://github.com/apache/tvm/pull/16382) - In jenkins.cmd_utils.Sh.tee, check for failing subprocess * [#16366](https://github.com/apache/tvm/pull/16366) - Upgrade sccache version to 0.7.* * [#16369](https://github.com/apache/tvm/pull/16369) - Upgrade Unity ci images * [#16344](https://github.com/apache/tvm/pull/16344) - Update docker images tag to 20240105-165030-51bdaec6 * [#16340](https://github.com/apache/tvm/pull/16340) - [Unity][UnitTest] Increase atol to resolve flaky CI failure * [#16337](https://github.com/apache/tvm/pull/16337) - [Hexagon][UnitTest] Disable flaky quantization test * [#16336](https://github.com/apache/tvm/pull/16336) - Upgrade cmake version to 3.24.0 ### Docker * [#16755](https://github.com/apache/tvm/pull/16755) - [SME]Add Fixed Virtual Platform (FVP) and toolchain install * [#16348](https://github.com/apache/tvm/pull/16348) - Upgrade pip in i386 container ### Dlight * [#16775](https://github.com/apache/tvm/pull/16775) - [Fix][Dlight] (Low-batched-)GeMV on small spatial loops * [#16429](https://github.com/apache/tvm/pull/16429) - [Unity][Dlight][Fix] Reduction rule support dyn-shape epilogue * [#16351](https://github.com/apache/tvm/pull/16351) - [Unity] Add dlight.gpu.Fallback in DispatchSortScan, add argsort, topk, and cumprod * [#16338](https://github.com/apache/tvm/pull/16338) - [Unity][DLight] Introduce Specific Rule for RMSNorm * [#16251](https://github.com/apache/tvm/pull/16251) - [Unity][Dlight] Support dlight gemv rule on nested inner block * [#16878](https://github.com/apache/tvm/pull/16878) - [Dlight] Enhance vectorization loading weight for gemv * [#16848](https://github.com/apache/tvm/pull/16848) - [DLight] Fix a corner case for reduction rule * [#16701](https://github.com/apache/tvm/pull/16701) - [Dlight] Add fallback for low batch gemv with outer reduction * [#16678](https://github.com/apache/tvm/pull/16678) - [Dlight] LowBatchGemv rule only apply to function with spatial symbolic var * [#16665](https://github.com/apache/tvm/pull/16665) - [Dlight] Skip GeMV when normalization fails * [#16579](https://github.com/apache/tvm/pull/16579) - [Dlight] Scheduling Low batch GEMM using GEMV-like rule * [#16579](https://github.com/apache/tvm/pull/16579) - [Dlight] Scheduling Low batch GEMM using GEMV-like rule * [#16321](https://github.com/apache/tvm/pull/16321) - [DLight] Skip rule if target is not suitable * [#16731](https://github.com/apache/tvm/pull/16731) - [Dlight] Fix GeMV shared memory estimation ### Docs * [#16792](https://github.com/apache/tvm/pull/16792) - [Doc] Fix set_axis_separator example * [#16610](https://github.com/apache/tvm/pull/16610) - [Doc] Fixed Docstring usage example in `tvm.ir.make_node` * [#16572](https://github.com/apache/tvm/pull/16572) - [Doc] Remove MxNet related tutorials * [#16514](https://github.com/apache/tvm/pull/16514) - [Unity][Doc] Document passes that depend on `DataflowBlock`s and encourage using `ConvertToDataflow` * [#16482](https://github.com/apache/tvm/pull/16482) - [Doc] Fix Docstring in `extern.py` for Sphinx * [#16346](https://github.com/apache/tvm/pull/16346) - [Doc] Fix minor error in "Expressions in Relay" ### Frontend * [#16001](https://github.com/apache/tvm/pull/16001) - [ONNX] Fix interpreting auto_pad parameters in ConvTranspose operator * [#16651](https://github.com/apache/tvm/pull/16651) - [PaddlePaddle] PaddlePaddle model with NCHW data format that supports quantization * [#16616](https://github.com/apache/tvm/pull/16616) - [PaddlePaddle] Support conv2d when data_format is NHWC * [#16526](https://github.com/apache/tvm/pull/16526) - [Keras] Enable Dense operator for any input dims * [#16478](https://github.com/apache/tvm/pull/16478) - [PaddlePaddle] Fixed the bug that prevented the model from being successfully converted to microTVM on MacOS ### Hexagon * [#16762](https://github.com/apache/tvm/pull/16762) - [VM]Cache operations when bypass mode is enabled * [#16706](https://github.com/apache/tvm/pull/16706) - [VM] Add buffers to `dma_wait` builtin * [#16448](https://github.com/apache/tvm/pull/16448) - [VM]Implement dma_copy and dma_wait builtin for hexagon ### LLVM * [#16782](https://github.com/apache/tvm/pull/16782) - [SVE] Support scalable vectors in LoopVectorizer * [#16812](https://github.com/apache/tvm/pull/16812) - Fix compilation failure due to minor change * [#16808](https://github.com/apache/tvm/pull/16808) - [Runtime]Fix errors during loading of target tags * [#16748](https://github.com/apache/tvm/pull/16748) - Lack of DWARF type is not an error * [#16696](https://github.com/apache/tvm/pull/16696) - [SVE] Add codegen support for scalable buffer accesses * [#15964](https://github.com/apache/tvm/pull/15964) - [RUNTIME] Add optional LLVM ORCJIT runtime executor * [#16612](https://github.com/apache/tvm/pull/16612) - [SVE] Add support for scalable data type strings * [#16523](https://github.com/apache/tvm/pull/16523) - [SVE] Change the dtype of Ramp and Broadcast lanes to PrimExpr * [#16484](https://github.com/apache/tvm/pull/16484) - [SVE] Add vscale builtin * [#16373](https://github.com/apache/tvm/pull/16373) - Update Host.h path ### MetaSchedule * [#16725](https://github.com/apache/tvm/pull/16725) - Make the `opt_level` of `tune_relay()` adjustable ### Metal * [#16713](https://github.com/apache/tvm/pull/16713) - [RUNTIME]Provide richer runtime when error happens * [#16605](https://github.com/apache/tvm/pull/16605) - [RUNTIME]Fix multithreading access of metal runtime * [#16438](https://github.com/apache/tvm/pull/16438) - Dispatch numerically stable tanh for metal ### OpenCL & CLML * [#16854](https://github.com/apache/tvm/pull/16854) - [OpenCL] Add OpenCL device for automatic target detection * [#16846](https://github.com/apache/tvm/pull/16846) - [Meta-Schedule][OpenCL] Enable MS tuning for Android OpenCL * [#16768](https://github.com/apache/tvm/pull/16768) - [RUNTIME][OPENCL] Bugfix for ciImage create with host ptr * [#16672](https://github.com/apache/tvm/pull/16672) - [CLML] Fix build TVM with CLML on MacOS * [#16328](https://github.com/apache/tvm/pull/16328) - [RUNTIME][CLML] Fix for Softmax op for 4D tensors * [#16394](https://github.com/apache/tvm/pull/16394) - [OpenCL][CMake] Fix OpenCL tests compilation ### ROCm * [#16441](https://github.com/apache/tvm/pull/16441) - [WebGPU] Intrin Dispatch: `tanh`, `erf`, `log` * [#16404](https://github.com/apache/tvm/pull/16404) - Some fixes of ROCm codegen ### Relax * [#16872](https://github.com/apache/tvm/pull/16872) - Enhance symbolic expr estimation in memory planning * [#16867](https://github.com/apache/tvm/pull/16867) - Dispatch sort/scan for non-cuda gpu backends * [#16852](https://github.com/apache/tvm/pull/16852) - Fix EliminiateCommonSubexpr removing alloc tensor * [#16851](https://github.com/apache/tvm/pull/16851) - [Relax,Topi] Allow passing workspace to thrust to avoid allocations * [#16841](https://github.com/apache/tvm/pull/16841) - Provide well-formed output in `transform.LazyGetInput` * [#16798](https://github.com/apache/tvm/pull/16798) - [Transform] Provide callback versions of LazyTransformParams * [#16801](https://github.com/apache/tvm/pull/16801) - Allow DeadCodeElimination within ApplyPassToFunction * [#16834](https://github.com/apache/tvm/pull/16834) - Capture symbolic vars in struct info of weights * [#16830](https://github.com/apache/tvm/pull/16830) - Share storage allocs among functions after cuda graph rewriting * [#16823](https://github.com/apache/tvm/pull/16823) - [VM] Refactor CUDA graph builtins as VM extension * [#16828](https://github.com/apache/tvm/pull/16828) - [Bugfix] Provide the full Expr to pattern-match rewriter * [#16805](https://github.com/apache/tvm/pull/16805) - [Bugfix]BlockBuilder may not assume unique input functions * [#16815](https://github.com/apache/tvm/pull/16815) - Enable capturing symbolic shapes in cuda graph * [#16642](https://github.com/apache/tvm/pull/16642) - Allow R.Prim('bool') in relax::If and assert_op * [#16796](https://github.com/apache/tvm/pull/16796) - Unit-test for structural equal of recursive function * [#16732](https://github.com/apache/tvm/pull/16732) - Allow composition of DFPattern replacements * [#16783](https://github.com/apache/tvm/pull/16783) - Improve CanonicalizeBindings in DataflowVar edge case * [#16721](https://github.com/apache/tvm/pull/16721) - Implement operators to inspec DLTensor::strides and offset * [#16730](https://github.com/apache/tvm/pull/16730) - Refactor PatternRewriter into separate Block/Expr mutators * [#16756](https://github.com/apache/tvm/pull/16756) - [IR]Improve highlighting in assert_structural_equal * [#16779](https://github.com/apache/tvm/pull/16779) - Improve malform error msg * [#16569](https://github.com/apache/tvm/pull/16569) - [Unity][Parser] Check well-formedness in the parser * [#16759](https://github.com/apache/tvm/pull/16759) - [Pass] Lowering passes for GPU IPC memory and allreduce * [#16697](https://github.com/apache/tvm/pull/16697) - Implement relax.transform.TopologicalSort * [#16658](https://github.com/apache/tvm/pull/16658) - Normalize use of void-type variable to inline R.tuple() * [#16711](https://github.com/apache/tvm/pull/16711) - [Frontend] Add op `tanh`, `exp`, `negative`, and `permute` * [#16703](https://github.com/apache/tvm/pull/16703) - [Fix]Fix top-p/top-k sampling kernel * [#16669](https://github.com/apache/tvm/pull/16669) - [Frontend][Onnx] add sum and globalavgpool 1d/3d op * [#16691](https://github.com/apache/tvm/pull/16691) - CUDA graph rewrite treating StringImm as static * [#16685](https://github.com/apache/tvm/pull/16685) - Implement StructInfoPattern for dataflow pattern matching * [#16681](https://github.com/apache/tvm/pull/16681) - [Frontend][Onnx] support MaxPool1/2/3D and AveragePool1/2/3D * [#16584](https://github.com/apache/tvm/pull/16584) - [Unity][TIR] Clear struct info when specializing PrimFunc * [#16676](https://github.com/apache/tvm/pull/16676) - Remove the legalization of cumsum/cumprob * [#16654](https://github.com/apache/tvm/pull/16654) - [Frontend][NN] Add support for Conv3D * [#16674](https://github.com/apache/tvm/pull/16674) - Eager free original weights in transform_params * [#16675](https://github.com/apache/tvm/pull/16675) - add sample_indices in sampling * [#16648](https://github.com/apache/tvm/pull/16648) - [Runtime] Support Unpack API for NDArrayCache * [#16591](https://github.com/apache/tvm/pull/16591) - [Unity][Transform] Handle dynamic shapes in CombineParallelMatmul * [#16594](https://github.com/apache/tvm/pull/16594) - [Transform] Preserve param names in LiftTransformParams * [#16575](https://github.com/apache/tvm/pull/16575) - [Unity] GPU sampling * [#16574](https://github.com/apache/tvm/pull/16574) - Additional unit tests for RemoveUnusedParameters * [#16585](https://github.com/apache/tvm/pull/16585) - [Unity][Analysis] Include impure call in VerifyWellFormed errors * [#16421](https://github.com/apache/tvm/pull/16421) - [Unity][Transform] Raise error in FuseOpsByPattern for SSA violation * [#16629](https://github.com/apache/tvm/pull/16629) - Fix error message in BlockBuilder * [#16592](https://github.com/apache/tvm/pull/16592) - Handle dynamic arguments in legalization of nn.attention * [#16590](https://github.com/apache/tvm/pull/16590) - [Unity][Transform] Check for permute_dims in ExpandMatmulOfSum * [#16604](https://github.com/apache/tvm/pull/16604) - [Frontend][Onnx] fix clip unsqueeze opset implement * [#16568](https://github.com/apache/tvm/pull/16568) - [Runtime] RNNState for Space State Models * [#16563](https://github.com/apache/tvm/pull/16563) - Implement operators to read runtime DLTensor* information * [#16581](https://github.com/apache/tvm/pull/16581) - [Unity][MSC][M4.2][Step2] Enable plugin with manager, test plugins in compile pipeline * [#16600](https://github.com/apache/tvm/pull/16600) - Expose name_hint field for BlockBuilder.match_cast * [#16601](https://github.com/apache/tvm/pull/16601) - [Transform] Canonicalize `let var = R.const` bindings * [#16583](https://github.com/apache/tvm/pull/16583) - [Unity][VM] Recursively visit match bindings in VMShapeLowerMutator * [#16586](https://github.com/apache/tvm/pull/16586) - Ignore non-relax functions in relax.transform.RunCodegen * [#16573](https://github.com/apache/tvm/pull/16573) - [VM] Re-implementation of callback functions * [#16561](https://github.com/apache/tvm/pull/16561) - [Bugfix]Remove call to tvm.build for empty TIR module * [#16564](https://github.com/apache/tvm/pull/16564) - [Unity] Check for symbolic vars in PrimValue in when lowering to TIR * [#16558](https://github.com/apache/tvm/pull/16558) - Minor updates for NN frontend * [#16542](https://github.com/apache/tvm/pull/16542) - Support callback as argument * [#16487](https://github.com/apache/tvm/pull/16487) - [Unity][Transform] Handle `call_tir_inplace` in `FuseTIR` and `FuseOps` * [#16355](https://github.com/apache/tvm/pull/16355) - [Unity] Infer struct info for relax.op.split on dynamic-sized index * [#16465](https://github.com/apache/tvm/pull/16465) - [Redo][Unity] Split DecomposeOpsForTraining into two steps * [#16495](https://github.com/apache/tvm/pull/16495) - [Unity][MSC][M4.2][Step1] Enable plugin with manager, test plugins in compile pipeline * [#16498](https://github.com/apache/tvm/pull/16498) - [Frontent] "tensor_ir_inplace" op * [#16500](https://github.com/apache/tvm/pull/16500) - [Unity] Support storage reuse for dynamic shapes * [#16493](https://github.com/apache/tvm/pull/16493) - [Pass] Skip data type node for CSE pass * [#16467](https://github.com/apache/tvm/pull/16467) - [Unity][MSC][Refactor] Reconstruct BYOC and runner * [#16422](https://github.com/apache/tvm/pull/16422) - [Unity][CodeGen] RunCodegen based on externally-exposed functions * [#16483](https://github.com/apache/tvm/pull/16483) - [Unity][Frontend] Add Sigmoid and Square Op * [#16472](https://github.com/apache/tvm/pull/16472) - [Unity] Improved error message in tvm::relax::UpdateStructInfo * [#16473](https://github.com/apache/tvm/pull/16473) - [Unity] Improve error message in tensor_to_shape struct inference * [#16466](https://github.com/apache/tvm/pull/16466) - Memory planning for "partially dynamic" shapes * [#16464](https://github.com/apache/tvm/pull/16464) - NDArray Cache Update with DLTensor Support * [#16315](https://github.com/apache/tvm/pull/16315) - [Unity][Transform] Implement relax.transform.ReorderTakeAfterMatmul * [#16313](https://github.com/apache/tvm/pull/16313) - [Unity][Transform] Implement relax.transform.ExpandMatmulOfSum * [#16411](https://github.com/apache/tvm/pull/16411) - [Unity][Transform] Handle symbolic variables in LambdaLift * [#16443](https://github.com/apache/tvm/pull/16443) - [Unity][FIX] fix thread dtype mismatch * [#16442](https://github.com/apache/tvm/pull/16442) - Revert "[Unity] Split DecomposeOpsForTraining into two steps" * [#16437](https://github.com/apache/tvm/pull/16437) - [Unity] Improve buffer allocation for handling duplicated buffer names. * [#16439](https://github.com/apache/tvm/pull/16439) - [Unity] Support cumsum with pure int32 * [#16432](https://github.com/apache/tvm/pull/16432) - [Unity] downgrade cmake version requirement * [#16427](https://github.com/apache/tvm/pull/16427) - [Unity][Frontend][NN] Better support for dynamic convolutions * [#16418](https://github.com/apache/tvm/pull/16418) - [Unity][Fix] Fix mismatched intrinsic name * [#16129](https://github.com/apache/tvm/pull/16129) - [Unity][Transform] Replace eligible operators with in-place versions in dataflow blocks * [#16414](https://github.com/apache/tvm/pull/16414) - [Bugfix][Unity] Recover MSVC/NVCC/ROCm/Vulkan * [#15954](https://github.com/apache/tvm/pull/15954) - [Unity] Split DecomposeOpsForTraining into two steps * [#16111](https://github.com/apache/tvm/pull/16111) - [Unity][Transform] Memory planning for dynamic-shape func return * [#16396](https://github.com/apache/tvm/pull/16396) - [Unity] PagedKVCache supporting on-the-fly RoPE calculation * [#16395](https://github.com/apache/tvm/pull/16395) - [Frontend][ONNX]fix onnx frontend parse * [#16385](https://github.com/apache/tvm/pull/16385) - [Unity][Op] Add Conv3D Operator * [#16284](https://github.com/apache/tvm/pull/16284) - [Unity][nnModule] Dynamic shape support in nn Module * [#16378](https://github.com/apache/tvm/pull/16378) - [Unity][BlockBuilder] Restore bb.get() * [#16374](https://github.com/apache/tvm/pull/16374) - [Unity] Support TIR kernel for PagedKVCache * [#16314](https://github.com/apache/tvm/pull/16314) - [Unity][Transform] Implement relax.transform.AdjustMatmulOrder * [#16349](https://github.com/apache/tvm/pull/16349) - [Unity][MSC] Avoid depending on trivial bindings in Relax intermediate * [#16376](https://github.com/apache/tvm/pull/16376) - [Unity][Contrib] Fix a bug due to typo in vllm `reconstruct_from_cache` kernel and add test * [#16388](https://github.com/apache/tvm/pull/16388) - [Unity] Update dispatch test cases following the merge from main * [#16335](https://github.com/apache/tvm/pull/16335) - [Unity] Set CMAKE_CUDA_ARCHITECTURES default to native * [#16306](https://github.com/apache/tvm/pull/16306) - [Unity][Transform] Update LambdaLift to use name of lifted lambda * [#16310](https://github.com/apache/tvm/pull/16310) - [Unity][Analysis] Show objects instead of names in WellFormedChecker * [#16362](https://github.com/apache/tvm/pull/16362) - [Unity][Fix] Memory planning check value type of 'tir_var_upper_bound' * [#16367](https://github.com/apache/tvm/pull/16367) - [Unity][Transform] Handle replacement at both var binding and usage * [#16309](https://github.com/apache/tvm/pull/16309) - [Unity][Transform] Use parameter name in BundleModelParams * [#16307](https://github.com/apache/tvm/pull/16307) - [Unity] Improved error message in ExprMutator::ReEmitBinding * [#16308](https://github.com/apache/tvm/pull/16308) - [Unity] Improved error message for matmul shape mismatch * [#16360](https://github.com/apache/tvm/pull/16360) - [Unity] Enhance Torch-consistency in rehsape * [#16350](https://github.com/apache/tvm/pull/16350) - [Unity][Contrib] Add vLLM paged attention kernel * [#16303](https://github.com/apache/tvm/pull/16303) - [Unity][NN] Use Linear name for nn.op.permute_dims * [#16325](https://github.com/apache/tvm/pull/16325) - [Unity][MSC][Legalize] legalize codes and mute logging * [#16312](https://github.com/apache/tvm/pull/16312) - [Unity][Analysis] Add utility for collecting compile-time bindings * [#16330](https://github.com/apache/tvm/pull/16330) - [Unity][WEBGPU] Enable wasm exception propagation * [#16304](https://github.com/apache/tvm/pull/16304) - [Unity][Analysis] Handle PrimStructInfo in EraseToWellDefined * [#16305](https://github.com/apache/tvm/pull/16305) - [Unity][Transform] Implement UpdateParamStructInfo * [#16331](https://github.com/apache/tvm/pull/16331) - [Unity] Alter op impl handling empty transform for output * [#16254](https://github.com/apache/tvm/pull/16254) - [Unity] Dispatch cumsum and sort * [#16120](https://github.com/apache/tvm/pull/16120) - [Unity][Transform] Extract partial-tuple-usage from FuseTIR * [#16311](https://github.com/apache/tvm/pull/16311) - [Unity] Validate struct info in relax::Call constructor * [#16333](https://github.com/apache/tvm/pull/16333) - [Unity] Fix nn.op.tensor_ir_op signature * [#16302](https://github.com/apache/tvm/pull/16302) - [Unity] Cutlass kernel compatibility with cmake 3.18+ ### Relay * [#16622](https://github.com/apache/tvm/pull/16622) - [ONNX] Fix the attribute mode parse of operator Upsample * [#16626](https://github.com/apache/tvm/pull/16626) - [ONNX] Fix the Resize operator in ONNX frontend * [#16624](https://github.com/apache/tvm/pull/16624) - [ONNX] fix the wrong default value about dtype in Multinomial converter * [#16417](https://github.com/apache/tvm/pull/16417) - [Frontend][Torch] fix pytorch frontend linspace op * [#16400](https://github.com/apache/tvm/pull/16400) - [Frontend][Torch] fix pytorch frontend not support logical or * [#16390](https://github.com/apache/tvm/pull/16390) - [Frontend][Torch] fix a typo mistake in nonzero_numpy * [#16324](https://github.com/apache/tvm/pull/16324) - make "ToScalar" support directly obtaining "int64_t" ### Runtime * [#16804](https://github.com/apache/tvm/pull/16804) - Introduce MSCCLPP with NCCL equivalent interface * [#16809](https://github.com/apache/tvm/pull/16809) - Add "TVM_DLL" to NVTX header * [#16750](https://github.com/apache/tvm/pull/16750) - CUDA IPC Memory support and custom allreduce kernels * [#16738](https://github.com/apache/tvm/pull/16738) - [Refactor]Always specify device in allocator interface * [#16716](https://github.com/apache/tvm/pull/16716) - Ensure NDArray.CopyTo(Device) always sync * [#16705](https://github.com/apache/tvm/pull/16705) - Add TVM_DLL to memory manager functions * [#16692](https://github.com/apache/tvm/pull/16692) - PagedKVCache execute data copy on a separate stream * [#16647](https://github.com/apache/tvm/pull/16647) - [RPC] Fix FreeObject in minrpc server * [#16667](https://github.com/apache/tvm/pull/16667) - [Builtin] Using float32 accumulation in attention kernel * [#16635](https://github.com/apache/tvm/pull/16635) - [RPC] Enable RPCObjectRef over multi-hop RPC * [#16630](https://github.com/apache/tvm/pull/16630) - Add TVM_DLL to threading backend funcs * [#16541](https://github.com/apache/tvm/pull/16541) - Add "TVM_DLL" to NDArray cache load func * [#16550](https://github.com/apache/tvm/pull/16550) - [ROCM] Properly align rocm parameter buffer * [#16545](https://github.com/apache/tvm/pull/16545) - Fix dtype conversion for bf16 and fp8 * [#16508](https://github.com/apache/tvm/pull/16508) - ParallelFor skipping thread backend for unit extent * [#16486](https://github.com/apache/tvm/pull/16486) - KV cache providing workspace for attn kernel * [#16456](https://github.com/apache/tvm/pull/16456) - [KVCache] AttentionWithFusedQKV and RoPE mode * [#16415](https://github.com/apache/tvm/pull/16415) - [Memory] Implement support for non-zero offset within a storage object in AllocNDArr… * [#16387](https://github.com/apache/tvm/pull/16387) - [RPC] Enable RPCObjectRef return in RPC * [#16377](https://github.com/apache/tvm/pull/16377) - Use cudaGetDeviceCount to check if device exists ### TIR * [#16832](https://github.com/apache/tvm/pull/16832) - Use constructor for new PrimFunc in TransformLayout * [#16543](https://github.com/apache/tvm/pull/16543) - Fix segfaults from ordering of Let/Assert in MakePackedAPI * [#16795](https://github.com/apache/tvm/pull/16795) - Ramp and Broadcast lanes fixed to int32 dtype * [#16767](https://github.com/apache/tvm/pull/16767) - [Driver] Use `BindTarget` to specify target for FP8 legalization * [#16742](https://github.com/apache/tvm/pull/16742) - [Bugfix]Fix cache_read update buffer region * [#16726](https://github.com/apache/tvm/pull/16726) - [Bugfix]Avoid overwrite of unmanaged buffer allocations * [#16548](https://github.com/apache/tvm/pull/16548) - [CUDA] Add native FP8 support to codegen * [#16723](https://github.com/apache/tvm/pull/16723) - Implement max/min_value for fp8 data types * [#16655](https://github.com/apache/tvm/pull/16655) - Improve well-formed check's handling of match buffer * [#16673](https://github.com/apache/tvm/pull/16673) - Support Vector Reinterpret Calls * [#16682](https://github.com/apache/tvm/pull/16682) - [Bugfix]Handle AttrStmt of upcoming tir.Var in ConvertSSA * [#16560](https://github.com/apache/tvm/pull/16560) - Enhance and fix tensorize schedule for some case * [#16660](https://github.com/apache/tvm/pull/16660) - [Bugfix]Fix duplicate AllocateConst in CacheReadWrite schedule primitive * [#16544](https://github.com/apache/tvm/pull/16544) - Expand debug symbol output for CodeGenLLVM * [#16553](https://github.com/apache/tvm/pull/16553) - Fix get_block_access_region for let bindings * [#16515](https://github.com/apache/tvm/pull/16515) - Require exactly same-dtype matching for Vulkan smem reuse * [#16406](https://github.com/apache/tvm/pull/16406) - Fix of inter thread reduction with shared memory prefetch * [#16293](https://github.com/apache/tvm/pull/16293) - Extend DP4A tensor intrin * [#16345](https://github.com/apache/tvm/pull/16345) - Allow sync threads inside condition * [#16250](https://github.com/apache/tvm/pull/16250) - In SplitHostDevice, check for variables in thread extents * [#16184](https://github.com/apache/tvm/pull/16184) - [Transform] Implement InlinePrivateFunctions ### TOPI * [#16652](https://github.com/apache/tvm/pull/16652) - improve inclusive_scan for thrust * [#16383](https://github.com/apache/tvm/pull/16383) - [Target] Add fp16 SIMD support for conv2d on `arm_cpu` targets ### TVMC * [#16261](https://github.com/apache/tvm/pull/16261) - Add tvmc flag to print ir before and print ir after named pass ### TVMScript * [#16864](https://github.com/apache/tvm/pull/16864) - Add parser and printer support for e4m3/e5m2 fp8 * [#16844](https://github.com/apache/tvm/pull/16844) - Produce empty DictAttrs when R.func_attrs is absent * [#16811](https://github.com/apache/tvm/pull/16811) - Do not throw error for duplicate definitions * [#16641](https://github.com/apache/tvm/pull/16641) - Allow use of relax.Expr with void type as a statement * [#16663](https://github.com/apache/tvm/pull/16663) - Infer T.reads() for DeclBuffer nodes * [#16640](https://github.com/apache/tvm/pull/16640) - Represent tir::builtin::ret() using python "return" * [#16562](https://github.com/apache/tvm/pull/16562) - [Bugfix]Handle R.match_cast as last binding in if/else * [#16593](https://github.com/apache/tvm/pull/16593) - [Unity]Parse R.Object return type from call_pure_packed * [#16356](https://github.com/apache/tvm/pull/16356) - [Unity]Optionally hide StructInfo that can be inferred * [#16379](https://github.com/apache/tvm/pull/16379) - [Unity]Update `call_packed` semantics to support empty sinfo_args ### Vulkan * [#16858](https://github.com/apache/tvm/pull/16858) - Fix CLZ support for Vulkan ### cuda & cutlass & tensorrt * [#16865](https://github.com/apache/tvm/pull/16865) - [Codegen, CUDA] Add handling of fp8 broadcast / const * [#16818](https://github.com/apache/tvm/pull/16818) - [Cutlass] Fix usage of cuda stream for group gemm * [#16788](https://github.com/apache/tvm/pull/16788) - [Cutlass] Add check for group gemm param shapes * [#16789](https://github.com/apache/tvm/pull/16789) - [Bugfix][Cutlass] Remove a typo in cutlass build * [#16787](https://github.com/apache/tvm/pull/16787) - [Codegen, Cuda] Add overload for fp8x4 e5m2 <-> half4 conversion * [#16751](https://github.com/apache/tvm/pull/16751) - [Cutlass] Add group gemm kernels * [#16736](https://github.com/apache/tvm/pull/16736) - [Target][CUDA] Allow non-numeric arch as needed for latest gpu * [#16619](https://github.com/apache/tvm/pull/16619) - [Bugfix][Cutlass] Check if function attributes is None * [#16342](https://github.com/apache/tvm/pull/16342) - [CUDA] Simple extend to optimize reuse for static shared memory. * [#16342](https://github.com/apache/tvm/pull/16342) - [CUDA] Simple extend to optimize reuse for static shared memory. * [#16342](https://github.com/apache/tvm/pull/16342) - [CUDA] Simple extend to optimize reuse for static shared memory. * [#16342](https://github.com/apache/tvm/pull/16342) - [CUDA] Simple extend to optimize reuse for static shared memory. * [#16342](https://github.com/apache/tvm/pull/16342) - [CUDA] Simple extend to optimize reuse for static shared memory. ### micoNPU * [#16266](https://github.com/apache/tvm/pull/16266) - [microNPU][ETHOSU] Add fixed point for tanh * [#16680](https://github.com/apache/tvm/pull/16680) - [microNPU][ETHOSU] Fix LUT size for int16 activations * [#16401](https://github.com/apache/tvm/pull/16401) - [microNPU][ETHOSU] Add fixed point for matmul ### web * [#16733](https://github.com/apache/tvm/pull/16733) - Support web indexDB cache for larger model storage * [#16810](https://github.com/apache/tvm/pull/16810) - Support building tvm/web on Windows * [#16825](https://github.com/apache/tvm/pull/16825) - Allow custom bc files in emcc making * [#16791](https://github.com/apache/tvm/pull/16791) - Add `kv_state` and `rnn_state` to wasm_runtime * [#16722](https://github.com/apache/tvm/pull/16722) - Implement linear congruential generator, make runtime seedable * [#16650](https://github.com/apache/tvm/pull/16650) - Seperate parallel shard download and iterative shard loading * [#16694](https://github.com/apache/tvm/pull/16694) - Initial support for asyncify * [#16631](https://github.com/apache/tvm/pull/16631) - Fix NDArrayCache loading report callback * [#16525](https://github.com/apache/tvm/pull/16525) - Move ArtifactCache to Interface, Support Cache delete and Batch Delete, Remove typo * [#16554](https://github.com/apache/tvm/pull/16554) - Compatibility with PagedKVCache in WebGPU * [#16527](https://github.com/apache/tvm/pull/16527) - Revert "[Unity]Temp disable wasm exception (#16444)" * [#16504](https://github.com/apache/tvm/pull/16504) - [Relax]Add ApplyPresenceAndRequencyPenalty * [#16485](https://github.com/apache/tvm/pull/16485) - [wasm] Enlarge initial memory for emcc * [#16444](https://github.com/apache/tvm/pull/16444) - [Unity]Temp disable wasm exception ### Misc * [#16873](https://github.com/apache/tvm/pull/16873) - [Thrust] Fix thrust workspace allocation * [#16868](https://github.com/apache/tvm/pull/16868) - [3rdparty] Bump flashinfer * [#16871](https://github.com/apache/tvm/pull/16871) - [PageKV] allow PopN to pop all the tokens in last block * [#16866](https://github.com/apache/tvm/pull/16866) - [3rdparty] Bump FlashInfer * [#16863](https://github.com/apache/tvm/pull/16863) - [Picojson] Let the key of objects in json be ordered by default * [#16856](https://github.com/apache/tvm/pull/16856) - [Thrust] Use pointer to tls pool to prevent creating new pool * [#16850](https://github.com/apache/tvm/pull/16850) - Fixing probability comment * [#16849](https://github.com/apache/tvm/pull/16849) - [KVCache] Initialize one extra page than specified * [#16843](https://github.com/apache/tvm/pull/16843) - [IR] Provide well-formed intermediate in ApplyPassToFunction * [#16772](https://github.com/apache/tvm/pull/16772) - [MSC][M5.3] Support torch.dynamo for dynamic models * [#16839](https://github.com/apache/tvm/pull/16839) - Bump pillow from 10.2.0 to 10.3.0 in /apps/microtvm/cmsisnn * [#16838](https://github.com/apache/tvm/pull/16838) - Bump pillow from 10.2.0 to 10.3.0 in /apps/microtvm/ethosu * [#16831](https://github.com/apache/tvm/pull/16831) - [KVCache] Reducing CacheAuxDataManager copy size * [#16794](https://github.com/apache/tvm/pull/16794) - [SME] Target parser support for SME * [#16824](https://github.com/apache/tvm/pull/16824) - [KVCache] Introducing auxiliary data manager * [#16800](https://github.com/apache/tvm/pull/16800) - [BugTIR]fix error merging shared memory for ptx_cp_async * [#16822](https://github.com/apache/tvm/pull/16822) - [VM] Recycle VMFrame * [#16813](https://github.com/apache/tvm/pull/16813) - [KVCache] Support forking sequence at specific posotion * [#16786](https://github.com/apache/tvm/pull/16786) - [Codegen] Add check to disable invalid reinterpret * [#16816](https://github.com/apache/tvm/pull/16816) - [Cmake] Allow using custom CCCL path for thrust * [#16784](https://github.com/apache/tvm/pull/16784) - [SLM] Add unit tests for SLM to Relax exporter * [#16814](https://github.com/apache/tvm/pull/16814) - Fix includes of custom allreduce kernel * [#16806](https://github.com/apache/tvm/pull/16806) - [Debug] Improve error message in VMShapeLower * [#16802](https://github.com/apache/tvm/pull/16802) - [Debug] Improve error messages in LiftTransformParams * [#16425](https://github.com/apache/tvm/pull/16425) - [Target] Use LLVM target parser for determining Arm(R) A-Profile Architecture features * [#16797](https://github.com/apache/tvm/pull/16797) - [3rdparty] AUTO mode for custom all-reduce strategy * [#16761](https://github.com/apache/tvm/pull/16761) - [SME] Add support for inserting processor state annotations * [#16778](https://github.com/apache/tvm/pull/16778) - [Analysis] Allow calls to GlobalVar in @R.function * [#16745](https://github.com/apache/tvm/pull/16745) - [IR] Default to empty attributes, instead of NULL * [#16777](https://github.com/apache/tvm/pull/16777) - Revert "[SLM] Allow modules to define pre-processing of weights" * [#16776](https://github.com/apache/tvm/pull/16776) - [Contrib] Remove thrust "built but not used" warning * [#16757](https://github.com/apache/tvm/pull/16757) - [SLM] Allow modules to define pre-processing of weights * [#16763](https://github.com/apache/tvm/pull/16763) - [CONTRIB] Add nm symbol dump * [#16717](https://github.com/apache/tvm/pull/16717) - Enable Shared Function in LiftTransformParam Pass * [#16729](https://github.com/apache/tvm/pull/16729) - [Builtin] Sliding window and sink support for PagedKVCache * [#16724](https://github.com/apache/tvm/pull/16724) - Fix cpp_rtvm cmake build on Windows * [#16513](https://github.com/apache/tvm/pull/16513) - [Target] Automatically detect system triple when not specified by the user * [#16710](https://github.com/apache/tvm/pull/16710) - [CMake] Add "USE_FLASHINFER" to libinfo * [#16702](https://github.com/apache/tvm/pull/16702) - [MSC][M5.2] Enable quantize && prune with gym by wrapper * [#16699](https://github.com/apache/tvm/pull/16699) - [Transform] Remove R.Object parameters after LazyTransformParams * [#16668](https://github.com/apache/tvm/pull/16668) - [MSC][M5.1] Build wrapper to support compression * [#16693](https://github.com/apache/tvm/pull/16693) - [Contrib] Support NDArray cache taking generator * [#16412](https://github.com/apache/tvm/pull/16412) - [Lint] Add check to prevent usage of #include <regex> * [#16689](https://github.com/apache/tvm/pull/16689) - [DeviceAPI] Support "GetCurrentStream" * [#16690](https://github.com/apache/tvm/pull/16690) - Use target name instead of node name as function name * [#16683](https://github.com/apache/tvm/pull/16683) - [skip ci] Fix wasm exception flag * [#16609](https://github.com/apache/tvm/pull/16609) - Minor update docs instructions * [#16656](https://github.com/apache/tvm/pull/16656) - Simplify Windows CMake Command * [#16666](https://github.com/apache/tvm/pull/16666) - [KVCache] Fix the reference counter in sequence fork * [#16662](https://github.com/apache/tvm/pull/16662) - Fixing workload comment * [#16595](https://github.com/apache/tvm/pull/16595) - [Transform] Check for zero-param operators in LiftTransformParams * [#16599](https://github.com/apache/tvm/pull/16599) - [Transform] De-duplicate MatchCast nodes in EliminateCommonSubexpr * [#16596](https://github.com/apache/tvm/pull/16596) - [Transform] Implement relax.transform.ReorderPermuteDimsAfterConcat * [#16597](https://github.com/apache/tvm/pull/16597) - [Transform] Allow explicit name of bundled model parameters * [#16602](https://github.com/apache/tvm/pull/16602) - [Transform] Improvements to LazyTransformParams * [#16606](https://github.com/apache/tvm/pull/16606) - [KVCache] Support passing in attn_score_scaling_factor into KV cache * [#16608](https://github.com/apache/tvm/pull/16608) - Extend gpu memory bandwidth test to work through RPC * [#16587](https://github.com/apache/tvm/pull/16587) - [Debug] Improve error message for codegen pattern mismatches * [#16570](https://github.com/apache/tvm/pull/16570) - [Marvell BYOC]: Marvell AI Accelerator Integration - Phase 1 * [#16576](https://github.com/apache/tvm/pull/16576) - Update the 3rdparty/libflash_attn submodule * [#16580](https://github.com/apache/tvm/pull/16580) - [KVCache] Support mode "None" for Rotary Embebdding * [#16578](https://github.com/apache/tvm/pull/16578) - [KVCache] Support returning query positions * [#16571](https://github.com/apache/tvm/pull/16571) - Fix compile warnings * [#16540](https://github.com/apache/tvm/pull/16540) - [Upd] Enable lld search to include /opt/rocm/llvm/bin for rocm * [#16539](https://github.com/apache/tvm/pull/16539) - Improve error message in NDArray::CopyFromTo * [#16524](https://github.com/apache/tvm/pull/16524) - [Build] Improving debug and build-dir options * [#16551](https://github.com/apache/tvm/pull/16551) - [KVCache] Fix attention kernel for ROCm * [#16512](https://github.com/apache/tvm/pull/16512) - Cut pytest-lazy-fixture * [#16506](https://github.com/apache/tvm/pull/16506) - Bump 3rdparty/cutlass_fpA_intB_gemm version * [#16511](https://github.com/apache/tvm/pull/16511) - [Minor] Fix Clang compilation warning in fuse_tir.cc and codegen_c_host.cc * [#16516](https://github.com/apache/tvm/pull/16516) - Add Relax, Unity Tags in make_notes.py * [#16497](https://github.com/apache/tvm/pull/16497) - [Instrument] Add default instrument to print all passes * [#16494](https://github.com/apache/tvm/pull/16494) - [DPL] Support tir_vars field in is_call_tir pattern * [#16453](https://github.com/apache/tvm/pull/16453) - Bump pillow from 10.0.1 to 10.2.0 in /apps/microtvm * [#16454](https://github.com/apache/tvm/pull/16454) - [BugTIR] fix thread_sync occurs in letstmt * [#16468](https://github.com/apache/tvm/pull/16468) - [LINT] Fix pylint issues in test_dma_builtin.py * [#16413](https://github.com/apache/tvm/pull/16413) - [Contrib] Workspace for cuBLAS backend * [#16460](https://github.com/apache/tvm/pull/16460) - [Cherry-pick][MSC][M4.1] Add plugin && plugin_builder, enable build and test in different frameworks (#16397) * [#16461](https://github.com/apache/tvm/pull/16461) - [Minor] Fix Docstring for sphinx-build * [#16431](https://github.com/apache/tvm/pull/16431) - [Schedule] Loop-Partition Scheduling Primitive * [#16451](https://github.com/apache/tvm/pull/16451) - Bump pillow from 10.0.1 to 10.2.0 in /apps/microtvm/ethosu * [#16452](https://github.com/apache/tvm/pull/16452) - Bump pillow from 10.0.1 to 10.2.0 in /apps/microtvm/cmsisnn * [#16445](https://github.com/apache/tvm/pull/16445) - [skip ci] update branch rule to prepare for unity transition * [#16426](https://github.com/apache/tvm/pull/16426) - [CMake] Enable cuda lang if USE_CUDA is on * [#16407](https://github.com/apache/tvm/pull/16407) - Add NVIDIA Hopper H100 target tag * [#16398](https://github.com/apache/tvm/pull/16398) - [DeviceAPI] Support querying total global memory * [#16357](https://github.com/apache/tvm/pull/16357) - [RPC] Fix tuning on macOS and Windows (#15771) * [#16386](https://github.com/apache/tvm/pull/16386) - [Thrust] Use no sync exec policy and caching allocator * [#16343](https://github.com/apache/tvm/pull/16343) - [CMake][MSVC] Disable permissive mode for MSVC builds * [#16242](https://github.com/apache/tvm/pull/16242) - [Codegen] Fix if_then_else codegen * [#16341](https://github.com/apache/tvm/pull/16341) - [CMake] Use ccache as CMAKE_CUDA_COMPILER_LAUNCHER * [#16332](https://github.com/apache/tvm/pull/16332) - Change metal dtype of ceil_log2 to fp32 -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm/issues/16911 You are receiving this because you are subscribed to this thread. Message ID: <apache/tvm/issues/16...@github.com>