# Introduction The TVM community has worked since the v0.17.0 release to deliver the following new exciting improvements! This release version is:
The main tags are below (**bold text is with lots of progress**): - Community, RFCs - AOT, Hexagon, OpenCL & CLML, Web, Metal - **Relax**, **Dlight**, **Disco** - TIR, TVMScript - Docs, CI, **Misc**, **BugFix** Please visit the full listing of commits for a complete view: [v0.17.dev0...v0.17.0.rc0](https://github.com/apache/tvm/compare/v0.17.dev0...v0.17.0.rc0). ### Community * [#17018](https://github.com/apache/tvm/pull/17018) - New committer: Balint Cristian ### RFCs This new RFC added an open, standardized format for neural network exchange developed by the Khronos Group since 2018 (https://www.khronos.org/nnef). It is aimed at deploying trained neural networks from deep learning frameworks to proprietary inference engines of neural network hardware vendors. * [#108](https://github.com/apache/tvm-rfcs/pull/108) - [RFC] [RFC] Add NNEF frontend ---- ### AOT * [#17077](https://github.com/apache/tvm/pull/17077) - Correctly calculate workspace for vector types ### Adreno * [#16927](https://github.com/apache/tvm/pull/16927) - [SCRIPT]Fix in build config for adreno ### BYOC * [#16895](https://github.com/apache/tvm/pull/16895) - Add layout check and update shape check for cublas FP8 BYOC ### BugFix * [#17138](https://github.com/apache/tvm/pull/17138) - [Fix][TIR] Fix outdated call to create extern buffer in make_extern * [#17132](https://github.com/apache/tvm/pull/17132) - Restrict CopyOnWrite to _type_final * [#17096](https://github.com/apache/tvm/pull/17096) - Update FAttrsGetter to return Map<String, ObjectRef> * [#17078](https://github.com/apache/tvm/pull/17078) - [NCCL] Release NCCL thread_local resources in destructor * [#17044](https://github.com/apache/tvm/pull/17044) - [Support] Fix copy constructor for support::OrderedSet * [#17000](https://github.com/apache/tvm/pull/17000) - [MSC] split name_string with index by colon from the right * [#16923](https://github.com/apache/tvm/pull/16923) - [Fix][Dlight] Fix GeneralReduction for log-sum-exp * [#16924](https://github.com/apache/tvm/pull/16924) - [Fix] Fix SSA conversion for SizeVar retention * [#16903](https://github.com/apache/tvm/pull/16903) - CudaDeviceAPI::GetAttr may check kExist when GPUs absent * [#16901](https://github.com/apache/tvm/pull/16901) - rocm shared memory issue on MI250 ### CI * [#17055](https://github.com/apache/tvm/pull/17055) - [SME][Test] Add additional conv2d tests for asymmetric parameters * [#17007](https://github.com/apache/tvm/pull/17007) - [TOPI][Testing] Enable conv2d NHWC fp16 topi testing for `arm_cpu` * [#16930](https://github.com/apache/tvm/pull/16930) - [UnitTest] Use pytest's scope='session' for tvm.testing.parameter * [#16948](https://github.com/apache/tvm/pull/16948) - Update image tag to 20240428-060115-0b09ed018 * [#16931](https://github.com/apache/tvm/pull/16931) - Use LLVM17 for tests on `ci_cpu` * [#16942](https://github.com/apache/tvm/pull/16942) - Enable Conda setup v3 * [#16939](https://github.com/apache/tvm/pull/16939) - Upgrade CUDA to 12.4 ### CRT * [#17097](https://github.com/apache/tvm/pull/17097) - [Bugfix]Return error code on error from ModuleGetFunction ### Disco * [#17035](https://github.com/apache/tvm/pull/17035) - [QoL] Implement broadcast/scatter methods for Session * [#16992](https://github.com/apache/tvm/pull/16992) - [Bugfix]Handle NDArray larger than OS buffer for pipe * [#16978](https://github.com/apache/tvm/pull/16978) - Implement `num_workers` property for `disco.Session` * [#16989](https://github.com/apache/tvm/pull/16989) - Treat hangup of disco worker process as kShutdown * [#16993](https://github.com/apache/tvm/pull/16993) - Allow allocation that only exists on worker0 * [#16979](https://github.com/apache/tvm/pull/16979) - Expose disco.Session.shutdown through the python API * [#16919](https://github.com/apache/tvm/pull/16919) - Improve error message for CallPacked ### Dlight * [#17082](https://github.com/apache/tvm/pull/17082) - Use 16x32 spatial x reduction thread extents in GEMV scheduling * [#17052](https://github.com/apache/tvm/pull/17052) - Skip GEMV rules when more than one vector * [#17026](https://github.com/apache/tvm/pull/17026) - Perf improvement for low_batch_gemv on Metal * [#17016](https://github.com/apache/tvm/pull/17016) - Update Adreno GEMV Rules * [#16972](https://github.com/apache/tvm/pull/16972) - [GPU] Enhance opencl thread limit for schedules * [#16973](https://github.com/apache/tvm/pull/16973) - [GPU] Improved gemv outer fallback schedule * [#16958](https://github.com/apache/tvm/pull/16958) - Check for target in function attributes * [#16894](https://github.com/apache/tvm/pull/16894) - Enhance vectorization for gpu matmul * [#16884](https://github.com/apache/tvm/pull/16884) - Add check for matmul dtype and fix reduction rule ### Docs * [#17146](https://github.com/apache/tvm/pull/17146) - [DOC] Fix typo for the "We utilize the intermediate representation of nn.Graph to convert the OneFlow model to Reley." * [#17015](https://github.com/apache/tvm/pull/17015) - [DOC] Update Model Links to Include Commit ### Frontend * [#17014](https://github.com/apache/tvm/pull/17014) - [ArgParse] Pass default values to target compiler(#13264) * [#16961](https://github.com/apache/tvm/pull/16961) - [Bugfix][ONNX] Improve broadcast and batch_matmul conversion * [#16936](https://github.com/apache/tvm/pull/16936) - [TFLite] Add support for GELU conversion ### Hexagon * [#17123](https://github.com/apache/tvm/pull/17123) - Add support for v75 ### LLVM * [#17046](https://github.com/apache/tvm/pull/17046) - [Arith][SVE] Add rewrite rules for indices split by scalable expressions * [#16966](https://github.com/apache/tvm/pull/16966) - [SVE] Add support for representing and creating buffer-level predicates * [#17001](https://github.com/apache/tvm/pull/17001) - [SVE] Use only powers of two as possible vscale values * [#16962](https://github.com/apache/tvm/pull/16962) - [SVE] Add codegen support for `vscale_range()` function attribute * [#16968](https://github.com/apache/tvm/pull/16968) - Stringref API deprecation fixes * [#16965](https://github.com/apache/tvm/pull/16965) - [SVE] Add get_active_lane_mask builtin * [#16899](https://github.com/apache/tvm/pull/16899) - [SVE][TOPI] Add conv2d NHWC hybrid SVE schedule for `arm_cpu` * [#16893](https://github.com/apache/tvm/pull/16893) - [SVE] Check for SVE target in VectorizeLoop * [#16862](https://github.com/apache/tvm/pull/16862) - [SVE] Support splitting by vscale in `tir::split` and `te::split` ### MetaSchedule * [#17012](https://github.com/apache/tvm/pull/17012) - [BugFix]MultiLevelTilingTensorCore generates inconsistent thread-binding sketch for batched matmul * [#17066](https://github.com/apache/tvm/pull/17066) - [BugFix]Fix TensorIntrin ‘dot_4x4_i8i8s32_sdot’ is not registered ### Metal * [#17059](https://github.com/apache/tvm/pull/17059) - Enable Debug Label * [#17025](https://github.com/apache/tvm/pull/17025) - Support metal device profiling ### OpenCL & CLML * [#16933](https://github.com/apache/tvm/pull/16933) - [CLML] Fix in clml pattern check condition * [#16929](https://github.com/apache/tvm/pull/16929) - [VM][OPENCL] Take advantage of OpenCL host ptr for improved copy ### ROCm * [#17141](https://github.com/apache/tvm/pull/17141) - [Backend]Fix error when building TVM with LLVM 19 ### Relax * [#17139](https://github.com/apache/tvm/pull/17139) - Fix cublas dispatch for corner cases * [#17127](https://github.com/apache/tvm/pull/17127) - [KVCache] Support fork in sliding window sink part * [#17115](https://github.com/apache/tvm/pull/17115) - Support `input_axis_separator` to allow 2D to 1D conversion * [#17119](https://github.com/apache/tvm/pull/17119) - [Bugfix]Set purity=false for LazySetOutput * [#17118](https://github.com/apache/tvm/pull/17118) - [VM] Improved error messages for mismatched parameter count * [#17110](https://github.com/apache/tvm/pull/17110) - Alloc BYOC workspace with R.builtin.alloc_tensor * [#17089](https://github.com/apache/tvm/pull/17089) - [ONNX] Add support for HardSigmoid * [#17100](https://github.com/apache/tvm/pull/17100) - [KVCache] Unlimited depth blocks * [#17075](https://github.com/apache/tvm/pull/17075) - [Transform] Modify FuseTIR pass to propagate buffer attributes * [#17088](https://github.com/apache/tvm/pull/17088) - [ONNX] Add support for HardSwish * [#17085](https://github.com/apache/tvm/pull/17085) - [PyTorch] Add support for torch.nn.Hardsigmoid * [#17083](https://github.com/apache/tvm/pull/17083) - [TVMScript]Preserve tir.SizeVar through TVMScript round-trip * [#17086](https://github.com/apache/tvm/pull/17086) - Ignore dynamic parameters in RewriteDataflowReshape * [#17084](https://github.com/apache/tvm/pull/17084) - [PyTorch] Add support for torch.nn.Hardswish * [#17074](https://github.com/apache/tvm/pull/17074) - [KVCache][Test] Fix TIR attn kernels for uncommon group size * [#17067](https://github.com/apache/tvm/pull/17067) - Add missing white spaces in error messages * [#17061](https://github.com/apache/tvm/pull/17061) - [Frontend][Onnx] Cast Op special handling for ShapeExpr input * [#17033](https://github.com/apache/tvm/pull/17033) - [Bugfix] Apply FuseOps to nested DataflowBlock * [#17032](https://github.com/apache/tvm/pull/17032) - [Bugfix] Annotate ComputePrimValue output as host function * [#17034](https://github.com/apache/tvm/pull/17034) - [Bugfix] Bind symbolic variables in R.match_cast * [#16960](https://github.com/apache/tvm/pull/16960) - [UnitTest] Validate IRModule with multiple targets * [#16995](https://github.com/apache/tvm/pull/16995) - [KVCache] Support KVCache decode from forked sequence and pop more tokens * [#16959](https://github.com/apache/tvm/pull/16959) - [Transform] Handle identical PrimFunc with distinct VDevice * [#16589](https://github.com/apache/tvm/pull/16589) - [Unity] Check for transpose and dynamic shape in AdjustMatmulOrder * [#16988](https://github.com/apache/tvm/pull/16988) - [KVCache] Fix the aux data syncing order of paged KV cache * [#16922](https://github.com/apache/tvm/pull/16922) - [BugFix]change FuseOpsByPattern strategy to pattern-match maximal subgraph * [#16982](https://github.com/apache/tvm/pull/16982) - [Unity][BYOC] Use arith.Analyzer to check batch equality of matmul in cublas * [#16955](https://github.com/apache/tvm/pull/16955) - Implement relax.op.view * [#16971](https://github.com/apache/tvm/pull/16971) - Support nested ModuleList in nn.Module * [#16826](https://github.com/apache/tvm/pull/16826) - Express dynamic arguments of strided_slice as arguments * [#16476](https://github.com/apache/tvm/pull/16476) - [Unity][Cutlass] Fix C source generation of dense operation * [#16940](https://github.com/apache/tvm/pull/16940) - Allow PrimValue as index in relax.op.take * [#16934](https://github.com/apache/tvm/pull/16934) - [TIR] Introduce new `cumsum` op for gpu * [#16859](https://github.com/apache/tvm/pull/16859) - [QoL]Use SeqExpr in IR types when SeqExpr is required * [#16904](https://github.com/apache/tvm/pull/16904) - Prevent to generate duplicate func in dispatch_sort_scan * [#16905](https://github.com/apache/tvm/pull/16905) - [Bugfix]Raise exception for OOM allocation * [#16827](https://github.com/apache/tvm/pull/16827) - Handle binary operations between Tensor and PrimValue * [#16902](https://github.com/apache/tvm/pull/16902) - Allow specifying entry_funcs for BYOC * [#16860](https://github.com/apache/tvm/pull/16860) - [QoL]Infer StructInfo for relax::Tuple on construction * [#16861](https://github.com/apache/tvm/pull/16861) - [QoL]Return well-formed IR from relax::Function::CreateEmpty * [#16886](https://github.com/apache/tvm/pull/16886) - [Frontend] Fix sort, argsort and topk in nn module * [#16883](https://github.com/apache/tvm/pull/16883) - Stabilize relax pass mutation order ### Relay * [#16983](https://github.com/apache/tvm/pull/16983) - [BugFix]skip leaf args when matching 'path' part for dominator pattern * [#16996](https://github.com/apache/tvm/pull/16996) - fixed to make TupleGetItem inherits the previous span ### Runtime * [#17057](https://github.com/apache/tvm/pull/17057) - Stateless interface of PagedKVCache leaf node commit * [#17049](https://github.com/apache/tvm/pull/17049) - Support PagedKVCache with tree attention * [#17045](https://github.com/apache/tvm/pull/17045) - Fix PagedKVCache for PopN and enhance tests * [#16998](https://github.com/apache/tvm/pull/16998) - Compatibility with dmlc::Stream API changes * [#17037](https://github.com/apache/tvm/pull/17037) - [ROCm] Enable ROCm host memory support * [#17036](https://github.com/apache/tvm/pull/17036) - Use preferred host memory (pinned memory) in KV cache * [#16994](https://github.com/apache/tvm/pull/16994) - Allow query of available device memory through DeviceAPI * [#16997](https://github.com/apache/tvm/pull/16997) - [Disco] Restore checks for hangup of disco pipe * [#16938](https://github.com/apache/tvm/pull/16938) - Allow offset to be specified in NDArray::CreateView * [#16890](https://github.com/apache/tvm/pull/16890) - [VULKAN] Support total_global_memory * [#16880](https://github.com/apache/tvm/pull/16880) - Implemented Datatype.itemsize() ### TIR * [#17134](https://github.com/apache/tvm/pull/17134) - [Schedule] Remove `@type_check` for `set_axis_separator` * [#17112](https://github.com/apache/tvm/pull/17112) - [DLight] Enable SimdGroup op for Metal * [#17098](https://github.com/apache/tvm/pull/17098) - [RPC] Allow RPC calls to compiled PrimFuncs with no arguments * [#17039](https://github.com/apache/tvm/pull/17039) - Fix Bug in VectorizeLoop * [#17030](https://github.com/apache/tvm/pull/17030) - Fix Shuffle rewrite * [#16947](https://github.com/apache/tvm/pull/16947) - Support narrow dtype for let binding * [#16952](https://github.com/apache/tvm/pull/16952) - Enhance CLZ intrinsic support * [#16945](https://github.com/apache/tvm/pull/16945) - [Compute-at] Make compute-ated block simple when the predicate could be merged * [#16879](https://github.com/apache/tvm/pull/16879) - Make T.reinterpret nop when dtype is the same ### TOPI * [#17091](https://github.com/apache/tvm/pull/17091) - Add dense schedule for fp16 and fp32 using gemm * [#17048](https://github.com/apache/tvm/pull/17048) - [SME]Add conv2d NHWC SME fp16->fp32 schedule * [#17040](https://github.com/apache/tvm/pull/17040) - Fix SME conv2d schedule import and intrin argument * [#17003](https://github.com/apache/tvm/pull/17003) - [SME]Add conv2d NHWC SME fp32 schedule * [#16977](https://github.com/apache/tvm/pull/16977) - Remove `blockIdx.z` in topi sort * [#16951](https://github.com/apache/tvm/pull/16951) - Revert unification of conv2d NHWC hybrid scheduling for `arm_cpu` targets ### TVMScript * [#17107](https://github.com/apache/tvm/pull/17107) - Better Type Annotation for TIR OP * [#16967](https://github.com/apache/tvm/pull/16967) - Fix error reporting inside Macro func * [#16916](https://github.com/apache/tvm/pull/16916) - Support `T.launch_thread` with i64 dtype * [#16876](https://github.com/apache/tvm/pull/16876) - Optionally use `ruff format` instead of `black` * [#16877](https://github.com/apache/tvm/pull/16877) - [Bug] Add test case for missing symbolic bounds ### cuda & cutlass & tensorrt * [#16980](https://github.com/apache/tvm/pull/16980) - [Cuda] Skip FreeDataSpace when CUDA driver is in inconsistent state ### web * [#17031](https://github.com/apache/tvm/pull/17031) - Fix string to uint8 array for special characters * [#17028](https://github.com/apache/tvm/pull/17028) - Add dtype and offset for CreateView in runtime * [#16910](https://github.com/apache/tvm/pull/16910) - Support string[] in setPackedFunc() and exceptionally long arrays ### Misc * [#17135](https://github.com/apache/tvm/pull/17135) - [QoL][IR] Provide default constructor for NameSupply/GlobalVarSupply * [#17125](https://github.com/apache/tvm/pull/17125) - [Utils] Define line-length for "ruff format" * [#17152](https://github.com/apache/tvm/pull/17152) - GraphExecutor: Fix wild pointer assign when input and output are reshape * [#17150](https://github.com/apache/tvm/pull/17150) - [WebGPU] Fall back to 256MB for maxBufferSize if needed * [#17128](https://github.com/apache/tvm/pull/17128) - [Compute-inline] Prefer T.where for reverse compute-inlined block with predicate * [#16976](https://github.com/apache/tvm/pull/16976) - [WebGPU] Implement `tir.dp4a` with WGSL built-in function `dot4I8Packed` * [#17124](https://github.com/apache/tvm/pull/17124) - [WebGPU] Add `tir.dp4a` * [#17113](https://github.com/apache/tvm/pull/17113) - [CudaGraph] Handle exceptions thrown while capturing cuda graph * [#17094](https://github.com/apache/tvm/pull/17094) - [Utility][Container] Support non-nullable types in Array::Map * [#17101](https://github.com/apache/tvm/pull/17101) - [RPC] Raise error if server process terminated * [#17092](https://github.com/apache/tvm/pull/17092) - [UnitTests] Use tvm.ir.assert_structural_equal whenever possible * [#17054](https://github.com/apache/tvm/pull/17054) - [SME] Utilize predication in fp32 matmul and conv2d schedules * [#17079](https://github.com/apache/tvm/pull/17079) - [CMake] Show NVCC include directories in compile_commands.json * [#17076](https://github.com/apache/tvm/pull/17076) - [SME] Extract gemm block correctly when fused with bias * [#17071](https://github.com/apache/tvm/pull/17071) - [WebGPU] Translate `int8x4` into `u32` * [#17065](https://github.com/apache/tvm/pull/17065) - [FP8][Codegen] Add make_fp8 vector constructors * [#17064](https://github.com/apache/tvm/pull/17064) - Add docs of v0.15.0 and v0.16.0 * [#16985](https://github.com/apache/tvm/pull/16985) - [CODEGEN] Vector-Codegen support for llvm-pure-intrin * [#17058](https://github.com/apache/tvm/pull/17058) - Introduce outer reduction for metal * [#17051](https://github.com/apache/tvm/pull/17051) - Use adapter.info when available instead of requestAdapterInfo * [#16981](https://github.com/apache/tvm/pull/16981) - [SME] Add scalable fp16->fp32 dense schedule * [#17029](https://github.com/apache/tvm/pull/17029) - [Contrib] Implement NDArray cache update * [#17027](https://github.com/apache/tvm/pull/17027) - [picojson] Let objects be ordered when serializing * [#17021](https://github.com/apache/tvm/pull/17021) - [WebGPU] Update error messages to be more user-friendly * [#17010](https://github.com/apache/tvm/pull/17010) - Support multinomial_from_uniform dispatch * [#16999](https://github.com/apache/tvm/pull/16999) - [USMP] add missing const specifier for global_const_workspace * [#17005](https://github.com/apache/tvm/pull/17005) - [WebGPU] Handle device OOM in createBuffer * [#16921](https://github.com/apache/tvm/pull/16921) - [SME] Introduce scalable fp32 dense schedule * [#16957](https://github.com/apache/tvm/pull/16957) - chore: remove repetitive words * [#16909](https://github.com/apache/tvm/pull/16909) - [QoL][IR] Provide std::hash and std::equal_to for IR Variable types * [#16987](https://github.com/apache/tvm/pull/16987) - [JVM] Automatic Compatibility of JVM AttachCurrentThread * [#16974](https://github.com/apache/tvm/pull/16974) - [CUBLAS][FP8] Enable R.matmul + R.multiply offloading * [#16896](https://github.com/apache/tvm/pull/16896) - [CUBLAS] Enable offloading of R.matmul + R.dequantize * [#16956](https://github.com/apache/tvm/pull/16956) - Add script for testing release package * [#16908](https://github.com/apache/tvm/pull/16908) - Overriding the StructuralEqual() for easy usage * [#16932](https://github.com/apache/tvm/pull/16932) - Enable gemv schedule for adreno * [#16935](https://github.com/apache/tvm/pull/16935) - [3rdparty] Bump FlashInfer for sampling functions * [#16937](https://github.com/apache/tvm/pull/16937) - [Thrust] Increase static workspace size * [#16915](https://github.com/apache/tvm/pull/16915) - [Marvell BYOC]: Marvell AI Accelerator Integration - Phase 2 * [#16741](https://github.com/apache/tvm/pull/16741) - Restore "pytest.mark.gpu" for RELAX tests * [#16914](https://github.com/apache/tvm/pull/16914) - [CMAKE] Make LOG_BEFORE_THROW explicit * [#16913](https://github.com/apache/tvm/pull/16913) - Enhance Release Note Script and Remove Useless File * [#16907](https://github.com/apache/tvm/pull/16907) - [Upd] Fixed lld search in rocm * [#16900](https://github.com/apache/tvm/pull/16900) - [CMAKE] Misc improvment of Util * [#16897](https://github.com/apache/tvm/pull/16897) - [Target] Don't register AArch64 target tags without LLVM compiler support * [#16892](https://github.com/apache/tvm/pull/16892) - [CUBLAS] Set fp32 compute and scale dtypes in fp16 matmul * [#16888](https://github.com/apache/tvm/pull/16888) - [CUBLAS][FP8] Support e4m3 gemm in cuBLAS BYOC * [#16887](https://github.com/apache/tvm/pull/16887) - [Contrib] Enable fp16 for thrust sort * [#16881](https://github.com/apache/tvm/pull/16881) - [release][Dont Squash] Update version to 0.16.0 and 0.17.0.dev on main branch -- View it on GitHub: https://github.com/apache/tvm/releases/tag/v0.17.0.rc0 You are receiving this because you are subscribed to this thread. Message ID: <apache/tvm/releases/166341...@github.com>