This is an automated email from the ASF dual-hosted git repository. spectrometerHBH pushed a commit to branch tir-bench in repository https://gitbox.apache.org/repos/asf/tvm.git
commit a394fd58b073399eb75be81fe31201cf5b9247af Author: Bohan Hou <[email protected]> AuthorDate: Sun May 24 10:43:24 2026 -0700 docs: update tir bench baseline results (#642) --- .claude/commands/tir-bench.md | 317 ++++++++++++++++++++++-------------------- 1 file changed, 167 insertions(+), 150 deletions(-) diff --git a/.claude/commands/tir-bench.md b/.claude/commands/tir-bench.md index 515863829b..06dbb6f680 100644 --- a/.claude/commands/tir-bench.md +++ b/.claude/commands/tir-bench.md @@ -5,14 +5,14 @@ Run kernel performance benchmarks to verify codegen changes. All commands use `--warmup 100 --repeat 30` for ~3-minute total runtime with reliable medians. Drop to defaults only when chasing a sub-2% regression. - **GEMM**: square GEMM at M=N=K in {1024, 2048, 4096, 8192, 16384} for three variants: - - fp16: `python -m tirx_kernels.bench --kernel fp16_bf16_gemm --warmup 100 --repeat 30` - - fp8: `python -m tirx_kernels.bench --kernel fp8_blockwise_gemm --warmup 100 --repeat 30` - - nvfp4: `python -m tirx_kernels.bench --kernel nvfp4_gemm --warmup 100 --repeat 30` + - fp16: `python -m tirx_kernels.bench --kernel fp16_bf16_gemm --warmup 100 --repeat 30 --timer proton` + - fp8: `python -m tirx_kernels.bench --kernel fp8_blockwise_gemm --warmup 100 --repeat 30 --timer proton` + - nvfp4: `python -m tirx_kernels.bench --kernel nvfp4_gemm --warmup 100 --repeat 30 --timer proton` - **FA4** (flash_attention4): all registered configs - - `python -m tirx_kernels.bench --kernel flash_attention4 --warmup 100 --repeat 30` + - `python -m tirx_kernels.bench --kernel flash_attention4 --warmup 100 --repeat 30 --timer proton` - **MQA logits** (fp8 / fp4): all registered configs - - `python -m tirx_kernels.bench --kernel deepgemm_sm100_fp8_mqa_logits --warmup 100 --repeat 30` - - `python -m tirx_kernels.bench --kernel deepgemm_sm100_fp4_mqa_logits --warmup 100 --repeat 30` + - `python -m tirx_kernels.bench --kernel deepgemm_sm100_fp8_mqa_logits --warmup 100 --repeat 30 --timer proton` + - `python -m tirx_kernels.bench --kernel deepgemm_sm100_fp4_mqa_logits --warmup 100 --repeat 30 --timer proton` ## Steps @@ -21,9 +21,15 @@ All commands use `--warmup 100 --repeat 30` for ~3-minute total runtime with rel export CUDA_VISIBLE_DEVICES=$(nvidia-smi --query-gpu=index,memory.used --format=csv,noheader,nounits | sort -t',' -k2 -n | head -1 | cut -d',' -f1 | tr -d ' ') ``` -2. Run benchmarks for each kernel using the commands above. +2. Record the exact provenance for every implementation in the result table: + - `tir`: `git rev-parse HEAD` from the TIR checkout being tested. + - `tirx-kernels`: `git rev-parse HEAD` from the kernel checkout used by `python -m tirx_kernels.bench`. + - Each git-backed baseline implementation repo, such as DeepGEMM, FlashInfer, or FlashAttention: repo path and full commit SHA. + - Package/system baselines without a local git repo, such as `torch-cublas`: package version, package git version when available, CUDA version, and any library version that is easy to query. -3. Present results in a table: kernel x config, with times in ms. +3. Run benchmarks for each kernel using the commands above. + +4. Present results in a table: kernel x config, with times in us, followed by the provenance block from step 2. ## When to use @@ -31,165 +37,176 @@ When modifying anything that affects code generation: kernels, op dispatches, lo ## Reference baseline -Captured 2026-05-17 on B200 (sm_100a), GPU 7, `warmup=100 repeat=30`, `timer=proton`. +Captured 2026-05-24 on B200 (sm_100a), physical GPU 2, `warmup=100 repeat=30`, `timer=proton`. -- `tir` @ `587f439c4c` (branch `scope-id`, with `feat(exec-scope): infer scope_id extent from sibling defs when omitted` on top of upstream tirx `c9ee147baf`) -- `tirx-kernels` @ `fdab8ac5` (branch `scope-id`, with `perf(kernel): hoist mqa_fp8 warpgroup index` on top of upstream `ae8673c9`) +Rows that were below `0.95x` in the full run were rerun twice; those rows use the median over the full-run measurement plus the two reruns (`runs=3`). Other rows use the full-run measurement (`runs=1`). All times are in us. `baseline/tirx > 1` means TIRX is faster. -All times in us. `baseline/tirx` > 1 means TIRX faster. +- raw full-run results: `/home/bohanhou/tirx-kernels/.porting/tir_bench_full/20260524T165849Z_gpu2_w100_r30` +- low-ratio reruns: `/home/bohanhou/tirx-kernels/.porting/tir_bench_full/20260524T165849Z_gpu2_w100_r30/reruns_lt095` -### `fp16_bf16_gemm` (baseline=`torch-cublas`) +Implementation provenance: +- `tir` : `/home/bohanhou/tir` @ `2a3241a267003249bdf45555a5c6bbcc2a03c90b` (commit date `2026-05-17T00:46:03-04:00`, `feat(op): add bounded mbarrier wait (#627)`, dirty local worktree) +- `tirx-kernels` : `/home/bohanhou/tirx-kernels` @ `127cd12de4e4962d4499a39d37beb8a6d4105306` (commit date `2026-05-17T00:44:53-04:00`, `feat(nymph): add experimental lowering stack and acceptance gates (#290)`, dirty local worktree) -| config | torch-cublas | tir | baseline/tirx | -|---|---:|---:|---:| -| `fp16_1024x1024x1024` | 5.73us | 16.54us | 0.347 | -| `fp16_2048x2048x2048` | 16.40us | 27.91us | 0.588 | -| `fp16_4096x4096x4096` | 95.19us | 94.34us | 1.009 | -| `fp16_8192x8192x8192` | 823.15us | 843.04us | 0.976 | -| `fp16_16384x16384x16384` | 6093.33us | 6128.95us | 0.994 | -| `bf16_1024x1024x1024` | 5.72us | 16.51us | 0.347 | -| `bf16_2048x2048x2048` | 16.13us | 27.77us | 0.581 | -| `bf16_4096x4096x4096` | 92.25us | 91.35us | 1.010 | -| `bf16_8192x8192x8192` | 756.17us | 781.91us | 0.967 | -| `bf16_16384x16384x16384` | 5823.27us | 5809.98us | 1.002 | +Baseline implementation provenance: -### `fp8_blockwise_gemm` (baseline=`deepgemm`) +- `torch-cublas` : PyTorch `2.11.0+cu130`, `torch.version.git_version=70d99e998b4955e0049d13a98d77ae1b14db1f45`, `torch.version.cuda=13.0` +- `deepgemm` : `/home/bohanhou/DeepGEMM` @ `714dd1a4a980f7937a74343d19a8eba4fe321480` (commit date `2026-05-11T19:20:18+08:00`, `Update test_mega_moe.py`) +- `flashinfer` : `/home/bohanhou/flashinfer` @ `bff85f3459707d5d2f1426d1ded4a320ab142078` (commit date `2026-05-22T14:36:34-07:00`, `feat: integrate cute-dsl Blackwell GQA decode into BatchDecodeWithPagedKVCacheWrapper (#3360)`, `flashinfer.__version__=0.6.11.post1`) +- `flashattn_sm100` : `/home/bohanhou/flash-attention` @ `3da76cdb8aedd842c46511e5194f5f20cdd4cf6f` (commit date `2026-05-22T16:00:00-07:00`, `Build Fix: Update abi3 tag to cp310 and minimum python version to 3.10 (#2532)`) +Notes: -| config | deepgemm | tir | baseline/tirx | -|---|---:|---:|---:| -| `smoke_1024x1024x1024` | 6.07us | 5.91us | 1.026 | -| `deepgemm_m4096_n2112_k7168` | 49.86us | 48.96us | 1.018 | -| `deepgemm_m4096_n576_k7168` | 19.12us | 18.84us | 1.015 | -| `deepgemm_m4096_n24576_k1536` | 116.18us | 115.68us | 1.004 | -| `deepgemm_m4096_n32768_k512` | 75.54us | 71.28us | 1.060 | -| `deepgemm_m4096_n7168_k16384` | 320.22us | 329.80us | 0.971 | -| `deepgemm_m4096_n4096_k7168` | 83.19us | 82.69us | 1.006 | -| `deepgemm_m4096_n7168_k2048` | 44.04us | 43.59us | 1.010 | -| `stress_m8192_n7168_k4096` | 159.30us | 159.99us | 0.996 | +- `fp16_bf16_gemm` still reports `BASELINE_ERROR: triton: No module named 'tirx_kernels.gemm._triton_matmul'`; the recorded baseline is `torch-cublas`. +- `nvfp4_gemm` FlashInfer baseline uses `backend="cutlass"`, `use_nvfp4=True`, and `flashinfer.autotune(True)`. `backend="auto"` was not recorded because it fails in this environment with mixed CUDA runtime libraries (`libcudart.so.12` and `libcudart.so.13`). -### `nvfp4_gemm` (baseline=`flashinfer`) +### `fp16_bf16_gemm` (baseline=`torch-cublas`) +| config | torch-cublas | tir | baseline/tirx | runs | +|---|---:|---:|---:|---:| +| `fp16_1024x1024x1024` | 5.47us | 16.46us | 0.332 | 3 | +| `fp16_2048x2048x2048` | 18.57us | 27.22us | 0.682 | 3 | +| `fp16_4096x4096x4096` | 156.08us | 95.04us | 1.642 | 1 | +| `fp16_8192x8192x8192` | 894.20us | 855.20us | 1.046 | 3 | +| `fp16_16384x16384x16384` | 6451.67us | 6740.25us | 0.957 | 3 | +| `bf16_1024x1024x1024` | 5.48us | 16.45us | 0.333 | 3 | +| `bf16_2048x2048x2048` | 18.36us | 27.08us | 0.678 | 3 | +| `bf16_4096x4096x4096` | 92.75us | 91.47us | 1.014 | 1 | +| `bf16_8192x8192x8192` | 871.44us | 780.62us | 1.116 | 1 | +| `bf16_16384x16384x16384` | 6043.31us | 6270.93us | 0.964 | 1 | -| config | flashinfer | tir | baseline/tirx | -|---|---:|---:|---:| -| `1024x1024x1024` | 5.13us | 6.59us | 0.778 | -| `2048x2048x2048` | 8.39us | 8.84us | 0.950 | -| `4096x4096x4096` | 32.50us | 30.56us | 1.064 | -| `8192x8192x8192` | 199.24us | 186.39us | 1.069 | -| `16384x16384x16384` | 2128.05us | 1511.81us | 1.408 | +### `fp8_blockwise_gemm` (baseline=`deepgemm`) -### `flash_attention4` (baseline=`flashattn_sm100`) +| config | deepgemm | tir | baseline/tirx | runs | +|---|---:|---:|---:|---:| +| `smoke_1024x1024x1024` | 6.84us | 6.36us | 1.077 | 1 | +| `deepgemm_m4096_n2112_k7168` | 49.78us | 48.67us | 1.023 | 1 | +| `deepgemm_m4096_n576_k7168` | 19.44us | 18.87us | 1.030 | 1 | +| `deepgemm_m4096_n24576_k1536` | 117.35us | 115.60us | 1.015 | 1 | +| `deepgemm_m4096_n32768_k512` | 75.29us | 71.92us | 1.047 | 1 | +| `deepgemm_m4096_n7168_k16384` | 328.45us | 314.40us | 1.045 | 1 | +| `deepgemm_m4096_n4096_k7168` | 83.71us | 83.71us | 1.000 | 1 | +| `deepgemm_m4096_n7168_k2048` | 44.64us | 43.97us | 1.015 | 1 | +| `stress_m8192_n7168_k4096` | 161.47us | 161.74us | 0.998 | 1 | +### `nvfp4_gemm` (baseline=`flashinfer`) -| config | flashattn_sm100 | tir | baseline/tirx | -|---|---:|---:|---:| -| `s1024_h32kv4` | 20.34us | 20.80us | 0.978 | -| `s1024_h32kv4_causal` | 19.85us | 19.66us | 1.009 | -| `s1024_h32kv8` | 20.50us | 20.91us | 0.980 | -| `s1024_h32kv8_causal` | 19.85us | 19.75us | 1.005 | -| `s1024_h32kv16` | 20.51us | 21.05us | 0.974 | -| `s1024_h32kv16_causal` | 20.24us | 20.68us | 0.979 | -| `s1024_h32kv32` | 20.75us | 21.18us | 0.980 | -| `s1024_h32kv32_causal` | 21.07us | 22.24us | 0.947 | -| `s2048_h32kv4` | 59.47us | 60.85us | 0.977 | -| `s2048_h32kv4_causal` | 39.40us | 37.51us | 1.050 | -| `s2048_h32kv8` | 60.23us | 61.84us | 0.974 | -| `s2048_h32kv8_causal` | 39.49us | 37.76us | 1.046 | -| `s2048_h32kv16` | 60.60us | 62.83us | 0.965 | -| `s2048_h32kv16_causal` | 39.94us | 38.57us | 1.036 | -| `s2048_h32kv32` | 61.59us | 63.62us | 0.968 | -| `s2048_h32kv32_causal` | 40.29us | 42.38us | 0.951 | -| `s4096_h32kv4` | 203.59us | 204.89us | 0.994 | -| `s4096_h32kv4_causal` | 114.98us | 111.69us | 1.029 | -| `s4096_h32kv8` | 204.46us | 207.67us | 0.985 | -| `s4096_h32kv8_causal` | 116.24us | 112.45us | 1.034 | -| `s4096_h32kv16` | 208.31us | 211.63us | 0.984 | -| `s4096_h32kv16_causal` | 117.59us | 113.66us | 1.035 | -| `s4096_h32kv32` | 211.75us | 216.02us | 0.980 | -| `s4096_h32kv32_causal` | 118.98us | 122.09us | 0.975 | -| `s8192_h32kv4` | 816.39us | 818.33us | 0.998 | -| `s8192_h32kv4_causal` | 429.56us | 420.64us | 1.021 | -| `s8192_h32kv8` | 795.55us | 852.89us | 0.933 | -| `s8192_h32kv8_causal` | 411.97us | 440.47us | 0.935 | -| `s8192_h32kv16` | 779.83us | 841.29us | 0.927 | -| `s8192_h32kv16_causal` | 412.70us | 399.01us | 1.034 | -| `s8192_h32kv32` | 784.06us | 821.54us | 0.954 | -| `s8192_h32kv32_causal` | 459.55us | 420.57us | 1.093 | +| config | flashinfer | tir | baseline/tirx | runs | +|---|---:|---:|---:|---:| +| `1024x1024x1024` | 5.19us | 6.74us | 0.770 | 3 | +| `2048x2048x2048` | 8.51us | 8.87us | 0.960 | 3 | +| `4096x4096x4096` | 30.96us | 30.22us | 1.025 | 1 | +| `8192x8192x8192` | 176.76us | 187.81us | 0.941 | 3 | +| `16384x16384x16384` | 1673.05us | 1546.78us | 1.082 | 1 | -### `deepgemm_sm100_fp8_mqa_logits` (baseline=`deepgemm`) +### `flash_attention4` (baseline=`flashattn_sm100`) +| config | flashattn_sm100 | tir | baseline/tirx | runs | +|---|---:|---:|---:|---:| +| `s1024_h32kv4` | 20.26us | 20.59us | 0.984 | 1 | +| `s1024_h32kv4_causal` | 19.30us | 19.16us | 1.007 | 1 | +| `s1024_h32kv8` | 20.06us | 20.74us | 0.967 | 1 | +| `s1024_h32kv8_causal` | 19.43us | 19.35us | 1.004 | 1 | +| `s1024_h32kv16` | 20.34us | 20.91us | 0.973 | 1 | +| `s1024_h32kv16_causal` | 19.82us | 20.16us | 0.983 | 1 | +| `s1024_h32kv32` | 20.59us | 21.26us | 0.968 | 1 | +| `s1024_h32kv32_causal` | 20.50us | 22.11us | 0.927 | 3 | +| `s2048_h32kv4` | 59.33us | 60.65us | 0.978 | 1 | +| `s2048_h32kv4_causal` | 38.53us | 36.74us | 1.049 | 1 | +| `s2048_h32kv8` | 59.55us | 60.89us | 0.978 | 1 | +| `s2048_h32kv8_causal` | 38.74us | 37.10us | 1.044 | 1 | +| `s2048_h32kv16` | 60.52us | 62.69us | 0.966 | 1 | +| `s2048_h32kv16_causal` | 39.33us | 37.86us | 1.039 | 1 | +| `s2048_h32kv32` | 61.18us | 63.07us | 0.970 | 1 | +| `s2048_h32kv32_causal` | 40.00us | 41.94us | 0.954 | 1 | +| `s4096_h32kv4` | 203.30us | 203.98us | 0.997 | 1 | +| `s4096_h32kv4_causal` | 114.28us | 110.97us | 1.030 | 1 | +| `s4096_h32kv8` | 204.64us | 212.61us | 0.963 | 1 | +| `s4096_h32kv8_causal` | 115.19us | 111.56us | 1.032 | 1 | +| `s4096_h32kv16` | 208.55us | 215.10us | 0.970 | 1 | +| `s4096_h32kv16_causal` | 116.25us | 113.04us | 1.028 | 1 | +| `s4096_h32kv32` | 213.89us | 217.60us | 0.983 | 1 | +| `s4096_h32kv32_causal` | 118.53us | 123.54us | 0.959 | 1 | +| `s8192_h32kv4` | 850.39us | 837.22us | 1.016 | 1 | +| `s8192_h32kv4_causal` | 462.08us | 445.77us | 1.037 | 1 | +| `s8192_h32kv8` | 863.43us | 850.11us | 1.016 | 1 | +| `s8192_h32kv8_causal` | 427.58us | 399.58us | 1.070 | 1 | +| `s8192_h32kv16` | 859.06us | 763.47us | 1.125 | 1 | +| `s8192_h32kv16_causal` | 415.83us | 405.42us | 1.026 | 1 | +| `s8192_h32kv32` | 833.86us | 873.28us | 0.955 | 1 | +| `s8192_h32kv32_causal` | 441.40us | 455.03us | 0.970 | 1 | -| config | deepgemm | tirx | baseline/tirx | -|---|---:|---:|---:| -| `s2048_skv4096_h64_d128_f32_dense_cp` | 43.80us | 44.49us | 0.984 | -| `s2048_skv4096_h64_d128_f32_dense_nocp` | 58.50us | 58.59us | 0.999 | -| `s2048_skv8192_h64_d128_f32_dense_cp` | 77.25us | 78.07us | 0.990 | -| `s2048_skv8192_h64_d128_f32_dense_nocp` | 118.40us | 118.97us | 0.995 | -| `s4096_skv4096_h64_d128_f32_dense_cp` | 78.02us | 77.94us | 1.001 | -| `s4096_skv4096_h64_d128_f32_dense_nocp` | 77.89us | 78.37us | 0.994 | -| `s4096_skv8192_h64_d128_f32_dense_cp` | 136.98us | 136.12us | 1.006 | -| `s4096_skv8192_h64_d128_f32_dense_nocp` | 196.36us | 202.57us | 0.969 | -| `s2048_skv4096_h64_d128_f32_compressed_cp` | 46.60us | 44.88us | 1.038 | -| `s2048_skv4096_h64_d128_f32_compressed_nocp` | 61.46us | 59.54us | 1.032 | -| `s2048_skv8192_h64_d128_f32_compressed_cp` | 81.83us | 78.99us | 1.036 | -| `s2048_skv8192_h64_d128_f32_compressed_nocp` | 125.40us | 120.15us | 1.044 | -| `s4096_skv4096_h64_d128_f32_compressed_cp` | 83.89us | 78.42us | 1.070 | -| `s4096_skv4096_h64_d128_f32_compressed_nocp` | 83.94us | 78.89us | 1.064 | -| `s4096_skv8192_h64_d128_f32_compressed_cp` | 147.25us | 137.97us | 1.067 | -| `s4096_skv8192_h64_d128_f32_compressed_nocp` | 209.79us | 196.89us | 1.066 | -| `s2048_skv4096_h64_d128_bf16_dense_cp` | 44.73us | 44.81us | 0.998 | -| `s2048_skv4096_h64_d128_bf16_dense_nocp` | 58.90us | 59.29us | 0.993 | -| `s2048_skv8192_h64_d128_bf16_dense_cp` | 79.48us | 79.03us | 1.006 | -| `s2048_skv8192_h64_d128_bf16_dense_nocp` | 121.27us | 121.16us | 1.001 | -| `s4096_skv4096_h64_d128_bf16_dense_cp` | 78.87us | 78.84us | 1.000 | -| `s4096_skv4096_h64_d128_bf16_dense_nocp` | 79.02us | 78.66us | 1.005 | -| `s4096_skv8192_h64_d128_bf16_dense_cp` | 139.18us | 138.40us | 1.006 | -| `s4096_skv8192_h64_d128_bf16_dense_nocp` | 199.50us | 197.53us | 1.010 | -| `s2048_skv4096_h64_d128_bf16_compressed_cp` | 46.91us | 46.09us | 1.018 | -| `s2048_skv4096_h64_d128_bf16_compressed_nocp` | 61.15us | 60.29us | 1.014 | -| `s2048_skv8192_h64_d128_bf16_compressed_cp` | 82.17us | 80.09us | 1.026 | -| `s2048_skv8192_h64_d128_bf16_compressed_nocp` | 126.02us | 123.97us | 1.017 | -| `s4096_skv4096_h64_d128_bf16_compressed_cp` | 84.10us | 82.16us | 1.024 | -| `s4096_skv4096_h64_d128_bf16_compressed_nocp` | 83.94us | 82.05us | 1.023 | -| `s4096_skv8192_h64_d128_bf16_compressed_cp` | 147.98us | 144.28us | 1.026 | -| `s4096_skv8192_h64_d128_bf16_compressed_nocp` | 209.74us | 204.18us | 1.027 | +### `deepgemm_sm100_fp8_mqa_logits` (baseline=`deepgemm`) -### `deepgemm_sm100_fp4_mqa_logits` (baseline=`deepgemm`) +| config | deepgemm | tirx | baseline/tirx | runs | +|---|---:|---:|---:|---:| +| `s2048_skv4096_h64_d128_f32_dense_cp` | 44.00us | 44.49us | 0.989 | 1 | +| `s2048_skv4096_h64_d128_f32_dense_nocp` | 57.69us | 58.43us | 0.987 | 1 | +| `s2048_skv8192_h64_d128_f32_dense_cp` | 77.33us | 77.38us | 0.999 | 1 | +| `s2048_skv8192_h64_d128_f32_dense_nocp` | 117.73us | 118.44us | 0.994 | 1 | +| `s4096_skv4096_h64_d128_f32_dense_cp` | 75.52us | 75.77us | 0.997 | 1 | +| `s4096_skv4096_h64_d128_f32_dense_nocp` | 75.57us | 76.16us | 0.992 | 1 | +| `s4096_skv8192_h64_d128_f32_dense_cp` | 133.81us | 133.16us | 1.005 | 1 | +| `s4096_skv8192_h64_d128_f32_dense_nocp` | 192.41us | 201.40us | 0.955 | 1 | +| `s2048_skv4096_h64_d128_f32_compressed_cp` | 46.44us | 44.28us | 1.049 | 1 | +| `s2048_skv4096_h64_d128_f32_compressed_nocp` | 60.38us | 59.15us | 1.021 | 1 | +| `s2048_skv8192_h64_d128_f32_compressed_cp` | 80.81us | 78.34us | 1.032 | 1 | +| `s2048_skv8192_h64_d128_f32_compressed_nocp` | 124.10us | 119.19us | 1.041 | 1 | +| `s4096_skv4096_h64_d128_f32_compressed_cp` | 82.16us | 77.14us | 1.065 | 1 | +| `s4096_skv4096_h64_d128_f32_compressed_nocp` | 82.26us | 76.78us | 1.071 | 1 | +| `s4096_skv8192_h64_d128_f32_compressed_cp` | 145.29us | 134.59us | 1.080 | 1 | +| `s4096_skv8192_h64_d128_f32_compressed_nocp` | 206.64us | 193.31us | 1.069 | 1 | +| `s2048_skv4096_h64_d128_bf16_dense_cp` | 44.26us | 44.85us | 0.987 | 1 | +| `s2048_skv4096_h64_d128_bf16_dense_nocp` | 58.90us | 59.06us | 0.997 | 1 | +| `s2048_skv8192_h64_d128_bf16_dense_cp` | 78.41us | 78.09us | 1.004 | 1 | +| `s2048_skv8192_h64_d128_bf16_dense_nocp` | 120.82us | 119.50us | 1.011 | 1 | +| `s4096_skv4096_h64_d128_bf16_dense_cp` | 77.42us | 76.89us | 1.007 | 1 | +| `s4096_skv4096_h64_d128_bf16_dense_nocp` | 77.03us | 77.42us | 0.995 | 1 | +| `s4096_skv8192_h64_d128_bf16_dense_cp` | 136.75us | 135.73us | 1.008 | 1 | +| `s4096_skv8192_h64_d128_bf16_dense_nocp` | 196.20us | 193.79us | 1.012 | 1 | +| `s2048_skv4096_h64_d128_bf16_compressed_cp` | 46.34us | 45.61us | 1.016 | 1 | +| `s2048_skv4096_h64_d128_bf16_compressed_nocp` | 60.94us | 59.65us | 1.022 | 1 | +| `s2048_skv8192_h64_d128_bf16_compressed_cp` | 80.72us | 79.53us | 1.015 | 1 | +| `s2048_skv8192_h64_d128_bf16_compressed_nocp` | 124.13us | 121.22us | 1.024 | 1 | +| `s4096_skv4096_h64_d128_bf16_compressed_cp` | 82.07us | 80.63us | 1.018 | 1 | +| `s4096_skv4096_h64_d128_bf16_compressed_nocp` | 82.30us | 80.60us | 1.021 | 1 | +| `s4096_skv8192_h64_d128_bf16_compressed_cp` | 145.44us | 141.84us | 1.025 | 1 | +| `s4096_skv8192_h64_d128_bf16_compressed_nocp` | 206.94us | 202.45us | 1.022 | 1 | +### `deepgemm_sm100_fp4_mqa_logits` (baseline=`deepgemm`) -| config | deepgemm | tirx | baseline/tirx | -|---|---:|---:|---:| -| `s2048_skv4096_h64_d128_f32_dense_cp` | 41.25us | 41.52us | 0.994 | -| `s2048_skv4096_h64_d128_f32_dense_nocp` | 53.67us | 54.10us | 0.992 | -| `s2048_skv8192_h64_d128_f32_dense_cp` | 71.99us | 72.44us | 0.994 | -| `s2048_skv8192_h64_d128_f32_dense_nocp` | 111.41us | 111.13us | 1.003 | -| `s4096_skv4096_h64_d128_f32_dense_cp` | 73.25us | 73.47us | 0.997 | -| `s4096_skv4096_h64_d128_f32_dense_nocp` | 73.21us | 73.52us | 0.996 | -| `s4096_skv8192_h64_d128_f32_dense_cp` | 130.21us | 129.54us | 1.005 | -| `s4096_skv8192_h64_d128_f32_dense_nocp` | 186.20us | 184.96us | 1.007 | -| `s2048_skv4096_h64_d128_f32_compressed_cp` | 45.14us | 42.37us | 1.066 | -| `s2048_skv4096_h64_d128_f32_compressed_nocp` | 59.05us | 54.82us | 1.077 | -| `s2048_skv8192_h64_d128_f32_compressed_cp` | 79.09us | 73.69us | 1.073 | -| `s2048_skv8192_h64_d128_f32_compressed_nocp` | 122.95us | 113.08us | 1.087 | -| `s4096_skv4096_h64_d128_f32_compressed_cp` | 80.41us | 73.88us | 1.088 | -| `s4096_skv4096_h64_d128_f32_compressed_nocp` | 80.32us | 73.81us | 1.088 | -| `s4096_skv8192_h64_d128_f32_compressed_cp` | 144.14us | 131.25us | 1.098 | -| `s4096_skv8192_h64_d128_f32_compressed_nocp` | 206.26us | 187.68us | 1.099 | -| `s2048_skv4096_h64_d128_bf16_dense_cp` | 42.24us | 42.51us | 0.994 | -| `s2048_skv4096_h64_d128_bf16_dense_nocp` | 55.24us | 55.44us | 0.996 | -| `s2048_skv8192_h64_d128_bf16_dense_cp` | 74.32us | 74.16us | 1.002 | -| `s2048_skv8192_h64_d128_bf16_dense_nocp` | 114.28us | 113.84us | 1.004 | -| `s4096_skv4096_h64_d128_bf16_dense_cp` | 74.91us | 74.90us | 1.000 | -| `s4096_skv4096_h64_d128_bf16_dense_nocp` | 74.90us | 74.84us | 1.001 | -| `s4096_skv8192_h64_d128_bf16_dense_cp` | 133.11us | 132.55us | 1.004 | -| `s4096_skv8192_h64_d128_bf16_dense_nocp` | 190.79us | 189.49us | 1.007 | -| `s2048_skv4096_h64_d128_bf16_compressed_cp` | 44.99us | 45.73us | 0.984 | -| `s2048_skv4096_h64_d128_bf16_compressed_nocp` | 59.06us | 60.01us | 0.984 | -| `s2048_skv8192_h64_d128_bf16_compressed_cp` | 79.27us | 80.35us | 0.987 | -| `s2048_skv8192_h64_d128_bf16_compressed_nocp` | 122.57us | 123.86us | 0.990 | -| `s4096_skv4096_h64_d128_bf16_compressed_cp` | 79.93us | 81.00us | 0.987 | -| `s4096_skv4096_h64_d128_bf16_compressed_nocp` | 79.78us | 80.97us | 0.985 | -| `s4096_skv8192_h64_d128_bf16_compressed_cp` | 142.89us | 144.28us | 0.990 | -| `s4096_skv8192_h64_d128_bf16_compressed_nocp` | 204.95us | 206.88us | 0.991 | +| config | deepgemm | tirx | baseline/tirx | runs | +|---|---:|---:|---:|---:| +| `s2048_skv4096_h64_d128_f32_dense_cp` | 39.78us | 40.41us | 0.984 | 1 | +| `s2048_skv4096_h64_d128_f32_dense_nocp` | 51.63us | 52.15us | 0.990 | 1 | +| `s2048_skv8192_h64_d128_f32_dense_cp` | 68.57us | 69.27us | 0.990 | 1 | +| `s2048_skv8192_h64_d128_f32_dense_nocp` | 105.31us | 105.61us | 0.997 | 1 | +| `s4096_skv4096_h64_d128_f32_dense_cp` | 69.83us | 69.95us | 0.998 | 1 | +| `s4096_skv4096_h64_d128_f32_dense_nocp` | 69.74us | 69.72us | 1.000 | 1 | +| `s4096_skv8192_h64_d128_f32_dense_cp` | 123.12us | 122.59us | 1.004 | 1 | +| `s4096_skv8192_h64_d128_f32_dense_nocp` | 175.80us | 174.99us | 1.005 | 1 | +| `s2048_skv4096_h64_d128_f32_compressed_cp` | 44.43us | 41.09us | 1.081 | 1 | +| `s2048_skv4096_h64_d128_f32_compressed_nocp` | 58.08us | 53.10us | 1.094 | 1 | +| `s2048_skv8192_h64_d128_f32_compressed_cp` | 77.15us | 70.58us | 1.093 | 1 | +| `s2048_skv8192_h64_d128_f32_compressed_nocp` | 119.08us | 107.61us | 1.107 | 1 | +| `s4096_skv4096_h64_d128_f32_compressed_cp` | 77.49us | 70.97us | 1.092 | 1 | +| `s4096_skv4096_h64_d128_f32_compressed_nocp` | 77.47us | 70.86us | 1.093 | 1 | +| `s4096_skv8192_h64_d128_f32_compressed_cp` | 138.42us | 124.58us | 1.111 | 1 | +| `s4096_skv8192_h64_d128_f32_compressed_nocp` | 198.99us | 177.86us | 1.119 | 1 | +| `s2048_skv4096_h64_d128_bf16_dense_cp` | 42.04us | 41.30us | 1.018 | 1 | +| `s2048_skv4096_h64_d128_bf16_dense_nocp` | 54.48us | 53.58us | 1.017 | 1 | +| `s2048_skv8192_h64_d128_bf16_dense_cp` | 72.25us | 71.16us | 1.015 | 1 | +| `s2048_skv8192_h64_d128_bf16_dense_nocp` | 110.92us | 108.73us | 1.020 | 1 | +| `s4096_skv4096_h64_d128_bf16_dense_cp` | 73.97us | 71.40us | 1.036 | 1 | +| `s4096_skv4096_h64_d128_bf16_dense_nocp` | 73.97us | 71.40us | 1.036 | 1 | +| `s4096_skv8192_h64_d128_bf16_dense_cp` | 129.30us | 126.49us | 1.022 | 1 | +| `s4096_skv8192_h64_d128_bf16_dense_nocp` | 185.61us | 180.39us | 1.029 | 1 | +| `s2048_skv4096_h64_d128_bf16_compressed_cp` | 42.89us | 41.88us | 1.024 | 1 | +| `s2048_skv4096_h64_d128_bf16_compressed_nocp` | 55.43us | 54.18us | 1.023 | 1 | +| `s2048_skv8192_h64_d128_bf16_compressed_cp` | 73.95us | 72.32us | 1.023 | 1 | +| `s2048_skv8192_h64_d128_bf16_compressed_nocp` | 113.15us | 110.48us | 1.024 | 1 | +| `s4096_skv4096_h64_d128_bf16_compressed_cp` | 75.31us | 72.50us | 1.039 | 1 | +| `s4096_skv4096_h64_d128_bf16_compressed_nocp` | 75.28us | 72.66us | 1.036 | 1 | +| `s4096_skv8192_h64_d128_bf16_compressed_cp` | 132.59us | 128.38us | 1.033 | 1 | +| `s4096_skv8192_h64_d128_bf16_compressed_nocp` | 189.25us | 183.46us | 1.032 | 1 |
