gnguralnick opened a new issue, #18885:
URL: https://github.com/apache/tvm/issues/18885

   ## Bug Description
   
   After PR #18865 (`[REFACTOR][TIR] Introduce AllocBuffer and phase out 
Allocate+DeclBuffer`), compiling models that use `cumsum` (e.g., via 
`gpu_2d_continuous_cumsum`) fails with:
   
   ```
   tvm.error.InternalError: Check failed: undefined.size() == 0 (1 vs. 0) :
   In PrimFunc gpu_2d_continuous_cumsum1 variables [ceil_log2] are used,
   but are not passed in as API arguments
   ```
   
   The error occurs in `make_packed_api.cc` during the final lowering stage.
   
   ## Root Cause
   
   In `python/tvm/relax/backend/gpu_generic/cumsum.py`, the `cumsum` PrimFunc 
(line ~156) has:
   
   ```python
   @T.prim_func(private=True)
   def cumsum(var_a: T.handle, var_out: T.handle):
       T.func_attr({"tir.is_scheduled": True})
       m, n = T.int64(), T.int64()
       A = T.match_buffer(var_a, [m, n], dtype=in_dtype)
       Out = T.match_buffer(var_out, [m, n], dtype=out_dtype)
       Tmp = T.sblock_alloc_buffer([m, n], dtype=out_dtype)      # changed by 
#18865
       ceil_log2 = T.Cast("int64", T.ceil(T.log2(T.Cast("float32", n))))
       total_rounds = ceil_log2 // LOG_BLOCK_N
       ...
   ```
   
   PR #18865 changed `T.alloc_buffer` → `T.sblock_alloc_buffer` for `Tmp`. 
Previously, `T.alloc_buffer` created an `Allocate(var, ..., DeclBuffer(buf, 
body))` node that nested subsequent statements inside its body, so `ceil_log2` 
was scoped within the allocation. Now, `T.sblock_alloc_buffer` adds the buffer 
to `SBlock.alloc_buffers` and does not create a nesting body, so `ceil_log2` 
becomes a top-level `Bind` statement. `make_packed_api` then fails because 
`ceil_log2` is used but not recognized as a function parameter.
   
   ## Steps to Reproduce
   
   Compile any model that triggers `gpu_2d_continuous_cumsum` legalization on a 
GPU target (e.g., Metal). For example, compiling a Gemma 3 model with mlc-llm:
   
   ```bash
   python -m mlc_llm compile <model-path> --device metal
   ```
   
   ## Environment
   
   - TVM commit: d5fd1c7c1 (post-#18865)
   - Platform: macOS (Apple Silicon), Metal target
   - Discovered via mlc-llm model compilation
   
   ## Suggested Fix
   
   The `ceil_log2` variable (and `total_rounds`) need to be handled differently 
now that `Tmp` uses `sblock_alloc_buffer`. One option is to use 
`T.alloc_buffer` (statement-level `AllocBuffer`) for `Tmp` instead, which would 
restore the body-nesting behavior. Alternatively, `ceil_log2` could be wrapped 
in a proper scope.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to