mitiskuma opened a new pull request, #18877:
URL: https://github.com/apache/tvm/pull/18877

   Benchmark results (Metal, M4 Max, MLC-LLM serve, temperature=0):
   
   256 decode tokens:
   Qwen2.5-0.5B-Instruct-q4f16_1: 238 t/s -> 466 t/s (1.95x)
   Qwen2.5-1.5B-Instruct-q4f16_1: 177 t/s -> 239 t/s (1.35x)
   Qwen2.5-3B-Instruct-q4f16_1: 114 t/s -> 139 t/s (1.21x)
   Llama-3.1-8B-Instruct-q4f16_1: 76 t/s -> 89 t/s (1.18x)
   
   1024 decode tokens:
   Qwen2.5-0.5B-Instruct-q4f16_1: 239 t/s -> 398 t/s (1.67x)
   Qwen2.5-1.5B-Instruct-q4f16_1: 137 t/s -> 190 t/s (1.38x)
   Qwen2.5-3B-Instruct-q4f16_1: 92 t/s -> 115 t/s (1.25x)
   Llama-3.1-8B-Instruct-q4f16_1: 70 t/s -> 80 t/s (1.14x)
   
   Baseline and optimized use the same MLC-LLM, same compiled models, only the
   TVM Metal runtime differs. Servers run sequentially (not parallel) to avoid
   GPU contention. Each run preceded by 2 warmup requests.
   
   The speedup is larger on smaller models because they are dispatch-bound
   (262 dispatches/token for 0.5B vs 394 for 8B). Larger models spend more
   time in actual compute, so the per-dispatch overhead is a smaller fraction.
   At 1024 tokens the 0.5B speedup drops from 1.95x to 1.67x because KV cache
   growth increases per-token compute, shifting the bottleneck toward memory
   bandwidth.
   
   What changed:
   
   1. Batched compute dispatch. Kernel dispatches are accumulated in a single
      MTLCommandBuffer via a shared MTLComputeCommandEncoder. Previously each
      dispatch created its own command buffer and committed immediately. The
      pending encoder is flushed on GPU->CPU readback, buffer deallocation,
      or stream sync.
   
   2. Inline blit encoders for copies. CPU->GPU and GPU->GPU copies now use
      blit encoders on the same pending command buffer instead of creating a
      separate command buffer per copy. Metal guarantees sequential ordering
      of encoders within a command buffer, so no explicit sync is needed
      between compute and copy operations.
   
   3. Staging buffer pool for CPU->GPU copies. Each inlined CPU->GPU copy
      needs its own staging buffer because the GPU reads them asynchronously
      from the deferred command buffer. A per-device StagingBufferPool hands
      out shared-mode buffers and recycles them after flush/sync.
   
   4. Conditional sync in FreeDataSpace. Instead of always calling StreamSync,
      we check HasPendingWork() first. When the GPU->CPU readback path has
      already flushed and waited, FreeDataSpace can skip the redundant sync.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to