Re: [PR] [Metal] Batched command dispatch and staging buffer pool [tvm]

via GitHub Thu, 05 Mar 2026 17:27:14 -0800


tqchen commented on code in PR #18877:
URL: https://github.com/apache/tvm/pull/18877#discussion_r2893230145



##########
src/runtime/metal/metal_common.h:
##########
@@ -133,8 +247,42 @@ class Stream {
   const std::string& ErrorDescription() const { return error_description_; }
 
  private:
+  /*! \brief Get or create the pending command buffer (shared by compute and 
blit). */
+  id<MTLCommandBuffer> GetOrCreatePendingCommandBuffer() {
+    if (pending_command_buffer_ == nil) {
+      pending_command_buffer_ = [[queue_ commandBuffer] retain];
+      pending_command_buffer_.label = @"TVMBatched";
+      [pending_command_buffer_ addCompletedHandler:^(id<MTLCommandBuffer> 
buffer) {
+        if (buffer.status == MTLCommandBufferStatusError) {
+          TVM_FFI_ICHECK(buffer.error != nil);
+          std::string msg = buffer.error.localizedDescription.UTF8String;
+          if (!this->last_dispatched_kernel_.empty()) {
+            msg = "GPUError after kernel " + this->last_dispatched_kernel_ + 
": " + msg;
+          }
+          this->SetError(msg);
+        }
+      }];
+    }
+    return pending_command_buffer_;
+  }
+
+  /*! \brief End the active compute encoder without committing the command 
buffer. */
+  void PauseComputeEncoder() {

Review Comment:
   Pause is a bit confusing,  EndPendingComputeEncoder()?



##########
src/runtime/metal/metal_common.h:
##########
@@ -103,13 +103,34 @@ class AutoReleasePoolWrapper {
 };
 
 /*!
- * \brief Structure for error handling in queues
+ * \brief Metal command stream with batched dispatch support.
+ *
+ * Compute dispatches are batched into a single command buffer via
+ * GetPendingComputeEncoder(). Blit operations (copies) are interleaved
+ * on the same command buffer via GetBlitEncoderOnPendingBuffer().
+ * The command buffer is committed when FlushCommandBuffer() is called.
+ *
+ * Must call FlushCommandBuffer() before:
+ * - GPU→CPU readback (need data in CPU memory)
+ * - Buffer deallocation (FreeDataSpace, setPurgeableState:Empty on
+ *   a buffer referenced by an uncommitted CB crashes Metal)
+ * - Stream sync (StreamSync / Synchronize)
  */
 class Stream {
  public:
   explicit Stream(id<MTLDevice> device) { queue_ = [device newCommandQueue]; }
-  ~Stream() { [queue_ release]; }
-  id<MTLCommandBuffer> GetCommandBuffer(std::string label = "", bool 
attach_error_callback = true) {
+  ~Stream() {
+    FlushCommandBuffer();
+    [queue_ release];
+  }

Review Comment:
   would be useful to document this via comment for context. This indeed places 
an implicit requirement that Stream have to be destructed in teardown



##########
src/runtime/metal/metal_common.h:
##########
@@ -201,22 +349,67 @@ class MetalThreadEntry {
   Device device;
   /*! \brief The current stream */
   std::vector<TVMStreamHandle> stream;
-  /*! \brief The shared buffer used for copy. */
+  /*! \brief The shared buffer used for GPU→CPU readback. */
   std::vector<id<MTLBuffer>> temp_buffer_;
+  /*!
+   * \brief Pool of staging buffers for CPU→GPU copies that are inlined
+   * into the pending command buffer. Each inlined copy needs its own
+   * staging buffer because the GPU reads them asynchronously.
+   * Buffers are recycled after FlushCommandBuffer()/Synchronize().
+   */
+  struct StagingBufferPool {

Review Comment:
   c++ coding style, it is better to keep pool and next index private, pool_, 
next_index_



##########
src/runtime/metal/metal_common.h:
##########
@@ -201,22 +349,67 @@ class MetalThreadEntry {
   Device device;
   /*! \brief The current stream */
   std::vector<TVMStreamHandle> stream;
-  /*! \brief The shared buffer used for copy. */
+  /*! \brief The shared buffer used for GPU→CPU readback. */
   std::vector<id<MTLBuffer>> temp_buffer_;
+  /*!
+   * \brief Pool of staging buffers for CPU→GPU copies that are inlined
+   * into the pending command buffer. Each inlined copy needs its own
+   * staging buffer because the GPU reads them asynchronously.
+   * Buffers are recycled after FlushCommandBuffer()/Synchronize().
+   */
+  struct StagingBufferPool {

Review Comment:
   worthwhile have a size query, so CopyToGPU can choose to sync if the size 
exceed a limit counter



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [Metal] Batched command dispatch and staging buffer pool [tvm]

Reply via email to