Re: [PR] Batched GPU dispatch and object caching for WebGPU runtime [tvm]

via GitHub Thu, 05 Mar 2026 07:20:25 -0800


tqchen commented on code in PR #18871:
URL: https://github.com/apache/tvm/pull/18871#discussion_r2890579183



##########
web/src/webgpu.ts:
##########
@@ -402,10 +401,21 @@ export class WebGPUContext {
   // internal data
   private bufferTable: Array<GPUBuffer | undefined> = [undefined];
   private bufferTableFreeId: Array<number> = [];
-  private podArgStagingBuffers: Array<GPUBuffer> = [];
   private canvasRenderManager?: CanvasRenderManager = undefined;
-  // number of pod arg staging buffers
-  private maxNumPodArgsStagingBuffers = 2;
+  // Pool of MAP_READ staging buffers to avoid per-copy create/destroy overhead
+  private readStagingBufferPool: Array<{ buffer: GPUBuffer; size: number }> = 
[];
+  private maxReadStagingBuffers = 4;
+  // Pending mapAsync promise from the last GPU→CPU copy (faster than 
onSubmittedWorkDone)
+  private pendingRead: Promise<void> | null = null;

Review Comment:
   clarity: pendingRead=> pendingGPUToCPUCopy



##########
web/src/runtime.ts:
##########
@@ -954,6 +956,9 @@ export class Instance implements Disposable {
   dispose(): void {
     this.deviceLostIsError = false;  // prevent dispose to trigger device.lost 
error
     // order matters
+    // dispose cached shape tuples before ctx
+    this.shapeTupleCache.forEach((obj) => obj.dispose());

Review Comment:
   consider introduce cache_state.ts which include a LRUCache that is generic, 
with the folliwing method
   - get(key, lambda constructor) that allows LRU cache with key, and if not 
hit, use lambda to construct, and do eviction if exceeding LRU limit and return 
the result
   - invalidate() clear all existing cache items
   - needEviction(key) => poll and return whether eviction is needed, since we 
may need to have followup items like flush
   
   
   CacheState class can contain members shape_cache, uniform_cache.
   - compute_shape_key()
   - compute_uniform_key()
   
   at each variable document:
   
   - Why we need the cache
   - Rule of invalidation: e.g. shape cache do not need invalidation, uniform 
cache needs invalidation during any de-allocation
   



##########
web/src/webgpu.ts:
##########
@@ -695,9 +704,28 @@ export class WebGPUContext {
           });
         }
 
-        // push pod buffer
+        // Each dispatch in a batch needs its own uniform buffer since the
+        // encoder defers submission. We maintain a reusable pool indexed by
+        // dispatch position — buffers grow as needed but are never destroyed.
         const sizeOfI32 = 4;
-        const podArgBuffer = this.getPodArgsBuffer((podArgIndices.length + 1) 
* sizeOfI32);
+        const bufBytes = (podArgIndices.length + 1) * sizeOfI32;
+        const dispatchIdx = this.pendingDispatchCount++;
+        let podArgBuffer: GPUBuffer;
+        if (dispatchIdx < this.uniformBufferPool.length &&

Review Comment:
   seems we are misisng when dispatchIdx runs out of 
`this.uniformBufferPool.length`, at whihc pt we should flush?. This part of 
code should be a subfunctio. getUniformFromPool. We need to have comment on 
state transition impact if we flush (e.g. if there were stale state that was 
created before this pt)



##########
web/src/webgpu.ts:
##########
@@ -880,17 +916,51 @@ export class WebGPUContext {
     );
   }
 
+  /**
+   * Get a MAP_READ staging buffer from the pool, or create one if none fits.
+   * Uses first-fit-by-size: returns the first pooled buffer >= nbytes.
+   */
+  private getReadStagingBuffer(nbytes: number): GPUBuffer {
+    for (let i = 0; i < this.readStagingBufferPool.length; i++) {
+      if (this.readStagingBufferPool[i].size >= nbytes) {
+        const entry = this.readStagingBufferPool.splice(i, 1)[0];
+        return entry.buffer;
+      }
+    }
+    return tryCreateBuffer(this.device, {
+      size: nbytes,
+      usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST,
+    });
+  }
+
+  /**
+   * Return a MAP_READ staging buffer to the pool for reuse.
+   * Evicts the smallest buffer if the pool is full.
+   */
+  private returnReadStagingBuffer(buf: GPUBuffer): void {

Review Comment:
   recycleReadStagingBuffer



##########
web/src/webgpu.ts:
##########
@@ -426,26 +436,54 @@ export class WebGPUContext {
     this.device = device;
   }
 
+  /**
+   * Flush all pending compute passes by finishing and submitting the
+   * accumulated command encoder. Call before any operation that reads
+   * back GPU data or waits on the queue.
+   */
+  flushCommands(): void {
+    if (this.pendingEncoder) {
+      this.device.queue.submit([this.pendingEncoder.finish()]);
+      this.pendingEncoder = null;
+      this.pendingDispatchCount = 0;
+    }
+  }
+
   /**
    * Dispose context.
    */
   dispose() {
+    this.flushCommands();
     this.canvasRenderManager?.dispose();
     this.bufferTableFreeId = [];
     while (this.bufferTable.length != 0) {
       this.bufferTable.pop()?.destroy();
     }
-    while (this.podArgStagingBuffers.length != 0) {
-      this.podArgStagingBuffers.pop()?.destroy();
+    for (const buf of this.uniformBufferPool) {
+      buf.destroy();
+    }
+    this.uniformBufferPool.length = 0;
+    this.uniformBufferPoolSizes.length = 0;
+    while (this.readStagingBufferPool.length != 0) {
+      this.readStagingBufferPool.pop()?.buffer.destroy();
     }
     this.device.destroy();
   }
 
+
   /**
    * Wait for all pending GPU tasks to complete
    */
   async sync(): Promise<void> {
-    await this.device.queue.onSubmittedWorkDone();
+    // Flush any batched compute passes before waiting on the queue.
+    this.flushCommands();
+    if (this.pendingRead) {

Review Comment:
   semantics clairification, if there is pendingGPUToCPUCopy and we have extra 
operations in queue after that, will we sync to these operations? I know that 
from effect pov it does not matter, but from the semantics it maybe helpful to 
clarify, e.g. one approach would be we only use pendingGPUToCPUCopy iff the 
reads is the last op. In that case, maybe we need some trackings



##########
web/src/webgpu.ts:
##########
@@ -880,17 +916,51 @@ export class WebGPUContext {
     );
   }
 
+  /**
+   * Get a MAP_READ staging buffer from the pool, or create one if none fits.
+   * Uses first-fit-by-size: returns the first pooled buffer >= nbytes.
+   */
+  private getReadStagingBuffer(nbytes: number): GPUBuffer {
+    for (let i = 0; i < this.readStagingBufferPool.length; i++) {

Review Comment:
   getOrCreateReadStagingBuffer



##########
web/src/webgpu.ts:
##########
@@ -426,26 +436,54 @@ export class WebGPUContext {
     this.device = device;
   }
 
+  /**
+   * Flush all pending compute passes by finishing and submitting the
+   * accumulated command encoder. Call before any operation that reads

Review Comment:
   list the operatios
   
   - read back GPU data
   - wait on queue
   - dealloc



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Batched GPU dispatch and object caching for WebGPU runtime [tvm]

Reply via email to