[TVM Discuss] [Development] [DISCUSS] Contributing new docs for InferBound

Yuan Lin via TVM Discuss Wed, 10 Apr 2019 13:34:47 -0700


@jdavies-huawei Thanks for creating this document. This is great. I just went 
through the same exercise so to understand the InferBound and my notes are not 
nearly as comprehensive as yours.


Following are some diffs, which I hope shall be useful to you. 

### Suggested change 1

The following graph illustrates my mental picture of the IterVar Hyper-graph, 
which I find a bit easier to understand than the circle. 

![IterRelation|416x271](upload://ygwbMxAHXHozsMzYMpWUtnpFHCM.png) 

### Suggested change 2 / Question 1

I'd suggest the following wording change to the 3rd paragraph of 
'InferRootbound'.

"These IntSets are used to create TensorDom of the ~~output~~ **input** tensors 
of the **consumer** stage (phase3)".

The reason is that phase 3 computes the TensorDom of all input tensors of the 
consumer stage, not just the output tensor of the current stage. Is that right?

I notice you also use the term `Phase 3: Propagate IntSets to consumer’s input 
tensors` in the later part of the document. 

### Suggested change 3

This is just a nit. 

Maybe move the explanation of `PassDownDomain` ahead of `InferRootBound`. This 
helps explaining where the `Range` of the `IterVars` fo the consumer stage come 
from.

### Question 2

How do you generate the output shown in Ex. 4?

The following is what I got by using this 
[method](https://discuss.tvm.ai/t/show-unflattened-tensor-in-tvm-lower/1728/2?u=yuanlin).
 

```
// attr [compute(D, 0x15b6460)] realize_scope = ""
realize D([0, 4], [0, 5], [0, 16]) {
  produce D {
    for (di, 0, 4) {
      for (dj, 0, 5) {
        for (dk, 0, 16) {
          // attr [compute(C, 0x1a0a270)] realize_scope = ""
          realize C([dj, 1], [dk, 1]) {
            produce C {
              C(dj, dk) =5
            }
            D(di, dj, dk) =(C(dj, dk)*2)
          }
        }
      }
    }
  }
}
```
It misses the inner `i` and `j` loop nest shown in your exmample. 

### Suggested change 4

The following text describes how storage scope affects the bound inference. It 
is adapted from my notes to fit your text flow.  

---

The tensor a stage computes can have a `StorageScope`, which can be either 
`global` (default), `shared`, `warp` or `local`. The `StorageScope` also 
affects the result of bound inference. 

The `StorageScope` can be explicitly set by the `schedule.set_scope` operation, 
or the `cache_write`/`cache_read` operation (if the stage is created by a cache 
operation), or inferred from the thread bound to an `IterVar` on the 
attach_path. The inference rule is

* if any IterVar on the attach_path is bound to `threadIdx`, `vthread` or 
`cthread`, then the scope is `local`;
* otherwise, if any IterVar on the attach_path is bound to `blockIdx`, then the 
scope is `shared`;
* otherwise, the scope is `global`.

During the bound inference, the `StorageScope` affects the decision whether 
relaxation is needed or not. From the above (i.e. case 3 of Phase 1 of 
'InferBound with compute_at'), we know relaxation is needed for IterVar's that 
are lower on the attach_path than (the attach_ivar). When the storage scope is 
specified (explicitly or infered), relaxation is also needed for an IterVar on 
the attach_path if
* the StorageScope is 'global' and the IterVar is bound to any thread,
* the StorageScope is 'shared' and the IterVar is bound to 'threadIdx', or
* the StorageScope is 'warp' and the IterVar is bound to 'threadIdx.x'.

#### Ex. 6

In the following example, stage `B` is attached to `i` of `C`. 

```
A = tvm.placeholder((200, 400), name='A')
B = tvm.compute((200, 400), lambda i,j: 3.14 * A[i, j], name='B')
C = tvm.compute((100, 200), lambda i,j: 2.72 * B[i, j], name='C')

block_x = tvm.thread_axis("blockIdx.x")
thread_x = tvm.thread_axis("threadIdx.x")

s = tvm.create_schedule(C.op)
i, j = C.op.axis
s[C].bind(i, block_x)
s[C].bind(j, thread_x)

s[B].set_scope("shared")
s[B].compute_at(s[C], i)
ib, jb = s[B].op.axis
s[B].bind(jb, thread_x)
```

The `j` is lower on the attach_path, therefore is relaxed and has extent 200.  
`B([blockIdx.x, 1], [0, 200])` needs to be realized.

```
realize C([0, 100], [0, 200]) {
  produce C {
    // attr [iter_var(blockIdx.x, , blockIdx.x)] thread_extent = 100
    // attr [compute(B, 0x19def60)] realize_scope = "shared"
    realize B([blockIdx.x, 1], [0, 200]) {
      produce B {
        // attr [iter_var(threadIdx.x, , threadIdx.x)] thread_extent = 200
        B(blockIdx.x, threadIdx.x) =(3.140000f*A(blockIdx.x, threadIdx.x))
      }
      // attr [iter_var(threadIdx.x, , threadIdx.x)] thread_extent = 200
      C(blockIdx.x, threadIdx.x) =(2.720000f*B(blockIdx.x, threadIdx.x))
    }
  }
}

extern "C" __global__ void test_kernel0( float* __restrict__ A,  float* 
__restrict__ C) {
  __shared__ float B[200];
  B[((int)threadIdx.x)] = (A[((((int)blockIdx.x) * 400) + ((int)threadIdx.x))] 
* 3.140000e+00f);
  C[((((int)blockIdx.x) * 200) + ((int)threadIdx.x))] = (B[((int)threadIdx.x)] 
* 2.720000e+00f);
}
```

### Ex. 7

The following code is exactly the same as that in Ex.6, except that stage `B` 
is attached to `j` of `C` instead of `i` of `C`.

```
A = tvm.placeholder((200, 400), name='A')
B = tvm.compute((200, 400), lambda i,j: 3.14 * A[i, j], name='B')
C = tvm.compute((100, 200), lambda i,j: 2.72 * B[i, j], name='C')

block_x = tvm.thread_axis("blockIdx.x")
thread_x = tvm.thread_axis("threadIdx.x")

s = tvm.create_schedule(C.op)
i, j = C.op.axis
s[C].bind(i, block_x)
s[C].bind(j, thread_x)

s[B].set_scope("shared")
s[B].compute_at(s[C], i)
ib, jb = s[B].op.axis
s[B].bind(jb, thread_x)
```

Without considering the storage scope, `j` would not be relaxed. In this case, 
however, the storage scope of `B` is `shared` and `j` is bound to `threadIdx`. 
Therefore `j` is relaxed and has extend 200. `B([blockIdx.x, 1], [0, 200])` 
needs to be realized, as it does in Ex. 6.

```
realize C([0, 100], [0, 200]) {
  produce C {
    // attr [iter_var(blockIdx.x, , blockIdx.x)] thread_extent = 100
    // attr [iter_var(threadIdx.x, , threadIdx.x)] thread_extent = 200
    // attr [compute(B, 0x19af470)] realize_scope = "shared"
    realize B([blockIdx.x, 1], [0, 200]) {
      produce B {
        B(blockIdx.x, threadIdx.x) =(3.140000f*A(blockIdx.x, threadIdx.x))
      }
      C(blockIdx.x, threadIdx.x) =(2.720000f*B(blockIdx.x, threadIdx.x))
    }
  }
}

extern "C" __global__ void test_kernel0( float* __restrict__ A,  float* 
__restrict__ C) {
  __shared__ float B[200];
  B[((int)threadIdx.x)] = (A[((((int)blockIdx.x) * 400) + ((int)threadIdx.x))] 
* 3.140000e+00f);
  C[((((int)blockIdx.x) * 200) + ((int)threadIdx.x))] = (B[((int)threadIdx.x)] 
* 2.720000e+00f);
}
```

### Ex. 8

The following code is exactly the same as that in Ex.6, except that storage 
scope of `B` is not explicitly set but infered. 


```
A = tvm.placeholder((200, 400), name='A')
B = tvm.compute((200, 400), lambda i,j: 3.14 * A[i, j], name='B')
C = tvm.compute((100, 200), lambda i,j: 2.72 * B[i, j], name='C')

block_x = tvm.thread_axis("blockIdx.x")
thread_x = tvm.thread_axis("threadIdx.x")

s = tvm.create_schedule(C.op)
i, j = C.op.axis
s[C].bind(i, block_x)
s[C].bind(j, thread_x)

# s[B].set_scope("shared")
s[B].compute_at(s[C], i)
ib, jb = s[B].op.axis
s[B].bind(jb, thread_x)
```

The lowered and generated code is the same as that in Ex. 6.

```
realize C([0, 100], [0, 200]) {
  produce C {
    // attr [iter_var(blockIdx.x, , blockIdx.x)] thread_extent = 100
    // attr [compute(B, 0x198f070)] realize_scope = ""
    realize B([blockIdx.x, 1], [0, 200]) {
      produce B {
        // attr [iter_var(threadIdx.x, , threadIdx.x)] thread_extent = 200
        B(blockIdx.x, threadIdx.x) =(3.140000f*A(blockIdx.x, threadIdx.x))
      }
      // attr [iter_var(threadIdx.x, , threadIdx.x)] thread_extent = 200
      C(blockIdx.x, threadIdx.x) =(2.720000f*B(blockIdx.x, threadIdx.x))
    }
  }
}

extern "C" __global__ void test_kernel0( float* __restrict__ A,  float* 
__restrict__ C) {
  __shared__ float B[200];
  B[((int)threadIdx.x)] = (A[((((int)blockIdx.x) * 400) + ((int)threadIdx.x))] 
* 3.140000e+00f);
  C[((((int)blockIdx.x) * 200) + ((int)threadIdx.x))] = (B[((int)threadIdx.x)] 
* 2.720000e+00f);
}
```





---
[Visit 
Topic](https://discuss.tvm.ai/t/discuss-contributing-new-docs-for-inferbound/2151/4)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/9d2c09ca2cd6e05d317d028410ef57e98fb637467714fea2487c30b95ab10d56).

Tianqi Chen, UW, Seattle, WA, 98105, United States
http://tracking.discuss.tvm.ai/tracking/unsubscribe?msgid=PCFDFNf8is3Yf5PDw6e7Hg2

[TVM Discuss] [Development] [DISCUSS] Contributing new docs for InferBound

Reply via email to