[apache/incubator-tvm] [RFC][VTA] Support for Cloud Devices (OpenCL-compatible) (#5840)

2020-06-18 Thread ZHANG Hao

# Motivation
 
Cloud devices are more powerful than Edge devices, which provides higher 
computation capabilities for deep learning workloads. For example, for the VTA 
core, with Cloud devices, we have more resources to support larger GEMM cores 
(e.g., 32\*32 or even 64\*64) and device buffers, thus making it possible to 
boost the performance to great extent. Therefore, it is worthwhile to provide a 
generic framework to support cloud devices under TVM/VTA architecture.

However, it is non-trivial to extend VTA to Cloud devices. Because the original 
Xilinx HLS VTA core only works on Xilinx Edge FPGA devices, and Cloud devices 
exposes different communication models (i.e., shared memory between ARM cores 
and FPGA device for Edge, vs., PCIe between host and FPGA device for Cloud), 
and different programming models. In this work, we propose to design a unified 
framework that can be adapted to any OpenCL-compatible hardware accelerators, 
e.g., FPGA, ASICs, to seamlessly work with the TVM-VTA architecture. Meanwhile, 
we provide an example of OpenCL-based VTA implementation that has been tested 
on the Intel's high-end FPGAs.
 
 
# Proposal
 
We would like to extend VTA to OpenCL-compatible devices (e.g. Intel 
Programmable Acceleration Card). In particular, we provide a framework where 
any OpenCL-compatible devices can be easily integrated. The reason we choose 
OpenCL-compatible devices are:
- OpenCL is generic enough to support a group of devices. For example, both 
Xilinx and Intel are now in transition towards OpenCL based HLS approaches. 
- Vendor-specific optimizations are built-in within their respective OpenCL 
SDKs (e.g., pack two 8-bit multiply-add units into 1 DSP slice), but the 
framework we're providing does not limit to specific SDKs.


In addition to the generic OpenCL framework, as a first attempt for the 
hardware implementation, we would like to base on Intel Cloud FPGA (e.g. Intel 
Programmable Acceleration Card) using Intel® FPGA SDK for OpenCL, which has 
proven portability and scalability for both Intel® Programmable Acceleration 
(PAC) cards and other custom Intel-FPGA-based acceleration cards. But the 
overall framework is generic, meaning that any OpenCL-compatible devices can be 
plugged in with only little extra hardware-specific implementation.

### Major works
- Efficient communication between host and PCIe devices as PCIe transmission is 
costly compared to memory copy
- To avoid frequent PCIe copies, we propose to let all middle layers of 
a computation graph to completely run in FPGA devices, without interleaved CPU 
layers. In particular, originally, residual block in Resnet run in CPU (ARM 
cores), which may cause copy in and out from device memory frequently. The 
addition of extra VTA instructions are intended to move this kind of residual 
block to FPGA device.
- Do copy of uops and instructions in a batch. In particular, only do 
synchronization after all on-device layers are queued, or queues are overflowed.

- Support auto-copy between layers running on different devices. We propose to 
add a few more IR passes:
- annotate device types for computation graph
- tag and propagate device types among layers
- add copy operations (device_copy) automatically if adjacent layers 
are not in the same devices


- Driver development for OpenCL-compatible devices
- The original pynq driver could not be used as we do not have direct 
access to h/w registers
- We implemented a middle layer driver for OpenCL-compatible devices
- The layer sits on devices' native driver stack, which implemented an 
interrupt based device driver


- OpenCL hardware implementation
- Addition of extra Load/ALU instructions, such as Load int8 to ACC 
buffer (to support ALU-only nodes), ALU Multiply and Left-shift, to support 
more continued calculations on FPGA
- Refactored the hardware implementation code to conform to Intel® FPGA 
SDK for OpenCL as a sample hardware implementation
### Major changes to the existing TVM/VTA framework

- To run a workload on cloud FPGA, there is no need to launch additional 
service on the device side (e.g., rpc server). All the driver and runtime 
programs are running in the host side.
- Change VTA runtime to support batch queue synchronization. We intend to only 
queue the instructions/uops when running a layer and return immediately without 
doing device synchronization. We only do synchronization and device run when 
queues are overflowed or the next layer is not on-device。
- We have to modify the device propagation behaviour from post DFS traversal to 
recursive method. Originally, device type is propagated based on the post DFS 
traversed graph, which may not be consistent if the argument order changes. In 
addition, it may handle some cases wrongly, e.g., the first residual block in 
Resnet50. The first few layers in Resnet50 are depicted in the following figure 
(top to bottom is

[TVM Discuss] [Meetup] uTVM (Embedded Focus) Online Meetup

2020-06-18 Thread Jason Knight via TVM Discuss


Thanks for the great meetup earlier today everyone. Video is up here: 
https://youtu.be/mW7dk-rXuy8





---
[Visit 
Topic](https://discuss.tvm.ai/t/utvm-embedded-focus-online-meetup/6908/12) to 
respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/646955932756a3f9901250be99a1cecd04bbdc1f25828b6e4f37d638c6813671).


[TVM Discuss] [Development/RFC] [RFC] Ansor: An Auto-scheduler for TVM (AutoTVM v2.0)

2020-06-18 Thread Lianmin Zheng via TVM Discuss


Thanks for the discussion. Here are my thoughts.

### API Usage
  The API for tuning a whole neural network will be the same as autotvm 
(extract tasks and tune all of them).
  The API for writing templates is still under development. But it will be 
similar to autotvm.

### Performance in absolute time
   We didn't run on c5.9x large. On our test machine (a 20-core cascadelake), 
we get around 10% improvements on resnet-50, which means around 0.5ms speedup.

### Dense schedule
   Ansor significantly outperforms autotvm on dense and can match MKLDNN. So 
this may not be a big issue. Combining MKLDNN and TVM is orthogonal to this 
RFC. 

### Quantized models
@FrozenGene got promissing results on ARM CPU. But we expect more work has to 
be done on tensorization.

### Replacing AutoTVM
Currently, I am confident that Ansor can replace all fp32 autotvm templates.
I agree that current AutoTVM serves as a handy tool for manual exploration and 
we should not deprecate this functionality. We should support easy manual 
customization in Ansor and then replace AutoTVM.

### Code generation without tuning
This is on my to-do list. We have a better and unified cost model (one model 
for all operators), so we should be able to get some results in this direction.

### Hybrid Script
This is not supported and not easy to support. Ansor only accepts tvm.compute 
as input.

### New backend
We need modifications of the search space. Ansor supports search space 
customization by allowing users to register composable rules.  The framework is 
general for different backends.





---
[Visit 
Topic](https://discuss.tvm.ai/t/rfc-ansor-an-auto-scheduler-for-tvm-autotvm-v2-0/7005/12)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/bd8723a8e12d60591a761996860b153cd9fd2fc2d1a1824dd67df755502702a7).


[TVM Discuss] [Development/RFC] [RFC] Ansor: An Auto-scheduler for TVM (AutoTVM v2.0)

2020-06-18 Thread Yao Wang via TVM Discuss


Thanks @merrymercy
The point of bringing up MKLDNN is that for dense op these libraries have a bag 
of tricks which might be difficult to achieve in TVM. @haichen has done nice 
work on TVM+MKLDNN for bert, and has become the standard way we use to support 
bert on cloud CPU. It would be nice to see whether Ansor can be another option 
in this use case. This will provide us more insights between manual method and 
fully automatic one.





---
[Visit 
Topic](https://discuss.tvm.ai/t/rfc-ansor-an-auto-scheduler-for-tvm-autotvm-v2-0/7005/13)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/1a31fe443956cb7c273fa9e34df2fe99ba662a6f4ce87e0f725ba32841899e31).


[TVM Discuss] [Development/RFC] [RFC][µTVM] Standalone µTVM Roadmap

2020-06-18 Thread Andrew Reusch via TVM Discuss


For #6 (export stats), I think you're absolutely right. I think there can be 
other interesting on-device stats (I.e. IRQs triggered, # function executions, 
etc). This is also the last one on the roadmap since it's a bit less planned 
relative to the others.

On #2, I think some part should run in the pre-submit. I don't think we should 
include custom hardware in the TVM presubmit for a couple of reasons:
1. It's harder for contributors to reproduce errors. Only contributors with 
that hardware could resolve CI errors that happen there.
2. It's easier for hardware to run into heisenbugs, and we shouldn't use the 
presence or absence of those to gate TVM code submission.
3. There's some logistical challenge around hosting the hardware for a CI (this 
one we can overcome, but we should think about how to place the CI for some 
piece of hardware closer to those with specific knowledge of that hardware, in 
case some offline troubleshooting is needed).

Right now for the presubmit, I'm thinking that we should run a suite of 
"black-box acceptance tests" against an x86 RPC server running in a child 
process. Those can also serve to validate the C runtime on x86, when compiled 
standalone.

I do think some regular automated job against hardware is important, too. I 
need to think a bit more about how we might put this together--open to thoughts 
from the community as well!





---
[Visit Topic](https://discuss.tvm.ai/t/rfc-tvm-standalone-tvm-roadmap/6987/5) 
to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/3b6be0592381dd7f02330acf892afc2a66f6a0ced9b2fc4c7c40bd910f1487da).


[TVM Discuss] [Development/RFC] [RFC] Ansor: An Auto-scheduler for TVM (AutoTVM v2.0)

2020-06-18 Thread Bing Xu via TVM Discuss


I support fully deprecate template based AutoTVM. Technically, template based 
AutoTVM is a subset of Ansor's search space. We may temperately allow one 
release to keep both AutoTVM and Ansor. But in a long run I can't see any 
reason we should keep AutoTVM.





---
[Visit 
Topic](https://discuss.tvm.ai/t/rfc-ansor-an-auto-scheduler-for-tvm-autotvm-v2-0/7005/14)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/a1c8d41d6c7622da293bfb1a46e74c59b5dea7aa1ef9c9d705bedd72f4c93f50).


[TVM Discuss] [Development/RFC] [RFC] Ansor: An Auto-scheduler for TVM (AutoTVM v2.0)

2020-06-18 Thread Cody H. Yu via TVM Discuss


I agree. As long as we can demonstrate that Ansor customized rules can fully 
cover the current AutoTVM templates in terms of the semantic and performance, 
we can deprecate AutoTVM. While we are working to achieve this goal, we will 
definitely have a period of time to keep both solutions.





---
[Visit 
Topic](https://discuss.tvm.ai/t/rfc-ansor-an-auto-scheduler-for-tvm-autotvm-v2-0/7005/15)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/6f1746dc3ca0318404174c501d7dac57ff3b9decaf9a558ae47a00a9be0d2558).


[TVM Discuss] [Development/RFC] [RFC] Ansor: An Auto-scheduler for TVM (AutoTVM v2.0)

2020-06-18 Thread Balint Cristian via TVM Discuss


@merrymercy et. All,

First, Ansel is a wonderful work congrats to All !

* Permit me to bring attention to: https://arxiv.org/pdf/2002.02145.pdf

They had a public repo (removed a few days ago), i am still wondering if 
polyhedral priors could bring any benefit to Ansor (e.g. help to 
reduce/optimize the search space) ?





---
[Visit 
Topic](https://discuss.tvm.ai/t/rfc-ansor-an-auto-scheduler-for-tvm-autotvm-v2-0/7005/17)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/ebffa4f60cf71c20f68c058e851710478fab6de2f5cca3be96e35ad2b531a2cc).


[TVM Discuss] [Development/RFC] [RFC] Ansor: An Auto-scheduler for TVM (AutoTVM v2.0)

2020-06-18 Thread Cody H. Yu via TVM Discuss


Good point. We (AWS) have a plan towarding to this direction in this summer.





---
[Visit 
Topic](https://discuss.tvm.ai/t/rfc-ansor-an-auto-scheduler-for-tvm-autotvm-v2-0/7005/18)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/292a6fe405bc2f3c37a65da13bb2a76651b5328a980f2a02aa6a8aa750aa7121).


[TVM Discuss] [Development] *Attrs not inheriting from Attrs

2020-06-18 Thread Thomas V via TVM Discuss


I noticed that while most Attrs inherit from Attrs, some don't and are only on 
the C++ side (thus being mapped to Object). In particular, they don't have the 
`keys` function.
Now defining them with a short docstring like the others is easy, but is that 
an OK patch?

Best regards

Thomas

The ops are:
AdaptivePool2DAttrs
AdaptivePool3DAttrs
AffineGridAttrs
AllocStorageAttrs
AllocTensorAttrs
CastHintAttrs
Conv1DTransposeAttrs
DictAttrsNode
ExpandDimsAttrs
GridSampleAttrs
GroupNormAttrs
InstanceNormAttrs
LayerNormAttrs
NdarraySizeAttrs
OneHotAttrs
QuantizeAttrs
ReduceAttrs
RequantizeAttrs
Resize3dAttrs
ScatterAttrs
SequenceMaskAttrs
ShapeFuncAttrs
SimulatedQuantizeAttrs
SparseDenseAttrs
SparseToDenseAttrs
SparseTransposeAttrs
TestAttrs
TopKAttrs
TupleGetItemAttrs
WithFuncIdAttrs





---
[Visit Topic](https://discuss.tvm.ai/t/attrs-not-inheriting-from-attrs/7029/1) 
to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/f8a800e836f0782e017325dc0ec699bffba51548bf1238e5208a10bb47988a9a).


[TVM Discuss] [Development] *Attrs not inheriting from Attrs

2020-06-18 Thread tqchen via TVM Discuss


I agree that it is good to add those attrs to the python side so that they maps 
to Attrs





---
[Visit Topic](https://discuss.tvm.ai/t/attrs-not-inheriting-from-attrs/7029/2) 
to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/ba17a3b20189870cde425dbf41513e1a524fac90c0c68f2f705dbce0910d7124).


[TVM Discuss] [Development/RFC] [RFC][VTA] Support for Cloud Devices (OpenCL-compatible)

2020-06-18 Thread zhang hao (4paradigm) via TVM Discuss


Formal RFC is here: https://github.com/apache/incubator-tvm/issues/5840

PRs are here:
https://github.com/apache/incubator-tvm-vta/pull/9
https://github.com/apache/incubator-tvm/pull/5842

@elnaz92 You may checkout the code and try first.





---
[Visit 
Topic](https://discuss.tvm.ai/t/rfc-vta-support-for-cloud-devices-opencl-compatible/6676/29)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/7788b9d1a81cb16dc35a304d9c10d7e1e558528b81704e3afedf75847aba062e).


[TVM Discuss] [Development/RFC] [RFC][VTA] Support for Cloud Devices (OpenCL-compatible)

2020-06-18 Thread Thierry via TVM Discuss


Thanks for the PRs this is a very welcome contribution! Expect some initial 
comments/reviews tomorrow.





---
[Visit 
Topic](https://discuss.tvm.ai/t/rfc-vta-support-for-cloud-devices-opencl-compatible/6676/30)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/d60ee6dae848e714768a1a53ad6510417fef6579c1e4117ac0d28ecd37db4e3d).