# Background
PyTorch framework is increasingly being adopted for research and production. At the same time, PyTorch lacks an effective inference acceleration toolchain, which is the main concern in the industry. Existing acceleration includes: 1. PyTorch -> ONNX -> TensorRT/TVM 2. PyTorch -> torchscript -> TensorRT/TVM >From our perspective, there are some limitations for both ONNX and TensorRT: 1. Onnx cannot cover all models with dynamic control flow (e.g. for loop) 2. TensorRT can only accelerate some standard networks So we hope to use TVM to accelerate PyTorch model inference. # Proposal To increase the TVM accessibility for PyTorch users, we propose PyTorchTVM module to support the following workflow: 1. convert a torchscript module to tvm graph 2. build and tune tvm graph 3. export well-tuned tvm graph as a pytorch op 4. torch jit trace the tvm pytorch op with other pytorch modules, then save/load/serve as normal pytorch model For example, we have an end-to-end resnet classification model, consisting of 3 parts: 1. Image reader 2. Image transforms 3. Resnet model inference ``` class Predictor(nn.Module): def __init__(self, tvm_module=None): super().__init__() self.resnet18 = resnet18(pretrained=True, progress=False).eval() self.transforms = nn.Sequential( T.Resize([256, ]), # We use single int value inside a list due to torchscript type restrictions T.CenterCrop(224), T.ConvertImageDtype(torch.half), T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) ) def forward(self, image_path: List[str]) -> torch.Tensor: with torch.no_grad(): images: List[torch.Tensor] = [] for path in image_path: img = read_image(path) images.append(img) x = torch.stack(images).cuda().half() x = self.transforms(x) print(x.shape) y_pred = self.resnet18(x) return y_pred.argmax(dim=1) ``` We choose to accelerate resnet model with PyTorchTVM ``` from tvm.contrib.pt_op import PyTorchTVMModule, compile print("compile...") option = { "input_infos": [ ("x", (1, 3, 224, 224)), ], "default_dtype": "float16", "export_dir": "pytorch_compiled", "num_outputs": 1, "tuning_n_trials": 0, # set zero to skip tuning "tuning_log_file": "tuning.log", } x = torch.randn(1, 3, 224, 224).cuda().half() resnet_jit = torch.jit.trace(model.resnet18, x) resnet_tvm = compile(resnet_jit, option) ``` Then we can use the accelerated tvm module directly in pytorch, and also use `torch.jit.script` together with the other 2 parts. ``` resnet_tvm = torch.jit.script(resnet_tvm) print(resnet_tvm.graph) class PredictorTVM(nn.Module): def __init__(self): super().__init__() self.resnet18 = resnet_tvm self.transforms = nn.Sequential( T.Resize([256, ]), # We use single int value inside a list due to torchscript type restrictions T.CenterCrop(224), T.ConvertImageDtype(torch.half), T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) ) def forward(self, image_path: List[str]) -> torch.Tensor: with torch.no_grad(): images: List[torch.Tensor] = [] for path in image_path: img = read_image(path) images.append(img) x = torch.stack(images).cuda().half() x = self.transforms(x) # y_pred = self.resnet18(x) y_pred = self.resnet18([x])[0] return y_pred.argmax(dim=1) print("run tvm...") model_tvm = PredictorTVM().cuda().half() for i in range(20): t = time.time() model_tvm([image_path]) torch.cuda.synchronize() print(time.time() - t) torch.jit.script(model_tvm).save("model_tvm.pt") ``` Finally, we get a TVM accelerated model, which can be loaded and served in production. # Implementation Our implementation is inspired by this RFC: https://discuss.tvm.apache.org/t/rfc-add-tensorflow-custom-op-to-embed-tvm-runtime-in-tensorflow-graph-and-session/4601 We have opened a PR: https://github.com/apache/tvm/pull/8777 The essential cpp code is as follows: ``` // This is just a wrapper class of tvm graph runtime module class TvmGraphModulePack { ... private: tvm::runtime::Module module_; ... }; // This is the base of our custom classes, // we define some common helper function in this class class BaseTvmClass : public torch::jit::CustomClassHolder { ... // Converts a list of input tensor shapes to a std::string static std::string TvmShapeRepr(const c10::List<c10::List<int64_t>>& shapes); // Gets shape list from input tensors static c10::List<c10::List<int64_t>> GetShapes(const c10::List<at::Tensor>& inputs); ... }; // The custom class that embeds TVM Graph runtime Module in torchscript. // There is also a TvmVMRuntimeClass that supports VM Runtime Module which is not shown here class TvmGraphRuntimeClass : public BaseTvmClass { public: TvmGraphRuntimeClass(const int64_t num_inputs, const int64_t num_outputs, const std::string& device) : BaseTvmClass(num_inputs, num_outputs, device) {} // Load a TVM Graph Runtime Module into tvm_modules_. void LoadTvmModule(const c10::List<c10::List<int64_t>>& shapes, const std::string& lib_path, const std::string& graph_path, const std::string& params_path) { ... auto shape_repr = TvmShapeRepr(GetShapes(inputs)); const auto it = tvm_modules_.emplace(shape_repr, TvmGraphModulePack(path, device_type_, device_id_)).first; ... } virtual c10::List<at::Tensor> forward(const c10::List<at::Tensor>& inputs) override { CHECK_EQ(inputs.size(), num_inputs_); auto shape_repr = TvmShapeRepr(GetShapes(inputs)); auto iter = tvm_modules_.find(shape_repr); ... } private: // key of this map is the shape repr string of inputs std::map<std::string, TvmGraphModulePack> tvm_modules_; }; // registry static auto __tvm_class_graph_runtime_registry = torch::jit::class_<TvmGraphRuntimeClass>("tvm_class", "TvmGraphModule") .def(torch::init<const int64_t, const int64_t, const std::string&>()) .def("load_tvm_module", &TvmGraphRuntimeClass::LoadTvmModule) .def("forward", &TvmGraphRuntimeClass::forward) .def("to", &TvmGraphRuntimeClass::to) .def_pickle( ... }); ``` And we wrap the custom class in Python: ``` class GraphModule(torch.nn.Module): def __init__(self, num_inputs, num_outputs, device=None): ... self.engine = torch.classes.tvm_class.TvmGraphModule(num_inputs, num_outputs, self.device) def init(self, input_shapes, lib_path, graph_path, params_path): self.engine.load_tvm_module(input_shapes, lib_path, graph_path, params_path) def forward(self, inputs: List[torch.Tensor]): return self.engine.forward(inputs) ... ``` # Limitations There are some limitations: 1. Dynamic shape support Currelty we support multiple input_shapes with a bucket policy, which is hacky. A more formal implementation will be in our future work. 2. Zero overhead output Now we only have `set_input_zero_copy`, but our `set_output` has a memcpy. 3. Default performance of TVM Without autotuning, the performance of TVM is most likely worse compared with native pytorch. To give users immediate feedback, maybe we can make tvm use cudnn/cublas/cutlass as a default implementation. Coauthor: @kongroo We hope to further discuss the user API and limitations above with the community. cc @tqchen @junrushao1994 @Laurawly --- [Visit Topic](https://discuss.tvm.apache.org/t/rfc-pytorchtvm-compile-torchscript-to-tvm-and-use-accelerated-module-in-pytorch/10873/1) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.apache.org/email/unsubscribe/672cb633182c9f8e0c2d509306f8de3c8cf2c1a323d841ff50f27e646ceff27d).