We do support to generate OpenCL, so we could run on Mali GPU. However, we 
don't test it on Mali GPU when we complete Ansor. Some difference compared with 
Nvidia GPU we could see, for example, on Mali GPU, we shouldn't use 
`cache_read("shared")` because Mali GPU doesn't have separate shared memory 
like Nvidia GPU. And we should generate `vectorize` explicitly which doesn't be 
required by Nvidia GPU.

We have collected the performance data of TFLite quantized model on ARM CPU. 
However we don't put it on paper. I am glad to share it:

![image|360x217](upload://kOVtkrTGnilHXZF4aCFqSDGr3xR.png) 

The target is 4 cores of cortext-a53, qnnpack commit is 
(b7bacb1899e6fa3a934c1dd6128096f2e1abf071) and only convolution been counted. 
 As you could see we have competitive performance compared with TFLite (2.1) 
and libraries like Qnnpack. However we should still have room to improve, for 
example we should generate the pari instruction (`smlal` / `smlal2`), which 
maybe could be done by tensorize.





---
[Visit 
Topic](https://discuss.tvm.ai/t/rfc-ansor-an-auto-scheduler-for-tvm-autotvm-v2-0/7005/10)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.ai/email/unsubscribe/b8265a98075df24bff1c38c633f5dae7ee516403e8b3993c1113a1ff588673d8).

Reply via email to