We do support to generate OpenCL, so we could run on Mali GPU. However, we don't test it on Mali GPU when we complete Ansor. Some difference compared with Nvidia GPU we could see, for example, on Mali GPU, we shouldn't use `cache_read("shared")` because Mali GPU doesn't have separate shared memory like Nvidia GPU. And we should generate `vectorize` explicitly which doesn't be required by Nvidia GPU.
We have collected the performance data of TFLite quantized model on ARM CPU. However we don't put it on paper. I am glad to share it:  The target is 4 cores of cortext-a53, qnnpack commit is (b7bacb1899e6fa3a934c1dd6128096f2e1abf071) and only convolution been counted. As you could see we have competitive performance compared with TFLite (2.1) and libraries like Qnnpack. However we should still have room to improve, for example we should generate the pari instruction (`smlal` / `smlal2`), which maybe could be done by tensorize. --- [Visit Topic](https://discuss.tvm.ai/t/rfc-ansor-an-auto-scheduler-for-tvm-autotvm-v2-0/7005/10) to respond. You are receiving this because you enabled mailing list mode. To unsubscribe from these emails, [click here](https://discuss.tvm.ai/email/unsubscribe/b8265a98075df24bff1c38c633f5dae7ee516403e8b3993c1113a1ff588673d8).