@AndrewZhaoLuo I briefly looked at bfloat16. While fp16 vs bf16 makes no 
difference for the conversion pass, it seems it is going to take a lot of 
effort to compile and run a bf16 model end to end, for at least two reasons:
* The constant folding pass doesn't work on bfloat16 input
* Numpy doesn't understand bfloat16, but some topi schedules (winograd conv) 
try to create a numpy array of type `out_dype`, which in this case bfloat16.

Since tensorcore can natively run bf16 workloads at the same rate as fp16, and 
bf16 on x86 servers are becoming a thing, it would be nice to have a good 
support for bf16 across the stack in the future.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm/issues/8296#issuecomment-926605871

Reply via email to