@AndrewZhaoLuo I briefly looked at bfloat16. While fp16 vs bf16 makes no difference for the conversion pass, it seems it is going to take a lot of effort to compile and run a bf16 model end to end, for at least two reasons: * The constant folding pass doesn't work on bfloat16 input * Numpy doesn't understand bfloat16, but some topi schedules (winograd conv) try to create a numpy array of type `out_dype`, which in this case bfloat16.
Since tensorcore can natively run bf16 workloads at the same rate as fp16, and bf16 on x86 servers are becoming a thing, it would be nice to have a good support for bf16 across the stack in the future. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/apache/tvm/issues/8296#issuecomment-926605871