Hi, I fork the https://github.com/GaryYuyjl/incubator-tvm/tree/int4tensorcore
for int4 computation with tensorcore. I found it cost too much time while
packing int4 to int32 with cpu.
So I write the pack progress into conv2d compute&schedule and get good results.
But the packing data time sti
Thank you @zhanghaohit, @remotego, @liangfu, @hjiang for the discussion.
This is a great step forward for VTA. Having a story for PCI-E type FPGAs is
highly needed and has been a little too overlooked lately, so I appreciate the
solid RFC and the hard work. The TVM community looks forward to y
Finally some lower level comments for @zhanghaohit and @remotego:
- I agree with @liangfu that leveraging Chisel would be ideal in the spirit of
minimizing the number of design sources. There is an initial scaffold of the
Chisel design to work on F1 FPGAs, which @vegaluis can share with you.