>n Mon, Sep 13, 2021 at 3:54 PM Daniel Vetter <daniel.vet...@ffwll.ch> wrote:
> > One straightforward hardware independent low-level API would > > be the traditional BLAS GEMM call[1] for matrix multiplication > > and its variants (integer, float, bfloat16, ...). Most of the frameworks > > are able to use SGEMM to do the actual calculation since that > > has optimized versions for most CPUs and GPUs, and most > > hardware accelerators should be able to provide an > > implementation of this that doesn't completely suck. This > > can be used for both inferencing and training. > > I think BLAS are too high-level for these. Sure fore perfect speed the > vendor probably wants to have their own BLAS thing, their own NN > optmizer and a heap of other things, but for the low-level userspace > we're talking about here that pretty much doesn't matter. I suppose high-level vs low-level is not the correct distinction here, it's more like fixed-function vs programmable. As a fixed-function interface, something like GEMM is probably as low-level as you would want to get, as it's big enough to make sense as a single atomic command, but small enough to be able to build on top of it. > I think a really good example of this is the compute stack Intel is building: > - level0 is the absolute bare-bones low level driver. For this > discussion here that's enough of a userspace to make at least Dave&me > happy. In 3d this would be vulkan. In AI/NN space, there's nothing > here, at least nothing cross-vendor. > - Then there's the entire OneApi ecosystem on top. Lots of this is > open, some of it is closed, but from the pov of an accel stack it's > all looking like applications, not like driver code. BLAS is sitting > here. For AI/NN this is pytorch, tensorflow and all these higher-level > frameworks (which often have quite sophisticated optimizers of their > won) Looking at OneAPI, I see a BLAS implementation (oneMKL) next to somewhat higher-level abstraction (oneDNN). Which of the two are the generic frameworks (pytorch/tensorflow/...) built on top of? The oneDNN interface looks like it could be implemented not only on top of level0 but also layered above some BLAS library or as a thin wrapper above a fixed-function kernel interface that provides similar high-level abstractions. Is that a correct understanding? It also seems like this is similar in purpose to Apple's BNNS library. > Especially BLAS isn't the most impressive, since largely it's fused > multiple-add benchmark and not much else. Ok, enormous amounts of > tuning to perfectly exploit the execution bw and interconnect/cache > hierarchy of your chip, whatever it is. That's often something vendors > don't like sharing (intel's math kernels are still closed afaik) > because it leaks a bit much about actual implementation details of the > chip as opposed to how it's programmed. Also not something I really > care about with my maintainer hat on. It's not /just/ benchmarks, it's actually being used directly underneath the high-level frameworks precisely because it is simple, portable and well optimized. If there is a higher-level interface like oneDNN that is usable by the common frameworks, using a subset of that as a fixed-function interface for the kernel may be a good alternative (or at least complementary) to a fully programmable interface. I realize that fixed-function is not fashionable on GPUs, but they are widely used in other areas (video codecs, crypto, ...) even when you are running precompiled code on the accelerator hardware. This would of course replace the question of open source user space with the question of open-source firmware, as the user side would become mostly while the accelerator goes from dynamically created to a firmware blob. Arnd --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@tvm.apache.org For additional commands, e-mail: dev-h...@tvm.apache.org