Hi Christian,
Did you have a chance to test int8 and int4? They are heavily relying on
newer SIMD instructions especially things like AVX512, and maybe they
face a larger performance impact without -march=native. BTW, for recent
large language models, in fact int4 does not lose much performance[1],
and should be the default precision to run locally since it ought to be
anyway faster than CPU float point.
If llama.cpp really lose a lot of int4 performance without SIMD, that
could be more demotivating to be honest.
I'm also a llama.cpp user through Ollama[2]'s convenient wrapping work.
It is too complicated to consider for packaging -- I mention it here in
order to give you a better idea on how the ecosystem uses llama.cpp, in
case you did not see it before.
To my point of view, llama.cpp is more suitable for source-based
distributions like Gentoo. In the past I proposed something similar for
Debian but the community was not interested in that.
In terms of the BLAS/MKL-like approach for SIMD capability dispatching
... I bet focusing on something else is more worthwhile.
Thanks for the update!
[1] https://arxiv.org/html/2402.16775v1
[2] https://ollama.com/
On 12/22/24 08:21, Christian Kastner wrote:
Hi Cory,
On 2024-12-15 08:45, Cordell Bloor wrote:
I would also argue that you're taking on too much responsibility trying
to enable -march=native optimizations. It's true that you can get
significantly more performance using AVX instructions available on most
modern computers
I just tested a 3.2B model with f16, and with AVX and other features
turned off, and tokens/s went down by a factor of 25x.
but if llama.cpp really wanted they could implement dynamic dispatch themselves.
Upstream seems to want people to just clone, configure, and build
locally. I don't think we can infer much regarding other design choices.
Why not deliver the basics before we try to do something fancy?
[...] There's value in providing a working package to users today, even
if it's imperfect.
Performance is crippled too badly for any practical use. We can't ship
this. Especially since it is so easy to use upstream.
llamap.cpp is intentionally designed to be trivial to deploy: no
dependencies by default, and the simplest of all build processes. It
doesn't benefit that much from packaging, compared to other software.
The approach I plan to look into is to build and ship all backends, to
make them user-selectable. Similar to what Mo does for MKL.
Best,
Christian