Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++

Mo Zhou Sun, 22 Dec 2024 08:42:28 -0800

Hi Christian,

Did you have a chance to test int8 and int4? They are heavily relying onnewer SIMD instructions especially things like AVX512, and maybe theyface a larger performance impact without -march=native. BTW, for recentlarge language models, in fact int4 does not lose much performance[1],and should be the default precision to run locally since it ought to beanyway faster than CPU float point.

If llama.cpp really lose a lot of int4 performance without SIMD, thatcould be more demotivating to be honest.

I'm also a llama.cpp user through Ollama[2]'s convenient wrapping work.It is too complicated to consider for packaging -- I mention it here inorder to give you a better idea on how the ecosystem uses llama.cpp, incase you did not see it before.

To my point of view, llama.cpp is more suitable for source-baseddistributions like Gentoo. In the past I proposed something similar forDebian but the community was not interested in that.

In terms of the BLAS/MKL-like approach for SIMD capability dispatching... I bet focusing on something else is more worthwhile.


Thanks for the update!

[1] https://arxiv.org/html/2402.16775v1
[2] https://ollama.com/


On 12/22/24 08:21, Christian Kastner wrote:

Hi Cory,

On 2024-12-15 08:45, Cordell Bloor wrote:

I would also argue that you're taking on too much responsibility trying
to enable -march=native optimizations. It's true that you can get
significantly more performance using AVX instructions available on most
modern computers

I just tested a 3.2B model with f16, and with AVX and other features
turned off, and tokens/s went down by a factor of 25x.

but if llama.cpp really wanted they could implement dynamic dispatch themselves.

Upstream seems to want people to just clone, configure, and build
locally. I don't think we can infer much regarding other design choices.

Why not deliver the basics before we try to do something fancy?
[...] There's value in providing a working package to users today, even
if it's imperfect.

Performance is crippled too badly for any practical use. We can't ship
this. Especially since it is so easy to use upstream.

llamap.cpp is intentionally designed to be trivial to deploy: no
dependencies by default, and the simplest of all build processes. It
doesn't benefit that much from packaging, compared to other software.

The approach I plan to look into is to build and ship all backends, to
make them user-selectable. Similar to what Mo does for MKL.

Best,
Christian

Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++

Reply via email to