Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++

Christian Kastner Sun, 22 Dec 2024 05:57:26 -0800

Hi Cory,

On 2024-12-15 08:45, Cordell Bloor wrote:
> I would also argue that you're taking on too much responsibility trying
> to enable -march=native optimizations. It's true that you can get
> significantly more performance using AVX instructions available on most
> modern computers


I just tested a 3.2B model with f16, and with AVX and other features
turned off, and tokens/s went down by a factor of 25x.

> but if llama.cpp really wanted they could implement dynamic dispatch 
> themselves.

Upstream seems to want people to just clone, configure, and build
locally. I don't think we can infer much regarding other design choices.

> Why not deliver the basics before we try to do something fancy? 
> [...] There's value in providing a working package to users today, even
> if it's imperfect.

Performance is crippled too badly for any practical use. We can't ship
this. Especially since it is so easy to use upstream.

llamap.cpp is intentionally designed to be trivial to deploy: no
dependencies by default, and the simplest of all build processes. It
doesn't benefit that much from packaging, compared to other software.

The approach I plan to look into is to build and ship all backends, to
make them user-selectable. Similar to what Mo does for MKL.

Best,
Christian

Bug#1063673: ITP: llama.cpp -- Inference of Meta's LLaMA model (and others) in pure C/C++

Reply via email to