On Mon, 2025-01-27 at 11:13 +0100, Christian Kastner wrote:
> Hi Cory,
> 
> On 2025-01-27 09:44, Cordell Bloor wrote:
> > Could we just sidestep this whole question of native instructions by
> > building llama.cpp with the BLAS backend?
> 
> I was going to ship BLAS as one of the backends, but you do raise an
> interesting point: why ship the "regular" backend at all if we have BLAS
> guaranteed on Debian.

BLAS itself only handles float32, float64, complex float32, and complex
float64 datatypes, which are typically "s", "d", "c", "z" prefixes in the
API. The quantized neural networks are not likely running in float point
mode, but int mode like int4 and int8.

Quoted from llama.cpp documentation:
"""
Building the program with BLAS support may lead to some performance
improvements in prompt processing using batch sizes higher than 32
(the default is 512). Using BLAS doesn't affect the generation performance.
"""
https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md#blas-build

> And not just one, but many implementations of BLAS that can easily be
> switched, thanks to Mo's work with the alternatives subsystem.

You may take libtorch2.5 as a reference, while building against
libblas-dev, we may manually recommend high-performance BLAS
implementations when the user installs:

Recommends: libopenblas0 | libblis4 | libmkl-rt | libblas3

The actual line for libtorch2.5 is outdated for copying.
Please use the one above.

Reply via email to