Hi Christian,

Could we just sidestep this whole question of native instructions by building llama.cpp with the BLAS backend? The OpenBLAS library will do CPU feature detection, so the parts of llama.cpp that call out to BLAS will make good use of available vector instructions. My benchmarking suggests that this may be sufficient to achieve reasonable CPU performance (albeit still imperfect). To prove this, I've included some benchmarks on my Ryzen 5950X workstation (with 64GB of DDR4 RAM @ 3600 MHz) running Debian Unstable.

First the results of the OpenMP backend with -march=native:

$ build/bin/llama-bench --threads 16 --model ~/ws/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf | model                          |       size |     params | backend    | threads |          test |              t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: | | llama 7B Q5_K - Medium         |   4.78 GiB |     7.24 B | CPU        |      16 |         pp512 |     48.63 ± 0.04 | | llama 7B Q5_K - Medium         |   4.78 GiB |     7.24 B | CPU        |      16 |         tg128 |      9.73 ± 0.05 |

The above results set the baseline for CPU performance. If we disable all vector instructions beyond those available in x86_64, we get:

$ build/bin/llama-bench --threads 16 --model ~/ws/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf | model                          |       size |     params | backend    | threads |          test |              t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: | | llama 7B Q5_K - Medium         |   4.78 GiB |     7.24 B | CPU        |      16 |         pp512 |      3.51 ± 0.00 | | llama 7B Q5_K - Medium         |   4.78 GiB |     7.24 B | CPU        |      16 |         tg128 |      3.34 ± 0.01 |

However, if we enable the BLAS backend and install libopenblas-pthread-dev and libopenblas64-pthread-dev, it improves:

$ build/bin/llama-bench --threads 16 --model ~/ws/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf | model                          |       size |     params | backend    | threads |          test |              t/s | | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: | | llama 7B Q5_K - Medium         |   4.78 GiB |     7.24 B | BLAS       |      16 |         pp512 |     54.64 ± 0.64 | | llama 7B Q5_K - Medium         |   4.78 GiB |     7.24 B | BLAS       |      16 |         tg128 |      3.34 ± 0.01 |

pp512 is the prompt processing benchmark, while tg128 is the text generation benchmark. So, you can see that this greatly improves the prompt processing while leaving token generation unchanged.

In my opinion, this may be sufficient. When using llama-cpp as a chat server, the entire conversation history is passed as the prompt for each server response. As such, the prompt processing speed is very important. While a 3x slowdown in the text generation is not ideal, this at least brings the model into the realm of usable. For modestly long conversations, PP: 54 t/s and TG: 3.3 t/s may very well be faster than PP: 48 t/s and TG: 9.7. After a single message, the prompt is going to be at least as long as a message, and I think the 6 t/s gain in PP will offset the 6 t/s loss in TG. From that point on, the tradeoff is a complete win.

With that all said, a GPU implementation blows the CPU implementation out of the water. With all host vector instructions disabled, but hipBLAS enabled, this is what I get on my Radeon RX 6800 XT:

$ build/bin/llama-bench --threads 16 --model ~/ws/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 6800 XT, compute capability 10.3, VMM: no
| model                          |       size |     params | backend    | ngl |          test |              t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: | | llama 7B Q5_K - Medium         |   4.78 GiB |     7.24 B | CUDA       |  99 |         pp512 |   1196.90 ± 1.27 | | llama 7B Q5_K - Medium         |   4.78 GiB |     7.24 B | CUDA       |  99 |         tg128 |     60.28 ± 0.05 |

When compared to the CPU implementation with vector instructions, the prompt processing is >20x faster on the GPU and the text generation is 6x faster.

Still, I know it's usable without the GPU. I setup the llama-server and spent hours chatting with it on an adventure through a fantasy world. It was kinda slow, but still enjoyable. Only afterwards did I realize that I'd started the server without assigning any layers to the GPU! I must have been getting PP: 48 t/s and TG: 9.7 t/s. It was kind of slow, but still totally usable.

The full suite of benchmark data is attached.

Sincerely,
Cory Bloor
$ HIPCXX=clang++-17 cmake -S. -Bbuild -DGGML_HIPBLAS=ON -DGGML_NATIVE=OFF 
-DGGML_F16C=OFF -DGGML_FMA=OFF -DGGML_AVX=OFF -DGGML_AVX2=OFF -DGGML_AVX512=OFF 
-DGGML_AVX512_VBMI=OFF -DGGML_AVX512_VNNI=OFF -DGGML_AVX512_BF16=OFF 
-DCMAKE_BUILD_TYPE=Release
<...>
$ make -j16 -C build
<...>
$ build/bin/llama-bench-matmult --threads 16
main: build = 3267 (9ef07800)
main: built with cc (Debian 14.2.0-12) 14.2.0 for x86_64-linux-gnu
Starting Test
Allocating Memory of size 800194560 bytes, 763 MB
Creating new tensors

------ Test 1 - Matrix Mult via F32 code
n_threads=16
            m11: type = 0 (  f32) ne = 11008 x  4096 x     1, nb = (    4, 
44032, 180355072) - Sum of tensor m11 is 45088768.00
             m2: type = 0 (  f32) ne = 11008 x   128 x     1, nb = (    4, 
44032, 5636096) - Sum of tensor m2 is 2818048.00
   gf->nodes[0]: type = 0 (  f32) ne =  4096 x   128 x     1, nb = (    4, 
16384, 2097152) - Sum of tensor gf->nodes[0] is 11542724608.00

------ Test 2 - Matrix Mult via q4_1 code
n_threads=16
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - about  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; 
gigaFLOPS
=====================================================================================
        0;      16; 11008;  4096;   128;    11542724608;            198155;     
58.25
        1;      16; 11008;  4096;   128;    11542724608;            196744;     
58.67
        2;      16; 11008;  4096;   128;    11542724608;            196788;     
58.66
        3;      16; 11008;  4096;   128;    11542724608;            197129;     
58.55
        4;      16; 11008;  4096;   128;    11542724608;            197276;     
58.51
        5;      16; 11008;  4096;   128;    11542724608;            196856;     
58.64
        6;      16; 11008;  4096;   128;    11542724608;            196886;     
58.63
        7;      16; 11008;  4096;   128;    11542724608;            196765;     
58.66
        8;      16; 11008;  4096;   128;    11542724608;            196737;     
58.67
        9;      16; 11008;  4096;   128;    11542724608;            196798;     
58.65

Average                                                                         
58.59
=====================================================================================
$ build/bin/llama-bench --threads 16 --model 
~/ws/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 6800 XT, compute capability 10.3, VMM: no
| model                          |       size |     params | backend    | ngl | 
         test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | 
------------: | ---------------: |
| llama 7B Q5_K - Medium         |   4.78 GiB |     7.24 B | CUDA       |  99 | 
        pp512 |   1196.90 ± 1.27 |
| llama 7B Q5_K - Medium         |   4.78 GiB |     7.24 B | CUDA       |  99 | 
        tg128 |     60.28 ± 0.05 |

build: 9ef07800 (3267)
$ cmake -S. -Bbuild -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS 
-DGGML_NATIVE=OFF -DGGML_F16C=OFF -DGGML_FMA=OFF -DGGML_AVX=OFF -DGGML_AVX2=OFF 
-DGGML_AVX512=OFF -DGGML_AVX512_VBMI=OFF -DGGML_AVX512_VNNI=OFF 
-DGGML_AVX512_BF16=OFF -DCMAKE_BUILD_TYPE=Release
<...>
$ make -j16 -C build
<...>
$ update-alternatives --get-selections | grep libblas.so.3
libblas.so.3-x86_64-linux-gnu  auto     
/usr/lib/x86_64-linux-gnu/openblas-openmp/libblas.so.3
$ build/bin/llama-bench-matmult --threads 16
main: build = 3267 (9ef07800)
main: built with cc (Debian 14.2.0-12) 14.2.0 for x86_64-linux-gnu
Starting Test
Allocating Memory of size 800194560 bytes, 763 MB
Creating new tensors

------ Test 1 - Matrix Mult via F32 code
n_threads=16
            m11: type = 0 (  f32) ne = 11008 x  4096 x     1, nb = (    4, 
44032, 180355072) - Sum of tensor m11 is 45088768.00
             m2: type = 0 (  f32) ne = 11008 x   128 x     1, nb = (    4, 
44032, 5636096) - Sum of tensor m2 is 2818048.00
   gf->nodes[0]: type = 0 (  f32) ne =  4096 x   128 x     1, nb = (    4, 
16384, 2097152) - Sum of tensor gf->nodes[0] is 11542724608.00

------ Test 2 - Matrix Mult via q4_1 code
n_threads=16
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - about  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; 
gigaFLOPS
=====================================================================================
        0;      16; 11008;  4096;   128;    11542724608;            198664;     
58.10
        1;      16; 11008;  4096;   128;    11542724608;            196818;     
58.65
        2;      16; 11008;  4096;   128;    11542724608;            198156;     
58.25
        3;      16; 11008;  4096;   128;    11542724608;            198221;     
58.23
        4;      16; 11008;  4096;   128;    11542724608;            198144;     
58.25
        5;      16; 11008;  4096;   128;    11542724608;            198221;     
58.23
        6;      16; 11008;  4096;   128;    11542724608;            197440;     
58.46
        7;      16; 11008;  4096;   128;    11542724608;            197713;     
58.38
        8;      16; 11008;  4096;   128;    11542724608;            197042;     
58.58
        9;      16; 11008;  4096;   128;    11542724608;            196785;     
58.66

Average                                                                         
58.38
=====================================================================================
$ build/bin/llama-bench --threads 16 --model 
~/ws/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf
| model                          |       size |     params | backend    | 
threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | 
------: | ------------: | ---------------: |
| llama 7B Q5_K - Medium         |   4.78 GiB |     7.24 B | BLAS       |      
16 |         pp512 |     52.51 ± 0.62 |
| llama 7B Q5_K - Medium         |   4.78 GiB |     7.24 B | BLAS       |      
16 |         tg128 |      3.33 ± 0.02 |

build: 9ef07800 (3267)
$ cmake -S. -Bbuild -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS 
-DGGML_NATIVE=OFF -DGGML_F16C=OFF -DGGML_FMA=OFF -DGGML_AVX=OFF -DGGML_AVX2=OFF 
-DGGML_AVX512=OFF -DGGML_AVX512_VBMI=OFF -DGGML_AVX512_VNNI=OFF 
-DGGML_AVX512_BF16=OFF -DCMAKE_BUILD_TYPE=Release
<...>
$ make -j16 -C build
<...>
$ update-alternatives --get-selections | grep libblas.so.3
libblas.so.3-x86_64-linux-gnu  auto     
/usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
$ build/bin/llama-bench-matmult --threads 16
main: build = 3267 (9ef07800)
main: built with cc (Debian 14.2.0-12) 14.2.0 for x86_64-linux-gnu
Starting Test
Allocating Memory of size 800194560 bytes, 763 MB
Creating new tensors

------ Test 1 - Matrix Mult via F32 code
n_threads=16
            m11: type = 0 (  f32) ne = 11008 x  4096 x     1, nb = (    4, 
44032, 180355072) - Sum of tensor m11 is 45088768.00
             m2: type = 0 (  f32) ne = 11008 x   128 x     1, nb = (    4, 
44032, 5636096) - Sum of tensor m2 is 2818048.00
   gf->nodes[0]: type = 0 (  f32) ne =  4096 x   128 x     1, nb = (    4, 
16384, 2097152) - Sum of tensor gf->nodes[0] is 11542724608.00

------ Test 2 - Matrix Mult via q4_1 code
n_threads=16
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - about  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; 
gigaFLOPS
=====================================================================================
        0;      16; 11008;  4096;   128;    11542724608;            199061;     
57.99
        1;      16; 11008;  4096;   128;    11542724608;            196941;     
58.61
        2;      16; 11008;  4096;   128;    11542724608;            196986;     
58.60
        3;      16; 11008;  4096;   128;    11542724608;            196851;     
58.64
        4;      16; 11008;  4096;   128;    11542724608;            196756;     
58.67
        5;      16; 11008;  4096;   128;    11542724608;            197119;     
58.56
        6;      16; 11008;  4096;   128;    11542724608;            196825;     
58.64
        7;      16; 11008;  4096;   128;    11542724608;            196788;     
58.66
        8;      16; 11008;  4096;   128;    11542724608;            196762;     
58.66
        9;      16; 11008;  4096;   128;    11542724608;            198143;     
58.25

Average                                                                         
58.53
=====================================================================================
$ build/bin/llama-bench --threads 16 --model 
~/ws/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf
| model                          |       size |     params | backend    | 
threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | 
------: | ------------: | ---------------: |
| llama 7B Q5_K - Medium         |   4.78 GiB |     7.24 B | BLAS       |      
16 |         pp512 |     54.64 ± 0.64 |
| llama 7B Q5_K - Medium         |   4.78 GiB |     7.24 B | BLAS       |      
16 |         tg128 |      3.34 ± 0.01 |

build: 9ef07800 (3267)
$ cmake -S. -Bbuild -DGGML_BLAS=OFF -DGGML_OPENMP=ON -DGGML_NATIVE=OFF 
-DGGML_F16C=OFF -DGGML_FMA=OFF -DGGML_AVX=OFF -DGGML_AVX2=OFF -DGGML_AVX512=OFF 
-DGGML_AVX512_VBMI=OFF -DGGML_AVX512_VNNI=OFF -DGGML_AVX512_BF16=OFF 
-DCMAKE_BUILD_TYPE=Release
<...>
$ make -j16 -C build
<...>
$ build/bin/llama-bench-matmult --threads 16
main: build = 3267 (9ef07800)
main: built with cc (Debian 14.2.0-12) 14.2.0 for x86_64-linux-gnu
Starting Test
Allocating Memory of size 800194560 bytes, 763 MB
Creating new tensors

------ Test 1 - Matrix Mult via F32 code
n_threads=16
            m11: type = 0 (  f32) ne = 11008 x  4096 x     1, nb = (    4, 
44032, 180355072) - Sum of tensor m11 is 45088768.00
             m2: type = 0 (  f32) ne = 11008 x   128 x     1, nb = (    4, 
44032, 5636096) - Sum of tensor m2 is 2818048.00
   gf->nodes[0]: type = 0 (  f32) ne =  4096 x   128 x     1, nb = (    4, 
16384, 2097152) - Sum of tensor gf->nodes[0] is 11542724608.00

------ Test 2 - Matrix Mult via q4_1 code
n_threads=16
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - about  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; 
gigaFLOPS
=====================================================================================
        0;      16; 11008;  4096;   128;    11542724608;            199019;     
58.00
        1;      16; 11008;  4096;   128;    11542724608;            196736;     
58.67
        2;      16; 11008;  4096;   128;    11542724608;            198137;     
58.26
        3;      16; 11008;  4096;   128;    11542724608;            196764;     
58.66
        4;      16; 11008;  4096;   128;    11542724608;            196758;     
58.66
        5;      16; 11008;  4096;   128;    11542724608;            196747;     
58.67
        6;      16; 11008;  4096;   128;    11542724608;            196750;     
58.67
        7;      16; 11008;  4096;   128;    11542724608;            196704;     
58.68
        8;      16; 11008;  4096;   128;    11542724608;            196738;     
58.67
        9;      16; 11008;  4096;   128;    11542724608;            196737;     
58.67

Average                                                                         
58.56
=====================================================================================
$ build/bin/llama-bench --threads 16 --model 
~/ws/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf
| model                          |       size |     params | backend    | 
threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | 
------: | ------------: | ---------------: |
| llama 7B Q5_K - Medium         |   4.78 GiB |     7.24 B | CPU        |      
16 |         pp512 |      3.51 ± 0.00 |
| llama 7B Q5_K - Medium         |   4.78 GiB |     7.24 B | CPU        |      
16 |         tg128 |      3.34 ± 0.01 |

build: 9ef07800 (3267)
$ cmake -S. -Bbuild -DGGML_BLAS=OFF -DGGML_OPENMP=ON -DGGML_NATIVE=ON 
-DCMAKE_BUILD_TYPE=Release
<...>
$ make -j16 -C build
<...>
$ build/bin/llama-bench-matmult --threads 16
main: build = 3267 (9ef07800)
main: built with cc (Debian 14.2.0-12) 14.2.0 for x86_64-linux-gnu
Starting Test
Allocating Memory of size 800194560 bytes, 763 MB
Creating new tensors

------ Test 1 - Matrix Mult via F32 code
n_threads=16
            m11: type = 0 (  f32) ne = 11008 x  4096 x     1, nb = (    4, 
44032, 180355072) - Sum of tensor m11 is 45088768.00
             m2: type = 0 (  f32) ne = 11008 x   128 x     1, nb = (    4, 
44032, 5636096) - Sum of tensor m2 is 2818048.00
   gf->nodes[0]: type = 0 (  f32) ne =  4096 x   128 x     1, nb = (    4, 
16384, 2097152) - Sum of tensor gf->nodes[0] is 11542724608.00

------ Test 2 - Matrix Mult via q4_1 code
n_threads=16
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - about  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; 
gigaFLOPS
=====================================================================================
        0;      16; 11008;  4096;   128;    11542724608;             17804;    
648.32
        1;      16; 11008;  4096;   128;    11542724608;             16828;    
685.92
        2;      16; 11008;  4096;   128;    11542724608;             16840;    
685.43
        3;      16; 11008;  4096;   128;    11542724608;             16425;    
702.75
        4;      16; 11008;  4096;   128;    11542724608;             15795;    
730.78
        5;      16; 11008;  4096;   128;    11542724608;             15766;    
732.13
        6;      16; 11008;  4096;   128;    11542724608;             15780;    
731.48
        7;      16; 11008;  4096;   128;    11542724608;             15789;    
731.06
        8;      16; 11008;  4096;   128;    11542724608;             15812;    
730.00
        9;      16; 11008;  4096;   128;    11542724608;             15771;    
731.90

Average                                                                        
710.98
=====================================================================================
$ build/bin/llama-bench --threads 16 --model 
~/ws/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf
| model                          |       size |     params | backend    | 
threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | 
------: | ------------: | ---------------: |
| llama 7B Q5_K - Medium         |   4.78 GiB |     7.24 B | CPU        |      
16 |         pp512 |     48.63 ± 0.04 |
| llama 7B Q5_K - Medium         |   4.78 GiB |     7.24 B | CPU        |      
16 |         tg128 |      9.73 ± 0.05 |

build: 9ef07800 (3267)

Reply via email to