Hi Christian,
Could we just sidestep this whole question of native instructions by
building llama.cpp with the BLAS backend? The OpenBLAS library will do
CPU feature detection, so the parts of llama.cpp that call out to BLAS
will make good use of available vector instructions. My benchmarking
suggests that this may be sufficient to achieve reasonable CPU
performance (albeit still imperfect). To prove this, I've included some
benchmarks on my Ryzen 5950X workstation (with 64GB of DDR4 RAM @ 3600
MHz) running Debian Unstable.
First the results of the OpenMP backend with -march=native:
$ build/bin/llama-bench --threads 16 --model
~/ws/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf
| model | size | params | backend
| threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ----------
| ------: | ------------: | ---------------: |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | CPU
| 16 | pp512 | 48.63 ± 0.04 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | CPU
| 16 | tg128 | 9.73 ± 0.05 |
The above results set the baseline for CPU performance. If we disable
all vector instructions beyond those available in x86_64, we get:
$ build/bin/llama-bench --threads 16 --model
~/ws/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf
| model | size | params | backend
| threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ----------
| ------: | ------------: | ---------------: |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | CPU
| 16 | pp512 | 3.51 ± 0.00 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | CPU
| 16 | tg128 | 3.34 ± 0.01 |
However, if we enable the BLAS backend and
install libopenblas-pthread-dev and libopenblas64-pthread-dev, it improves:
$ build/bin/llama-bench --threads 16 --model
~/ws/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf
| model | size | params | backend
| threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ----------
| ------: | ------------: | ---------------: |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | BLAS
| 16 | pp512 | 54.64 ± 0.64 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | BLAS
| 16 | tg128 | 3.34 ± 0.01 |
pp512 is the prompt processing benchmark, while tg128 is the text
generation benchmark. So, you can see that this greatly improves the
prompt processing while leaving token generation unchanged.
In my opinion, this may be sufficient. When using llama-cpp as a chat
server, the entire conversation history is passed as the prompt for each
server response. As such, the prompt processing speed is very important.
While a 3x slowdown in the text generation is not ideal, this at least
brings the model into the realm of usable. For modestly long
conversations, PP: 54 t/s and TG: 3.3 t/s may very well be faster than
PP: 48 t/s and TG: 9.7. After a single message, the prompt is going to
be at least as long as a message, and I think the 6 t/s gain in PP will
offset the 6 t/s loss in TG. From that point on, the tradeoff is a
complete win.
With that all said, a GPU implementation blows the CPU implementation
out of the water. With all host vector instructions disabled, but
hipBLAS enabled, this is what I get on my Radeon RX 6800 XT:
$ build/bin/llama-bench --threads 16 --model
~/ws/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 6800 XT, compute capability 10.3, VMM: no
| model | size | params | backend
| ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ----------
| --: | ------------: | ---------------: |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | CUDA
| 99 | pp512 | 1196.90 ± 1.27 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | CUDA
| 99 | tg128 | 60.28 ± 0.05 |
When compared to the CPU implementation with vector instructions, the
prompt processing is >20x faster on the GPU and the text generation is
6x faster.
Still, I know it's usable without the GPU. I setup the llama-server and
spent hours chatting with it on an adventure through a fantasy world. It
was kinda slow, but still enjoyable. Only afterwards did I realize that
I'd started the server without assigning any layers to the GPU! I must
have been getting PP: 48 t/s and TG: 9.7 t/s. It was kind of slow, but
still totally usable.
The full suite of benchmark data is attached.
Sincerely,
Cory Bloor
$ HIPCXX=clang++-17 cmake -S. -Bbuild -DGGML_HIPBLAS=ON -DGGML_NATIVE=OFF
-DGGML_F16C=OFF -DGGML_FMA=OFF -DGGML_AVX=OFF -DGGML_AVX2=OFF -DGGML_AVX512=OFF
-DGGML_AVX512_VBMI=OFF -DGGML_AVX512_VNNI=OFF -DGGML_AVX512_BF16=OFF
-DCMAKE_BUILD_TYPE=Release
<...>
$ make -j16 -C build
<...>
$ build/bin/llama-bench-matmult --threads 16
main: build = 3267 (9ef07800)
main: built with cc (Debian 14.2.0-12) 14.2.0 for x86_64-linux-gnu
Starting Test
Allocating Memory of size 800194560 bytes, 763 MB
Creating new tensors
------ Test 1 - Matrix Mult via F32 code
n_threads=16
m11: type = 0 ( f32) ne = 11008 x 4096 x 1, nb = ( 4,
44032, 180355072) - Sum of tensor m11 is 45088768.00
m2: type = 0 ( f32) ne = 11008 x 128 x 1, nb = ( 4,
44032, 5636096) - Sum of tensor m2 is 2818048.00
gf->nodes[0]: type = 0 ( f32) ne = 4096 x 128 x 1, nb = ( 4,
16384, 2097152) - Sum of tensor gf->nodes[0] is 11542724608.00
------ Test 2 - Matrix Mult via q4_1 code
n_threads=16
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - about 11.54 gFLOPS
Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds;
gigaFLOPS
=====================================================================================
0; 16; 11008; 4096; 128; 11542724608; 198155;
58.25
1; 16; 11008; 4096; 128; 11542724608; 196744;
58.67
2; 16; 11008; 4096; 128; 11542724608; 196788;
58.66
3; 16; 11008; 4096; 128; 11542724608; 197129;
58.55
4; 16; 11008; 4096; 128; 11542724608; 197276;
58.51
5; 16; 11008; 4096; 128; 11542724608; 196856;
58.64
6; 16; 11008; 4096; 128; 11542724608; 196886;
58.63
7; 16; 11008; 4096; 128; 11542724608; 196765;
58.66
8; 16; 11008; 4096; 128; 11542724608; 196737;
58.67
9; 16; 11008; 4096; 128; 11542724608; 196798;
58.65
Average
58.59
=====================================================================================
$ build/bin/llama-bench --threads 16 --model
~/ws/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 6800 XT, compute capability 10.3, VMM: no
| model | size | params | backend | ngl |
test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: |
------------: | ---------------: |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | CUDA | 99 |
pp512 | 1196.90 ± 1.27 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | CUDA | 99 |
tg128 | 60.28 ± 0.05 |
build: 9ef07800 (3267)
$ cmake -S. -Bbuild -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
-DGGML_NATIVE=OFF -DGGML_F16C=OFF -DGGML_FMA=OFF -DGGML_AVX=OFF -DGGML_AVX2=OFF
-DGGML_AVX512=OFF -DGGML_AVX512_VBMI=OFF -DGGML_AVX512_VNNI=OFF
-DGGML_AVX512_BF16=OFF -DCMAKE_BUILD_TYPE=Release
<...>
$ make -j16 -C build
<...>
$ update-alternatives --get-selections | grep libblas.so.3
libblas.so.3-x86_64-linux-gnu auto
/usr/lib/x86_64-linux-gnu/openblas-openmp/libblas.so.3
$ build/bin/llama-bench-matmult --threads 16
main: build = 3267 (9ef07800)
main: built with cc (Debian 14.2.0-12) 14.2.0 for x86_64-linux-gnu
Starting Test
Allocating Memory of size 800194560 bytes, 763 MB
Creating new tensors
------ Test 1 - Matrix Mult via F32 code
n_threads=16
m11: type = 0 ( f32) ne = 11008 x 4096 x 1, nb = ( 4,
44032, 180355072) - Sum of tensor m11 is 45088768.00
m2: type = 0 ( f32) ne = 11008 x 128 x 1, nb = ( 4,
44032, 5636096) - Sum of tensor m2 is 2818048.00
gf->nodes[0]: type = 0 ( f32) ne = 4096 x 128 x 1, nb = ( 4,
16384, 2097152) - Sum of tensor gf->nodes[0] is 11542724608.00
------ Test 2 - Matrix Mult via q4_1 code
n_threads=16
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - about 11.54 gFLOPS
Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds;
gigaFLOPS
=====================================================================================
0; 16; 11008; 4096; 128; 11542724608; 198664;
58.10
1; 16; 11008; 4096; 128; 11542724608; 196818;
58.65
2; 16; 11008; 4096; 128; 11542724608; 198156;
58.25
3; 16; 11008; 4096; 128; 11542724608; 198221;
58.23
4; 16; 11008; 4096; 128; 11542724608; 198144;
58.25
5; 16; 11008; 4096; 128; 11542724608; 198221;
58.23
6; 16; 11008; 4096; 128; 11542724608; 197440;
58.46
7; 16; 11008; 4096; 128; 11542724608; 197713;
58.38
8; 16; 11008; 4096; 128; 11542724608; 197042;
58.58
9; 16; 11008; 4096; 128; 11542724608; 196785;
58.66
Average
58.38
=====================================================================================
$ build/bin/llama-bench --threads 16 --model
~/ws/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf
| model | size | params | backend |
threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- |
------: | ------------: | ---------------: |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | BLAS |
16 | pp512 | 52.51 ± 0.62 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | BLAS |
16 | tg128 | 3.33 ± 0.02 |
build: 9ef07800 (3267)
$ cmake -S. -Bbuild -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
-DGGML_NATIVE=OFF -DGGML_F16C=OFF -DGGML_FMA=OFF -DGGML_AVX=OFF -DGGML_AVX2=OFF
-DGGML_AVX512=OFF -DGGML_AVX512_VBMI=OFF -DGGML_AVX512_VNNI=OFF
-DGGML_AVX512_BF16=OFF -DCMAKE_BUILD_TYPE=Release
<...>
$ make -j16 -C build
<...>
$ update-alternatives --get-selections | grep libblas.so.3
libblas.so.3-x86_64-linux-gnu auto
/usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
$ build/bin/llama-bench-matmult --threads 16
main: build = 3267 (9ef07800)
main: built with cc (Debian 14.2.0-12) 14.2.0 for x86_64-linux-gnu
Starting Test
Allocating Memory of size 800194560 bytes, 763 MB
Creating new tensors
------ Test 1 - Matrix Mult via F32 code
n_threads=16
m11: type = 0 ( f32) ne = 11008 x 4096 x 1, nb = ( 4,
44032, 180355072) - Sum of tensor m11 is 45088768.00
m2: type = 0 ( f32) ne = 11008 x 128 x 1, nb = ( 4,
44032, 5636096) - Sum of tensor m2 is 2818048.00
gf->nodes[0]: type = 0 ( f32) ne = 4096 x 128 x 1, nb = ( 4,
16384, 2097152) - Sum of tensor gf->nodes[0] is 11542724608.00
------ Test 2 - Matrix Mult via q4_1 code
n_threads=16
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - about 11.54 gFLOPS
Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds;
gigaFLOPS
=====================================================================================
0; 16; 11008; 4096; 128; 11542724608; 199061;
57.99
1; 16; 11008; 4096; 128; 11542724608; 196941;
58.61
2; 16; 11008; 4096; 128; 11542724608; 196986;
58.60
3; 16; 11008; 4096; 128; 11542724608; 196851;
58.64
4; 16; 11008; 4096; 128; 11542724608; 196756;
58.67
5; 16; 11008; 4096; 128; 11542724608; 197119;
58.56
6; 16; 11008; 4096; 128; 11542724608; 196825;
58.64
7; 16; 11008; 4096; 128; 11542724608; 196788;
58.66
8; 16; 11008; 4096; 128; 11542724608; 196762;
58.66
9; 16; 11008; 4096; 128; 11542724608; 198143;
58.25
Average
58.53
=====================================================================================
$ build/bin/llama-bench --threads 16 --model
~/ws/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf
| model | size | params | backend |
threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- |
------: | ------------: | ---------------: |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | BLAS |
16 | pp512 | 54.64 ± 0.64 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | BLAS |
16 | tg128 | 3.34 ± 0.01 |
build: 9ef07800 (3267)
$ cmake -S. -Bbuild -DGGML_BLAS=OFF -DGGML_OPENMP=ON -DGGML_NATIVE=OFF
-DGGML_F16C=OFF -DGGML_FMA=OFF -DGGML_AVX=OFF -DGGML_AVX2=OFF -DGGML_AVX512=OFF
-DGGML_AVX512_VBMI=OFF -DGGML_AVX512_VNNI=OFF -DGGML_AVX512_BF16=OFF
-DCMAKE_BUILD_TYPE=Release
<...>
$ make -j16 -C build
<...>
$ build/bin/llama-bench-matmult --threads 16
main: build = 3267 (9ef07800)
main: built with cc (Debian 14.2.0-12) 14.2.0 for x86_64-linux-gnu
Starting Test
Allocating Memory of size 800194560 bytes, 763 MB
Creating new tensors
------ Test 1 - Matrix Mult via F32 code
n_threads=16
m11: type = 0 ( f32) ne = 11008 x 4096 x 1, nb = ( 4,
44032, 180355072) - Sum of tensor m11 is 45088768.00
m2: type = 0 ( f32) ne = 11008 x 128 x 1, nb = ( 4,
44032, 5636096) - Sum of tensor m2 is 2818048.00
gf->nodes[0]: type = 0 ( f32) ne = 4096 x 128 x 1, nb = ( 4,
16384, 2097152) - Sum of tensor gf->nodes[0] is 11542724608.00
------ Test 2 - Matrix Mult via q4_1 code
n_threads=16
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - about 11.54 gFLOPS
Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds;
gigaFLOPS
=====================================================================================
0; 16; 11008; 4096; 128; 11542724608; 199019;
58.00
1; 16; 11008; 4096; 128; 11542724608; 196736;
58.67
2; 16; 11008; 4096; 128; 11542724608; 198137;
58.26
3; 16; 11008; 4096; 128; 11542724608; 196764;
58.66
4; 16; 11008; 4096; 128; 11542724608; 196758;
58.66
5; 16; 11008; 4096; 128; 11542724608; 196747;
58.67
6; 16; 11008; 4096; 128; 11542724608; 196750;
58.67
7; 16; 11008; 4096; 128; 11542724608; 196704;
58.68
8; 16; 11008; 4096; 128; 11542724608; 196738;
58.67
9; 16; 11008; 4096; 128; 11542724608; 196737;
58.67
Average
58.56
=====================================================================================
$ build/bin/llama-bench --threads 16 --model
~/ws/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf
| model | size | params | backend |
threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- |
------: | ------------: | ---------------: |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | CPU |
16 | pp512 | 3.51 ± 0.00 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | CPU |
16 | tg128 | 3.34 ± 0.01 |
build: 9ef07800 (3267)
$ cmake -S. -Bbuild -DGGML_BLAS=OFF -DGGML_OPENMP=ON -DGGML_NATIVE=ON
-DCMAKE_BUILD_TYPE=Release
<...>
$ make -j16 -C build
<...>
$ build/bin/llama-bench-matmult --threads 16
main: build = 3267 (9ef07800)
main: built with cc (Debian 14.2.0-12) 14.2.0 for x86_64-linux-gnu
Starting Test
Allocating Memory of size 800194560 bytes, 763 MB
Creating new tensors
------ Test 1 - Matrix Mult via F32 code
n_threads=16
m11: type = 0 ( f32) ne = 11008 x 4096 x 1, nb = ( 4,
44032, 180355072) - Sum of tensor m11 is 45088768.00
m2: type = 0 ( f32) ne = 11008 x 128 x 1, nb = ( 4,
44032, 5636096) - Sum of tensor m2 is 2818048.00
gf->nodes[0]: type = 0 ( f32) ne = 4096 x 128 x 1, nb = ( 4,
16384, 2097152) - Sum of tensor gf->nodes[0] is 11542724608.00
------ Test 2 - Matrix Mult via q4_1 code
n_threads=16
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - about 11.54 gFLOPS
Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds;
gigaFLOPS
=====================================================================================
0; 16; 11008; 4096; 128; 11542724608; 17804;
648.32
1; 16; 11008; 4096; 128; 11542724608; 16828;
685.92
2; 16; 11008; 4096; 128; 11542724608; 16840;
685.43
3; 16; 11008; 4096; 128; 11542724608; 16425;
702.75
4; 16; 11008; 4096; 128; 11542724608; 15795;
730.78
5; 16; 11008; 4096; 128; 11542724608; 15766;
732.13
6; 16; 11008; 4096; 128; 11542724608; 15780;
731.48
7; 16; 11008; 4096; 128; 11542724608; 15789;
731.06
8; 16; 11008; 4096; 128; 11542724608; 15812;
730.00
9; 16; 11008; 4096; 128; 11542724608; 15771;
731.90
Average
710.98
=====================================================================================
$ build/bin/llama-bench --threads 16 --model
~/ws/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf
| model | size | params | backend |
threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- |
------: | ------------: | ---------------: |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | CPU |
16 | pp512 | 48.63 ± 0.04 |
| llama 7B Q5_K - Medium | 4.78 GiB | 7.24 B | CPU |
16 | tg128 | 9.73 ± 0.05 |
build: 9ef07800 (3267)