Performance of llama.cpp with Vulkan #10879

netrunnereve · 2024-12-18T03:56:09Z

netrunnereve
Dec 18, 2024
Collaborator

This is similar to the Apple Silicon benchmark thread, but for Vulkan! Many improvements have been made to the Vulkan backend and I think it's good to consolidate and discuss our results here.

We'll be testing the Llama 2 7B model like the other thread to keep things consistent, and use Q4_0 as it's simple to compute and small enough to fit on a 4GB GPU. You can download it here.

Instructions

Either run the commands below or download one of our Vulkan releases.

wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_0.gguf
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build
cd build
cmake .. -DGGML_VULKAN=on -DCMAKE_BUILD_TYPE=Release
make
llama-bench -m ../../llama-2-7b.Q4_0.gguf -ngl 100 (add any extra options here)

Share your llama-bench results along with the git hash and Vulkan info string in the comments. Feel free to try other models and compare backends, but only valid runs will be placed on the scoreboard.

If multiple entries are posted for the same device newer commits with substantial Vulkan updates are prioritized, alternatively the one with the highest tg128 score will be used. Performance may vary depending on driver, operating system, board manufacturer, etc. even if the chip is the same.

Vulkan Scoreboard for Llama 2 7B, Q4_0 (no FA)

Chip	pp512 t/s	tg128 t/s	Commit	Comments
AMD Radeon RX 7900 XTX	3236.63 ± 71.63	148.21 ± 0.94	`902368a`	Best of multiple submissions
Nvidia RTX 3090	3301.47 ± 33.76	123.72 ± 0.14	`0d52a69`
AMD Radeon RX 9070 XT	2206.24 ± 266.06	117.22 ± 0.33	d84635b
AMD Radeon RX 7800 XT	1260.54 ± 10.51	107.53 ± 0.07	`ee02ad0`
AMD Radeon RX 6900 XT	1257.98 ± 1.55	101.42 ± 0.02	44e18ef	Best of multiple submissions
AMD Radeon RX 6800 XT	1533.60 ± 2.47	95.56 ± 0.72	N/A	Best of multiple submissions
Nvidia RTX 4070	3179.37 ± 46.16	92.29 ± 0.28	`9a48399`
Nvidia RTX 5060Ti	3211.73 ± 24.44	81.48 ± 3.50	`658987c`	coopmat2
AMD Radeon Instinct MI60	369.26 ± 2.48	78.16 ± 1.40	504af20
AMD Radeon Instinct MI50	348.66 ± 0.55	72.06 ± 0.07	`d80be89`	Best of multiple submissions
AMD Radeon Pro VII	612.47 ± 0.87	71.37 ± 0.98	N/A	Best of multiple submissions
AMD Radeon RX 5700 XT	439.42 ± 0.28	70.13 ± 0.05	c05e8c9
Nvidia RTX 2070 SUPER	1199.13 ± 7.70	64.64 ± 0.20	`b7552cf`
Nvidia RTX 3080	1706.07 ± 139.33	62.16 ± 1.98	`4da69d1`	Result appears lower than expected, maybe non-release build?
AMD Radeon Instinct MI25	439.42 ± 0.34	54.69 ± 0.03	`2739a71`
Nvidia RTX 3060	1298.03 ± 23.40	54.28 ± 1.05	6171c9d
AMD Radeon RX 6600 XT	574.65 ± 0.86	53.92 ± 0.11	`091592d`	Best of multiple submissions
AMD BC-250	331.58 ± 0.06	49.76 ± 0.06	cf2270e
Nvidia RTX 3060 Mobile	1059.76 ± 3.54	49.03 ± 0.13	`dbb3a47`
Intel Arc A770	603.46 ± 0.92	49.02 ± 0.04	`5f696e8`	Best of multiple submissions
AMD Radeon RX 6600M	605.59 ± 0.65	48.21 ± 0.07	`fe5b78c`	DP4A
AMD Radeon RX 6600	380.87 ± 0.21	47.47 ± 0.18	0fd7ca7
AMD Radeon RX 7600M XT	459.39 ± 2.34	45.28 ± 0.10	`b9ab0a4`	eGPU
Intel Arc B580	175.56 ± 2.65	44.12 ± 0.09	`9a48399`
Nvidia RTX 4050 Mobile	1154.28 + 15.76	41.89 + 0.10	`d79d8f3`
AMD RX 470	161.47 ± 0.43	33.45 ± 0.04	`4da69d1`
AMD FirePro W8100	137.10 ± 0.44	28.51 ± 0.12	`4da69d1`
Intel Arc A750	88.86 ± 0.14	27.57 ± 0.03	`8d59d91`
Apple M3 MacBook Pro	263.70 ± 0.02	26.39 ± 0.14	`b9ab0a4`	MoltenVK with special patch
AMD FirePro S10000	94.78 ± 0.02	25.32 ± 0.02	`914a82d`	Two GPU chips on one card
AMD Ryzen 7 8840HS	245.79 ± 2.97	20.10 ± 0.07	`19d3c82`
AMD Ryzen 7 7940HS	281.62 ± 1.56	19.91 ± 0.07	`ebce03e`
AMD Ryzen Z1 Extreme	199.36 ± 7.02	18.77 ± 0.02	`53ff6b9`
AMD Ryzen 7 7840U	237.73 ± 13.98	18.22 ± 0.62	`70680c4`
AMD FirePro D700	69.95 ± 0.04	16.62 ± 0.01	`d3bd719`	MoltenVK, running in FP16 mode on FP32 only chip
Apple M2 MacBook Air	38.67 ± 0.03	11.07 ± 0.04	`017cc5f`	Asahi Linux
AMD Ryzen 7 5700G	90.55 ± 0.08	10.98 ± 0.07	d84635b
AMD Ryzen 7 5800H	90.15 ± 1.45	10.81 ± 0.14	`dbb3a47`
AMD Ryzen 5 5600H	75.60 ± 0.32	10.59 ± 0.18	`0bb2919`
AMD Ryzen 7 7730U	84.79 ± 0.88	10.23 ± 0.13	d84635b
MediaTek Dimensity 9400	38.36 ± 15.15	8.92 ± 0.06	`b9ab0a4`	GPU supports coopmat but pp512 is faster with it turned off
Intel i7-1185G7	42.02 ± 0.07	7.28 ± 0.24	`ff3fcab`
AMD Ryzen 5 3400G	46.47 ± 5.15	5.99 ± 0.71	`0893e01`
Intel Core i7-1065G7	25.58 ± 0.00	4.25 ± 0.18	N/A
Intel i5-8350U	25.28 ± 0.00	3.23 ± 0.00	`f26c874`

Vulkan Scoreboard for Llama 2 7B, Q4_0 (with FA)

Currently FA only works properly with coopmat2.

Chip	pp512 t/s	tg128 t/s	Commit	Comments
Nvidia RTX 3090	4516.92 ± 9.55	120.44 ± 2.58	N/A	coopmat2
Nvidia RTX 4070	4293.57 ± 27.70	91.49 ± 0.89	`9a48399`	coopmat2
Nvidia RTX 5060Ti	3492.22 ± 15.73	83.26 ± 2.03	`658987c`	coopmat2

netrunnereve · 2024-12-18T03:58:41Z

netrunnereve
Dec 18, 2024
Collaborator Author

AMD FirePro W8100

ggml_vulkan: 0 = AMD Radeon FirePro W8100 (RADV HAWAII) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
build: 4da69d1a (4351)

model	size	params	backend	ngl	threads	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	none	pp512	137.10 ± 0.44
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	none	tg128	28.51 ± 0.12

0 replies

netrunnereve · 2024-12-18T04:00:36Z

netrunnereve
Dec 18, 2024
Collaborator Author

AMD RX 470

ggml_vulkan: 1 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
build: 4da69d1a (4351)

model	size	params	backend	ngl	threads	main_gpu	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	pp512	161.47 ± 0.43
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	tg128	33.45 ± 0.04

0 replies

max-krasnyansky · 2024-12-18T05:09:04Z

max-krasnyansky
Dec 18, 2024
Collaborator

ubuntu 24.04, vulkan and cuda installed from official APT packages.

ggml_vulkan: 0 = NVIDIA GeForce RTX 3080 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	1706.07 ± 139.33
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	62.16 ± 1.98

build: 4da69d1 (4351)

vs CUDA on the same build/setup

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	pp512	4499.47 ± 60.66
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	tg128	131.01 ± 0.43

build: 4da69d1 (4351)

0 replies

hkbu-kennycheng · 2025-01-08T02:57:11Z

hkbu-kennycheng
Jan 8, 2025

Macbook Air M2 on Asahi Linux

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Apple M2 (G14G B0) (Honeykrisp) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	38.67 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	11.07 ± 0.04

[build build: 017cc5f](build: 017cc5f)

3 replies

ericcurtin Jan 14, 2025
Collaborator

For the record I think this is slow on the HoneyKrisp side rather than llama.cpp

nettyso Mar 29, 2025

Can you share how you got vulkan to build on Asahi? I can't seem to get cmake to notice it.

cmake -B build -DGGML_CPU_AARCH64=OFF -DGGML_VULKAN=1
-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: aarch64
-- Including CPU backend
-- ARM detected
-- ARM -mcpu not found, -mcpu=native will be used
-- ARM feature DOTPROD enabled
-- ARM feature MATMUL_INT8 enabled
-- ARM feature FMA enabled
-- Adding CPU backend variant ggml-cpu: -mcpu=native+dotprod+i8mm+nosve+nosme 
CMake Error at /usr/share/cmake-3.30/Modules/FindPackageHandleStandardArgs.cmake:233 (message):
  Could NOT find Vulkan (missing: Vulkan_LIBRARY) (found version "1.3.296")
Call Stack (most recent call first):
  /usr/share/cmake-3.30/Modules/FindPackageHandleStandardArgs.cmake:603 (_FPHSA_FAILURE_MESSAGE)
  /usr/share/cmake-3.30/Modules/FindVulkan.cmake:595 (find_package_handle_standard_args)
  ggml/src/ggml-vulkan/CMakeLists.txt:4 (find_package)


-- Configuring incomplete, errors occurred!

nettyso Mar 29, 2025

Spoke too soon, got it working! cmake -B build -DGGML_CPU_AARCH64=OFF -DGGML_VULKAN=1 -DVulkan_LIBRARY=/usr/lib64/libvulkan.so.1

hkbu-kennycheng · 2025-01-08T03:22:16Z

hkbu-kennycheng
Jan 8, 2025

Gentoo Linux on ROG Ally (2023) Ryzen Z1 Extreme

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1103_R1) (radv) | uma: 1 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	199.36 ± 7.02
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	18.77 ± 0.02

[build build: 53ff6b9](build: 53ff6b9)

0 replies

hkbu-kennycheng · 2025-01-08T10:35:31Z

hkbu-kennycheng
Jan 8, 2025

ggml_vulkan: Found 4 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 2 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 3 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1545.39 ± 6.58
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	88.12 ± 1.06

[build build: 53ff6b9](build: 53ff6b9)

4 replies

0cc4m Jan 8, 2025
Collaborator

Cool setup! Could you also post the result of 1, 2 and 3 7900 XTX GPUs? You can use only the first GPU with export GGML_VK_VISIBLE_DEVICES=0, the first two with export GGML_VK_VISIBLE_DEVICES=0,1 and so on.

hkbu-kennycheng Jan 8, 2025

env GGML_VK_VISIBLE_DEVICES=0 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	2022.59 ± 10.08
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	136.24 ± 0.30

env GGML_VK_VISIBLE_DEVICES=1 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	2039.24 ± 18.08
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	140.68 ± 2.09

env GGML_VK_VISIBLE_DEVICES=2 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	2062.17 ± 5.36
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	143.99 ± 0.23

env GGML_VK_VISIBLE_DEVICES=3 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1997.04 ± 5.78
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	136.98 ± 1.73

env GGML_VK_VISIBLE_DEVICES=0,1 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1668.19 ± 12.78
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	100.62 ± 0.66

env GGML_VK_VISIBLE_DEVICES=0,1,2 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 3 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 2 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1566.38 ± 8.01
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	97.96 ± 1.13

env GGML_VK_VISIBLE_DEVICES=0,1,2,3 ./build/bin/llama-bench -m llama-2-7b.Q4_0.gguf -ngl 100
ggml_vulkan: Found 4 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 2 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
ggml_vulkan: 3 = AMD Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1484.04 ± 6.01
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	91.48 ± 0.63

netrunnereve Jan 8, 2025
Collaborator Author

For this multi GPU case getting Vulkan to support #6017 pipeline parallelism might help improve the prompt processing speed.

hkbu-kennycheng Jan 9, 2025

@netrunnereve I updated the commit id in all my result.

0cc4m · 2025-01-08T11:04:08Z

0cc4m
Jan 8, 2025
Collaborator

build: 0d52a69 (4439)

NVIDIA GeForce RTX 3090 (NVIDIA)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	3301.47 ± 33.76
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	123.72 ± 0.14

AMD Radeon RX 6800 XT (RADV NAVI21) (radv)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	863.03 ± 0.70
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	91.59 ± 0.40

AMD Radeon (TM) Pro VII (RADV VEGA20) (radv)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	312.02 ± 0.97
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	70.17 ± 0.25

Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	95.52 ± 0.12
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	44.49 ± 0.03

0 replies

0cc4m · 2025-01-08T11:08:46Z

0cc4m
Jan 8, 2025
Collaborator

@netrunnereve Some of the tg results here are a little low, I think they might be debug builds. The cmake step (at least on Linux) might require cmake .. -DGGML_VULKAN=on -DCMAKE_BUILD_TYPE=Release

2 replies

netrunnereve Jan 8, 2025
Collaborator Author

I've added -DCMAKE_BUILD_TYPE=Release to the post, but honestly I've always built without this flag for both Vulkan and CPU backends and never noticed a difference in performance. Having Release set might strip the debug symbols but it shouldn't affect the compiler optimizations.

My release numbers for the RX 470 are basically identical to the ones I posted earlier without the flag.

model	size	params	backend	ngl	threads	main_gpu	sm	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	pp512	160.08 ± 0.38
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	none	tg128	33.41 ± 0.15

0cc4m Jan 8, 2025
Collaborator

Maybe not in your case, but some other results are suspiciously low in tg (for example the RTX 3080)

qnixsynapse · 2025-01-09T02:41:52Z

qnixsynapse
Jan 9, 2025
Collaborator

Build: 8d59d91 (4450)
ggml_vulkan: 0 = Intel(R) Arc(tm) A750 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | warp size: 32 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	88.86 ± 0.14
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	27.57 ± 0.03

Lack of proper Xe coopmat support in the ANV driver is a setback honestly.
Compared to SYCL:

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	pp512	1616.11 ± 5.28
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	99	tg128	36.64 ± 0.05

edit: retested both with the default batch size.

8 replies

0cc4m Jan 10, 2025
Collaborator

They do have vtune but it needs a third party kernel module to run which I don't like tbh.

Also, I don't know whether it supports Vulkan apps or not. But it does seem to support opencl.

I put my A770 into a Windows PC and gave Intel GPA and vtune a shot: GPA just crashes most of the time, I couldn't get it to trace anything useful. vtune works, but does not support Vulkan. It just shows some high-level metrics in that case, not really useful sadly.

qnixsynapse Jan 11, 2025
Collaborator

Your Vulkan tg result is lower than expected, can you retry with the cmake build type set like in the updated instructions? It might be due to a debug build.

I did build it with cmake with build type Release.

0cc4m Jan 11, 2025
Collaborator

In that case it's something else, cause it should be performing similarly to my A770. I suspect the mesa version, there was something in newer mesa versions that slowed down tg on Intel.

qnixsynapse Jan 11, 2025
Collaborator

A750 has 448 CUs, A770 has 512 CUs I think. Personally, I am not worried about tg. I am worried about pp here. The gemm batch quickly saturates my GPU.

qnixsynapse Feb 9, 2025
Collaborator

@0cc4m https://gitlab.freedesktop.org/mesa/mesa/-/issues/12585

0cc4m · 2025-01-09T15:32:01Z

0cc4m
Jan 9, 2025
Collaborator

Here's something exotic: An AMD FirePro S10000 dual GPU from 2012 with 2x 3GB GDDR5.

build: 914a82d (4452)

ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD FirePro W8000 (RADV TAHITI) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD FirePro W8000 (RADV TAHITI) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	8	pp512	94.78 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	8	tg128	25.32 ± 0.02

1 reply

netrunnereve Jan 9, 2025
Collaborator Author

Very interesting, and looks like it's pretty close to the W8100 in tg despite being a dual GPU card. Your backend scales pretty well with layer splitting which is why I find it worthwhile to run my RX470 and W8100 together (I end up getting results that are close to the average of both cards).

ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon FirePro W8100 (RADV HAWAII) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon RX 470 Graphics (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	threads	main_gpu	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	pp512	147.84 ± 0.38
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	1	tg128	30.77 ± 0.00

vkhodygo · 2025-01-10T12:21:36Z

vkhodygo
Jan 10, 2025

Latest arch with Vulkan Instance Version: 1.4.303 on a i7-1185G7 laptop. The config is not completely stock, I had to deal with thermals ages ago to boost the performance, so it doesn't throttle.

For the sake of consistency I run every bit in a script and also build every target from scratch (for some reason cmake doesn't want to clean everything):

kill -STOP -1

timeout 240s $COMMAND

kill -CONT -1

Vulkan only:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	42.02 ± 0.07
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	7.28 ± 0.24

build: ff3fcab (4459)

Vulkan and OpenBLAS w/ default 4 threads:

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	4	pp512	42.05 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	4	tg128	7.35 ± 0.26

This bit seems to underutilise both GPU and CPU in real conditions based on top activities.

Vulkan and OpenBLAS w/ default 8 threads:

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	8	pp512	41.89 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	8	tg128	7.22 ± 0.20

3 replies

0cc4m Jan 10, 2025
Collaborator

Unless you reduce the number of GPU layers, threads and openblas/non-openblas is not gonna make any difference. Try it with ngl 0, then only prompt processing is accelerated using Vulkan, the rest runs on CPU. This is often a good setting for integrated GPUs.

vkhodygo Jan 10, 2025

That's something I didn't think about, with -ngl 0 it goes like this:

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	4	pp512	30.51 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	4	tg128	9.87 ± 0.05

build: ba8a1f9 (4460)

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	8	pp512	32.11 ± 0.45
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	8	tg128	9.49 ± 0.18

vkhodygo Feb 5, 2025

It seems latest patches has improved the results a bit:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none

model	size	params	backend	ngl	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	50.86 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	8.30 ± 0.05
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	2	pp512	50.90 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	2	tg128	8.11 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	4	pp512	50.91 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	4	tg128	7.99 ± 0.25
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	pp512	50.89 ± 0.04
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	8	tg128	7.92 ± 0.24

0cc4m · 2025-01-10T20:27:15Z

0cc4m
Jan 10, 2025
Collaborator

Intel ARC A770 on Windows:

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	314.24 ± 1.04
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	45.22 ± 0.25

build: ba8a1f9 (4460)

0 replies

8XXD8 · 2025-01-11T12:48:55Z

8XXD8
Jan 11, 2025

Single GPU Vulkan

Radeon Instinct MI25

ggml_vulkan: 0 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	439.42 ± 0.34
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	54.69 ± 0.03

build: 2739a71 (4461)

Radeon PRO VII

ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	329.86 ± 0.80
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	75.22 ± 0.05

build: 2739a71 (4461)

Multi GPU Vulkan

ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 2 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	324.55 ± 0.55
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	38.39 ± 0.09

build: 2739a71 (4461)

ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 1 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 2 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 3 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
ggml_vulkan: 4 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	Vulkan	100	pp512	32.29 ± 0.04
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	Vulkan	100	tg128	4.75 ± 0.00

build: 2739a71 (4461)

Single GPU Rocm

Device 0: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	409.83 ± 0.23
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	63.94 ± 0.06

build: 2739a71 (4461)

Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	1064.99 ± 1.18
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	87.45 ± 0.04

build: 2739a71 (4461)

Multi GPU Rocm

Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 1: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 2: AMD Radeon Pro VII, compute capability 9.0, VMM: no

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	1061.87 ± 0.26
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	81.49 ± 0.41

build: 2739a71 (4461)

Layer split
Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 1: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 2: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 3: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no
Device 4: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no

model	size	params	backend	ngl	test	t/s
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	ROCm	100	pp512	16.36 ± 0.02
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	ROCm	100	tg128	6.43 ± 0.01

build: 2739a71 (4461)

Row split
Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 1: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 2: AMD Radeon Pro VII, compute capability 9.0, VMM: no
Device 3: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no
Device 4: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no

model	size	params	backend	ngl	sm	test	t/s
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	ROCm	100	row	pp512	30.86 ± 0.03
llama 70B Q5_K - Medium	46.51 GiB	70.55 B	ROCm	100	row	tg128	12.52 ± 0.21

build: 2739a71 (4461)

Single GPU speed is decent, but multi GPU trails Rocm by a wide margin, especially with large models due to the lack of row split.

2 replies

cb88 Jan 18, 2025

What is the power profile for this MI25? Mine is 110W but its running slower than yours on git from today.

8XXD8 Jan 21, 2025

Mine defaults to 220w.
You can increase the power with rocm-smi --setpoweroverdrive 220

daniandtheweb · 2025-01-12T01:48:51Z

daniandtheweb
Jan 12, 2025

AMD Radeon RX 5700 XT on Arch using mesa-git and setting a higher GPU power limit compared to the stock card.
build: c05e8c9 (4462)

Vulkan:

ggml_vulkan: 0 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	439.42 ± 0.28
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	70.13 ± 0.05

HIP:

  Device 0: AMD Radeon RX 5700 XT, compute capability 10.1, VMM: no

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	pp512	354.17 ± 0.18
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	tg128	67.55 ± 0.04

I also think it could be interesting adding the flash attention results to the scoreboard (even if the support for it still isn't as mature as CUDA's).

Vulkan FA:

ggml_vulkan: 0 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	214.48 ± 2.31
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	23.21 ± 0.08

HIP FA:

  Device 0: AMD Radeon RX 5700 XT, compute capability 10.1, VMM: no

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	pp512	314.17 ± 0.29
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	100	1	tg128	62.02 ± 0.05

2 replies

0cc4m Jan 12, 2025
Collaborator

There is no Vulkan flash attention support (except with coopmat2 on very new nvidia drivers). What you're measuring here is a CPU fallback.

daniandtheweb Jan 12, 2025

I see, I was sure about the CPU fallback but didn't know there was no flash attention support at all.

FNsi · 2025-01-12T06:17:07Z

FNsi
Jan 12, 2025

I tried but there's nothing after 1 hrs , ok, might be 40 mins...

Anyway I run the llama_cli for a sample eval...

build: 4419 (46e3556e)

./llama-cli -m ~/storage/llama-2-7b.Q4_0.gguf -p "can u" -ngl 100                         ggml_vulkan: Found 1 Vulkan devices:                  ggml_vulkan: 0 = Mali-G57 (Mali-G57) | uma: 1 | fp16: 1 | warp size: 16 | matrix cores: none                build: 4419 (46e3556e) with clang version 19.1.6 for aarch64-unknown-linux-android24

llama_perf_sampler_print:    sampling time =       3.31 ms /    24 runs   (    0.14 ms per token,  7242.00 tokens per second)                                     llama_perf_context_print:        load time =   28544.85 ms                                                  llama_perf_context_print: prompt eval time =    3788.63 ms /     3 tokens ( 1262.88 ms per token,     0.79 tokens per second)                                     llama_perf_context_print:        eval time =   23248.44 ms /    20 runs   ( 1162.42 ms per token,     0.86 tokens per second)                                     llama_perf_context_print:       total time =   27591.65 ms /    23 tokens

Meanwhile OpenBLAS

llama_perf_sampler_print:    sampling time =       5.00 ms /    43 runs   (    0.12 ms per token,  8608.61 tokens per second)                                     llama_perf_context_print:        load time =   10871.74 ms                                                  llama_perf_context_print: prompt eval time =    1228.38 ms /     3 tokens (  409.46 ms per token,     2.44 tokens per second)                                     llama_perf_context_print:        eval time =   17010.39 ms /    39 runs   (  436.16 ms per token,     2.29 tokens per second)                                     llama_perf_context_print:       total time =   18639.62 ms /    42 tokens

2 replies

netrunnereve Jan 12, 2025
Collaborator Author

Even at below 1t/s llama-bench shouldn't run for an hour. The support just isn't there atm for Vulkan on Android.

FNsi Jan 13, 2025

Truth is ...

(0.79 tokens per second),

3788.63 ms / 3 tokens

So it's not even...it just slower...

Djip007 · 2025-03-17T23:10:09Z

Djip007
Mar 17, 2025

To have some compare with possible perf, I run some more test to compare backend. The V1/V2 are a test WIP backend I create using hip for RDNA3 iGPU and only compute the matmul (BF16 for now) the CPU result use BF16 too, Vulkan use FP16.

Run on update Fedora41 OS. on a Ryzen 9 7940HS (with Radeon 780M iGPU)

Llama-3.2-1B-Instruct/BF16.gguf

params	test	CPU BF16	CPU Q6_K	V1 BF16	I2 BF16	VK FP16	VK Q6_K
1.24 B	pp1	23.17	54.55	18.53	27.59	30.66	67.39
1.24 B	pp2	45.19	102.55	36.20	34.22	60.98	134.46
1.24 B	pp4	89.33	179.41	71.78	65.12	116.78	267.93
1.24 B	pp8	177.32	233.40	139.26	119.79	228.44	382.32
1.24 B	pp16	335.71	265.18	266.42	200.51	200.21	453.58
1.24 B	pp32	562.35	288.68	422.50	429.52	376.84	868.41
1.24 B	pp48	661.87	296.40	653.25	601.83	587.85	967.06
1.24 B	pp64	680.88	299.58	717.96	760.94	767.70	1242.62
1.24 B	pp128	719.70	299.83	990.37	1062.69	1012.38	1372.62
1.24 B	pp192	740.17	290.80	1131.50	1304.20	1065.36	1506.75
1.24 B	pp256	737.50	283.65	1151.29	1326.96	1168.63	1451.44
1.24 B	pp384	698.57	279.49	1178.65	1220.25	1256.91	1468.14
1.24 B	pp512	667.68	275.15	963.16	950.69	1219.82	1351.70
1.24 B	pp768	656.89	272.90	901.93	884.07	1174.53	1351.31
1.24 B	tg16	23.05	54.29	18.26	27.69	30.61	71.35

Llama-3.2-3B-Instruct

params	test	CPU BF16	CPU Q6_K	V1 BF16	V2 BF16	VK FP16	VK Q6_K
3.21 B	pp1	9.01	21.36	7.85	11.03	12.38	28.19
3.21 B	pp2	17.63	39.67	15.67	14.61	24.28	54.87
3.21 B	pp4	34.97	69.18	31.11	27.86	46.34	107.79
3.21 B	pp8	68.87	89.58	61.01	51.21	91.45	160.40
3.21 B	pp16	131.19	97.20	117.77	86.80	84.37	180.03
3.21 B	pp32	202.45	101.95	185.05	178.08	159.16	342.47
3.21 B	pp48	235.34	103.50	273.60	249.61	215.34	383.57
3.21 B	pp64	230.37	100.85	300.62	313.17	262.87	496.23
3.21 B	pp128	267.26	101.09	390.84	438.12	336.13	534.90
3.21 B	pp192	262.91	100.44	445.00	506.12	387.69	541.20
3.21 B	pp256	263.00	99.83	450.11	516.21	400.45	531.62
3.21 B	pp384	258.38	98.76	470.54	485.27	422.51	530.41
3.21 B	pp512	250.57	97.60	441.51	480.40	425.73	508.97
3.21 B	pp768	250.58	97.44	429.79	462.86	413.11	500.80
3.21 B	tg16	8.98	21.19	7.85	11.02	12.41	28.84

Meta-Llama-3.1-8B-Instruct

params	test	CPU BF16	CPU Q8_0	CPU Q6_K	V1 BF16	V2 BF16	VK FP16	VK Q8_0	VK Q6_K
8.03 B	pp1	3.88	7.19	9.23	3.88	4.88	5.37	10.01	12.61
8.03 B	pp2	7.58	14.15	17.92	7.74	7.40	10.60	19.51	24.51
8.03 B	pp4	15.02	27.69	31.03	15.43	14.20	20.66	35.84	48.26
8.03 B	pp8	29.68	51.17	37.50	30.23	26.37	40.84	48.73	72.47
8.03 B	pp16	56.67	66.58	40.66	58.55	45.95	42.52	89.15	77.11
8.03 B	pp32	86.55	71.80	41.20	91.54	83.38	78.43	172.24	146.82
8.03 B	pp48	95.62	71.00	40.91	114.77	116.55	109.56	172.52	158.94
8.03 B	pp64	87.34	71.98	41.29	137.17	139.46	138.57	223.18	205.31
8.03 B	pp128	107.06	73.08	41.54	152.59	195.33	156.76	242.62	222.98
8.03 B	pp192	108.39	72.28	41.41	183.30	215.62	168.35	254.70	220.00
8.03 B	pp256	108.75	70.92	41.34	185.74	235.19	172.59	244.86	215.58
8.03 B	pp384	107.27	68.78	41.01	213.56	230.65	174.30	250.99	212.94
8.03 B	pp512	105.41	68.22	40.72	203.01	232.16	172.46	239.22	206.76
8.03 B	pp768	105.06	68.41	40.64	194.98	225.46	171.45	237.95	207.00
8.03 B	tg16	3.88	7.19	9.21	3.88	4.87	5.35	9.97	12.72

Mistral-Nemo-Instruct-2407

params	test	CPU BF16	CPU Q8_0	CPU Q6_K	CPU Q5_K_M	CPU Q4_K_M	V1 BF16	V2 BF16	VK FP16	VK Q8_0	VK Q6_K	VK Q5_K_M	VK Q4_K_M
12.25 B	pp1	2.52	4.67	5.92	6.72	7.83	2.76	3.16	3.49	6.49	8.16	9.29	10.70
12.25 B	pp2	4.94	9.22	11.56	12.90	15.05	5.49	4.90	6.92	12.71	15.88	18.05	20.66
12.25 B	pp4	9.81	18.17	20.28	18.89	24.27	10.92	9.42	13.50	22.02	30.69	33.16	37.28
12.25 B	pp8	19.38	33.64	24.65	21.00	28.21	21.60	17.56	26.07	27.91	44.97	41.45	43.88
12.25 B	pp16	36.89	43.63	25.71	21.81	30.42	42.03	30.77	21.30	57.81	49.36	44.65	49.58
12.25 B	pp32	51.02	45.07	26.10	22.11	30.48	65.33	56.43	39.26	111.07	93.40	86.84	96.61
12.25 B	pp48	56.26	46.24	26.39	22.31	30.82	77.46	76.93	54.51	114.04	105.00	100.84	105.34
12.25 B	pp64	57.06	46.85	26.47	22.37	30.93	94.48	93.57	68.30	148.86	134.96	129.64	137.33
12.25 B	pp128	67.08	47.41	26.60	22.22	31.22	103.87	127.90	80.09	153.60	139.01	134.55	142.31
12.25 B	pp192	69.01	46.36	26.46	22.40	31.17	121.43	143.60	84.90	161.71	140.19	136.59	144.84
12.25 B	pp256	70.19	45.58	26.26	22.33	31.02	130.03	156.00	88.21	158.44	137.92	135.17	142.95
12.25 B	pp384	69.02	44.62	25.69	22.14	30.76	142.89	154.52	89.21	160.96	134.03	135.03	142.52
12.25 B	pp512	67.90	43.98	25.98	22.11	30.49	136.18	156.22	81.52	156.66	129.21	126.74	139.99
12.25 B	pp768	68.26	43.39	25.81	22.07	30.57	130.78	151.59	81.41	155.45	130.12	128.42	135.64
12.25 B	tg16	2.52	4.66	5.92	6.73	7.82	2.76	3.16	3.47	6.48	8.15	9.28	10.66

Mistral-Small-24B-Instruct-2501

params	test	CPU BF16	CPU Q8_0	CPU Q6_K	CPU Q5_K_M	CPU Q4_K_M	V1 BF16	V2 BF16	VK Q8_0	VK Q6_K
23.57 B	pp1	1.28	2.39	3.06	3.51	4.08	1.39	1.64	3.3	4.25
23.57 B	pp2	2.52	4.69	5.94	6.65	7.80	2.76	2.71	6.4	8.30
23.57 B	pp4	5.02	9.27	10.04	9.51	12.15	5.50	5.26	10.6	15.63
23.57 B	pp8	9.84	16.95	11.80	10.14	13.70	10.89	9.94	12.8	21.00
23.57 B	pp16	18.19	21.16	12.53	10.74	14.64	21.32	17.86	27.6	24.92
23.57 B	pp32	25.03	22.39	12.88	10.97	15.11	34.65	31.50	45.9	41.61
23.57 B	pp48	29.17	23.02	12.94	11.01	15.18	36.05	43.93	54.6	51.02
23.57 B	pp64	27.98	23.34	12.84	11.05	15.24	47.87	53.96	70.7	65.14
23.57 B	pp128	31.54	23.56	13.01	11.06	15.26	52.03	69.64	70.6	63.92
23.57 B	pp192	32.31	22.95	13.01	11.05	15.27	61.00	79.73	72.7	61.77
23.57 B	pp256	32.31	22.53	13.00	11.03	15.24	63.11	87.30	71.7	61.29
23.57 B	pp384	32.30	22.22	12.96	11.01	15.18	75.00	86.26	71.2	60.72
23.57 B	pp512	32.03	21.90	12.85	10.97	15.14	71.11	88.11	69.1	59.10
23.57 B	pp768	32.10	21.92	12.91	10.96	15.13	67.33	85.47	70.3	59.74
23.57 B	tg16	1.28	2.39	3.06	3.51	4.08	1.38	1.62	3.3	4.24

As you can see for now Vulkan backend don't like big fp16 model (I need to make some OS change for Mistral-Small bench on Vuikan...)
The tg is good for Vulkan in all off this case (I may need to had more compute on GPU on the hip-WIP only simple matmul is done wit them, all other OP is done on CPU .)

4 replies

0cc4m Mar 18, 2025
Collaborator

Interesting. I never optimized for FP16, that's why the Vulkan results are underwhelming here. Quants would run better.

Djip007 Mar 19, 2025

I have add Quant benchmark .
Note that CPU quant est not realy good (for pp) llamafile have a faster backend.

netrunnereve Mar 19, 2025
Collaborator Author

As you can see for now Vulkan backend don't like big model

I'm a bit confused here. If I look at the numbers for VK_Q8_0 and VK_Q6_K across the different models you posted it seems like the speed scales pretty well across the different model sizes. Like the 8B is approximately 3 times faster than the 24B, the 12B is twice as fast, and so on.

Djip007 Mar 19, 2025

Yes miss to change this part of my post, it was before I run quants bench so for vulkan FP16

ckane · 2025-03-18T21:31:54Z

ckane
Mar 18, 2025

Radeon RX 9070 XT on Arch w/ mesa-git

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	2206.24 ± 266.06
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	117.22 ± 0.33

build: d84635b (4920)

1 reply

netrunnereve Mar 19, 2025
Collaborator Author

Good to see some new chips. I'm waiting for someone to post a 5090 and top the chart 😃.

Johnreidsilver · 2025-03-19T01:26:57Z

Johnreidsilver
Mar 19, 2025

5700G, gfx90c, 8 CU, 2x32GB@3200

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	90.55 ± 0.08
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	10.98 ± 0.07

build: d84635b (4920)

CPU results for reference:

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	RPC	100	pp512	58.29 ± 0.44
llama 7B Q4_0	3.56 GiB	6.74 B	RPC	100	tg128	10.75 ± 0.25

build: d84635b (4920)

55% speedup for pp512 and lower power usage

ROCm v5.7 results for reference:
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx900:xnack+ (0x900), VMM: no, Wave Size: 64
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 65536 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	100	pp512	78.20 ± 0.63
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	100	tg128	12.22 ± 0.15

build: 8ba95dc (4896)

0 replies

ckane · 2025-03-19T03:41:44Z

ckane
Mar 19, 2025

Also since I have it around - a laptop, Ryzen 7 7730U w/ Vega 8 iGPU:

ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 65536 | matrix cores: none

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	84.79 ± 0.88
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	10.23 ± 0.13

build: d84635b (4920)

0 replies

Johnreidsilver · 2025-03-20T19:39:32Z

Johnreidsilver
Mar 20, 2025

5800H 2x16GB@3200, STAPM limit 80, basically the laptop version of 5700G.

to skip the dGPU:
export GGML_VK_VISIBLE_DEVICES=0

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	90.15 ± 1.45
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	10.81 ± 0.14

build: dbb3a47 (4930)

The dGPU, RTX 3060 laptop max-q 80W, 6GB VRAM, Driver Version: 550.120

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1059.76 ± 3.54
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	49.03 ± 0.13

build: dbb3a47 (4930)

18% below the desktop 3060 in pp512

and with flash attention, somehow it's much slower [edit: later I realized only beta and yet to be released driver v575 and upwards have support for coopmat2 and I tested with v550 limited to KHR_coopmat):
./llama-bench -m ~/models/llama2 -ngl 100 -fa 1

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	pp512	254.09 ± 14.64
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	1	tg128	20.18 ± 0.05

build: dbb3a47 (4930)

0 replies

TechHara · 2025-03-29T15:40:43Z

TechHara
Mar 29, 2025

AMD Ryzen 5 5600H

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	75.60 ± 0.32
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	10.59 ± 0.18

build: 0bb2919 (4991)

0 replies

JamasChuang94 · 2025-04-01T08:46:24Z

JamasChuang94
Apr 1, 2025

All tested in Windows 11, A770 LE 16G driver is 32.0.101.6653

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1116.80 ± 1.84
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	48.45 ± 1.32

build: a8a1f33 (5010)

.\source\repos\llama-cpp-ipx> .\llama-bench.exe -m ..\llama-cpp-vulkan\llama-2-7b.Q4_0.gguf -ngl 100
ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	100	pp512	2478.20 ± 10.42
llama 7B Q4_0	3.56 GiB	6.74 B	SYCL	100	tg128	70.18 ± 0.57

build: 4cfa0b8 (1)

2 replies

JamasChuang94 Apr 3, 2025

全部在 Windows 11 中测试，A770 LE 16G 驱动程序为 32.0.101.6653

.\source\repos\llama-cpp-vulkan> .\llama-bench.exe -m .\llama-2-7b.Q4_0.gguf -ngl 100 ggml_vulkan：找到 1 个 Vulkan 设备： ggml_vulkan：0 = Intel(R) Arc(TM) A770 Graphics（英特尔公司）| uma：0 | fp16：1 | 扭曲大小：32 | 共享内存：32768 | int dot：1 | 矩阵核心：无

模型尺寸参数后端天然气测试吨/秒
骆驼 7B Q4_0 3.56 GiB 6.74 亿火力 100 pp512 1116.80±1.84
骆驼 7B Q4_0 3.56 GiB 6.74 亿火力 100 TG128 48.45±1.32
版本：a8a1f33 (5010)

.\source\repos\llama-cpp-ipx> .\llama-bench.exe -m ..\llama-cpp-vulkan\llama-2-7b.Q4_0.gguf -ngl 100 ggml_sycl_init: GGML_SYCL_FORCE_MMQ: 否 ggml_sycl_init: SYCL_USE_XMX: 是 ggml_sycl_init: 找到 1 个 SYCL 设备：

模型尺寸参数后端天然气测试吨/秒
骆驼 7B Q4_0 3.56 GiB 6.74 亿新加坡联合通讯社 100 pp512 2478.20±10.42
骆驼 7B Q4_0 3.56 GiB 6.74 亿新加坡联合通讯社 100 TG128 70.18±0.57
构建：4cfa0b8 (1)

The performance of version 5039 has deteriorated significantly

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	603.46 ± 0.92
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	49.02 ± 0.04

build: 5f696e8 (5039)

netrunnereve Apr 4, 2025
Collaborator Author

The performance of version 5039 has deteriorated significantly

That's expected and due to #12722 as the original implementation was bugged.

lukewp · 2025-04-03T15:13:35Z

lukewp
Apr 3, 2025

... Cross-posted from the Mac thread:

Mac Pro 2013 🗑️ 12-core Xeon E5-2697 v2, Dual FirePro D700, 64 GB RAM, MacOS Monterey

Note: I've updated this post -- I realized when I posted the first time I was so excited to see the GPUs doing stuff that I didn't check whether they were working right. Turns out they were not! So I recompiled MoltenVK and llama.cpp with some tweaks and checked that the models were working correctly before re-benchmarking. When the system was spitting garbage it was running about 30% higher t/s rates across the board.

Full HOWTO on getting the Mac Pro D700s to accept layers here: https://github.com/lukewp/TrashCanLLM/blob/main/README.md

./build/bin/llama-bench -m ../llm-models/llama2-7b-chat-q8_0.gguf -m ../llm-models/llama-2-7b-chat.Q4_0.gguf -p 512 -n 128 -ngl 99 2> /dev/null

model	size	params	backend	threads	test	t/s
llama 7B Q8_0	6.67 GiB	6.74 B	Vulkan,BLAS	12	pp512	68.55 ± 0.25
llama 7B Q8_0	6.67 GiB	6.74 B	Vulkan,BLAS	12	tg128	11.05 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	pp512	68.86 ± 0.16
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	tg128	16.73 ± 0.05

build: d3bd719 (5092)

The FP16 model, was throwing garbage so I did not include here -- it will require some unique flags to run correctly. Additionally, here's the 8- and 4- bit llama 2 7B runs on the CPU alone (using -ngl 0 flag):

./build/bin/llama-bench -m ../llm-models/llama2-7b-chat-q8_0.gguf -m ../llm-models/llama-2-7b-chat.Q4_0.gguf -p 512 -n 128 -ngl 0 2> /dev/null

model	size	params	backend	threads	test	t/s
llama 7B Q8_0	6.67 GiB	6.74 B	Vulkan,BLAS	12	pp512	25.87 ± 0.56
llama 7B Q8_0	6.67 GiB	6.74 B	Vulkan,BLAS	12	tg128	6.85 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	pp512	26.17 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	12	tg128	10.85 ± 0.01

build: d3bd719 (5092)

(proof-of-life images below):
GPU test:

CPU test:

12 replies

0cc4m Apr 12, 2025
Collaborator

I guess MoltenVK basically did the same thing that we do internally when the extension is not available: just replace float16 uses with float32.

lukewp Apr 14, 2025

I'm looking at vulkaninfo output and it's hard to decipher what's an emulation vs a device feature. I uploaded the device report to vulkan.gpuinfo.org via the vkcapviewer: http://vulkan.gpuinfo.org/displayreport.php?id=38287 ... not sure if this offers any clarity.

I'm not expecting much, after all it is a 13-year-old device! I couldn't get it to be recognized at all despite effort under either Linux or Windows so MacOS is surprisingly leading the way here. Though I could only get it to work in a 4-major-version-old MacOS, and didn't attempt a similar older-kernel approach with linux or windows but probably should have for fair comparison's sake.

0cc4m Apr 14, 2025
Collaborator

Actually it should be fully supported under any modern Linux version, since it's GCN. You just have to switch it from the radeon driver to amdgpu manually, see for example the Arch wiki and this Ubuntu post. Your D700 is a Southern Islands card. Afterwards it should be fully working under Vulkan. I have done this in the past with my AMD Firepro S10000.

lukewp Apr 14, 2025

Thank you, I will investigate again. I went down a rabbit hole with Linux trying to get ROCm by building an older version, since it seemed Southern Islands support was deprecated past (iirc) v5.6? I should have been more specific and said the kernel recognized both cards but didn't engage them in llama.cpp even after prioritizing amdgpu. I almost certainly missed a step in the install somewhere, though I think I probably messed up the driver by trying to get the older ROCm version running. Will give it another go on a clean Arch install one of these days.

netrunnereve Apr 14, 2025
Collaborator Author

I'm looking at vulkaninfo output and it's hard to decipher what's an emulation vs a device feature.

I don't think you can tell from that. I mean it shows that shaderFloat16 is supported which is what we use to check for FP16, but I just know that the chip physically can't do that.

I went down a rabbit hole with Linux trying to get ROCm by building an older version, since it seemed Southern Islands support was deprecated past (iirc) v5.6?

Was ROCM ever supported for GCN1? Anyways even if you manage to get it working with an old version and kernel I believe our ROCM implementation only runs properly on Vega and newer cards.

I should have been more specific and said the kernel recognized both cards but didn't engage them in llama.cpp even after prioritizing amdgpu.

Get it working with the old radeon driver first and if that's stable you should be able to switch it to amdgpu like @0cc4m mentioned. I've successfully done that with both GCN1 and GCN2 cards.

mdavey · 2025-04-10T18:19:57Z

mdavey
Apr 10, 2025

AMD Radeon RX 6600M 8GB in a Mini PC (HX99G)

./llama-bench -m ../../llama-2-7b.Q4_0.gguf

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	605.59 ± 0.65
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	48.21 ± 0.07

build: fe5b78c (5097)

0 replies

0cc4m · 2025-04-12T08:22:55Z

0cc4m
Apr 12, 2025
Collaborator

Here are updated benchmarks from my hardware with the new integer dot extension for improved prompt processing speeds:

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	1533.60 ± 2.47
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	95.56 ± 0.72

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	612.47 ± 0.87
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	71.37 ± 0.98

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	661.92 ± 0.47
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	39.41 ± 0.41

And the RTX 3090 with coopmat2 (which does not use integer dot):

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	pp512	4241.77 ± 88.02
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	0	tg128	122.61 ± 0.93
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	4516.92 ± 9.55
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	120.44 ± 2.58

0 replies

jeffbolznv · 2025-04-16T14:14:50Z

jeffbolznv
Apr 16, 2025
Collaborator

I'm soliciting some specific Vulkan perf data in #12976. I'd appreciate any help.

0 replies

characharm · 2025-04-18T13:56:54Z

characharm
Apr 18, 2025

model	size	params	backend	ngl	test	t/s
qwen2 14B Q4_0	7.95 GiB	14.77 B	Vulkan	99	pp512	647.35 ± 1.85
qwen2 14B Q4_0	7.95 GiB	14.77 B	Vulkan	99	tg128	31.09 ± 0.50
qwen2 14B Q4_0	7.95 GiB	14.77 B	Vulkan	99	tg256	30.72 ± 0.45
qwen2 14B Q4_0	7.95 GiB	14.77 B	Vulkan	99	tg512	31.04 ± 0.31

build: 7538246 (5083)

model	size	params	backend	ngl	test	t/s
qwen2 14B Q4_0	7.95 GiB	14.77 B	Vulkan	99	pp512	2039.45 ± 1.97
qwen2 14B Q4_0	7.95 GiB	14.77 B	Vulkan	99	tg128	58.78 ± 0.55
qwen2 14B Q4_0	7.95 GiB	14.77 B	Vulkan	99	tg256	57.86 ± 0.48
qwen2 14B Q4_0	7.95 GiB	14.77 B	Vulkan	99	tg512	58.16 ± 0.68

build: 7538246 (5083)

0 replies

rgerganov · 2025-04-22T13:38:31Z

rgerganov
Apr 22, 2025
Collaborator

Here are some results with the Vulkan backend running on Steam Deck:

model	size	params	backend	ngl	test	t/s
gemma3 4B Q4_0	2.93 GiB	3.88 B	Vulkan,RPC	99	pp512	156.14 ± 0.43
gemma3 4B Q4_0	2.93 GiB	3.88 B	Vulkan,RPC	99	tg128	18.39 ± 0.05

build: 5368ddd (5164)

The output of vulkaninfo on this device is here

1 reply

0cc4m Apr 22, 2025
Collaborator

Nice! You'll get a little more prompt processing performance if you compile with a more modern glslc (Vulkan SDK) that supports the VK_KHR_shader_integer_dot_product extension. The Steam Deck APU is RDNA2, so it has the required DP4A support.

Johnreidsilver · 2025-04-22T23:34:50Z

Johnreidsilver
Apr 22, 2025

RTX 5060Ti 16GB Driver Version: 575.51.02 CUDA Version: 12.9

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	3211.73 ± 24.44
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	81.48 ± 3.50

build: 658987c (5170)

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	3492.22 ± 15.73
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	83.26 ± 2.03

build: 658987c (5170)

5 replies

toncao Apr 24, 2025

How do you get your PC to recognise the 5060 Ti? Are you using Linux with ARM, or x86_64? I'm using Linux x86_64 with thr same driver version, but my PC detects the GPU but nvidia-smi does not.

cb88 Apr 24, 2025

to be fair both of those are Vulkan not CUDA benches... so maybe the same issue you have he has.

Johnreidsilver Apr 24, 2025

After some initial snafu with proprietary vs open source driver options (had to pick open source) it works with CUDA too, nvidia-smi recognizes the board right after running the nvidia beta driver .run file.
Then you need to install the pytorch nightly (it's for CUDA 12.8 but it's working with 12.9 driver), but if nvidia-smi isn't recognizing the board, likely it won't work. Does the board show in dmesg and lspci lshw? I'm using the integrated for display. Mine didn't on the first power up, simply because it wasn't 100% inserted into the pci-e slot. Fans spinning and everything, but that slot holding bracket on the right needs to really snap in and click.

I'm using Ubuntu, actually an older install with 22.04.5, kernel 6.2.0-39-generic, while newer 24.04 ships with 6.8 and 24.04.2 with 6.11.

Here's the results with CUDA, for reference:
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	4034.65 ± 2.41
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	94.30 ± 0.05

build: 3a8e9af (1)

For some reason the benchmark crashes after the pp512 run if Flash Attention is off:

`
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	pp512	3654.97 ± 8.15

/home/user/git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:70: CUDA error
[New LWP 15007]
[New LWP 15008]
[New LWP 15009]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fc6cbeea42f in __GI___wait4 (pid=15010, stat_loc=0x7ffc829abfe4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0 0x00007fc6cbeea42f in __GI___wait4 (pid=15010, stat_loc=0x7ffc829abfe4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x00007fc6cc561589 in ggml_abort () from /home/user/git/llama.cpp/build/ggml/src/libggml-base.so
#2 0x00007fc6c6c89cc2 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/user/git/llama.cpp/build/ggml/src/ggml-cuda/libggml-cuda.so
#3 0x00007fc6c6c92e9d in ggml_backend_cuda_synchronize(ggml_backend*) () from /home/user/git/llama.cpp/build/ggml/src/ggml-cuda/libggml-cuda.so
#4 0x00007fc6cc5780ce in ggml_backend_sched_synchronize () from /home/user/git/llama.cpp/build/ggml/src/libggml-base.so
#5 0x00007fc6cc6774f4 in llama_synchronize () from /home/user/git/llama.cpp/build/src/libllama.so
#6 0x000055b016afa0c9 in main ()
[Inferior 1 (process 15006) detached]
Aborted
`

jakogut Apr 25, 2025

AMD Ryzen AI 9 HX 370 - GPD Pocket 4 w/ 64 GB LPDDR5X 7500 MT/s

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	309.35 ± 0.93
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	21.23 ± 0.40

build: 87616f0 (5184)

toncao Apr 25, 2025

Thanks for the update. It turns out the 575.51.02 beta driver can recognise my 5060 Ti, as I switch to MIT/GPL 575.51.02 driver from proprietary. The 575.51.02 proprietary driver weirdly does not recognise my 5060 Ti.

I'm using 24.04 LTS Ubuntu Linux, and 6.11.0-24-generic kernel.

beebopkim · 2025-04-27T04:42:21Z

beebopkim
Apr 27, 2025

M3 Ultra(Mac Studio 2025) 24P+8E Cores of CPU, 80 Cores of GPU with Vulkan

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	24	pp512	1116.48 ± 0.20
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan,BLAS	24	tg128	114.55 ± 1.09

build: 2d451c8 (5195)

Non-BLAS

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	pp512	1116.83 ± 0.55
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	100	tg128	115.54 ± 0.78

build: 2d451c8 (5195)

For comparison, Metal on same machine

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Metal	100	pp512	1574.09 ± 1.42
llama 7B Q4_0	3.56 GiB	6.74 B	Metal	100	tg128	102.28 ± 0.78

build: 2d451c8 (5195)

It is interesting that TG in Vulkan is faster than Metal. Faster PP in Metal is as expected.

0 replies

Performance of llama.cpp with Vulkan #10879

netrunnereve Dec 18, 2024 Collaborator

Replies: 60 comments · 102 replies

netrunnereve Dec 18, 2024 Collaborator Author

netrunnereve Dec 18, 2024 Collaborator Author

max-krasnyansky Dec 18, 2024 Collaborator

ericcurtin Jan 14, 2025 Collaborator

0cc4m Jan 8, 2025 Collaborator

netrunnereve Jan 8, 2025 Collaborator Author

0cc4m Jan 8, 2025 Collaborator

NVIDIA GeForce RTX 3090 (NVIDIA)

AMD Radeon RX 6800 XT (RADV NAVI21) (radv)

AMD Radeon (TM) Pro VII (RADV VEGA20) (radv)

Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver)

0cc4m Jan 8, 2025 Collaborator

netrunnereve Jan 8, 2025 Collaborator Author

0cc4m Jan 8, 2025 Collaborator

qnixsynapse Jan 9, 2025 Collaborator

0cc4m Jan 10, 2025 Collaborator

qnixsynapse Jan 11, 2025 Collaborator

0cc4m Jan 11, 2025 Collaborator

qnixsynapse Jan 11, 2025 Collaborator

qnixsynapse Feb 9, 2025 Collaborator

0cc4m Jan 9, 2025 Collaborator

netrunnereve Jan 9, 2025 Collaborator Author

0cc4m Jan 10, 2025 Collaborator

0cc4m Jan 10, 2025 Collaborator

0cc4m Jan 12, 2025 Collaborator

netrunnereve Jan 12, 2025 Collaborator Author

Llama-3.2-1B-Instruct/BF16.gguf

Llama-3.2-3B-Instruct

Meta-Llama-3.1-8B-Instruct

Mistral-Nemo-Instruct-2407

Mistral-Small-24B-Instruct-2501

0cc4m Mar 18, 2025 Collaborator

netrunnereve Mar 19, 2025 Collaborator Author

netrunnereve Mar 19, 2025 Collaborator Author

netrunnereve
Dec 18, 2024
Collaborator

Replies: 60 comments 102 replies

netrunnereve
Dec 18, 2024
Collaborator Author

netrunnereve
Dec 18, 2024
Collaborator Author

max-krasnyansky
Dec 18, 2024
Collaborator

ericcurtin Jan 14, 2025
Collaborator

0cc4m Jan 8, 2025
Collaborator

netrunnereve Jan 8, 2025
Collaborator Author

0cc4m
Jan 8, 2025
Collaborator

0cc4m
Jan 8, 2025
Collaborator

netrunnereve Jan 8, 2025
Collaborator Author

0cc4m Jan 8, 2025
Collaborator

qnixsynapse
Jan 9, 2025
Collaborator

0cc4m Jan 10, 2025
Collaborator

qnixsynapse Jan 11, 2025
Collaborator

0cc4m Jan 11, 2025
Collaborator

qnixsynapse Jan 11, 2025
Collaborator

qnixsynapse Feb 9, 2025
Collaborator

0cc4m
Jan 9, 2025
Collaborator

netrunnereve Jan 9, 2025
Collaborator Author

0cc4m Jan 10, 2025
Collaborator

0cc4m
Jan 10, 2025
Collaborator

0cc4m Jan 12, 2025
Collaborator

netrunnereve Jan 12, 2025
Collaborator Author

0cc4m Mar 18, 2025
Collaborator

netrunnereve Mar 19, 2025
Collaborator Author

netrunnereve Mar 19, 2025
Collaborator Author