Performance of llama.cpp with Vulkan #10879
Replies: 60 comments 102 replies
-
AMD FirePro W8100
|
Beta Was this translation helpful? Give feedback.
-
AMD RX 470
|
Beta Was this translation helpful? Give feedback.
-
ubuntu 24.04, vulkan and cuda installed from official APT packages.
build: 4da69d1 (4351) vs CUDA on the same build/setup
build: 4da69d1 (4351) |
Beta Was this translation helpful? Give feedback.
-
Macbook Air M2 on Asahi Linux ggml_vulkan: Found 1 Vulkan devices:
|
Beta Was this translation helpful? Give feedback.
-
Gentoo Linux on ROG Ally (2023) Ryzen Z1 Extreme ggml_vulkan: Found 1 Vulkan devices:
|
Beta Was this translation helpful? Give feedback.
-
ggml_vulkan: Found 4 Vulkan devices:
|
Beta Was this translation helpful? Give feedback.
-
build: 0d52a69 (4439) NVIDIA GeForce RTX 3090 (NVIDIA)
AMD Radeon RX 6800 XT (RADV NAVI21) (radv)
AMD Radeon (TM) Pro VII (RADV VEGA20) (radv)
Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver)
|
Beta Was this translation helpful? Give feedback.
-
@netrunnereve Some of the tg results here are a little low, I think they might be debug builds. The cmake step (at least on Linux) might require |
Beta Was this translation helpful? Give feedback.
-
Build: 8d59d91 (4450)
Lack of proper Xe coopmat support in the ANV driver is a setback honestly.
edit: retested both with the default batch size. |
Beta Was this translation helpful? Give feedback.
-
Here's something exotic: An AMD FirePro S10000 dual GPU from 2012 with 2x 3GB GDDR5. build: 914a82d (4452)
|
Beta Was this translation helpful? Give feedback.
-
Latest arch with For the sake of consistency I run every bit in a script and also build every target from scratch (for some reason kill -STOP -1
timeout 240s $COMMAND
kill -CONT -1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics (TGL GT2) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none
build: ff3fcab (4459)
This bit seems to underutilise both GPU and CPU in real conditions based on
|
Beta Was this translation helpful? Give feedback.
-
Intel ARC A770 on Windows:
build: ba8a1f9 (4460) |
Beta Was this translation helpful? Give feedback.
-
Single GPU VulkanRadeon Instinct MI25 ggml_vulkan: 0 = AMD Radeon Instinct MI25 (RADV VEGA10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) Radeon PRO VII ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) Multi GPU Vulkanggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) ggml_vulkan: 0 = AMD Radeon Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
build: 2739a71 (4461) Single GPU RocmDevice 0: AMD Radeon Instinct MI25, compute capability 9.0, VMM: no
build: 2739a71 (4461) Device 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
build: 2739a71 (4461) Multi GPU RocmDevice 0: AMD Radeon Pro VII, compute capability 9.0, VMM: no
build: 2739a71 (4461) Layer split
build: 2739a71 (4461) Row split
build: 2739a71 (4461) Single GPU speed is decent, but multi GPU trails Rocm by a wide margin, especially with large models due to the lack of row split. |
Beta Was this translation helpful? Give feedback.
-
AMD Radeon RX 5700 XT on Arch using mesa-git and setting a higher GPU power limit compared to the stock card.
I also think it could be interesting adding the flash attention results to the scoreboard (even if the support for it still isn't as mature as CUDA's).
|
Beta Was this translation helpful? Give feedback.
-
I tried but there's nothing after 1 hrs , ok, might be 40 mins... Anyway I run the llama_cli for a sample eval...
Meanwhile OpenBLAS
|
Beta Was this translation helpful? Give feedback.
-
To have some compare with possible perf, I run some more test to compare backend. The V1/V2 are a test WIP backend I create using hip for RDNA3 iGPU and only compute the matmul (BF16 for now) the CPU result use BF16 too, Vulkan use FP16. Run on update Fedora41 OS. on a Ryzen 9 7940HS (with Radeon 780M iGPU) Llama-3.2-1B-Instruct/BF16.gguf
Llama-3.2-3B-Instruct
Meta-Llama-3.1-8B-Instruct
Mistral-Nemo-Instruct-2407
Mistral-Small-24B-Instruct-2501
As you can see for now Vulkan backend don't like big fp16 model (I need to make some OS change for Mistral-Small bench on Vuikan...) |
Beta Was this translation helpful? Give feedback.
-
Radeon RX 9070 XT on Arch w/
build: d84635b (4920) |
Beta Was this translation helpful? Give feedback.
-
5700G, gfx90c, 8 CU, 2x32GB@3200 ggml_vulkan: Found 1 Vulkan devices:
build: d84635b (4920) CPU results for reference:
build: d84635b (4920) 55% speedup for pp512 and lower power usage ROCm v5.7 results for reference:
build: 8ba95dc (4896) |
Beta Was this translation helpful? Give feedback.
-
Also since I have it around - a laptop, Ryzen 7 7730U w/ Vega 8 iGPU:
build: d84635b (4920) |
Beta Was this translation helpful? Give feedback.
-
5800H 2x16GB@3200, STAPM limit 80, basically the laptop version of 5700G. to skip the dGPU: ggml_vulkan: Found 1 Vulkan devices:
build: dbb3a47 (4930) The dGPU, RTX 3060 laptop max-q 80W, 6GB VRAM, Driver Version: 550.120 ggml_vulkan: Found 1 Vulkan devices:
build: dbb3a47 (4930) 18% below the desktop 3060 in pp512 and with flash attention, somehow it's much slower [edit: later I realized only beta and yet to be released driver v575 and upwards have support for coopmat2 and I tested with v550 limited to KHR_coopmat): ggml_vulkan: Found 1 Vulkan devices:
build: dbb3a47 (4930) |
Beta Was this translation helpful? Give feedback.
-
AMD Ryzen 5 5600H ggml_vulkan: Found 1 Vulkan devices:
build: 0bb2919 (4991) |
Beta Was this translation helpful? Give feedback.
-
All tested in Windows 11, A770 LE 16G driver is 32.0.101.6653 .\source\repos\llama-cpp-vulkan> .\llama-bench.exe -m .\llama-2-7b.Q4_0.gguf -ngl 100
build: a8a1f33 (5010) .\source\repos\llama-cpp-ipx> .\llama-bench.exe -m ..\llama-cpp-vulkan\llama-2-7b.Q4_0.gguf -ngl 100
build: 4cfa0b8 (1) |
Beta Was this translation helpful? Give feedback.
-
... Cross-posted from the Mac thread: Mac Pro 2013 🗑️ 12-core Xeon E5-2697 v2, Dual FirePro D700, 64 GB RAM, MacOS MontereyNote: I've updated this post -- I realized when I posted the first time I was so excited to see the GPUs doing stuff that I didn't check whether they were working right. Turns out they were not! So I recompiled MoltenVK and llama.cpp with some tweaks and checked that the models were working correctly before re-benchmarking. When the system was spitting garbage it was running about 30% higher t/s rates across the board. Full HOWTO on getting the Mac Pro D700s to accept layers here: https://github.com/lukewp/TrashCanLLM/blob/main/README.md ./build/bin/llama-bench -m ../llm-models/llama2-7b-chat-q8_0.gguf -m ../llm-models/llama-2-7b-chat.Q4_0.gguf -p 512 -n 128 -ngl 99 2> /dev/null
build: d3bd719 (5092) The FP16 model, was throwing garbage so I did not include here -- it will require some unique flags to run correctly. Additionally, here's the 8- and 4- bit llama 2 7B runs on the CPU alone (using -ngl 0 flag): ./build/bin/llama-bench -m ../llm-models/llama2-7b-chat-q8_0.gguf -m ../llm-models/llama-2-7b-chat.Q4_0.gguf -p 512 -n 128 -ngl 0 2> /dev/null
build: d3bd719 (5092) |
Beta Was this translation helpful? Give feedback.
-
AMD Radeon RX 6600M 8GB in a Mini PC (HX99G)
build: fe5b78c (5097) |
Beta Was this translation helpful? Give feedback.
-
Here are updated benchmarks from my hardware with the new integer dot extension for improved prompt processing speeds:
And the RTX 3090 with coopmat2 (which does not use integer dot):
|
Beta Was this translation helpful? Give feedback.
-
I'm soliciting some specific Vulkan perf data in #12976. I'd appreciate any help. |
Beta Was this translation helpful? Give feedback.
-
ggml_vulkan: Found 2 Vulkan devices:
build: 7538246 (5083) ggml_vulkan: Found 1 Vulkan devices:
build: 7538246 (5083) |
Beta Was this translation helpful? Give feedback.
-
Here are some results with the Vulkan backend running on Steam Deck: ggml_vulkan: Found 1 Vulkan devices:
build: 5368ddd (5164) The output of |
Beta Was this translation helpful? Give feedback.
-
RTX 5060Ti 16GB Driver Version: 575.51.02 CUDA Version: 12.9 ggml_vulkan: Found 1 Vulkan devices:
build: 658987c (5170) w/Flash Attention
build: 658987c (5170) |
Beta Was this translation helpful? Give feedback.
-
M3 Ultra(Mac Studio 2025) 24P+8E Cores of CPU, 80 Cores of GPU with Vulkanggml_vulkan: Found 1 Vulkan devices:
build: 2d451c8 (5195) Non-BLAS
build: 2d451c8 (5195) For comparison, Metal on same machine
build: 2d451c8 (5195) It is interesting that TG in Vulkan is faster than Metal. Faster PP in Metal is as expected. |
Beta Was this translation helpful? Give feedback.
-
This is similar to the Apple Silicon benchmark thread, but for Vulkan! Many improvements have been made to the Vulkan backend and I think it's good to consolidate and discuss our results here.
We'll be testing the Llama 2 7B model like the other thread to keep things consistent, and use Q4_0 as it's simple to compute and small enough to fit on a 4GB GPU. You can download it here.
Instructions
Either run the commands below or download one of our Vulkan releases.
Share your llama-bench results along with the git hash and Vulkan info string in the comments. Feel free to try other models and compare backends, but only valid runs will be placed on the scoreboard.
If multiple entries are posted for the same device newer commits with substantial Vulkan updates are prioritized, alternatively the one with the highest tg128 score will be used. Performance may vary depending on driver, operating system, board manufacturer, etc. even if the chip is the same.
Vulkan Scoreboard for Llama 2 7B, Q4_0 (no FA)
Vulkan Scoreboard for Llama 2 7B, Q4_0 (with FA)
Currently FA only works properly with coopmat2.
Beta Was this translation helpful? Give feedback.
All reactions