I'm totally new to using AI, and recently bought an RTX 4060 as secondary GPU for stuff like this #1142
-
I was curious, since the card is already offloading quite a bunch of Windows stuff, I have around 4-5GB of VRAM remaining, is it enough to run the model and have a good experience or would you recommend having more VRAM for a better experience? I want to use the model to accelerate .NET apps development with Visual Studio. Of course I'd like as much as possible to avoid running the model on the CPU that goes without saying. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
llama.cpp can do partial offloading, where some of the model runs on the GPU and the rest runs on the CPU. If you've got 4-5GB of VRAM then a smallish model (8B or less) at 4bits quantisation should fit, as long as you keep the context size reasonably small. |
Beta Was this translation helpful? Give feedback.
llama.cpp can do partial offloading, where some of the model runs on the GPU and the rest runs on the CPU. If you've got 4-5GB of VRAM then a smallish model (8B or less) at 4bits quantisation should fit, as long as you keep the context size reasonably small.