Skip to content

I'm totally new to using AI, and recently bought an RTX 4060 as secondary GPU for stuff like this #1142

Answered by martindevans
Deus-nsf asked this question in Q&A
Discussion options

You must be logged in to vote

llama.cpp can do partial offloading, where some of the model runs on the GPU and the rest runs on the CPU. If you've got 4-5GB of VRAM then a smallish model (8B or less) at 4bits quantisation should fit, as long as you keep the context size reasonably small.

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@Deus-nsf
Comment options

Answer selected by Deus-nsf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants