Replies: 1 comment
-
What kind of documentation are you thinking about? What the flag does is very straightforward, it lets you choose where to store the tensors of model. I don't think this needs a lot of explanation. Beyond that, my view is that it is up to the community to find interesting ways to take advantage of this. Of course, that requires you to understand what tensors are in a model, and where they may be stored. This is inherently a very low level feature and most people probably shouldn't be using it, but maybe it would be nice to have some documentation about typical use cases, and maybe in the future we can add simpler flags to enable them once it is clear what the applications are. I am not aware of any cases where using this flag causes the the model to produce incorrect result. That should not happen, if you find a case open a issue about it. |
Beta Was this translation helpful? Give feedback.
-
Llama-server, llama-cli version 5184 (still in effect for the latest build b5205)
The override tensor flag has become increasingly popular as a cost effective way to run Mixture of Expert models (particularly Deepseek style models with a shared expert that can very reliably target the GPU for a low VRAM overhead), because it allows for very cost effective single-user hybrid CPU / GPU inference.
As an example, on a Ryzen 9950X (192GB 4400MHZ RAM), RTX 4000 SFF + RTX 2000 ADA GPU, I'm able to run Maverick at anywhere between a q4 and q6 quant at roughly 10 tokens per second (around 33t/s prompt processing with
-ubatch-size 4
), by explicitly leaving all conditional experts on CPU. If I wanted to run a model that feels roughly as intelligent, I'd be looking at probably a 70B llama 3.3 finetune which runs at about 1.7 tokens per second (even with an optimized speculative decoding setup). I suspect it may even become something of a meta, similar to the P40 of early consumer LLM deployments, where hobbyists would pick up four P40s for less than the price of a 3090 for single user local LLM deployments.A major issue atm is that the flag is wildly undocumented, and the only documentation was around the PR of the feature itself.
#11397
I'd like to place the argument forward that given the release of recent high profile MoE models such as Scout, and Maverick, which will likely proceed to be more of a platform than a pair of models similar to how Llama 3.x became a platform for finetunes, or the Ling MoE (and presumably Qwen 3 MoE series which recently leaked slightly ahead of launch), we're likely to see a lot more cases where somebody may want to be very intentional about the placement of tensors on their available hardware, and it's important to understand how
-ot
works as a major part of that strategy.To that end: I'm hoping to foster some degree of discussion about this issue in this thread. I'm not sure if it's appropriate to file this under a formal issue, so I've started with a discussion on it.
Additionally: Prior to the fixes for the Llama 4 series there was another major issue with the -ot command: It broke multi-GPU setups, and left me in a weird situation where my primary GPU would see 19.5GB of usage, while my secondary would only see... 5GB (possibly it was only being used for prompt processing?). Recent updates have fixed this, but having no documentation on the functionality and usage of the flag makes it difficult to troubleshoot issues like this, and I currently have no idea what fix it was exactly that fixed multi-GPU usage. I think leaving edge-case functionality to luck is not necessarily the best case scenario for end users, so I kindly hope that contributors will pitch in to this discussion.
Beta Was this translation helpful? Give feedback.
All reactions