Replies: 1 comment
-
This looks like a description of "speculative decoding", there are a couple of llama.cpp examples implementing it here: https://github.com/ggml-org/llama.cpp/tree/master/examples/speculative, https://github.com/ggml-org/llama.cpp/tree/master/examples/speculative-simple. It's not currently supported at all in the high level executors. It's probably possible to implement using the BatchedExecutor (I sketched out a prototype a while ago, never quite got it working though). It should definitely be possible to implement using the low level/native API (we just directly expose all the llama.cpp calls). |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Is it possible to do something similar on llamasharp now?
https://huggingface.co/blog/assisted-generation
Beta Was this translation helpful? Give feedback.
All reactions