Assisted Generation #1169

aropb · 2025-04-23T09:31:19Z

aropb
Apr 23, 2025

Is it possible to do something similar on llamasharp now?

https://huggingface.co/blog/assisted-generation

martindevans · 2025-04-23T13:01:23Z

martindevans
Apr 23, 2025
Maintainer

This looks like a description of "speculative decoding", there are a couple of llama.cpp examples implementing it here: https://github.com/ggml-org/llama.cpp/tree/master/examples/speculative, https://github.com/ggml-org/llama.cpp/tree/master/examples/speculative-simple.

It's not currently supported at all in the high level executors. It's probably possible to implement using the BatchedExecutor (I sketched out a prototype a while ago, never quite got it working though). It should definitely be possible to implement using the low level/native API (we just directly expose all the llama.cpp calls).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assisted Generation #1169

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Assisted Generation #1169

aropb Apr 23, 2025

Replies: 1 comment

martindevans Apr 23, 2025 Maintainer

aropb
Apr 23, 2025

martindevans
Apr 23, 2025
Maintainer