Issue with BatchedExecutor and cancelling inference #1159

iejeecee · 2025-04-18T10:44:17Z

iejeecee
Apr 18, 2025

I am using the BatchedExecutor and this is the workflow I'm trying to accomplish

Send Prompt A.
Start generating Response A.
Cancel Response A mid-generation.
Immediately send Prompt B, which should logically follow Prompt A (and the processed part of Response A's prompt).
Generate Response B, using the KV cache from Prompt A.

The problem is that the state after cancellation is "Requires Inference [to finish Response A]" rather than "Ready for Prompt B".

I could dispose everything and rebuild the entire context before prompting B, but that would be slow and wasteful.

Another way would be to save the state after each generation. Then after a cancellation dispose the conversation and reload the saved state so only the previous prompt, it's cancelled partial-answer and the new prompt have to be added before starting inference. Which is better as the previous method but requires saving the state after each generation. This could potentially become slow and big (?).

So my question is, is there a better way to get out of this RequiresInference state after cancellation that allows me to add a new prompt?

martindevans · 2025-04-19T00:27:43Z

martindevans
Apr 19, 2025
Maintainer

I'm not 100% sure I understand the situation, sorry if any of this misses the mark!

Normally you'll only be in the RequiresInference state if you've prompted a token but not yet run inference. If you don't prompt that final token before deciding to cancel, you should be in the state you want (I think).

If you're doing Conversation.Fork to create B from A, I think that should be able to be done even if A is in the RequiresInference state, so that's another way around the problem potentially.

0 replies

iejeecee · 2025-04-19T09:35:10Z

iejeecee
Apr 19, 2025
Author

I'm sorry, maybe it's unclear without some code.

The BatchedExecutor is in a async inference loop, which I want to be able to cancel.
For example when the LLM is generating some wrong or rambling answer.

Here is the loop:

while (true)
{                
    // Run inference
    var decodeResult = await executor.Infer(cancellationToken).ConfigureAwait(false);
    if (decodeResult == DecodeResult.NoKvSlot)
    {
        throw new Exception("Out of memory");
    }
  
    // check if inference needs to continue
    if (conversation.RequiresInference) continue;

    var token = sampler.Sample(executor.Context.NativeHandle, conversation.GetSampleIndex());

    if (token.IsEndOfGeneration(model.Vocab))
    {
        break;
    }

    decoder.Add(token);

    string decoded = decoder.Read();
    response.Content += decoded;

    yield return new InferenceResult(decoded, AuthorRole.Assistant);

    conversation.Prompt(token);
}

I cancel the loop using the cancellation token that is passed to the infer function.
However doing that leaves the BatchedExecutor in a RequiresInference state.
In that state I cannot continue the conversation and add a new prompt.

However even if I try to continue inference after cancellation the BatchedExecutor will never leave the state RequiresInference
the following code will run forever. So I'm not sure what is going on there, maybe some sort of bug or issue?

while (conversation.RequiresInference)
{
   await executor.Infer().ConfigureAwait(false);
}

If I understand the first part of your response correctly your suggesting I should cancel the loop like this?

 while (true)
 {                
     // Run inference
     var decodeResult = await executor.Infer().ConfigureAwait(false);
     if (decodeResult == DecodeResult.NoKvSlot)
     {
         throw new Exception("Out of memory");
     }
   
     // check if inference needs to continue
     if (conversation.RequiresInference) continue;

     var token = sampler.Sample(executor.Context.NativeHandle, conversation.GetSampleIndex());

     if (token.IsEndOfGeneration(model.Vocab))
     {
         break;
     }

     decoder.Add(token);

     string decoded = decoder.Read();
     response.Content += decoded;

     yield return new InferenceResult(decoded, AuthorRole.Assistant);

     if (cancellationToken.IsCancellationRequested) break;

     conversation.Prompt(token);
 }

And then continue with the next prompt? I wouldn't be adding the last generated token to the context and I'm unsure in what sort of internal state the BatchedExecutor is in. It feels icky, and it is also unclear why I can't use the cancellation token with the Infer function.

And with Forking the conversation, looking at the code it will simply create a copy of the current conversation right? It will also copy the _requiredEpoch variable, which will fail when trying to prompt with RequiresInference => _requiredEpoch > Executor.Epoch;

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with BatchedExecutor and cancelling inference #1159

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Issue with BatchedExecutor and cancelling inference #1159

iejeecee Apr 18, 2025

Replies: 2 comments

martindevans Apr 19, 2025 Maintainer

iejeecee Apr 19, 2025 Author

iejeecee
Apr 18, 2025

martindevans
Apr 19, 2025
Maintainer

iejeecee
Apr 19, 2025
Author