Skip to content

fix: broken invariants in agent-loop, "400 No tool output found" #490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jonchurch
Copy link
Contributor

@jonchurch jonchurch commented Apr 21, 2025

I'm not certain this fully fixes the issue, but these changes address several clear problems in the current cancellation flow

This is a draft because I think we can do better still, and I welcome pushes to this PR or publishing if yall wanna run with it.

There is an issue of broken invariants, where we are running logic such as flush even though the code as written expects it will not run after cancel.

related to #114 aka 400 No tool output found for function call.

Problem 1: Abort Controller is null

Upon cancel, we attempt to abort the stream via abortController.

After #178 we are nulling out the handle to the stream BEFORE we call the abort method on its controller. TS assertions hide that fact from the compiler, giving us a silent no-op instead of an abort:

// Reset the current stream to allow new requests
this.currentStream = null;
log(
`AgentLoop.cancel() invoked – currentStream=${Boolean(
this.currentStream,
)} execAbortController=${Boolean(this.execAbortController)} generation=${
this.generation
}`,
);
(
this.currentStream as { controller?: { abort?: () => void } } | null
)?.controller?.abort?.();

This means cancel will not abort the stream, and we continue processing events.

Problem 2: Broken control flow expectations

Problem 1 is an issue, but not one that should lead to the 400 No tool output found for function call.

The agent loop is written in such a way that it appears to expect to break out of the async iterator loop when the stream is aborted. That never actually happens even after fixing Problem 1 and you can confirm that w/ a console.log in the catch handler.

} catch (err: unknown) {
// Gracefully handle an abort triggered via `cancel()` so that the
// consumer does not see an unhandled exception.
if (err instanceof Error && err.name === "AbortError") {
if (!this.canceled) {
// It was aborted for some other reason; surface the error.
throw err;
}
this.onLoading(false);
return;
}

That is the broken invariant. When we do not break out of the loop on abort, we continue processing events, and our cleanup code later runs.

"Fix Problem 1, and Problem 2 heals itself then" you may say. Under testing, it does not! I have not observed the catch executing in response to a proper abort.

The implementation of the Stream in the openai sdk does not seem to bubble up aborts, but will instead return silently. (assuming that i've located the codepath correctly here) https://github.com/openai/openai-node/blob/2785c1186b528e4ab3a2a7c9282e041aaa4c13f6/src/streaming.ts#L82-L85

Problem 3 -- Inconsistent State Guards

There are inconsistent canceled state guards in the event processing and leads to state drift when we realize that:

  1. Problem 1 meant we continue to process the stream
  2. Problem 2 means that we will run cleanup code flush even though we may have this.canceled === true impacting the processing of events.

With probelm 2 being the broken invariant, the rest of the flow gets unpredictable. Most places are guarded, like

private async processEventsWithoutStreaming(
output: Array<ResponseInputItem>,
emitItem: (item: ResponseItem) => void,
): Promise<Array<ResponseInputItem>> {
// If the agent has been canceled we should short‑circuit immediately to
// avoid any further processing (including potentially expensive tool
// calls). Returning an empty array ensures the main run‑loop terminates
// promptly.
if (this.canceled) {
return [];

But we will still increment the lastReponseId even when we are in a cancel situation, and stopped processing function calls.

I focused on the flush code. It will unconditionally clear the pendingAborts array, despite it expecting that we are in a cancel free state. I didn't trace the exact specifics to limit my brain running out of context space for while debugging these async flows.

Solution

Well, I ran out of time to really come up with a bullet proof solution. Testing this is difficult, as it relies on timing. Asking to prefix tool runs w/ sleep can help a bit, but I had a harder time reproing today vs last week.

For now, ensuring Abort actually is triggered, and guarding the flush is an improvement

Ideally, we can get the throw to happen on an abort, or otherwise return early from the processing. That's the smallest surgical fix until this logic can be reworked to be entirely signal aware. I propose that in the near future instead of a canceled flag, we observe the AbortSignal as the single source of truth for cancellation state and esnure it propogates consistently through all operations.

bugbears

Stream processing can be finicky, as you may still receive buffered events after a stream has been canceled. Idk how this maps exactly to async generators

Once the abort is successful, assuming we have no more events that leak out, we can't easily break from the loop and return early. I wasn't able to consistently repro today getting another event even when we are in the canceled state, but throwing a manually constructed AbortError if we process an event during this.canceled === true would fix the control flow.

p.s. I include this note whenever I’ve spent significant time on a contribution to a project I don’t maintain. It’s part disclosure, part invitation. ❤️

this investigation took a few days of work to fully unwind and validate. working on open source full-time has given me the freedom to do the kind of deep, slow, careful work that rarely fits into a sprint. that freedom made this fix possible.

if this kind of work resonates with you, or you want to connect about open source systems work more broadly, feel free to reach out.

@jonchurch jonchurch marked this pull request as draft April 21, 2025 22:00
@jonchurch jonchurch changed the title fix: broken invariants in agent-loop fix: broken invariants in agent-loop, "400 No tool output found" Apr 21, 2025
@jonchurch
Copy link
Contributor Author

Im east coast and taking off for the day, will circle back tmr

@jonchurch jonchurch force-pushed the cancel-invariant-fix branch from 64dde7f to 4f200f2 Compare April 21, 2025 22:31
@tibo-openai
Copy link
Collaborator

Hey @jonchurch, thank you for the incredible sleuthing here, lots of good insights here with subtle issues in the current logical flow. I'm excited to see if we can land this, can I help in some way?

@jonchurch
Copy link
Contributor Author

jonchurch commented Apr 25, 2025

Hey @tibo-openai, after spending more time analyzing the agent loop, I'm confident in the surgical fix this diff provides for the 400 "No function output" issues.

However, fixing the stream abort reveals a hidden trade off in the current implementation.

With this diff, we could potentially lose conversation state if a user cancels midstream (before the response.completed event is processed).

Here's why: the current code never truly aborts the stream upon cancel. This bug actually has an unintended benefit, it ensures that even during cancellation, we still receive the response.completed event and update lastResponseId.
This preserves the user's input and model responses in the server-side conversation state.

When we fix the abort mechanism, we'll properly stop the stream before response.completed if user cancels during response streaming, meaning we won't get a new lastResponseId. The next request will use the previous conversation state, effectively "losing" the user's most recent input and any partial model responses they already saw between the conclusion of that last complete response and the new turn they canceled.

For a complete solution, we should consider implementing a partial transcript approach alongside this fix (very similar to what the Zero Data Retention feature implemented). This would track user inputs and partial responses during a canceled turn and include them in the next request to maintain conversation continuity.

Given this trade off, do you think we should:

  1. Proceed with this surgical fix and accept the potential conversation discontinuity (prolly not, as that is confusing UX, and "data loss")
  2. Expand the fix to include a basic partial transcript implementation
  3. Merge this fix but open an issue for the transcript implementation as follow-up work

I'm inclined toward option 3 since the current fix addresses the critical 400 errors, but I'd appreciate your thoughts.

I am picking at a reworking of the agent loop, as this diff is still sort of a band aid. The current implementation is clear that flush should not run on cancelation. The guard of pendingAborts.clear() in the flush method is only required because the abort (even after this diff) doesn't throw and land us in the actually expected termination branch, the catch block which would prevent flush from running via an early return. So band aids which are a code smell IMO until control flow is better managed.

unfortunately Im having too much fun reimplementing the loop, so progress has been slow 🤣

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants