fix: broken invariants in agent-loop, "400 No tool output found" #490
+7
−3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I'm not certain this fully fixes the issue, but these changes address several clear problems in the current cancellation flow
This is a draft because I think we can do better still, and I welcome pushes to this PR or publishing if yall wanna run with it.
There is an issue of broken invariants, where we are running logic such as
flush
even though the code as written expects it will not run aftercancel
.related to #114 aka
400 No tool output found for function call
.Problem 1: Abort Controller is null
Upon cancel, we attempt to abort the stream via abortController.
After #178 we are nulling out the handle to the stream BEFORE we call the abort method on its controller. TS assertions hide that fact from the compiler, giving us a silent no-op instead of an abort:
codex/codex-cli/src/utils/agent/agent-loop.ts
Lines 122 to 133 in 09f0ae3
This means cancel will not abort the stream, and we continue processing events.
Problem 2: Broken control flow expectations
Problem 1 is an issue, but not one that should lead to the
400 No tool output found for function call
.The agent loop is written in such a way that it appears to expect to break out of the async iterator loop when the stream is aborted. That never actually happens even after fixing Problem 1 and you can confirm that w/ a console.log in the catch handler.
codex/codex-cli/src/utils/agent/agent-loop.ts
Lines 790 to 800 in 09f0ae3
That is the broken invariant. When we do not break out of the loop on abort, we continue processing events, and our cleanup code later runs.
"Fix Problem 1, and Problem 2 heals itself then" you may say. Under testing, it does not! I have not observed the
catch
executing in response to a proper abort.Problem 3 -- Inconsistent State Guards
There are inconsistent
canceled
state guards in the event processing and leads to state drift when we realize that:flush
even though we may havethis.canceled === true
impacting the processing of events.With probelm 2 being the broken invariant, the rest of the flow gets unpredictable. Most places are guarded, like
codex/codex-cli/src/utils/agent/agent-loop.ts
Lines 1112 to 1121 in 09f0ae3
But we will still increment the
lastReponseId
even when we are in a cancel situation, and stopped processing function calls.I focused on the
flush
code. It will unconditionally clear thependingAborts
array, despite it expecting that we are in a cancel free state. I didn't trace the exact specifics to limit my brain running out of context space for while debugging these async flows.Solution
Well, I ran out of time to really come up with a bullet proof solution. Testing this is difficult, as it relies on timing. Asking to prefix tool runs w/ sleep can help a bit, but I had a harder time reproing today vs last week.
For now, ensuring Abort actually is triggered, and guarding the flush is an improvement
Ideally, we can get the throw to happen on an abort, or otherwise return early from the processing. That's the smallest surgical fix until this logic can be reworked to be entirely signal aware. I propose that in the near future instead of a canceled flag, we observe the AbortSignal as the single source of truth for cancellation state and esnure it propogates consistently through all operations.
bugbears
Stream processing can be finicky, as you may still receive buffered events after a stream has been canceled. Idk how this maps exactly to async generators
Once the abort is successful, assuming we have no more events that leak out, we can't easily break from the loop and return early. I wasn't able to consistently repro today getting another event even when we are in the canceled state, but throwing a manually constructed
AbortError
if we process an event duringthis.canceled === true
would fix the control flow.