Description
Problem description
During our service deployments, the client is rejecting certain amount of requests with CANCELLED
errors. Our servers are configured to send a GOAWAY
frame to clients during deployments to gracefully terminate connections. On May 11 2023, we had bumped our gRPC-js version from 1.7.3
to 1.8.14
. We reproduced our deployment scenario on multiple versions of our SDK, and the aforementioned gRPC version bump is the one where we start seeing the errors during deployments. We are not seeing this behavior with other SDKs.
I haven't bisected the gRPC-js versions themselves in between those version ranges, and I do see that there are 3k commits between the versions so hard to pinpoint a specific commit. I also see an open issue related to "New streams cannot be created after receiving a GOAWAY"
but that's one part of the problem we witness. There's another ticket that seemed relevant. On our logs, we see two flavors of errors.
One is when the transport
layer of gRPC detects a GOAWAY
frame and logs it, but still is followed by a few CANCELLED
errors. The other flavor is when the transport
layer doesn't detect the frame explicitly, but we see new stream cannot be created after GOAWAY
messages, followed by more cancelled errors. Sometimes we see both of these behaviors.
Here's a logStream from when we see both of them with http and gRPC traces:
with_cancelled_new_stream_errors.log
The below is a logStream where we don't see any cancelled
or new stream
errors from a version of gRPC earlier to 1.8.14
. Notice that these do report INTERNAL
errors sometimes (not as frequently or high as cancelled
), but even that can be safely retried so isn't an issue. If there's a way where we can know if the cancelled
errors are safe to retry for non-idempotent operations, it will be awesome.
without_cancelled_new_stream_errors_old_grpc_version.log
Let me know if you need more logs! I have attached as files to not clutter this space. I can easily reproduce this in my setup so will be happy to add more traces/scenarios that you'd like!
Reproduction steps
- We have reproduced this on every deployment since that gRPC-JS version bump in our private infrastructure. I haven't reproduced this yet with a toy/hello-word example. But have added enough logs from our repros.
Environment
- OS name, version and architecture: Tried on both MacOS and Amazon Linux
- Node version [e.g. 8.10.0]: v18
- Node installation method [e.g. nvm]: nodenv
- Package name and version [e.g. gRPC@1.12.0]: since atleast gRPC-JS
1.8.14