Skip to content

"New streams cannot be created after receiving a GOAWAY" and CANCELLED errors during service deployments #2694

Open
@pratik151192

Description

@pratik151192

Problem description

During our service deployments, the client is rejecting certain amount of requests with CANCELLED errors. Our servers are configured to send a GOAWAY frame to clients during deployments to gracefully terminate connections. On May 11 2023, we had bumped our gRPC-js version from 1.7.3 to 1.8.14. We reproduced our deployment scenario on multiple versions of our SDK, and the aforementioned gRPC version bump is the one where we start seeing the errors during deployments. We are not seeing this behavior with other SDKs.

I haven't bisected the gRPC-js versions themselves in between those version ranges, and I do see that there are 3k commits between the versions so hard to pinpoint a specific commit. I also see an open issue related to "New streams cannot be created after receiving a GOAWAY" but that's one part of the problem we witness. There's another ticket that seemed relevant. On our logs, we see two flavors of errors.

One is when the transport layer of gRPC detects a GOAWAY frame and logs it, but still is followed by a few CANCELLED errors. The other flavor is when the transport layer doesn't detect the frame explicitly, but we see new stream cannot be created after GOAWAY messages, followed by more cancelled errors. Sometimes we see both of these behaviors.

Here's a logStream from when we see both of them with http and gRPC traces:

with_cancelled_new_stream_errors.log

The below is a logStream where we don't see any cancelled or new stream errors from a version of gRPC earlier to 1.8.14. Notice that these do report INTERNAL errors sometimes (not as frequently or high as cancelled), but even that can be safely retried so isn't an issue. If there's a way where we can know if the cancelled errors are safe to retry for non-idempotent operations, it will be awesome.

without_cancelled_new_stream_errors_old_grpc_version.log

Let me know if you need more logs! I have attached as files to not clutter this space. I can easily reproduce this in my setup so will be happy to add more traces/scenarios that you'd like!

Reproduction steps

  • We have reproduced this on every deployment since that gRPC-JS version bump in our private infrastructure. I haven't reproduced this yet with a toy/hello-word example. But have added enough logs from our repros.

Environment

  • OS name, version and architecture: Tried on both MacOS and Amazon Linux
  • Node version [e.g. 8.10.0]: v18
  • Node installation method [e.g. nvm]: nodenv
  • Package name and version [e.g. gRPC@1.12.0]: since atleast gRPC-JS 1.8.14

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions