Dgraph internal client RST_STREAM errors and timeouts

Has anyone else been seeing sporadic errors like these from the dgraph-js grpc client recently?

13 INTERNAL: Received RST_STREAM with code 2 triggered by internal client error: read ETIMEDOUT
13 INTERNAL: Failed to start HTTP/2 stream with error: The session has been destroyed

We’ve seen this with two dgraph-js versions…

  • v20.11.0
  • v21.3.1

…in both of our Dgraph Cloud environments:

  • A single dedicated Dgraph Cloud instance running Dgraph v20.11.2-rc1-25-g4400610b2
  • A high-availability Dgraph Cloud cluster running Dgraph v20.11.2-rc1-29-gff3c84328

I opened a Jira ticket with Dgraph support, but they haven’t responded in 6 days, so I have nowhere else to turn! This is killing us right now. These periodic timeouts seem to block all queries until the dgraph-js client resets itself.

Any help, ideas, thoughts, or recommendations would be appreciated. We’re desperate for a solution.

We’ve made some recent changes in Dgraph Cloud to reduce the frequency of RST_STREAM and timeout errors by setting idle timeouts and reducing the frequency of our proxy getting restarted.

Though we’re still seeing some errors happening infrequently. We’re looking to replace our L4 load balancer with an L7 load balancer which can gracefully handle the cases in which these errors get returned to the client.