Unhealthy connection when querying

We’re pretty consistently seeing query errors where fails due to an Unhealthy Connection error. It does anecdotally appear to happen more frequently when the query result is large (a million or more edges). Is there any advice to start debugging what the real issue is and resolve the connection issue?

Thanks

Caused by: io.grpc.StatusRuntimeException: UNKNOWN : dispatchTaskOverNetwork: while retrieving connection.: Unhealthy connection
   at io.grpc.Status.asRuntimeException(Status.java:533)
   at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:478)
   at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:617)
   at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
   at io.grpc.internal.ClientCallImpl.ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:803)
   at io.grpc.internal.ClientCallImpl.ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:782)
   at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
   at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)   
   at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   at java.base/java.lang.Thread.run(Thread.java:829)

The “Unhealthy Connection” error, especially during large queries with extensive result sets, is concerning. To start debugging and resolving this issue effectively, here are some steps we can take:
1. Connection Stability
2. Resource Allocation
3. Error Analysis
4. Timeout Settings
5. Network Configuration
6. gRPC Configuration

I will share you some steps they might help you :
Check Network Stability
Increase Timeout Settings
Monitor Server Resources
Optimize Queries
Examine Server Logs

I’m assuming you’re using dgraph4j. It’s been a while since I’ve worked with Java gRPC, but did you check out the keepAliveWithoutCalls​(boolean enable) method of ManagedChannelBuilder? I seem to recall that being useful with regard to unhealthy connections.

Yes, we’re using dgraph4j. We’re using NettyChannelBuilder.

We’re seeing this error reliably now. We were previously struggling with max edges exceeded so we raised the limit to 5 million. Now that the limit is raised, instead of receiving the max edges exceeded error, we’re receiving the unhealthy connection error.

The timing between the start of the query and the stacktrace isn’t always consistent either which makes me think it’s not a timeout (I’ve seen 30 seconds, 33 seconds, 58 seconds, 62… etc).

One important question I’m trying to figure out… Is the unhealthy connection error referring to my client’s connection to dgraph or a connection issues between the dgraph nodes?

Small quirk about the edge limits:
When the limit was set to 1 million, dgraph believed there was 2.3 million edges.
When the limit was raised to 2.5 million, dgraph believed there was 3.5 million edges.

If the limit is 1 million, how does it know how many total edges there? Does it perform the full query anyway, but not return the data? Or does it just go out one more hop?.. Or a partial query and it cancels when it sees it hit the limit?

Why does when the limit change, dgraph has a different count of total edges?

Thanks,
Ryan

I went onto the server running a dgraph alpha and curl’d the same query that is giving the above error via the dgraph4j client. I get the same error.

{"errors":[{"message":": dispatchTaskOverNetwork: while retrieving connection.: Unhealthy connection","extensions":{"code":"ErrorInvalidRequest"}}],"data":null}

Does this mean it’s an error internal to the dgraph server? Initially I was thinking this may have been a client to server connection error.

I’ve enabled jaeger tracing and a high verbosity of --v=3.

Error in the alpha container log:

server.go:1467] Finished a query that started at: 2024-07-22T21:45:33Z
server.go:1473] Error processing query: dispatchTaskOverNetwork: while retrieving connection.: Unhealthly connection

For the query that causes the Unhealthy connection error, the Duration and Total Spans are frequently different. Here’s an example:
Duration: 25.46s
Services: 1
Depth: 5
Total Spans: 112,035

Looking at the processTask in Jaeger, a few things stick out to me.

  1. The Start Time of each successive operation is greater than the one before. For example:
dgraph.alpha   Server.Query
   dgraph.alpha processTask.0-<predicate1>  Duration 2.67ms  Start time 2.2ms
   dgraph.alpha processTask.0-<predicate2>  Duration 2.78ms  Start time 2.2ms
   dgraph.alpha processTask.0-<predicate3>  Duration 2.68ms  Start time 2.21ms
   dgraph.alpha processTask.0-<predicate4>  Duration 2.39ms  Start time 2.23ms
   dgraph.alpha processTask.0-<predicate5>  Duration 3.2ms  Start time 2.26ms
   dgraph.alpha processTask.0-<predicate6>  Duration 33us  Start time 4.73ms
... 
   dgraph.alpha Sent.pb.Worker.ServeTask   Duration 6.39ms  Start time 6.51ms
      dgraph.alpha Recv.pb.Worker.ServeTask   Duration 1.66ms  Start time 8.13ms
         dgraph.alpha Recv.pb.Worker.ServeTask   Duration 1.47ms  Start time 8.24ms
            dgraph.alpha processTask.0-<predicate7>   Duration 1.34ms  Start time 8.34ms
...
   dgraph.alpha Sent.pb.Worker.ServeTask   Duration 2.46s  Start time 6.08s
      [Client:true, error:true, FailFast:true, internal.span.format:jaeger, status.code:1, status.message: context canceled]
...

The query starts at a single node and recurses until it hits a depth of 31.

Because each span appears to be starting after it’s found the edge/node, it makes me think there isn’t enough goroutines/threads to process the edges and nodes it’s traversing. For large recurse queries, should settings be set higher?

  1. There are two gaps where there are no spans… At about 4.5 seconds, there are no more spans until 6.2 seconds. At 10.5 seconds there are no more spans until 11.5 seconds. Why is this?

  2. There are many context canceled errors within the spans - 2,082 out of 112,035. Is this normal?