We’re pretty consistently seeing query errors where fails due to an Unhealthy Connection error. It does anecdotally appear to happen more frequently when the query result is large (a million or more edges). Is there any advice to start debugging what the real issue is and resolve the connection issue?
Thanks
Caused by: io.grpc.StatusRuntimeException: UNKNOWN : dispatchTaskOverNetwork: while retrieving connection.: Unhealthy connection
at io.grpc.Status.asRuntimeException(Status.java:533)
at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:478)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:617)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
at io.grpc.internal.ClientCallImpl.ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:803)
at io.grpc.internal.ClientCallImpl.ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:782)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
The “Unhealthy Connection” error, especially during large queries with extensive result sets, is concerning. To start debugging and resolving this issue effectively, here are some steps we can take: 1. Connection Stability 2. Resource Allocation 3. Error Analysis 4. Timeout Settings 5. Network Configuration 6. gRPC Configuration
I will share you some steps they might help you :
Check Network Stability
Increase Timeout Settings
Monitor Server Resources
Optimize Queries
Examine Server Logs
I’m assuming you’re using dgraph4j. It’s been a while since I’ve worked with Java gRPC, but did you check out the keepAliveWithoutCalls(boolean enable) method of ManagedChannelBuilder? I seem to recall that being useful with regard to unhealthy connections.
Yes, we’re using dgraph4j. We’re using NettyChannelBuilder.
We’re seeing this error reliably now. We were previously struggling with max edges exceeded so we raised the limit to 5 million. Now that the limit is raised, instead of receiving the max edges exceeded error, we’re receiving the unhealthy connection error.
The timing between the start of the query and the stacktrace isn’t always consistent either which makes me think it’s not a timeout (I’ve seen 30 seconds, 33 seconds, 58 seconds, 62… etc).
One important question I’m trying to figure out… Is the unhealthy connection error referring to my client’s connection to dgraph or a connection issues between the dgraph nodes?
Small quirk about the edge limits:
When the limit was set to 1 million, dgraph believed there was 2.3 million edges.
When the limit was raised to 2.5 million, dgraph believed there was 3.5 million edges.
If the limit is 1 million, how does it know how many total edges there? Does it perform the full query anyway, but not return the data? Or does it just go out one more hop?.. Or a partial query and it cancels when it sees it hit the limit?
Why does when the limit change, dgraph has a different count of total edges?
I’ve enabled jaeger tracing and a high verbosity of --v=3.
Error in the alpha container log:
server.go:1467] Finished a query that started at: 2024-07-22T21:45:33Z
server.go:1473] Error processing query: dispatchTaskOverNetwork: while retrieving connection.: Unhealthly connection
For the query that causes the Unhealthy connection error, the Duration and Total Spans are frequently different. Here’s an example:
Duration: 25.46s
Services: 1
Depth: 5
Total Spans: 112,035
Looking at the processTask in Jaeger, a few things stick out to me.
The Start Time of each successive operation is greater than the one before. For example:
dgraph.alpha Server.Query
dgraph.alpha processTask.0-<predicate1> Duration 2.67ms Start time 2.2ms
dgraph.alpha processTask.0-<predicate2> Duration 2.78ms Start time 2.2ms
dgraph.alpha processTask.0-<predicate3> Duration 2.68ms Start time 2.21ms
dgraph.alpha processTask.0-<predicate4> Duration 2.39ms Start time 2.23ms
dgraph.alpha processTask.0-<predicate5> Duration 3.2ms Start time 2.26ms
dgraph.alpha processTask.0-<predicate6> Duration 33us Start time 4.73ms
...
dgraph.alpha Sent.pb.Worker.ServeTask Duration 6.39ms Start time 6.51ms
dgraph.alpha Recv.pb.Worker.ServeTask Duration 1.66ms Start time 8.13ms
dgraph.alpha Recv.pb.Worker.ServeTask Duration 1.47ms Start time 8.24ms
dgraph.alpha processTask.0-<predicate7> Duration 1.34ms Start time 8.34ms
...
dgraph.alpha Sent.pb.Worker.ServeTask Duration 2.46s Start time 6.08s
[Client:true, error:true, FailFast:true, internal.span.format:jaeger, status.code:1, status.message: context canceled]
...
The query starts at a single node and recurses until it hits a depth of 31.
Because each span appears to be starting after it’s found the edge/node, it makes me think there isn’t enough goroutines/threads to process the edges and nodes it’s traversing. For large recurse queries, should settings be set higher?
There are two gaps where there are no spans… At about 4.5 seconds, there are no more spans until 6.2 seconds. At 10.5 seconds there are no more spans until 11.5 seconds. Why is this?
There are many context canceled errors within the spans - 2,082 out of 112,035. Is this normal?
@KVG@rft@Shivaji_Kharse + anyone willing to help! I work with Ryan and wanted to highlight the importance of us tracking down the cause of this unhealthy connection blocking issue. We’re ready to deploy a year-long effort involving Dgraph schema and processing updates, but the queries for our production dataset are no longer returning as outlined above. Because this error is coming directly from the Alpha logs, we aren’t sure how to make the proper adjustments to get this fixed. Are there any settings you can specifically suggest? Any help you can give is much appreciated! Thanks!
Happy to help-- can you tell me more about the set up ya’ll have going? Self-hosted on AWS? How many nodes (alpha/zero). How large is the dataset?
Reading above it looks like you’re using dgraph4j, we just rolled (or will be rolling very shortly) some updates there. It might be worthwhile to grab the most recent version and see if that resolves it. Though I’d still like to figure out why you’re seeing this.
Our self-managed production Dgraph cluster is running on six on-premises servers:
Our current set up is a 9-node cluster: 3 Zeroes, 6 Alphas (two groups). Three machines run both a single Zero and a single Alpha, where as the other three run just a single Alpha. We have the latest version (v24.0.0) installed and running using Docker Swarm mode.
Our dataset is growing on the daily (although not as fast as we’d like it to with the amount of exceptions caused); but in our present p directories, there’s roughly 75GB in the Alpha Group #1 directories and 100GB in Alpha Group #2 directories.
We’re seeing the Unhealthy connection error in several different places. We see it most often (and very consistently) from our results service API, which uses dgraph4j to query Dgraph from our application. We’ve occasionally seen it from a separate client in our custom NiFi processor used to ingest the data, which also uses dgraph4j under the hood. However, as @rahst12 mentioned above, we also see this when cURLing directly using no client at all:
$ curl -H "Content-Type: application/dql" -X POST http://x.x.x.x:8080/query --data-binary '@query.json'
I think that debugging over the phone next week could really help us out; but let us know if we can help with any more background information in the meantime!
We’ve taken a look at the code base and believe we know where the error is coming from, but not why.
conn/pool.go
The conn/pool.go defines the ErrorUnhealthyConnection message as “Unhealthly Conenction”, hard codes the echoDuration as 500ms, and then allows up to a 2 second delay in the health check.
Relevant lines:
edgraph/server.go
The edgraph/server.go file calls “Process”, it fails to process throwing the ErrorUnhealthyConnection, and printing the error on line 1473.
We don’t know why inside qr.Process(ctx) it’s failing the health check. We do know it’s a large query with a lot of edges. Can large queries cause comms issues between the nodes possibly? We’re open to other thoughts too. We’re thinking about recompiling dgraph with a longer timeout and see if that helps.
@rahst12 Yeah larger queries can definitely cause comms issues as the internal queries might timeout. It could cause the unhealthy connection issue.
The reason you are seeing empty time on the trace is that we haven’t added spans everywhere. They could slow system down because they are generating too many spans.
Let me know when do you guys want to do a debugging session. We can schedule something.