Unable to connect with dgraph-0.dgraph.dgraph.svc.cluster.local:7080

I1223 12:20:02.567665      18 raft.go:807] Skipping creating a snapshot. Num groups: 1, Num checkpoints: 0
E1223 12:24:50.033333      18 pool.go:311] CONN: Unable to connect with dgraph-0.dgraph.dgraph.svc.cluster.local:7080 : rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.244.1.135:7080: connect: connection refused"
E1223 12:24:51.037742      18 pool.go:311] CONN: Unable to connect with dgraph-0.dgraph.dgraph.svc.cluster.local:7080 : rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.244.1.135:7080: connect: connection refused"
E1223 12:24:52.041051      18 pool.go:311] CONN: Unable to connect with dgraph-0.dgraph.dgraph.svc.cluster.local:7080 : rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.244.1.135:7080: connect: connection refused"
I1223 12:24:52.179575      18 zero.go:506] Got connection request: cluster_info_only:true 
I1223 12:24:52.180567      18 zero.go:531] Connected: cluster_info_only:true 
I1223 12:24:52.182355      18 zero.go:506] Got connection request: id:1 addr:"dgraph-0.dgraph.dgraph.svc.cluster.local:7080" 
I1223 12:24:52.182938      18 zero.go:653] Connected: id:1 addr:"dgraph-0.dgraph.dgraph.svc.cluster.local:7080" 
I1223 12:24:53.042535      18 pool.go:327] CONN: Re-established connection with dgraph-0.dgraph.dgraph.svc.cluster.local:7080.

Can someone help me understand this error and why it is happening?

It is fine. Perhaps Alpha was not available to the cluster and eventually connected.

This is a repeatable error. It blocks our deployments. Is there any case where the zero loses the connectivity with alpha? Just keep in mind that they are running in the same pod, so the networking problem could be out of the box.

It is possible for the zero node to lose connectivity with the alpha node in certain situations. Some possible causes of this issue include:

  • The alpha node is unhealthy: If the alpha node is experiencing problems such as being overloaded, panicking, or crashing. This can cause the zero node to receive a connection refused error when attempting to connect to the alpha node.
  • The alpha node is being restarted: If the alpha node is being restarted or replaced by the Kubernetes cluster, it may temporarily become unavailable, causing the zero node to receive a connection refused error when attempting to connect.
  • Networking issues: If there are networking issues between the zero node and the alpha node, such as a network partition or a routing problem, it may be unable to establish a connection, resulting in a connection refused error.

It is worth noting that in order to ensure optimal performance and reliability, it is generally recommended to run the zero and alpha nodes on separate physical machines or K8s worker. This can help to avoid resource contention and other issues that can arise when the nodes are running on the same machine or in the same pod. If you are experiencing repeated connectivity issues between the zero and alpha nodes, it may be worth considering separating them onto different machines or workers to see if this helps to resolve the problem.

@MichelDiz Thanks for the reply. I tried to redeploy dgraph in high availability mode with 3 zeros and 3 alphas in separate pods. I tried to figure out why dgraph drains our nodes and found out that the dgraph-alpha is using 12GB RAM only for one query. This is absurd!

What can I do in order to reduce that RAM usage? My database is not significant, and I don’t understand why Dgraph uses that many resources.

Dgraph is using a lot of memory to improve performance. Some problems with RAM are due to the way that Go handles the garbage collector. Try to balance the load on your alpha nodes. This can involve analyzing the queries that are being run and determining if there are any changes that can be made to distribute the workload more evenly across the nodes - the use of multiple blocks helps - Break your query as “pipelines”. Another example, you may want to consider whether all of the predicates in your cluster are being grouped together or if they can be divided into multiple groups. If you spread the predicates, this will help.

Badger has several configuration options that can affect memory usage, including the compression_level. If you set it to none might help.

Another option is to consider using the badger.tables flag to specify that the Badger LSM tree should be stored on disk rather than in RAM. This can help to reduce the amount of memory that Dgraph uses.

BTW, Dgraph keeps the full WAL in RAM. I’m not sure if we can remove this. The more data the w directory has, the more RAM it will use. But I’m not sure if we can control this. I believe not.

You can also try reducing the cache size by adjusting the cache_mb flag. By default, Dgraph sets the cache size to half of the available RAM, but you may be able to achieve better performance by setting it to a smaller value.

Please note that any changes to the parameters may have pros and cons. You should be aware that the default parameters have been chosen to keep the cluster performant.

Thanks, @MichelDiz! I will try those suggestions and get back to you. Btw, the RAM usage I posted was from only one query… I am not sure if this is optimizable.

@MichelDiz I successfully re-deploy my dgraph with 3 alphas and 3 zeros. The problem is that I am still getting this error and my prod is falling.

E0110 11:12:21.822240 23 pool.go:311] CONN: Unable to connect with dgraph-alpha-1.dgraph-alpha.dgraph-prod.svc.cluster.local:7080 : rpc error: code = Unavailable desc = connection error: desc = “transport: Error while dialing dial tcp: lookup dgraph-alpha-1.dgraph-alpha.dgraph-prod.svc.cluster.local: no such host”

All alphas are up and the zeros are still generating this error when we initiate a specific query. Something here is fishy and it looks like a bug from your side.

In addition, the aplha pods are generating logs like

E0110 11:37:00.034304      19 groups.go:1224] Error during SubscribeForUpdates for prefix "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15dgraph.graphql.schema\x00": while receiving from stream: rpc error: code = Unavailable desc = transport is closing. closer err: <nil>

@MichelDiz

Here is a screenshot from our grafana.

Can you please let me know why Dgraph is using more than 10GB of RAM to return a specific query? This is not looking normal. My database is really small and I don’t think that this is related to our data. Also, the query itself doesn’t try to expand predicates deeper in the graph.

This is a BIG blocker for us. In your previous comment, you mention that there is the option to change the badger configuration. Can you please elaborate more on that, and how I apply those changes?

@matthewmcneely Can you also drop some eyes on this issue?