Dgraph Alpha Node unresponsive

We have a 6-node cluster - 3 Zeros and 3 Alphas.

There’s an Alpha instance in our Test environment that’s non-responsive. The other two Alphas are fine. I logged onto the instance and the dgraph process was still up and running. I can see the other nodes can disconnect/re-connect to it when I stop/restart the dgraph process on the bad Alpha. However, when attempting to run a query against it (using curl localhost), the query just doesn’t return. I’ve had this happen once before and just terminated and rebuilt the instance because I didn’t have the time to investigate, but I’m curious what causes this to happen.

I tried restarting the dgraph process, but it it didn’t help.

The ERROR log has:
Error during SubscribeForUpdates for prefix "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15dgraph.graphql.schema\x00": Unable to find any servers for group: 1. closer err: <nil>

When I look at the localhost:8080/state, it has the correct cluster metadata. The other five nodes are up and reachable.

In the INFO log, when it restarts, I can see where it gets the first state update from the Zero successfully and displays the schema predicates.

Is there any other debugging that I can try to get the instance fixed before I just terminate and let it rebuild?

Thanks!

Dgraph version   : v21.03.2
Dgraph codename  : rocket-2
Dgraph SHA-256   : 00a53ef6d874e376d5a53740341be9b822ef1721a4980e6e2fcb60986b3abfbf
Commit SHA-1     : b17395d33
Commit timestamp : 2021-08-26 01:11:38 -0700
Branch           : HEAD
Go version       : go1.16.2
jemalloc enabled : true

Did you data came from a shared instance or some multitenant env?

This prefix \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\ feels like it came from multitenant.

No, it’s not a multi-tenant environment, and the data did not come from a shared instance. The instance had been up and running as part of the cluster for a couple of weeks and working fine.

If you don’t use GraphQL, ACL and such. Remove those predicates from the dataset and reimport it. At some moment you did it and it remains there. My best suggestion would to start really from scratch. Cuz there are several steps to guarantee no other small mistake is being in place. So, start from scratch. Cleaning the whole thing from the top to the bottom. And reimport the data.

Feel free to share all steps, configs, yaml and so on you are using so I can check for any typo or misconfig.

We’re not using GraphQL or ACL. Once the cluster is up, I have a shell script that I run manually to deploy the DQL schema via localhost:8080/alter.

I went ahead and removed the bad node from the cluster and allowed it to recreate. It seems like the new third node is fine now.

The data in the Test environment was imported from an export of the Dev environment. I just compared the predicates in Ratel and noticed these differences in the dgraph.* predicates:

Test contains:
dgraph.acl.rule
dgraph.drop.op
dgraph.graphql.p_query
dgraph.graphql.schema
dgraph.graphql.xid
dgraph.rule.permission
dgraph.rule.predicate

Dev contains only:
dgraph.drop.op
dgraph.graphql.p_query
dgraph.graphql.schema
dgraph.graphql.xid

I’m not sure how predicates like dgraph.acl.rule or dgraph.rule.permission would have gotten created in Test since we’re not using a GraphQL schema or touching ACL at all.

Is there anything that could have inadvertently created these predicates?

When I try to drop the dgraph.rule.permission predicate, I get the message:
“Could not drop predicate: Error: predicate dgraph.rule.permssion is pre-defined and is not allowed to be dropped”

I’ll drop the Test data and schema, recreate the schema, reimport the data, and verify the schema after each step.

Thanks

All clusters has 30 days to use enterprise features. That’s explain part of it.

Maybe you should start Dgraph with all ee features disabled.

Yeah, you know? that’s a bit odd. Cuz this doesn’t happen too often. Maybe you are the fifth person with this in a year.

The Alpha communication issue happened to me again yesterday, in the same environment. I was going through a process to update the EC2 instances with a newer AMI.

All of the Zero nodes replaced successfully.

I removed the first Alpha node from the cluster and terminated it. The new instance started, appeared to communicate with the Zero leader and pull down the schema, but it couldn’t communicate with the other Alphas.

When looking at the cluster state, I noticed that the IP address assigned to the new Alpha had previously been used by another node that had been removed. Additionally, I noticed that a second Alpha node also had this occurrence - and I’m wondering if that’s what caused the first issue in the cluster that I encountered in my first post above.

I ended up rebuilding the cluster and reimporting the schema and data from an export with no issues, so there’s nothing off with the networking or security groups. I’m wondering if IP address re-use of removed nodes could somehow potentially cause issues?

Thanks!

Ephemeral vs State: As, EC2 are by nature ephemeral, so will have a random host name and IP address, this can happen, because the state (hostname or IP address) must be consistent with the data used by those systems. Thus if there was any data that got accidentally added to the image, it would be trying to connect back to the older state.

Baking Images: When baking new AMI images, I would recommend not putting Dgraph on it, just keep the base OS and packages, and instead use userdata, such as pull-based change configuration or a simple shell script, to install Dgraph. This way you know do not have existing state.

Configuring vs Remediation (Recovery): I should point out that this scenario is only good for the initial installation of Dgraph, but will not work for remediation, such as when a Dgraph Alpha or Dgraph Zero fails. As the data state must be married to the configuration state, change configurations almost always fail in this setup, as they only support synchronous service discovery (Chef, Puppet, Salt Stack, Ansible). Thus in this scenario, rebuilding from scratch and doing a restore process or import process is the only available option for remediation.

Minimum Automation Needed for Recovery: The minimum automation needed for a stateful distributed cluster with higher level of HA, would be to use some sort of asynchronous service discovery with automated DNS updates, and maintain references to external volumes, so that if an alpha-01 is replaced, for example, it gets the same hostname + external volume.

Recommended Solution: For this reason, we recommend Kubernetes, as it has all this automation bundled in the platform through the StatefulSet object, as is the most well supported platform for this use case. The StatefulSet supports automating external volumes and linking this to a state of the node, along with automated DNS updates, and async service discovery.

Segue: There may be other solutions, like Hashicorp Nomad, or custom homegrown solution with consul-template and portworx, but so far, there hasn’t been anyone that used these solutions. There were a few that used Docker Swarm in the past, but as Mirantis (current owners) are putting their energy and business toward Kubernetes.

@joaquin Thank you for the response. The AMIs don’t have Dgraph baked into them. As you recommended, Dgraph is installed via a shell script that is invoked by the user data. After installation, there is another shell script that is managed by monit which will check the state API and dynamically setting the IP address of the Zero leader, and start the Alpha process. IPs are being used exclusively for internal connections and not host names.

Terminating the bad instance seems to be fine for now but just curious if there are any known scenarios that might cause this to happen. Agreed that Kubernetes would be the ideal solution, but it wasn’t an option for us at the time our implementation was set up for reasons. It might be possible for us to use K8s in the near future, so I’ll start down that path as soon as possible. Thanks again.

There can be all sorts of reasons, including the hardware failing (memory or disk), so there area always need to have recovery. I am sure you heard this before, it’s not a matter of if, but a matter of when.

I can help you get setup in Kubernetes in 33-35 minutes (10 minutes of instruction, 20 minutes for provisioning the VPC with public/private subnets + EKS, 3-5 minutes for Dgraph using the helm chart).

You can just use https://eksctl.io/:

eksctl create cluster \
  --name $EKS_CLUSTER_NAME \
  --region $EKS_CLUSTER_REGION

I wrote an article on the topic of automating Route53 during deployment, so this has how-to with eksctl + suign IRSA (IAM Role for Service Account) setup, should anything running on K8S need access to a cloud resource, such as S3 bucket for backups:

On monitoring, Dgraph has build in metric counts, which can be scraped by a tool that can pull the service, to this might be useful. I have some boiler plate code to get started on this in the contrib directory. The Monit setup sounds interesting.

1 Like

Thank you, I will start looking at this next week. I also came across your article here, which I’m very interested in