Alpha fails on startup

I’ve got a DGraph cluster, consisting of a single alpha and single zero, deployed in ECS. It’s been hapilly running away for the past couple of months. We recently updated the schema, which is significantly larger than it was previously.

The Alpha node just won’t start any more. The significant line in the logs is:

2021/02/05 15:53:12 Buffer length: 285038378 greater than file size: 14093. Manifest file might be corrupted

The Alpha has 16384 MB memory and 4096 vCPU. Alpha is started with:

	["dgraph","alpha","--my=alpha.develop.dgraph.imaging:7080","--zero=zero.develop.dgraph.imaging:5080","--lru_mb=5460","--whitelist=10.250.0.0:10.250.2.254"]

Can anyone help? Let me know if you need any more information

What version of Dgraph are you running? Based on the error message it looks like the MANIFEST file is corrupted. Do you have a data export or backup that you can restore from?

We’re not in production, so any loss of data doesn’t matter. We’re using DGraph version 20.07.2

What is the Manifest file? If necessary we could purge the EFS that persists our data, but it feels slightly heavy-handed.

The MANIFEST file is a Badger file in the p directory of Dgraph Alpha node. e.g., when ls-ing a p directory you’ll see the MANIFEST file as one of files:

$ ls ./p
000001.vlog  00001.mem  DISCARD  KEYREGISTRY  LOCK  MANIFEST

I’d recommend upgrading to the latest Dgraph release (currently v20.11.1) if you’re going to redeploy your cluster. And, we don’t usually recommend EFS/NFS for a proper Dgraph setup (see our production checklist docs). It could be that persisting data over EFS caused the issue here.

Are you running the Alpha node with four thousand CPU cores? Or does “4096 vCPU” here mean 4 cores?

Sorry, ECS measures CPUs in memory units; so 4 CPUs.

I’ve given it more CPU & memory, if the problem occurs again, I’ll swap out EFS for EBS.

Out of interest, how do you guys recommend deploying in AWS? This is in Fargate with EFS. Do you suggest an EC2 cluster with EBS?

Thanks for clarifying, @SamJBentley. Our Production Checklist docs go over the recommended setup which recommends setups using EBS volumes with high IOPS for performance.

Unless you’re using EFS for some specific reason, I wouldn’t recommend it to run Dgraph.

You can also use Dgraph Cloud hosted on AWS instead of having to host Dgraph yourself.