I’ve got a DGraph cluster, consisting of a single alpha and single zero, deployed in ECS. It’s been hapilly running away for the past couple of months. We recently updated the schema, which is significantly larger than it was previously.
The Alpha node just won’t start any more. The significant line in the logs is:
2021/02/05 15:53:12 Buffer length: 285038378 greater than file size: 14093. Manifest file might be corrupted
The Alpha has 16384 MB memory and 4096 vCPU. Alpha is started with:
What version of Dgraph are you running? Based on the error message it looks like the MANIFEST file is corrupted. Do you have a data export or backup that you can restore from?
The MANIFEST file is a Badger file in the p directory of Dgraph Alpha node. e.g., when ls-ing a p directory you’ll see the MANIFEST file as one of files:
$ ls ./p
000001.vlog 00001.mem DISCARD KEYREGISTRY LOCK MANIFEST
I’d recommend upgrading to the latest Dgraph release (currently v20.11.1) if you’re going to redeploy your cluster. And, we don’t usually recommend EFS/NFS for a proper Dgraph setup (see our production checklist docs). It could be that persisting data over EFS caused the issue here.
Are you running the Alpha node with four thousand CPU cores? Or does “4096 vCPU” here mean 4 cores?
Thanks for clarifying, @SamJBentley. Our Production Checklist docs go over the recommended setup which recommends setups using EBS volumes with high IOPS for performance.
Unless you’re using EFS for some specific reason, I wouldn’t recommend it to run Dgraph.
You can also use Dgraph Cloud hosted on AWS instead of having to host Dgraph yourself.