Alpha fails on startup

SamJBentley · February 5, 2021, 4:00pm

I’ve got a DGraph cluster, consisting of a single alpha and single zero, deployed in ECS. It’s been hapilly running away for the past couple of months. We recently updated the schema, which is significantly larger than it was previously.

The Alpha node just won’t start any more. The significant line in the logs is:

2021/02/05 15:53:12 Buffer length: 285038378 greater than file size: 14093. Manifest file might be corrupted

The Alpha has 16384 MB memory and 4096 vCPU. Alpha is started with:

	["dgraph","alpha","--my=alpha.develop.dgraph.imaging:7080","--zero=zero.develop.dgraph.imaging:5080","--lru_mb=5460","--whitelist=10.250.0.0:10.250.2.254"]

Can anyone help? Let me know if you need any more information

dmai · February 5, 2021, 5:18pm

What version of Dgraph are you running? Based on the error message it looks like the MANIFEST file is corrupted. Do you have a data export or backup that you can restore from?

SamJBentley · February 5, 2021, 5:24pm

We’re not in production, so any loss of data doesn’t matter. We’re using DGraph version 20.07.2

What is the Manifest file? If necessary we could purge the EFS that persists our data, but it feels slightly heavy-handed.

dmai · February 5, 2021, 6:57pm

The MANIFEST file is a Badger file in the p directory of Dgraph Alpha node. e.g., when ls-ing a p directory you’ll see the MANIFEST file as one of files:

$ ls ./p
000001.vlog  00001.mem  DISCARD  KEYREGISTRY  LOCK  MANIFEST

I’d recommend upgrading to the latest Dgraph release (currently v20.11.1) if you’re going to redeploy your cluster. And, we don’t usually recommend EFS/NFS for a proper Dgraph setup (see our production checklist docs). It could be that persisting data over EFS caused the issue here.

Are you running the Alpha node with four thousand CPU cores? Or does “4096 vCPU” here mean 4 cores?

SamJBentley · February 8, 2021, 5:33pm

Sorry, ECS measures CPUs in memory units; so 4 CPUs.

I’ve given it more CPU & memory, if the problem occurs again, I’ll swap out EFS for EBS.

Out of interest, how do you guys recommend deploying in AWS? This is in Fargate with EFS. Do you suggest an EC2 cluster with EBS?

dmai · February 9, 2021, 12:09am

Thanks for clarifying, @SamJBentley. Our Production Checklist docs go over the recommended setup which recommends setups using EBS volumes with high IOPS for performance.

Unless you’re using EFS for some specific reason, I wouldn’t recommend it to run Dgraph.

You can also use Dgraph Cloud hosted on AWS instead of having to host Dgraph yourself.

Topic		Replies	Views
Lost one of alpha after stoping dgraph Dgraph	1	382	February 13, 2023
Did I lose my data for forever? Dgraph dgraph , kind:bug	23	1968	June 16, 2021
Alpha crashes when loading data Dgraph	7	757	July 1, 2020
Received err: file does not exist for table 267629. Cleaning up Dgraph kind:bug	3	1022	September 24, 2021
Dgraph crash loop on aws Dgraph	16	1728	June 8, 2020

Alpha fails on startup

Related topics