Concurrent Backups Can Lead to Issues

joaquin · August 31, 2020, 6:34pm

Report a Dgraph Bug

The issue arose incremental backups were made without a full backup.

What version of Dgraph are you using?

v20.03.4

Have you tried reproducing the issue with the latest release?

yes

What is the hardware spec (RAM, OS)?

Amazon Linux EKS nodes with Dgraph Ubuntu containers

Steps to reproduce the issue (command/config used to run Dgraph).

At this moment, still researching the steps to reproduce this issue, but deducing one probable way this can occur is the following:

Do backups simultaneously on two different alpha nodes, e.g. alpha-0 and alpha-2.
Use the same backup destination for all backups, e.g. s3 bucket or NFS mount.

Expected behavior and actual result.

A summary is a situation is that there is an assumption that after you start a new backup, all the backups after that belong to the same series of backups until a new full backup is forced. If this assumption is broken, when we can get into this scenario.

As a workaround, we can use two different folders for each of the backup series and make sure we only execute a backup on the same alpha node if they occur concurrently.

LGalatin · September 16, 2020, 11:28pm

Fixed by this PR: Fix(Dgraph): Add a lock to backups to process one request at a time. … · dgraph-io/dgraph@5b79260 · GitHub

mrjn · September 17, 2020, 3:32am

Won’t work if backup request is sent to different alphas. To only have one backup across cluster , need to involve zero.

martinmr · September 17, 2020, 3:23pm

Currently backups and restores do not involve zero at all and moving to using zero for coordination would be a bigger change. That’s why I wasn’t sure if we should involve zero.

@mrjn, should I work on that?

joaquin · September 17, 2020, 4:01pm

In the interim, the dgraph helm chart sends send backups only to a single alpha, rather than an svc that can hit alpha0…n pod, and run into this scenario.

https://github.com/dgraph-io/charts/blob/backup_support/charts/dgraph/templates/backups/cronjob-full.yaml#L47-L49

I’ll have to keep it this way until fix w/ zero (or other coordination, like a backup-mgr) is cherry-picked to all supported versions, e.g. 1.2.x, etc.

martinmr · September 17, 2020, 4:15pm

@LGalatin I talked to Manish and he was fine with keeping the solution as is for now because using zero to sync backups would require a redesign of backups since zero is not being used at all right now.

I added a section on how to automate backups to our documentation explaining this caveat.

Topic		Replies	Views
Duplicate zero and alphas in v20.03.1 Dgraph	1	314	April 26, 2020
Having issues with binary backups Users	25	1264	January 14, 2020
Cluster setup confusing issues Users	6	424	October 6, 2019
Production instance is taking entire load for cluster Users	8	618	November 21, 2019
Clarification on machine requirements for Dgraph Dgraph kind:question , dgraph	1	464	May 2, 2023