Concurrent Backups Can Lead to Issues

Report a Dgraph Bug

The issue arose incremental backups were made without a full backup.

What version of Dgraph are you using?

v20.03.4

Have you tried reproducing the issue with the latest release?

yes

What is the hardware spec (RAM, OS)?

  • Amazon Linux EKS nodes with Dgraph Ubuntu containers

Steps to reproduce the issue (command/config used to run Dgraph).

At this moment, still researching the steps to reproduce this issue, but deducing one probable way this can occur is the following:

  • Do backups simultaneously on two different alpha nodes, e.g. alpha-0 and alpha-2.
  • Use the same backup destination for all backups, e.g. s3 bucket or NFS mount.

Expected behavior and actual result.

A summary is a situation is that there is an assumption that after you start a new backup, all the backups after that belong to the same series of backups until a new full backup is forced. If this assumption is broken, when we can get into this scenario.

As a workaround, we can use two different folders for each of the backup series and make sure we only execute a backup on the same alpha node if they occur concurrently.

Fixed by this PR: https://github.com/dgraph-io/dgraph/commit/5b7926018adc5ed2173d0df5b1ac5796517566ed

Won’t work if backup request is sent to different alphas. To only have one backup across cluster , need to involve zero.

Currently backups and restores do not involve zero at all and moving to using zero for coordination would be a bigger change. That’s why I wasn’t sure if we should involve zero.

@mrjn, should I work on that?

In the interim, the dgraph helm chart sends send backups only to a single alpha, rather than an svc that can hit alpha0…n pod, and run into this scenario.

I’ll have to keep it this way until fix w/ zero (or other coordination, like a backup-mgr) is cherry-picked to all supported versions, e.g. 1.2.x, etc.

@LGalatin I talked to Manish and he was fine with keeping the solution as is for now because using zero to sync backups would require a redesign of backups since zero is not being used at all right now.

I added a section on how to automate backups to our documentation explaining this caveat.

2 Likes