The issue arose incremental backups were made without a full backup.
What version of Dgraph are you using?
v20.03.4
Have you tried reproducing the issue with the latest release?
yes
What is the hardware spec (RAM, OS)?
Amazon Linux EKS nodes with Dgraph Ubuntu containers
Steps to reproduce the issue (command/config used to run Dgraph).
At this moment, still researching the steps to reproduce this issue, but deducing one probable way this can occur is the following:
Do backups simultaneously on two different alpha nodes, e.g. alpha-0 and alpha-2.
Use the same backup destination for all backups, e.g. s3 bucket or NFS mount.
Expected behavior and actual result.
A summary is a situation is that there is an assumption that after you start a new backup, all the backups after that belong to the same series of backups until a new full backup is forced. If this assumption is broken, when we can get into this scenario.
As a workaround, we can use two different folders for each of the backup series and make sure we only execute a backup on the same alpha node if they occur concurrently.
Currently backups and restores do not involve zero at all and moving to using zero for coordination would be a bigger change. That’s why I wasn’t sure if we should involve zero.
In the interim, the dgraph helm chart sends send backups only to a single alpha, rather than an svc that can hit alpha0…n pod, and run into this scenario.
I’ll have to keep it this way until fix w/ zero (or other coordination, like a backup-mgr) is cherry-picked to all supported versions, e.g. 1.2.x, etc.
@LGalatin I talked to Manish and he was fine with keeping the solution as is for now because using zero to sync backups would require a redesign of backups since zero is not being used at all right now.
I added a section on how to automate backups to our documentation explaining this caveat.