Broken database, unable to recover from VM restore


I’ve been running Dgraph on a virtual server for a few months. Now, after I messed something up, I had to restore the server to an earlier point in time. Since then, however, the database appears to be broken:

  • several edges and nodes have disappeared (checked with Ratel)
  • some queries fail with an error message mentioning incorrect timestamps, e.g.:
    Could not query predicate statistics: ": cannot retrieve default value from list with key 0000046e616d65000000000000000010: cannot retrieve posting for UID 18446744073709551615 from list with key 0000046e616d65000000000000000010: readTs: 1400037 less than minTs: 1411132 for key: “\x00\x00\x04name\x00\x00\x00\x00\x00\x00\x00\x00\x10"”

Anyway, I thought I could consider myself lucky because I did create a tarball of the p, w and zw folders just before I restored the backup. But no luck, even when I restore those files and restart the alpha, the database stays broken all the same.

I’m running out of ideas how to rescue the data. Any help will be appreciated!


This error happens when you delete the Zero wall, which contains UIDs and timestamps. Are you using the same instance? have you created the Tarball with the Zero files? The Alphas and Zero can’t live without each other information.

It’s a simple setup, with the Zero and a single Alpha running on the same machine. The contents of the tarball:

$ ls -l p
insgesamt 393008
-rw------- 1 n n 12260922 24. Jul 22:41 000017.vlog
-rw------- 1 n n 71205867  5. Sep 19:09 000018.vlog
-rw------- 1 n n 15687857 24. Jul 20:28 000120.sst
-rw------- 1 n n 16983415 24. Jul 20:28 000121.sst
-rw------- 1 n n 15768404 24. Jul 20:28 000122.sst
-rw------- 1 n n 17723682 24. Jul 21:43 000125.sst
-rw------- 1 n n 14174462 24. Jul 21:43 000127.sst
-rw------- 1 n n 22393783 24. Jul 21:43 000128.sst
-rw------- 1 n n 33308285 24. Jul 21:43 000129.sst
-rw------- 1 n n 21686818 24. Jul 21:43 000130.sst
-rw------- 1 n n 17937816 24. Jul 21:43 000131.sst
-rw------- 1 n n 20334864 24. Jul 21:43 000132.sst
-rw------- 1 n n 12954608 24. Jul 21:43 000133.sst
-rw------- 1 n n 11196490 24. Jul 21:43 000134.sst
-rw------- 1 n n 14817536 24. Jul 21:43 000135.sst
-rw------- 1 n n 19241264 24. Jul 21:43 000136.sst
-rw------- 1 n n 26678501 24. Jul 21:43 000137.sst
-rw------- 1 n n 16111800 24. Jul 21:43 000138.sst
-rw------- 1 n n  5847503 24. Jul 21:43 000139.sst
-rw------- 1 n n 16069438 24. Jul 22:37 000140.sst
-rw------- 1 n n       28 24. Jul 19:04 KEYREGISTRY
-rw------- 1 n n     1669 24. Jul 22:37 MANIFEST
$ ls -l w
insgesamt 32336
-rw------- 1 n n 8638935 24. Jul 22:41 000277.sst
-rw------- 1 n n 9830701 16. Aug 13:55 000282.sst
-rw------- 1 n n 2405244 16. Aug 13:55 000439.vlog
-rw------- 1 n n 3072083 18. Aug 12:36 000440.vlog
-rw------- 1 n n 1048308 18. Aug 13:17 000441.vlog
-rw------- 1 n n 2587909 20. Aug 16:27 000442.vlog
-rw------- 1 n n 1635725 29. Aug 23:31 000443.vlog
-rw------- 1 n n 1667854  3. Sep 18:07 000444.vlog
-rw------- 1 n n  982334  3. Sep 19:06 000445.vlog
-rw------- 1 n n 1209720  5. Sep 23:07 000446.vlog
-rw------- 1 n n      28 24. Jul 19:04 KEYREGISTRY
-rw------- 1 n n    1187 24. Jul 22:41 MANIFEST
$ ls -l zw
insgesamt 49656
-rw------- 1 n n 35139355 24. Jul 22:41 000002.vlog
-rw------- 1 n n 15698609 24. Jul 22:41 000007.sst
-rw------- 1 n n       28 24. Jul 19:04 KEYREGISTRY
-rw------- 1 n n       62 24. Jul 22:41 MANIFEST

What’s peculiar is that only a few files are dated newer than 24 Jul, although there have been many and frequent updates since then.

Running dgraph v20.03.3

What you did exact?

Did you start your tar unzipped with another context or on top of the already running Dgraph?

You can’t do point in time backups that way tho.

I did:

systemctl stop dgraph-zero
systemctl stop dgraph-alpha
cd /var/lib/dgraph
tar -xjf backup.tar.bz2
systemctl start dgraph-zero
systemctl start dgraph-alpha

Which is the right way then? If I can’t recover from failures, I can’t run a production environment :slight_smile:

We have point in time backups (not sure, if we already ship it on binaries, or if it is being developed), but it is EE feature. What you can do is create a cron task to always export the data as RDF or JSON. And that way you have backups similar to point in time.

Another way to do this is by using OpenZFS. Ubuntu and BSD have support for it, and there is a package that you install on mac or windows.

OpenZFS has snapshots that give the same result you want. You can program it to create snapshots of the disks periodically. And if happens to fail or something, you can recover it. The problem is that this is out of Dgraph’s scope. You have to search for the support for your context (BTW, I think there’s no ZFS for containers).

Personally, I use TrueNAS, which has built-in controls for snapshots and all the other features for ZFS. I don’t use it with Dgraph, but it would be easy to create a jail (BSD Jails) just for Dgraph. I think you have to build Dgraph for BSD tho.

The TrueNAS would be your best shot to have a really good back-up approach natively out-of-the-box. And ZFS, for me, is the best file system out there by far.

This is the issue. Did you retry the query?

Thank you, this is really helpful!

Yeah, the readTs keeps counting down. I’ll let you know when it hits the threshold!

You can accelerate that with this call to Zero.


curl http://localhost:6080/assign?what=timestamps&num=100

results in

{"errors":[{"message":"num not passed","extensions":{"code":"ErrorInvalidRequest"}}]}

You need quotes around the URL to keep bash happy.

If bash is happy, I’m happy :sweat_smile:

OK, the error has disappeared and I can query my data again. Great!

It still has to be noted that some data loss has occurred. I can’t tell exactly which parts of the data are affected, but it on first sight, it seems as if a random set of edges had been removed.

Anyway, thanks for now!

FYI. I’ve made a few observations on the implications of what has happened.

  1. After the recovery, of the initial data import I did in July, all nodes seem to have survived. However, all the mutations I committed in a few days after the import seem to be gone. The mutations that have been committed during the past few weeks, however, seem to have survived. As if there had been a gap in the sequence of mutations.

  2. What led to this situation in the first place was that I added a fulltext index on the “name” predicate (type “string @lang”) through Ratel. After that I realized that a lot of names have disappeared, which made me restore a backup, which led to the aforementioned situation. Another funny thing is that some of the ‘name’ fields have gone, in all languages, even those that had been added during the initial import.

Some points to note.

The way that you generated such backups is not recommended.
Dgraph does not suffer from data loss. These facts that you evidence may be related to the way that you managed the backup manually. Or some other step that you may not noticed.

It is recommended that you use export to RDF instead of trying to make manual backups using Tar.

It doesn’t really happen. No data is lost or modified, whenever the user tries to forcefully modify something, he is discouraged from doing it. Or Ratel informs that is not possible, or Dgraph API informs that is not possible.

Generally, the lang directive requires that you have the language predicates in the dataset. If you had predicates without a language tag on the dataset, they are there but without any tags. You have to search for them without lang tag.