BUG in incremental backups using vector predicates

Hello,

we noticed a problem using incremental backups. Since Dgraph v24 we have made some tests with vector predicates where we have found a problem after restore (live) our backups.

The following behaviour could be observed:

  1. New Dgraph v24.0.4 instance created with initial setup of a schema containing a node type with two predicates, name and vector
  2. Set data to the name and vector predicate for some nodes and created a initial backup (full backup)
  3. After the first backup we changed / deleted some vector predicates and created an incremental backup afterward
  4. Based on this incremental backup we restored our data on a new clean Dgraph v24.0.4 system, where the changes on the vector predicates were inconsistent (all changes / deletions made before incremental backup disappeared and the state of the full backup for these vector predicates were restored). However, changes on non-vector predicates (like ‘name’) were restored as expected (also after the incremental backup).
  5. If we run full backups and restore this data, all changes are restored as expcted.

Could anybody confirm this behaviour?

Maybe issued by the mentioned problem with incremental backups and restores we have gotten an error when we tried to delete and write (renew) some vectors (some vectors could be deleted and renewed without problems).
However the database crashed after this event with the following log:

panic: runtime error: makeslice: len out of range

goroutine 278 [running]:
github.com/dgraph-io/dgraph/v24/tok/hnsw.decodeUint64MatrixUnsafe({0xc0001a2d00, 0x84e, 0x3c5b?}, 0xc00b44a8e0)
        /home/runner/work/dgraph/dgraph/tok/hnsw/helper.go:482 +0x45
github.com/dgraph-io/dgraph/v24/tok/hnsw.populateEdgeDataFromKeyWithCacheType({0xc00b4f65e0?, 0xc00b44a7a8?}, 0x3c5b?, {0x270eb40?, 0xc000015c20?}, 0xc00b44a8e0)
        /home/runner/work/dgraph/dgraph/tok/hnsw/helper.go:326 +0xa5
github.com/dgraph-io/dgraph/v24/tok/hnsw.(*persistentHNSW[...]).fillNeighborEdges(0xc000015d58, 0x3c5b?, {0x270eb40, 0xc000015c20}, 0xc00b44a8e0)
        /home/runner/work/dgraph/dgraph/tok/hnsw/persistent_hnsw.go:150 +0xa9
github.com/dgraph-io/dgraph/v24/tok/hnsw.(*persistentHNSW[...]).searchPersistentLayer(0x2736860, {0x270eb40, 0xc000015c20}, 0x0, 0x3c5b, {0xc00ba12000, 0x600, 0x600}, {0xc00b4e0800, 0x600, ...}, ...)
        /home/runner/work/dgraph/dgraph/tok/hnsw/persistent_hnsw.go:208 +0x647
github.com/dgraph-io/dgraph/v24/tok/hnsw.(*persistentHNSW[...]).insertHelper(0x2736860, {0x2712ab8, 0xc00b4f2300}, 0xc000015c20, 0x56ff62, {0xc00b4e0800, 0x600, 0x600})
        /home/runner/work/dgraph/dgraph/tok/hnsw/persistent_hnsw.go:462 +0x2a7
github.com/dgraph-io/dgraph/v24/tok/hnsw.(*persistentHNSW[...]).Insert(0xc00b48fb00?, {0x2712ab8?, 0xc00b4f2300?}, {0x270eb40?, 0xc000015c20?}, 0xc00b44ac01?, {0xc00b4e0800, 0x600, 0x600})
        /home/runner/work/dgraph/dgraph/tok/hnsw/persistent_hnsw.go:422 +0x5f
github.com/dgraph-io/dgraph/v24/posting.(*Txn).addIndexMutations(0xc00b48cd00, {0x2712ab8, 0xc00b4f2300}, 0xc00b44af50)
        /home/runner/work/dgraph/dgraph/posting/index.go:178 +0x5e7
github.com/dgraph-io/dgraph/v24/posting.(*List).AddMutationWithIndex(0xc0004ffec0, {0x2712ab8, 0xc00b4f2300}, 0xc00b4da090, 0xc00b48cd00)
        /home/runner/work/dgraph/dgraph/posting/index.go:604 +0x585
github.com/dgraph-io/dgraph/v24/worker.runMutation({0x2712a48?, 0x38721c0?}, 0xc00b4da090, 0xc00b48cd00)
        /home/runner/work/dgraph/dgraph/worker/mutation.go:125 +0x558
github.com/dgraph-io/dgraph/v24/worker.(*node).applyMutations.func3({0xc00b498780, 0x9, 0xc00b4c76b0?})
        /home/runner/work/dgraph/dgraph/worker/draft.go:520 +0x167
github.com/dgraph-io/dgraph/v24/worker.(*node).applyMutations(0x126499dea004f?, {0x2712a48, 0x38721c0}, 0xc00b498680)
        /home/runner/work/dgraph/dgraph/worker/draft.go:539 +0x10b5
github.com/dgraph-io/dgraph/v24/worker.(*node).applyCommitted(0xc0003e7480, 0xc00b498680, 0x126499dea004f)
        /home/runner/work/dgraph/dgraph/worker/draft.go:584 +0xe32
github.com/dgraph-io/dgraph/v24/worker.(*node).processApplyCh.func1({0xc00b4d6000, 0x3, 0xc00b62e670?})
        /home/runner/work/dgraph/dgraph/worker/draft.go:784 +0x57f
github.com/dgraph-io/dgraph/v24/worker.(*node).processApplyCh(0xc0003e7480)
        /home/runner/work/dgraph/dgraph/worker/draft.go:825 +0x212
created by github.com/dgraph-io/dgraph/v24/worker.(*node).InitAndStartNode in goroutine 250
        /home/runner/work/dgraph/dgraph/worker/draft.go:1884 +0x59e

The affected Dgraph instance was broken afterward and could not be restarted after this event.

Thanks to the development team fixing this issue and providing some recommendations how could we get our system back on track without dataloss (export as RDF and import into a new system?).

cheers
Michael

Thanks a lot for the issue. Are you able to reproduce this easily / can you automate the failure?

Yes, we could reproduce backup problem like described very easily on a new Dgraph instance.
Sorry I don’t get your point, what do you mean with automate the failure? Do you mean the panic error? Not yet, but we still have a test instance running where we can reproduce this just by replacing a vector predicate.
Could this be related to any automatic indexing on the vector predicate, because we increased the vector length between backups?

Yeah, because of changing vector length this definitely can happen. Our vector predicate is very sensitive to length. The vector inserted fixes the length and changing the lengths could cause issues. Normally we want to through errors to show it, but looks like we missed some places.