[21.03.2] Inconsistent edge/reverse edge generation with large mutations

Alex_Pedini · September 2, 2022, 5:17am

Report a Dgraph Bug

We’re adding edges to a node, and the edge is defined with @reverse.
If the mutation contains more than ~700 N-Quad, the resulting state of the relationship is inconsistent.
For example, always adding the same 700 relationships… navigating from parent to children (direct edge) I count 697 edges (this number can change at any upsert execution, but is consistent on read queries)… but if I could the edges via the reverse relationship, I find all expected 700.
Meaning, we have a reverse-edge, without the corresponding direct-edge. More details below.

For now we ended up limiting the mutations to maximum 500 n-quads, and doing multiple batches. This seems to give consistent results.

What version of Dgraph are you using?

Dgraph Version

Running in local docker:

dgraph/standalone:v21.03.2
dgraph/ratel:v21.03.2

Golang Client

github.com/dgraph-io/dgo/v200

Have you tried reproducing the issue with the latest release?

No, had lots of other issues with Zion so we gave up, but looking forward to the next version

What is the hardware spec (RAM, OS)?

MacBookPro 16’
2.4 GHz 8-Core Intel Core i9
32 GB 2667 MHz DDR4

Steps to reproduce the issue (command/config used to run Dgraph).

n.b. Name of entities and fields have been changed for the sake of the example.

I have the following schema:

type Collection {
	nid
	last_modified
	modified_by
	collection.books
}

type Book {
	nid
	name
}

nid: string @index(hash) @upsert .
collection.books: [uid] @count @reverse .

Our golang code prepares a request to append to the list of book uids in the collection like this:

func (p UpdateCollection) ToUpsertRequest() api.Request {
	var fields strings.Builder
	fmt.Fprintf(&fields, "uid(target) <last_modified> %q .\n", p.LastModified)
	fmt.Fprintf(&fields, "uid(target) <modified_by> %q .\n", p.ModifiedBy)
	
	for _, v := range p.TargetBookUids {
		fmt.Fprintf(&fields, "uid(target) <collection.books> <%s> .\n", v) //adds one triplet for each book
	}

	mutations = append(mutations, &api.Mutation{
		SetNquads: []byte(fields.String()),
	})

	request := api.Request{
		Query: `
		query getByNid($nid: string!) {
			target as target_query (func:eq(nid,$nid)) {
		        uid
		    }
		}
		`,
		Vars: map[string]string{
			"$nid": p.Nid,
		},
		Mutations: mutations,
	}

	return request
}

The request is then sent via the dgo library inside a transaction. And the transaction is committed at the end if the request is successful.

We call this method with a certain number of book uids that we can configure, but it seems if we send more then ~700 uids, we create an inconsistent state in the nodes.

I noticed that if I sent 700 items, and then ran a count on Ratel (best effort disabled) on the collection, I get less items, sometimes 698, sometimes 697, sometimes as low as 650~, but with no apparent pattern. Also the same exact items are sent, with variable results.

Then I ran a query in the code, to fetch the attached items to the collection, and found out the missing ones, I took one of the missing Uids and ran this on ratel:

collection(func:eq(nid,"collection1")) {
    uid
    nid
    last_modified
	countBooks: count(collection.books)
    collection.books @filter(uid(<0x5d1>)) {
        uid
		nid
    }
}
missingBook(func:uid(<0x5d1>)) {
    uid
    nid
  	~collection.books @cascade {
        uid
		nid
    }
}

where 0x5d1 is the uid of one of the missing books.
The result of the query is as follows:

"collection": [
  {
    "uid": "0xabd47a",
    "nid": "collection1",
    "last_modified": "2022-09-02T08:34:45+09:00",
    "countBooks": 697
  }
],
"missingBook": [
  {
    "uid": "0x5d1",
    "nid": "HgkANjsRIQE",
    "~collection.books": [
      {
        "uid": "0xabd47a",
        "nid": "collection1"
      }
    ]
  }
]

so the collection has a lower count as I saw before (697 vs 700), and it did not find the node with uid 0x5d1 when navigating the direct edge from the collection.
BUT when I fetch the missing book, I can find the collection by navigating the reverse edge ~collection.books
This behavior is consistent for all the 3 books that are missing from the count, and is consistent every time some of the books are missing. So everytime we update, all the reverse-edges are present, but a few of the direct-edges are missing.
This problem disappears if we reduce the batch size, by trial and error at the moment we set it to 500 elements.

Expected behaviour and actual result.

Expected behavior is all or nothing, if the request contains too many mutations I would expect it to fail and rollback the transaction, with a message saying that it’s too many nodes to handle in a single request.
But if the request is accepted, I would always expect consistent direct/reverse edges…

matthewmcneely · September 6, 2022, 8:53pm

Hi @Alex_Pedini

I’m attempting to recreate your issue. Can you verify that you’re importing github.com/dgraph-io/dgo/v200?

It might not make a difference, but from the dgo repo, the recommended (?) version of dgo for 21.03.2 is github.com/dgraph-io/dgo/v210.

Alex_Pedini · September 7, 2022, 12:03am

Hi @matthewmcneely
thanks for the reply, yes I can confirm the version we’re using is v200
This is from our go.mod file:

github.com/dgraph-io/dgo/v200 v200.0.0-20210401091508-95bfd74de60e

I will try also v210 and see if I can reproduce with that version, as soon as I have some time I’ll try to create a snipped to reproduce the issue with the demo/tutorial database models

Alex_Pedini · September 7, 2022, 6:12am

Hi @matthewmcneely,
I apologize, the example above by itself apparently is not enough to reproduce the issue, I think the issue might appear only in our environment which has a relatively big schema and content as I was not able to reproduce it myself using a simple go executable and a clean dgraph with just those two types…

would you be willing to join a Google Meet or zoom call, at whatever time you could be available, so that I can share my screen and show you the issue in our environment?

and just for more info, I can reproduce the issue also with the suggested dgo/v210 version

matthewmcneely · September 7, 2022, 2:19pm

Hey @Alex_Pedini,

Sure. Send me an email at matthew.mcneely@gmail.com and let’s take it from there.

Alex_Pedini · September 12, 2022, 9:02pm

Hi @matthewmcneely ,
thanks again for your time today!
As discussed, this is the schema/queries we used to replicate the issue:

type BookCollection {
	nid
	collection.books
}

type Book {
	nid
	book_name
}

nid: string @index(hash) @upsert .
collection.books: [uid] @count @reverse .
book_name: string .

Read query:

{
  coll(func:eq(nid, "BookColl_13_a")) {
    xid
    count(collection.books)
  }
  books(func:type(Book)) @cascade{
    count(uid)
    ~collection.books @filter(eq(nid, "BookColl_13_a"))
  }  
  // these 2 queries should always return the same count
}

Upsert query:

upsert {
  query{
    target as var(func:eq(nid, "BookColl_13_a")) {
      uid
    }
    books as var(func:type(Book), first: 1000, offset: 0) {
      uid
    }
  }

  mutation {
    delete {
      uid(target) <collection.books> * .
    } // delete first all old edges in the collection
  }
  mutation {
    set {
      uid(target) <nid> "BookColl_13_a" .
      uid(target) <dgraph.type> "BookCollection" .
      uid(target) <collection.books> uid(books) .
    } // update the collection with new book edges
  }  
}

To add the books to the graph I used a simple golang script that adds books in batches like this:

func addBooks(dgraph *dgo.Dgraph) error {
	for i := 0; i < 25; i++ {
		txn := dgraph.NewTxn()
		batchSize := 1000
		_, err := txn.Do(context.Background(), setupBooksRequest(batchSize, i*batchSize))
		if err != nil {
			txn.Discard(context.Background())
			fmt.Println(err)
			return err
		}

		err = txn.Commit(context.Background())
		if err != nil {
			return err
		}
	}
	return nil
}

func setupBooksRequest(limit int, offset int) *api.Request {
	fmt.Println("Limit: " + strconv.Itoa(limit) + ", offset: " + strconv.Itoa(offset))
	var setMutation strings.Builder
	for i := 0; i < limit; i++ {
		fmt.Fprintf(&setMutation, "_:book_%d <nid> %q .\n", i, strconv.Itoa(i+offset))
		fmt.Fprintf(&setMutation, "_:book_%d <book_name> %q .\n", i, uuid.NewString())
		fmt.Fprintf(&setMutation, "_:book_%d <dgraph.type> %q .\n", i, "Book")
	}
	mutations := []*api.Mutation{}
	mutations = append(mutations, &api.Mutation{SetNquads: []byte(setMutation.String())})
	request := api.Request{
		Mutations: mutations,
	}
	return &request
}

The issue seems to happen from the second upsert on a collection. When a new collection is created, the count matches correctly, when it’s updated again the counts start to diverge.
Since the delete is not really executed the first time a collection is created, the issue might have to do with the delete mutation interfering with the set mutation.

matthewmcneely · September 19, 2022, 10:03pm

Hey @Alex_Pedini,

Finally had some time to get to a minimal, reproducible test suite, results of which you can see here: GitHub - matthewmcneely/dgraph-v21.03-sandbox at issue/large-upsert-mutation-reverse-edge

As I continued to dig in, it seems the reverse predicate is not really a factor in the issue as you can see from the HEAD of this branch. I’ll keep digging in this week and will update this thread as appropriate.

Thanks again for all your work here in identifying this issue.

Tracking: [BUG]: Upsert that sets a uid array consisting of a large (>~600) number of elements fails if the predicate was star deleted in the same mutation · Issue #8324 · dgraph-io/dgraph · GitHub

Topic		Replies	Views
Reverse Edges - Mutations Documentation	0	581	January 28, 2021
From the v1.0.9 or later version of Dgraph returns unexpected results when use has clause with reverse edge Users	6	454	November 28, 2018
Weird behavior of reverse edge on one-to-one relation Dgraph kind:bug	3	923	February 18, 2019
Mutation returns the same response Dgraph kind:question	3	330	January 28, 2021
Update Reverse Dgraph kind:question , dgraph	8	543	May 4, 2021