Upsert Will duplicate nodes be generated in the process of multithreading data update using the upsert block

tss · December 15, 2020, 3:24am

Will duplicate nodes be generated in the process of multithreading data update using the upsert block？

anand · December 15, 2020, 4:18am

This is a popular question, hence I am trying to explain at length.

Dgraph does not have a notion of unique attribute values across nodes. In fact every node in Dgraph is guaranteed to have a distinct uid (this attribute is controlled by Dgraph). Thus, when we talk about duplicates, we can only think of them in terms of the attribute value on the nodes.

Let’s imagine that Dgraph has predicates fullName and accountBalance. We declare that the fullName is unique within our dataset. The simplest way to avoid duplicates is to go with a logic below.

Path A
If a node with value fullName exists
update node X with value accountBalance
Path B
if a node X with value fullName does not exist
create node X
update node X with value fullName and accountBalance

This, of course, is the upsert block, a construct supported by Dgraph. The upsert block can be invoked via Ratel as well as Dgraph clients.

Multi-threading / Concurrency
If our transactions are spaced out, with no concurrency, the upsert block will help in avoiding duplicates. But if transactions happen concurrently, we could still end up with duplicates. We need an additional mechanism to help avoid duplicates for this particular concurrent update scenario.
This is exactly where @upsert directive in the schema helps. The @upsert directive checks if concurrent transactions are modifying nodes with the same attribute value, and if found, aborts one of the transactions. In our scenario, we can set the @upsert directive on the fullName attribute.

From the client perspective, all it needs to do in case of an aborted transaction is to do a retry. When duplicate transactions arrive concurrently, the first one will take Path B and the one which retries will take Path A.
Here is a video on the upsert directive.

tss · December 15, 2020, 6:27am

The result of my upsert test is different from the result of the test on the video.

tss · December 15, 2020, 6:29am

anand · December 15, 2020, 6:51am

Hi @tss , Thanks for sharing the schema. Please also share the steps you followed in a bit more detail. If you follow the steps exactly as mentioned in the video, you should see a transaction abort.

tss · December 15, 2020, 6:56am

I ran the test3 function twice. The first run produced a node, and then the second run, and found that two nodes with the same topic_type were produced.

anand · December 15, 2020, 7:01am

In the code, you are committing immediately. When you run this code twice, Dgraph simply processes it as two back-to-back transaction and no aborts are done.

Please note that in the video, the transactions were sent to Dgraph and then committed one after another. This was done to illustrate two transactions overlapping with each other (which is likely to happen if you would run your example in a multithreaded mode).

tss · December 15, 2020, 7:13am

Use 100 threads to insert and find that 5 nodes are successfully inserted！

anand · December 15, 2020, 7:25am

I suggest you use the same 100 threads, but write up a conditional upsert block instead of simple mutations. You should see a better result.
Pseudo code for conditional upsert:
check for node with topic_type1
if not present
create node

Please see if it helps. Here is the python documentation for conditional upsert.

seanlaff · March 13, 2021, 1:12am

I’ve been confused about this for over a year, but this comment really clears things up. Thanks!

Topic		Replies	Views
Upsert in parallel creates duplicates Users	4	532	October 14, 2019
@upsert directive is important to detect conflicts Dgraph	7	1487	June 29, 2018
Writing only unique nodes based on values not working Users	3	403	May 16, 2019
Upsert in DGraph Misc	2	1601	April 20, 2018
Upsert Failed for unique data Dgraph	12	604	November 3, 2020

Upsert Will duplicate nodes be generated in the process of multithreading data update using the upsert block

Related topics