Will duplicate nodes be generated in the process of multithreading data update using the upsert block？
This is a popular question, hence I am trying to explain at length.
Dgraph does not have a notion of unique attribute values across nodes. In fact every node in Dgraph is guaranteed to have a distinct uid (this attribute is controlled by Dgraph). Thus, when we talk about duplicates, we can only think of them in terms of the attribute value on the nodes.
Let’s imagine that Dgraph has predicates fullName and accountBalance. We declare that the fullName is unique within our dataset. The simplest way to avoid duplicates is to go with a logic below.
If a node with value fullName exists
update node X with value accountBalance
if a node X with value fullName does not exist
create node X
update node X with value fullName and accountBalance
This, of course, is the upsert block, a construct supported by Dgraph. The upsert block can be invoked via Ratel as well as Dgraph clients.
Multi-threading / Concurrency
If our transactions are spaced out, with no concurrency, the upsert block will help in avoiding duplicates. But if transactions happen concurrently, we could still end up with duplicates. We need an additional mechanism to help avoid duplicates for this particular concurrent update scenario.
This is exactly where @upsert directive in the schema helps. The @upsert directive checks if concurrent transactions are modifying nodes with the same attribute value, and if found, aborts one of the transactions. In our scenario, we can set the @upsert directive on the fullName attribute.
From the client perspective, all it needs to do in case of an aborted transaction is to do a retry. When duplicate transactions arrive concurrently, the first one will take Path B and the one which retries will take Path A.
Here is a video on the upsert directive.
The result of my upsert test is different from the result of the test on the video.
Hi @tss , Thanks for sharing the schema. Please also share the steps you followed in a bit more detail. If you follow the steps exactly as mentioned in the video, you should see a transaction abort.
I ran the test3 function twice. The first run produced a node, and then the second run, and found that two nodes with the same topic_type were produced.
In the code, you are committing immediately. When you run this code twice, Dgraph simply processes it as two back-to-back transaction and no aborts are done.
Please note that in the video, the transactions were sent to Dgraph and then committed one after another. This was done to illustrate two transactions overlapping with each other (which is likely to happen if you would run your example in a multithreaded mode).
Use 100 threads to insert and find that 5 nodes are successfully inserted！
I suggest you use the same 100 threads, but write up a conditional upsert block instead of simple mutations. You should see a better result.
Pseudo code for conditional upsert:
check for node with
if not present
Please see if it helps. Here is the python documentation for conditional upsert.
I’ve been confused about this for over a year, but this comment really clears things up. Thanks!