I’m trying to load initial data into dgraph via the bulk loader and afterwards add new/modify existing nodes with the live loader. The problem is that if the live loader uploads nodes, which were uploaded by the bulk loader before, it creates duplicate nodes with new uids. I don’t want duplicate nodes. I need those new nodes either not to be loaded at all (if they bring no new edges for bulk loaded nodes) or to modify already existing nodes (if those new nodes do bring new edges for old nodes).
If I use onlylive loader, then it is trivial: I just add -x dirname to all my dgraph live commands and I get a xid directory named dirname, so I don’t get duplicate nodes with new uids from the following live loader data. The issue with bulk loader is that -x option does not create a folder for xids. Therefore, when I live load nodes with the uids, which were uploaded by the bulk loader before, I get duplicate nodes with new uids.
How do I prevent duplicate nodes when I use bulk load first and live load afterwards?
The --store_xids flag for bulk loader writes xid edges into your database. This is different from the --xidmap flag for live loader, which writes out the xid-uid mapping to a separate directory.
Okay. How do I use the fact that xid edges are stored in the database to avoid duplicate nodes appearing after the bulk load, which is followed by >= 1 live loads? The docs don’t give much insight into that.
What is the purpose of writing xid directly to the database?
I am sorry, still interested in the answer, so I reply to bring the topic to the top. Feel free to
How I can use xid edges stored in the dgraph after bulk load to avoid duplicates with the future dgraph live loads?
@aamrtv thanks for raising this. It has been resolved in the current master. You can use --xidmap dirname flag while doing a bulk upload to save xids in a directory named dirname.