Bulk importing data from CSVs/JSON using pre-determined UIDs

Hi, I’m importing about 200,000 records into dgraph. Here is a part of the schema:

type Mnemonic {
  full_name: String! @id

type MnemonicCode {
  phonem: String! @search(by: [exact])
  mnemonic: Mnemonic!

I would first import type Mnemonics using JSON like this:

  "set": [
  		"dgraph.type": "Mnemonic",
  		"Mnemonic.full_name": "eg1"
  		"dgraph.type": "Mnemonic",
  		"Mnemonic.full_name": "eg2"

And then import the MnemonicCodes assigning them to the correct UIDs like so:

  "set": [
  		"dgraph.type": "MnemonicCode",
  		"MnemonicCode.phonem": "ASDF",
  		"MnemonicCode.mnemonic": {
  			"uid": "0x27dd"

My issue is that I have to import all of the first type first, then take the 60,000 UIDs from the response, then assign those UIDs to the correct MnemonicCodes and re-manipulate all the json, which seems like it could lead to a lot of human error. Or is it safe to just set my own custom UIDs? As I also understand from the docs it is only possible to link nodes in ratel using UIDs and not the @id attribute of a type?

Dgraph supports a notion of a blank node. If you are open to using RDF triples instead of JSON, you could generate RDFs that look like this:

_:eg1 <dgraph.type> "Mnemonic" .
_:eg1 <Mnemonic.full_name> "eg1" . 

_:eg2 <dgraph.type> "Mnemonic" .
_:eg2 <Mnemonic.full_name> "eg2" . 

_:asdf <dgraph.type> "MnemonicCode" .
_:asdf <MnemonicCode.phonem> "ASDF" .
_:asdf <MnemonicCode.mnemonic>  :_eg1 . 

I have been vocal in the past about why JSON mutation notation is not good. This is one place where RDF shines more.

(on a broader note: json, yaml, and unprincipled “human readable config/data” languages should die in a fire. Things like Dhall can stay)

So, let me understand right what are the steps you are doing.

This example is what you are doing(?):

  1. You have the “Base” dataset and you import it via Bulkload.
  2. Later you have more datasets that takes the previous dataset as context and load it via liveloader.

Is that correct?

I would recommend the following.

  1. The same as before, but use some of the XIDs storing approach
dgraph bulk -h | grep xid
--store_xids.        Generate an xid edge for each node.
--xidmap string        Directory to store xid to uid mapping

I like the mapping to a directory. But you can use the first one to store the XID in the node itself.

  1. The same as before, but continue the XIDs approach.
dgraph live -h | grep xid

-U, --upsertPredicate string
run in upsertPredicate mode. the value would be used to 
store blank nodes as an xid

-x, --xidmap string
Directory to store xid to uid mapping

The upsertPredicate approach is new, and I didn’t used so far. But the idea is that it will take your data, analyze the XIDs and them generate internally several upsert blocks.

The xidmap map is the continuation of the previous storing XIDs in the directory. So here you gonna add the path that you have stored the mapping in previous loads. Always use the same path/files and make sure that all XIDs are always unique. XID is basically a “_:BLANKNODE”.

I think it is fine. I don’t see any problem with this. XIDs in JSON syntax is just "uid": "_:BlankNode1" - I think JSON is really friendly for new users. RDF is really cool, but you have very few devs out there that really understand it. I prefer RDF, but JSON is totally fine.

1 Like

hmmm well from a glance it seems great, but from a quick google it seems like a whole new topic and format to learn just to import data, especially generating them using rdflib in python, or maybe I’m just overlooking it. If you have any links or tips would be great. one more thing to note I’m using Dgraph Cloud if that makes a difference on importing.