Generate blank UIDs based on JSON fields on mutations and loaders

Moved from GitHub dgraph/2848

Posted by MichelDiz:

Experience Report

What you wanted to do

Import JSON data made by any other DB but having sure that the nodes will be unique.
Informing Bulkload what are the Keys of my JSON are that represent uniqueness.

And also be able to setup a direct injection from API’s serving JSON.

something like this for locally JSON
dgraph bulk --jsons="./json.json" --setunique="mission_name, rocket_id, payload_id, site_id"

That way Bulkload would generate a “uid” key throughout JSON before sending it to Dgraph. This ensures that objects are unique.

Also, it would be interesting to have an option to import JSON offered by API. This can help a lot in other situations. Like in Database Migration. Lots of people could use this to migrate from old DBs to Dgraph.

From API’s serving JSON:
dgraph bulk --bulkapi="https://api.spacexdata.com/v3/launches/, https://api.spacexdata.com/v1/nasa_launches/" --setunique="mission_name, rocket_id, payload_id, site_id"

Any external references to support your case

e.g: I have this API https://api.spacexdata.com/v3/launches/ . It provides me with a dataset. However, since there is no “uid: _:blank” key (with a blank node) all the rockets in this object of this API will not be unique in Dgraph when I import it.

This was discussed with @codexnull

manishrjain commented :

Didn’t we already build support for JSON in bulk loader?

MichelDiz commented :

Yes we do, after talking to Javier late 2018 we believe that this enhancement would be plausible.

codexnull commented :

Yes, the bulk loader already supports JSON. I believe this request is for supporting a way to automatically add a uid field to the data. Currently it requires it to be present already.

manishrjain commented :

Ok. After talking to Javier, my understanding is that this would create a blank node for the UID field in the JSON map (if UID field isn’t present), based on the fields mentioned in --setunique. That way, all the records holding the same mission_name and/or rocket_id, would get the same UID.

This seems like an easy and useful change.

campoy commented :

For reference, in order to load all the data from pokeapi.com I had to play around with UIDs in order to make this work.

In addition to generating blank UIDs from existing IDs (id, name, etc) I also had to make some fields into objects.

For instance given:

{
  "url": "/pokemon/1234",
  "name": "pikachu",
  "types": [
    "electric"
  ]
}

I had to modify it into:

{
  "uid": "_:pokemon1234",
  "url": "/pokemon/1234",
  "name": "pikachu",
  "types": [
    {
      "uid": "_:electric",
      "name": "electric"
    }
  ]
}

You can see the code in github.com/campoy/pokegraph.

MichelDiz commented :

BTW, there’s another more obvious case about using this feature.

Taking into account this JSON below

[{
		"name": "SpaceX",
		"Industry": "Aerospace",
		"Founded": "May 6, 2002",
		"Services": "Orbital rocket launch",
		"Owner": "Elon Musk Trust",
		"employees": "7,000"
	},
	{
		"name": "Falcon 9",
		"type": "Family",
		"Stages": "2"
	},
	{
		"Engine": "Merlin 1D",
		"name": "Falcon Heavy",
		"Family": 	"Falcon 9",
		"Manufacturer":  "SpaceX"
	},
	{
		"Engine": "Raptor",
		"name": "SpaceX Starship",
		"Family": 	"Starship",
		"Manufacturer":  "SpaceX"
	}
]

The command

dgraph bulk -f ./json.json --setunique="name" --setedgeunique="Manufacturer:name,Family:name"

I added the idea a new feature. “setedgeunique” would be a function of transforming a scalar value edge into a node edge. Where would I look in the dataset for “name” and see if it has the same value as “Manufacturer”. e.g. if Manufacturer field has the same name as a found node with name field. They are related. So it would smash/relate "Manufacturer": "SpaceX" into "name": "SpaceX".

Maybe we could add an especial schema for this

dgraph bulk -f ./json.json --unique=./schema.json 
[{
	"setunique": {
		"name": "Name"  // All data with name will be the identifier of that data. Also, we could "rename" the edge. "name" to "Name" or other convention.
	},
	"setedgeunique": {
		"Manufacturer": "name", // In this field, there is no renaming, only the indication that two fields represent the same identifier.
		"Family": "name" // What matters here is the value inside the field matches.
	}
}]

Result

[{
		"uid": "_:SpaceX",
		"name": "SpaceX",
		"Industry": "Aerospace",
		"Founded": "May 6, 2002",
		"Services": "Orbital rocket launch",
		"Owner": "Elon Musk Trust",
		"employees": "7,000"
	},
	{
		"uid": "_:Falcon_9",
		"name": "Falcon 9",
		"Stages": "2"
	},
	{
		"uid": "_:Falcon_Heavy",
		"Engine": "Merlin 1D",
		"name": "Falcon Heavy",
		"Family": [{
			"uid": "_:Falcon_9"
		}],
		"Manufacturer": [{
			"uid": "_:SpaceX"
		}]
	},
	{
		"uid": "_:SpaceX_Starship",
		"Engine": "Raptor",
		"name": "SpaceX Starship",
		"Family": [{
			"uid": "_:Starship"
		}],
		"Manufacturer": [{
			"uid": "_:SpaceX"
		}]
	}
]

campoy commented :

This actually starts to look like a pretty well-defined feature.

Similarly to how @animesh2049 is working on a validator tool + library for bulk loader in Add bulk loader validator by animesh2049 · Pull Request #3838 · dgraph-io/dgraph · GitHub, we should create a tool and library for this.

This might be a good candidate to be added as a small feature, somehow independent of the rest of the release.

Keeping it as P2 and waiting for someone to express interested in working on this project.