Generate blank UIDs based on JSON fields on mutations and loaders

diggy · December 27, 2018, 1:59pm

Moved from GitHub dgraph/2848

Experience Report

What you wanted to do

Import JSON data made by any other DB but having sure that the nodes will be unique.
Informing Bulkload what are the Keys of my JSON are that represent uniqueness.

And also be able to setup a direct injection from API’s serving JSON.

something like this for locally JSON
dgraph bulk --jsons="./json.json" --setunique="mission_name, rocket_id, payload_id, site_id"

That way Bulkload would generate a “uid” key throughout JSON before sending it to Dgraph. This ensures that objects are unique.

Also, it would be interesting to have an option to import JSON offered by API. This can help a lot in other situations. Like in Database Migration. Lots of people could use this to migrate from old DBs to Dgraph.

From API’s serving JSON:
dgraph bulk --bulkapi="https://api.spacexdata.com/v3/launches/, https://api.spacexdata.com/v1/nasa_launches/" --setunique="mission_name, rocket_id, payload_id, site_id"

Any external references to support your case

e.g: I have this API https://api.spacexdata.com/v3/launches/ . It provides me with a dataset. However, since there is no “uid: _:blank” key (with a blank node) all the rockets in this object of this API will not be unique in Dgraph when I import it.

This was discussed with @codexnull

diggy · January 16, 2019, 2:41am

manishrjain commented :

Didn’t we already build support for JSON in bulk loader?

diggy · January 16, 2019, 3:02am

MichelDiz commented :

Yes we do, after talking to Javier late 2018 we believe that this enhancement would be plausible.

diggy · January 16, 2019, 5:39pm

codexnull commented :

Yes, the bulk loader already supports JSON. I believe this request is for supporting a way to automatically add a uid field to the data. Currently it requires it to be present already.

diggy · January 16, 2019, 5:52pm

manishrjain commented :

Ok. After talking to Javier, my understanding is that this would create a blank node for the UID field in the JSON map (if UID field isn’t present), based on the fields mentioned in --setunique. That way, all the records holding the same mission_name and/or rocket_id, would get the same UID.

This seems like an easy and useful change.

diggy · September 13, 2019, 3:18pm

campoy commented :

For reference, in order to load all the data from pokeapi.com I had to play around with UIDs in order to make this work.

In addition to generating blank UIDs from existing IDs (id, name, etc) I also had to make some fields into objects.

For instance given:

{
  "url": "/pokemon/1234",
  "name": "pikachu",
  "types": [
    "electric"
  ]
}

I had to modify it into:

{
  "uid": "_:pokemon1234",
  "url": "/pokemon/1234",
  "name": "pikachu",
  "types": [
    {
      "uid": "_:electric",
      "name": "electric"
    }
  ]
}

You can see the code in github.com/campoy/pokegraph.

diggy · September 13, 2019, 4:26pm

MichelDiz commented :

BTW, there’s another more obvious case about using this feature.

Taking into account this JSON below

[{
		"name": "SpaceX",
		"Industry": "Aerospace",
		"Founded": "May 6, 2002",
		"Services": "Orbital rocket launch",
		"Owner": "Elon Musk Trust",
		"employees": "7,000"
	},
	{
		"name": "Falcon 9",
		"type": "Family",
		"Stages": "2"
	},
	{
		"Engine": "Merlin 1D",
		"name": "Falcon Heavy",
		"Family": 	"Falcon 9",
		"Manufacturer":  "SpaceX"
	},
	{
		"Engine": "Raptor",
		"name": "SpaceX Starship",
		"Family": 	"Starship",
		"Manufacturer":  "SpaceX"
	}
]

The command

dgraph bulk -f ./json.json --setunique="name" --setedgeunique="Manufacturer:name,Family:name"

I added the idea a new feature. “setedgeunique” would be a function of transforming a scalar value edge into a node edge. Where would I look in the dataset for “name” and see if it has the same value as “Manufacturer”. e.g. if Manufacturer field has the same name as a found node with name field. They are related. So it would smash/relate "Manufacturer": "SpaceX" into "name": "SpaceX".

Maybe we could add an especial schema for this

dgraph bulk -f ./json.json --unique=./schema.json

[{
	"setunique": {
		"name": "Name"  // All data with name will be the identifier of that data. Also, we could "rename" the edge. "name" to "Name" or other convention.
	},
	"setedgeunique": {
		"Manufacturer": "name", // In this field, there is no renaming, only the indication that two fields represent the same identifier.
		"Family": "name" // What matters here is the value inside the field matches.
	}
}]

Result

[{
		"uid": "_:SpaceX",
		"name": "SpaceX",
		"Industry": "Aerospace",
		"Founded": "May 6, 2002",
		"Services": "Orbital rocket launch",
		"Owner": "Elon Musk Trust",
		"employees": "7,000"
	},
	{
		"uid": "_:Falcon_9",
		"name": "Falcon 9",
		"Stages": "2"
	},
	{
		"uid": "_:Falcon_Heavy",
		"Engine": "Merlin 1D",
		"name": "Falcon Heavy",
		"Family": [{
			"uid": "_:Falcon_9"
		}],
		"Manufacturer": [{
			"uid": "_:SpaceX"
		}]
	},
	{
		"uid": "_:SpaceX_Starship",
		"Engine": "Raptor",
		"name": "SpaceX Starship",
		"Family": [{
			"uid": "_:Starship"
		}],
		"Manufacturer": [{
			"uid": "_:SpaceX"
		}]
	}
]

diggy · September 17, 2019, 3:54pm

campoy commented :

This actually starts to look like a pretty well-defined feature.

Similarly to how @animesh2049 is working on a validator tool + library for bulk loader in Add bulk loader validator by animesh2049 · Pull Request #3838 · dgraph-io/dgraph · GitHub, we should create a tool and library for this.

This might be a good candidate to be added as a small feature, somehow independent of the rest of the release.

Keeping it as P2 and waiting for someone to express interested in working on this project.

Topic		Replies	Views
Understanding bulk data loads, and bulk updates, with XID in v0.8 Users	2	851	November 1, 2017
Preserve UIDs in bulk loader Users	5	653	June 27, 2019
Bulk importing data from CSVs/JSON using pre-determined UIDs Dgraph	2	455	April 8, 2021
Batch insertion in dgraph Dgraph mutation	3	1334	November 19, 2019
Where is the mapping of xids to uids which is created by bulk Users	3	660	April 5, 2018