Loading data using pydgraph

Kashifa_Khursheed · November 17, 2022, 7:03am

I’m new to dgraph and I wanted to know if I can use something like (shown in below images) to load my dataset using python library pydgraph. The below images display doing the same in neo4j. Or is it only to be done using live loader and bulk loader in dgraph? (PS: my data is in CSV format)

Anurag · November 17, 2022, 11:22am

Hi @Kashifa_Khursheed ! Welcome to Dgraph.

You can use pydgraph to load data in dgraph. The only catch is that you need a way to transform the csv data into a RDF format. If you can share more details about your schema and data, I can probably suggest a way to do that.

Once you have a schema and RDFs in place, its easy to do a live-load or a bulk-load based on the data size.

A simple script is here. If you need help let me know.

Kashifa_Khursheed · November 17, 2022, 11:37am

The CSVs I have are of the form:

movie_id, title, genres, budget, revenue, director_name, director_id, cast_ids, cast_names, cast_characters

1, “Sunshine”, [‘Drama’,‘Romance’], 100000, 8000, “David”, 154, [2,5,8], ['Hannah",“Laura”,“Carlos”], [“A”,“B”,“C”]

(genres, cast_ids, cast_names are in the form of lists)
So, I would like to have two Nodes mainly Person and Movie

Person (person_id, person_name)–[:ACTED_IN or :DIRECTED_BY]–>Movie (movie_id, title, budget, revenue, genres (in list format))

only ACTED_IN relationship has an edge attribute ‘character’ and ‘count’
DIRECTED_BY relationship has ‘count’ edge attribute
For example, for the above row, the nodes and relationships would be like shown below:

PERSON (person_id:154, person_name:‘David’) – [e:DIRECTED_BY (e.count)] → MOVIE (movie_id:1, title:‘Sunshine’, budget:100000, revenue:8000, genres:[‘Drama’,‘Romance’])

PERSON (person_id:2, person_name:‘Hannah’) – [r:ACTED_IN (r.character:“A”, r.count)] → MOVIE (movie_id:1, title:‘Sunshine’, budget:100000, revenue:8000, genres:[‘Drama’,‘Romance’])

PERSON (person_id:5, person_name:‘Laura’) – [r:ACTED_IN (r.character:“B”, r.count)] → MOVIE (movie_id:1, title:‘Sunshine’, budget:100000, revenue:8000, genres:[‘Drama’,‘Romance’])

PERSON (person_id:8, person_name:‘Carlos’) – [r:ACTED_IN (r.character:“C”, r.count)] → MOVIE (movie_id:1, title:‘Sunshine’, budget:100000, revenue:8000, genres:[‘Drama’,‘Romance’])

The count variable would increment everytime the same data is loaded (I’m having this extra count variable just to see if the count attribute is incremented when the same data is loaded again)

MichelDiz · November 17, 2022, 2:26pm

Try this converter

Related Neo4JConvertor(test):Fixed some issues regarding Conversion of Neo4J csv to rdf by AvinashRatnam608 · Pull Request #7892 · dgraph-io/dgraph · GitHub

You can also try

I also have done this

But it is for JSON exported files from Neo4J.

Kashifa_Khursheed · November 17, 2022, 3:31pm

Can you please elaborate on how to use Neo4j CSV TO RDF Converter? I would like to convert a lot of CSV files to RDF situated in a folder. It would be of much help. Thank you.

MichelDiz · November 17, 2022, 9:16pm

I never used them. Just my own code. But basically it would be just check the flags in the code and follow your guts.

Note that none of these codes are officially supported. And they run differently. You can have different results in terms of entities. Cuz Neo4J is weird…

Kashifa_Khursheed · November 18, 2022, 1:40am

Update: I have exploded the columns with lists so as to make it simple to import the data into dgraph (below image shows the same). So, while going for ‘dgraph live load’ command, am I supposed to also provide a schema file (written in DQL)? If yes, then are the schema file predicate names same as my column names? Can you just draw a out a sample schema file for the following data? I’ve converted this into a .json file. I’m only attaching CSV’s snapshot so as to make it easy to understand.

MichelDiz · November 18, 2022, 5:34am

Don’t need to. It can be empty. If it asks, provide an empty file that is fine.
Schema is Schema, it is not exactly DQL.

It will be, Dgraph will infer the predicate name in the RDF file from the generated ones based in the columns.

Schema is needed when you need lang or indexes.

Kashifa_Khursheed · November 18, 2022, 5:41am

My idea right now is to have separate files for Movies, Directors, Actors, and a separate file for Relationships. I want to compare the results with both Neo4j and Dgraph. So, I need to have a common format of files to go with before hand.

(Movies file will contain only rows of movies data like movie_id, movie_name, genres, revenue, budget)
(Directors file will contain data like id, name)
(Actors file will contain id, name, character)
(Relationship file will contain relationships between Movie-Director and Movie-Actor)

I would like to have data in such a format so that I load it into neo4j and dgraph and test things out. Can you please suggest such a common data format?

MichelDiz · November 18, 2022, 5:59am

Neo4J’s modeling is different from Dgraph. Both are directed graphs, but Neo4J works differently. Each case is a case. A query on Neo4J can be 70% different on Dgraph. And the response too. But conceptually, they can follow a similar structure.

The simpler the modeling, the better. But for cases where you are months modeling your Graph in Neo4J. This gets a lot more complicated. Some things need to be remodeled.

When I was playing with this thing of exporting data from Neo4J to Dgraph. I had to treat the data a lot. Either via some loop on my code or using Upsert Block in the Dgraph query. CSV is a bad choice for a Graph database. Does not make sense, relationships are detached. It would be better to continue using SQL.

Send them separately and do some logic with the relations using e.g. https://dgraph.io/docs/mutations/upsert-block/#example-of-val-function if you don’t wanna code.

Kashifa_Khursheed · November 18, 2022, 6:36am

So, I’ll just stick to CSV for Neo4j and JSON for Dgraph.

The problem now is, I’ll have data in the form of separate JSONs (Movies, Actors, Directors, relationships). I want also want to have an edge attribute called “strength” which would change if there are more than one overlapping edges (let’s say there’s a formula to increment or update strength values). Noting that when initially loading the data, there’s unique edges and hence strength is set to a default value (such as 0.5 or 1 initially). While loading more data later, Should I loop this relationship file and perform an upsert query for each of the relation in relationships file to update the strength? Because while using dgraph live loader or bulk loader, I do not think I can incorporate a formula anywhere to update ‘strength’ edge attribute.

Kashifa_Khursheed · November 19, 2022, 5:12am

Data is not getting loaded when using live loader. Encountering this error. I’ve separate files for Movies, Persons and Relationships (JSON format):

Movies – movie_id, title, revenue, budget, genres
Persons – person_id, person_name
Relationships – movie_id, RELATIONSHIP, person_id

I’m getting this error while trying to load Movies.json file. Zero and alpha are already running …

NOTE: I’ve not created any schema anywhere, since live loader automatically detects predicates, I wanted to do this

MichelDiz · November 19, 2022, 5:58am

You are using 20.03, only v21.03 would be acceptable to v22.0.x. Upgrade liveloader.

Can you share a example of your JSON?

Kashifa_Khursheed · November 19, 2022, 6:03am

[
  {
    "movie_id": 86838,
    "title": "Seven Psychopaths",
    "revenue": 19422261,
    "budget": 15000000,
    "genres": "['Comedy', 'Crime']"
  },
  {
    "movie_id": 44154,
    "title": "A Touch of Zen",
    "revenue": 0,
    "budget": 0,
    "genres": "['Action', 'Adventure']"
  },
  {
    "movie_id": 211798,
    "title": "Wallace & Gromit's Cracking Contraptions",
    "revenue": 0,
    "budget": 0,
    "genres": "['Animation', 'Comedy']"
  },
  {
    "movie_id": 84066,
    "title": "Bonnie and Clyde Italian Style",
    "revenue": 0,
    "budget": 0,
    "genres": "['Crime', 'Comedy']"
  },
  {
    "movie_id": 150473,
    "title": "Bad Hair Friday",
    "revenue": 0,
    "budget": 0,
    "genres": "['Thriller']"
  },
  {
    "movie_id": 40478,
    "title": "Baby Doll",
    "revenue": 0,
    "budget": 0,
    "genres": "['Drama']"
  }
]

Kashifa_Khursheed · November 19, 2022, 6:37am

Thanks, I could load all 3 files and it was pretty fast. But now, I would like to form relationships between nodes. Should I define a schema and then import all files again?

I’d like to have movie_id, title, genres, revenue, budget and node properties for node MOVIE
person_id, person_name node properties for node PERSON
and relationship is the edge between PERSON and MOVIE nodes … how can I achieve this?

MichelDiz · November 19, 2022, 4:49pm

relationships don’t come from Schema. Schema just define the types.

Which edges? What are the refecences?

If you have something like “relationship table” you can use https://dgraph.io/docs/mutations/upsert-block/ to link them.
But you’d better try using the tools I mentioned before. Because they try their best to create a valid RDF already with the relations.

Kashifa_Khursheed · November 20, 2022, 2:24pm

I’ve used dgraph live loader to load movies.rdf, persons.rdf and relationships.rdf files. But edges aren’t created. Why is that? (Attaching three files snapshots below)

MichelDiz · November 20, 2022, 4:18pm

Really long story. In short, you have to send all together. Or use the upsert flag in live load -U, --upsertPredicate string run in upsertPredicate mode. the value would be used to store blank nodes as an xid.

The Blank node context isn’t saved “globaly”. So, each new mutation the blank nodes are considered new uids.

You can try also(I think it is easier) xidmapping

"xidmap", "x", "", "Directory to store xid to uid mapping"

It will save the UID context to a path called x and you have to use this flag in every load.

Kashifa_Khursheed · November 21, 2022, 4:02am

It’s not working out for me. Can we connect over zoom and sort this out? It’s alright if you cannot, I’ll post more information regarding the problem.

MichelDiz · November 21, 2022, 5:34am

Not sure due to Timezone. Monday I’m having some meetings.

PS. If this data is public, give me the Link that I will run on my side.

The logic is simple.

You need to make sure that all entities are using the same BlankNode.

BlankNode is a unique temporary identifier that we use in RDF and mutations in general.

If you run two different transactions for Blank Node <_:New01> the Dgraph will not identify the UID of the first transaction. Because when committing the transaction, the context of the leasing UID is lost. So the solution is as follows.

Run all RDFs in one transaction. Or upload them all in one batch via Liveloader. Liveloader can read all RDFs that are within a given Path.
Or Start the first transaction with xidmap. Bulkloader also generates a map of xids. XID stands for “External IDs”.
When running the command:
dgraph live -x "~/pathTo/dgraphData/XIDs"
You will save all UIDs mapped to each Blank Node identifier from your dump data.

You need to certify that they are unique according to their entity uniqueness. That is, Bob and all his attributes must have the same Blank Node. However, Alice and all her attributes do not.

Topic		Replies	Views
Neo4j vs Dgraph - The numbers speak for themselves - Dgraph Blog Blog	19	3418	July 4, 2021
Neo4j vs Dgraph - The numbers speak for themselves - Dgraph Blog Blog	10	2785	February 5, 2017
Batch import large amount of data Dgraph kind:question	5	954	January 19, 2021
Loading 50GB CSV data Dgraph	5	1015	February 25, 2022
Creating Schema and loading data Dgraph kind:question	18	1680	July 23, 2021

Loading data using pydgraph

The logic is simple.

Related topics