Pre-processing DBpedia dataset for Dgraph

EnricoMi · October 5, 2020, 2:51pm

I was looking for a large real-world graph dataset to load into a Dgraph cluster to ultimately test
my spark-dgraph-connector. I want to share the dataset that met my requirements, together with all processing steps needed to reproduce it and get it loaded into Dgraph: https://github.com/EnricoMi/dgraph-dbpedia.

A large real-world dataset

Dgraph organizes the graph around predicates, so that dataset should contain predicates with these characteristics:

a predicate that links a deep hierarchy of nodes
a predicate that links a deep network of nodes
a predicate that links strongly connected components
a predicate with a lot of data, ideally a long string that exists for every node and with multiple languages
a predicate with geo coordinates
numerous predicates, to have a large schema
a long-tail predicate frequency distribution:
a few predicates have high frequency (and low selectivity),
most predicates have low frequency (and high selectivity)
predicates that, if they exist for a node:
- have a single occurrence (single value)
- have a multiple occurrences (value list)
real-world predicate names in multiple languages
various data types and strings in multiple languages

A dataset that checks all these boxes can be found at the DBpedia project. They extract structured information from the Wikipedia project and provide them in RDF format. However, that RDF data requires some preparation before it can be loaded into Dgraph. Given the size of the datasets, a scalable pre-processing step is required.

For this, I used Apache Spark to bring real-work graph data into a Dgraph-compatible shape. Read the detailed tutorial on the pre-processing steps.

DBpedia Datasets

I have combined the following datasets from DBpedia project into one graph:

dataset	filename	description
labels	`labels_{lang}.ttl`	Each article has a single title in the article’s language.
category	`article_categories_{lang}.ttl`	Some articles link to categories, multiple categories allowed.
skos	`skos_categories_{lang}.ttl`	Categories link to broader categories. Forms a deep hierarchy.
inter-language links	`interlanguage_links_{lang}.ttl`	Articles link to the same article in all other languages. Forms strongly connected components.
page links	`page_links_{lang}.ttl`	Articles link to other articles or other resources. Forms a network of articles.
infobox	`infobox_properties_{lang}.ttl`	Some articles have infoboxes. Provides structured information as key-value tables.
geo coordinates	`geo_coordinates_{lang}.ttl`	Some articles have geo coordinates of type `Point`.
en_uris	`{dataset}_en_uris_{lang}.ttl`	Non-English `labels`, `infobox` and `category` predicates for English articles. Provides multiple language strings and predicates for articles.

The infobox dataset provides real-world user-generated multi-language predicates.
The other datasets provide a fixed set of predicates each.

Dataset Statistics

dataset	triples	nodes	predicates	dbpedia	dgraph	schema
labels	94,410,463	76,478,687	1	1 GB	1 GB	`Article --rdfs:label-> lang string`
article_categories	149,254,994	41,655,032	1	1 GB	2 GB	`Article --dcterms:subject-> Category`
skos_categories	32,947,632	8,447,863	4	0.3 GB	0.4 GB	`Category --skos-core:broader-> Category`
interlanguage_links	546,769,314	49,426,513	1	5 GB	5 GB	`Article --owl:sameAs-> Article`
page_links	1,042,567,811	76,392,179	1	7 GB	10 GB	`Article --dbpedia:wikiPageWikiLink-> Article`
geo_coordinates	1,825,817	1,825,817	1	0.05 GB	0.03 GB	`Article --georss:point-> geoJSON`
infobox_properties	596,338,417	29,753,821	1,050,875	4 GB	6 GB	`Article --property-> literal or uri`
all	2,396,517,559	86,737,376	1,050,884	19 GB	24 GB

Read https://github.com/EnricoMi/dgraph-dbpedia/blob/master/README.md for detailed instructions on how to generate this dataset or a subset of it. See https://github.com/EnricoMi/dgraph-dbpedia/blob/master/SPARK.md to learn how Spark is used to scale the pre-processing of this dataset.

Topic		Replies	Views
Dataset pre-processing for Dgraph with Apache Spark Showcase rdf , dgraph	3	965	December 10, 2020
Complex edge filtering with large data is too slow. I'm not sure what mistake i make.Can you give me some advice and help? Users	1	603	September 28, 2019
String matching in Dgraph v0.7.4 Blog	2	1097	April 8, 2017
String matching in Dgraph v0.7.5 - Dgraph Blog Blog	0	1197	August 18, 2017
String matching in Dgraph v0.7.4 - Dgraph Blog Blog	0	853	April 10, 2017

Pre-processing DBpedia dataset for Dgraph

A large real-world dataset

DBpedia Datasets

Dataset Statistics

Related topics