Pre-processing DBpedia dataset for Dgraph

I was looking for a large real-world graph dataset to load into a Dgraph cluster to ultimately test
my spark-dgraph-connector. I want to share the dataset that met my requirements, together with all processing steps needed to reproduce it and get it loaded into Dgraph:

A large real-world dataset

Dgraph organizes the graph around predicates, so that dataset should contain predicates with these characteristics:

  • a predicate that links a deep hierarchy of nodes
  • a predicate that links a deep network of nodes
  • a predicate that links strongly connected components
  • a predicate with a lot of data, ideally a long string that exists for every node and with multiple languages
  • a predicate with geo coordinates
  • numerous predicates, to have a large schema
  • a long-tail predicate frequency distribution:
    a few predicates have high frequency (and low selectivity),
    most predicates have low frequency (and high selectivity)
  • predicates that, if they exist for a node:
    • have a single occurrence (single value)
    • have a multiple occurrences (value list)
  • real-world predicate names in multiple languages
  • various data types and strings in multiple languages

A dataset that checks all these boxes can be found at the DBpedia project. They extract structured information from the Wikipedia project and provide them in RDF format. However, that RDF data requires some preparation before it can be loaded into Dgraph. Given the size of the datasets, a scalable pre-processing step is required.

For this, I used Apache Spark to bring real-work graph data into a Dgraph-compatible shape. Read the detailed tutorial on the pre-processing steps.

DBpedia Datasets

I have combined the following datasets from DBpedia project into one graph:

dataset filename description
labels labels_{lang}.ttl Each article has a single title in the article’s language.
category article_categories_{lang}.ttl Some articles link to categories, multiple categories allowed.
skos skos_categories_{lang}.ttl Categories link to broader categories. Forms a deep hierarchy.
inter-language links interlanguage_links_{lang}.ttl Articles link to the same article in all other languages. Forms strongly connected components.
page links page_links_{lang}.ttl Articles link to other articles or other resources. Forms a network of articles.
infobox infobox_properties_{lang}.ttl Some articles have infoboxes. Provides structured information as key-value tables.
geo coordinates geo_coordinates_{lang}.ttl Some articles have geo coordinates of type Point.
en_uris {dataset}_en_uris_{lang}.ttl Non-English labels, infobox and category predicates for English articles. Provides multiple language strings and predicates for articles.

The infobox dataset provides real-world user-generated multi-language predicates.
The other datasets provide a fixed set of predicates each.

Dataset Statistics

dataset triples nodes predicates dbpedia dgraph schema
labels 94,410,463 76,478,687 1 1 GB 1 GB Article --rdfs:label-> lang string
article_categories 149,254,994 41,655,032 1 1 GB 2 GB Article --dcterms:subject-> Category
skos_categories 32,947,632 8,447,863 4 0.3 GB 0.4 GB Category --skos-core:broader-> Category
interlanguage_links 546,769,314 49,426,513 1 5 GB 5 GB Article --owl:sameAs-> Article
page_links 1,042,567,811 76,392,179 1 7 GB 10 GB Article --dbpedia:wikiPageWikiLink-> Article
geo_coordinates 1,825,817 1,825,817 1 0.05 GB 0.03 GB Article --georss:point-> geoJSON
infobox_properties 596,338,417 29,753,821 1,050,875 4 GB 6 GB Article --property-> literal or uri
all 2,396,517,559 86,737,376 1,050,884 19 GB 24 GB

Read for detailed instructions on how to generate this dataset or a subset of it. See to learn how Spark is used to scale the pre-processing of this dataset.