I was looking for a large real-world graph dataset to load into a Dgraph cluster to ultimately test
my spark-dgraph-connector. I want to share the dataset that met my requirements, together with all processing steps needed to reproduce it and get it loaded into Dgraph: https://github.com/EnricoMi/dgraph-dbpedia.
A large real-world dataset
Dgraph organizes the graph around predicates, so that dataset should contain predicates with these characteristics:
- a predicate that links a deep hierarchy of nodes
- a predicate that links a deep network of nodes
- a predicate that links strongly connected components
- a predicate with a lot of data, ideally a long string that exists for every node and with multiple languages
- a predicate with geo coordinates
- numerous predicates, to have a large schema
- a long-tail predicate frequency distribution:
a few predicates have high frequency (and low selectivity),
most predicates have low frequency (and high selectivity) - predicates that, if they exist for a node:
- have a single occurrence (single value)
- have a multiple occurrences (value list)
- real-world predicate names in multiple languages
- various data types and strings in multiple languages
A dataset that checks all these boxes can be found at the DBpedia project. They extract structured information from the Wikipedia project and provide them in RDF format. However, that RDF data requires some preparation before it can be loaded into Dgraph. Given the size of the datasets, a scalable pre-processing step is required.
For this, I used Apache Spark to bring real-work graph data into a Dgraph-compatible shape. Read the detailed tutorial on the pre-processing steps.
DBpedia Datasets
I have combined the following datasets from DBpedia project into one graph:
dataset | filename | description |
---|---|---|
labels | labels_{lang}.ttl |
Each article has a single title in the article’s language. |
category | article_categories_{lang}.ttl |
Some articles link to categories, multiple categories allowed. |
skos | skos_categories_{lang}.ttl |
Categories link to broader categories. Forms a deep hierarchy. |
inter-language links | interlanguage_links_{lang}.ttl |
Articles link to the same article in all other languages. Forms strongly connected components. |
page links | page_links_{lang}.ttl |
Articles link to other articles or other resources. Forms a network of articles. |
infobox | infobox_properties_{lang}.ttl |
Some articles have infoboxes. Provides structured information as key-value tables. |
geo coordinates | geo_coordinates_{lang}.ttl |
Some articles have geo coordinates of type Point . |
en_uris | {dataset}_en_uris_{lang}.ttl |
Non-English labels , infobox and category predicates for English articles. Provides multiple language strings and predicates for articles. |
The infobox
dataset provides real-world user-generated multi-language predicates.
The other datasets provide a fixed set of predicates each.
Dataset Statistics
dataset | triples | nodes | predicates | dbpedia | dgraph | schema |
---|---|---|---|---|---|---|
labels | 94,410,463 | 76,478,687 | 1 | 1 GB | 1 GB | Article --rdfs:label-> lang string |
article_categories | 149,254,994 | 41,655,032 | 1 | 1 GB | 2 GB | Article --dcterms:subject-> Category |
skos_categories | 32,947,632 | 8,447,863 | 4 | 0.3 GB | 0.4 GB | Category --skos-core:broader-> Category |
interlanguage_links | 546,769,314 | 49,426,513 | 1 | 5 GB | 5 GB | Article --owl:sameAs-> Article |
page_links | 1,042,567,811 | 76,392,179 | 1 | 7 GB | 10 GB | Article --dbpedia:wikiPageWikiLink-> Article |
geo_coordinates | 1,825,817 | 1,825,817 | 1 | 0.05 GB | 0.03 GB | Article --georss:point-> geoJSON |
infobox_properties | 596,338,417 | 29,753,821 | 1,050,875 | 4 GB | 6 GB | Article --property-> literal or uri |
all | 2,396,517,559 | 86,737,376 | 1,050,884 | 19 GB | 24 GB |
Read https://github.com/EnricoMi/dgraph-dbpedia/blob/master/README.md for detailed instructions on how to generate this dataset or a subset of it. See https://github.com/EnricoMi/dgraph-dbpedia/blob/master/SPARK.md to learn how Spark is used to scale the pre-processing of this dataset.