Discussion: Wikipedia backed by DGraph

soneymathew · July 19, 2020, 5:26am

I am interested to learn if someone attempted to load wikipedia datadump into DGraph?

How will a wikipedia site built on top of DGraph perform?
Is DGraph a good fit for solving this problem?
Can loading up this data into DGraph allow a dynamic and richer experience?
How much will it cost in terms of resources(storage/processing) to attempt this?
What would be the bulk loading strategy for such an attempt?

MichelDiz · July 19, 2020, 5:07pm

Humm, maybe.

Which problem?

Dgraph is a GraphDB. Any kind of linked data should work fine. But you have to master the query language first. And also understand Wikipedia’s data structure as well.

I see tons of small pieces of data. For sure Wikipedia has TBs in data. This is hard to tell in costs. For sure would be very similar to what they spend with their context. But hard to tell what are the pros and cons. Even if we were aware of their setup.

BTW, I see XML data dumps. That won’t work in Dgraph. You have to convert it. Maybe using Open Refine should work for that task.

mrjn · July 19, 2020, 6:01pm

At Google, the Knowledge Graph was being extracted from Wikipedia as well, which is what Dgraph was being designed to serve. So, surely, it can be used to serve Wikipedia. We’re currently doing a TB data of testing — perhaps it can be used there.

soneymathew · July 25, 2020, 1:04pm

Do you think the dataset can be exposed via a GraphQL endpoint? I will be keen to build a wikipedia replica based on that endpoint.
This would be a great reference to the scale at which DGraph can operate as well.

soneymathew · July 25, 2020, 1:07pm

Wikipedia backed by DGraph is what I had in mind as a problem statement

EnricoMi · July 26, 2020, 5:18pm

I have loaded DBpedia, an RDF version of Wikipedia, into Dgraph, specifically the 2016-10 dump. In contrast to Wikipedia, DBpedia is less texty (at most there is a long abstract but not the entire article iirc).

The goal was to get a large graph (500m triples) with a wide long-tail schema (230k predicates) and to query those data with simple single-step path queries, so like a benchmark dataset. Benchmarking is meant not to measure how fast it is but how performance degrades with scale (constant, linearly, polynomial). Loading that data was very painful with 20.03.3 (memory-wise), but discussion with core devs gave me the impression the next version is much more stable and performant.

Query performace in my use-case is satisfying, except for some issues around pagination.

I think your problem statement needs a why, as in why would a graph database be beneficial, what is the use case / access pattern that Dgraph can improve.

Anurag · July 27, 2020, 2:29pm

Did you start a discuss post regarding all the issues you can across? Can you link it here if we have one already? Also it would be great if you could share code with us if it is open source, this can help us identify more issues and make the next releases more stable.

EnricoMi · July 28, 2020, 12:25pm

I will open source the code to transform the DBpedia dataset into Dgraph RDF triples in a few weeks and link it here. I will re-do the load with the latest release of Dgraph then.

soneymathew · July 28, 2020, 2:13pm

I am primarily interested in surfacing the new capabilities that naturally becomes possible when this dataset is loaded into Dgraph.
I think this dataset can help as a large enough public reference that showcases the capabilities of DGraph.

Also interested in estimating what will be expense involved in ETL of this dataset into DGraph as well as running it.

EnricoMi · October 5, 2020, 2:55pm

I have put the code to generate the DBpedia dataset online, linked from here: Pre-processing DBpedia dataset for Dgraph.

I have also rerun the whole procedure against 20.07.1 and can confirm that the memory issues that hit me have been fixed. I could reproduce the issue with 20.03.0.

Anurag · October 5, 2020, 4:37pm

Thanks for reverting back on this. A lot of work on memory management was done recently. For instance: Dgraph crashes after predicate movement (another oom crash?) - #6 by praneelrathore
Glad that it works for you now!

Topic		Replies	Views
Hello, World! - Dgraph Blog Blog	1	1142	June 23, 2016
SURVEY: Who is using Dgraph? Dgraph dgraph , help-wanted	5	952	July 3, 2020
Slow performance on ~ 100M vertices graph from Wikidata Dgraph area:performance	18	1627	December 13, 2021
Neo4j vs Dgraph - The numbers speak for themselves - Dgraph Blog Blog	19	3721	July 4, 2021
Benchmarks and companies using dgraph Dgraph	5	1897	May 4, 2018

Discussion: Wikipedia backed by DGraph

Related topics