Discussion: Wikipedia backed by DGraph

I am interested to learn if someone attempted to load wikipedia datadump into DGraph?

How will a wikipedia site built on top of DGraph perform?
Is DGraph a good fit for solving this problem?
Can loading up this data into DGraph allow a dynamic and richer experience?
How much will it cost in terms of resources(storage/processing) to attempt this?
What would be the bulk loading strategy for such an attempt?

Humm, maybe.

Which problem?

Dgraph is a GraphDB. Any kind of linked data should work fine. But you have to master the query language first. And also understand Wikipedia’s data structure as well.

I see tons of small pieces of data. For sure Wikipedia has TBs in data. This is hard to tell in costs. For sure would be very similar to what they spend with their context. But hard to tell what are the pros and cons. Even if we were aware of their setup.

BTW, I see XML data dumps. That won’t work in Dgraph. You have to convert it. Maybe using Open Refine should work for that task.

1 Like

At Google, the Knowledge Graph was being extracted from Wikipedia as well, which is what Dgraph was being designed to serve. So, surely, it can be used to serve Wikipedia. We’re currently doing a TB data of testing — perhaps it can be used there.

1 Like

Do you think the dataset can be exposed via a GraphQL endpoint? I will be keen to build a wikipedia replica based on that endpoint.
This would be a great reference to the scale at which DGraph can operate as well.

Wikipedia backed by DGraph is what I had in mind as a problem statement :slight_smile:

I have loaded DBpedia, an RDF version of Wikipedia, into Dgraph, specifically the 2016-10 dump. In contrast to Wikipedia, DBpedia is less texty (at most there is a long abstract but not the entire article iirc).

The goal was to get a large graph (500m triples) with a wide long-tail schema (230k predicates) and to query those data with simple single-step path queries, so like a benchmark dataset. Benchmarking is meant not to measure how fast it is but how performance degrades with scale (constant, linearly, polynomial). Loading that data was very painful with 20.03.3 (memory-wise), but discussion with core devs gave me the impression the next version is much more stable and performant.

Query performace in my use-case is satisfying, except for some issues around pagination.

I think your problem statement needs a why, as in why would a graph database be beneficial, what is the use case / access pattern that Dgraph can improve.

1 Like

Did you start a discuss post regarding all the issues you can across? Can you link it here if we have one already? Also it would be great if you could share code with us if it is open source, this can help us identify more issues and make the next releases more stable.

I will open source the code to transform the DBpedia dataset into Dgraph RDF triples in a few weeks and link it here. I will re-do the load with the latest release of Dgraph then.

1 Like

I am primarily interested in surfacing the new capabilities that naturally becomes possible when this dataset is loaded into Dgraph.
I think this dataset can help as a large enough public reference that showcases the capabilities of DGraph.

Also interested in estimating what will be expense involved in ETL of this dataset into DGraph as well as running it.

1 Like

I have put the code to generate the DBpedia dataset online, linked from here: Pre-processing DBpedia dataset for Dgraph.

I have also rerun the whole procedure against 20.07.1 and can confirm that the memory issues that hit me have been fixed. I could reproduce the issue with 20.03.0.

1 Like

Thanks for reverting back on this. A lot of work on memory management was done recently. For instance: Dgraph crashes after predicate movement (another oom crash?) - #6 by praneelrathore
Glad that it works for you now! :slight_smile: