XID in other graph DBs

(Michael Compton) #1

I did a quick survey of what other Graph DBs are doing with ID and allowing external IDs and if they handle RDF style URIs, e.g. <http://dbpedia.org/resource/Category:German_money_launderers>.

The short summary is that RDF stores allow URI style IDs, but largely these aren’t distributed stores (though they must some how be handling the mapping to an internal ID if they allow concurrent writes), while non-RDF stores, like Neo4j, in general don’t allow external IDs for nodes.

Neo, for example, has some recent material on importing RDF (http://connected-data.london/2016/06/09/importing-rdf-data-neo4j/), in which they add a string edge to a node to store its URI - same as I think we are thinking of. This is quick, but doesn’t have any handling of uniqueness of nodes for that URI: i.e. concurrent writes could mint two nodes with same URI. I’m not sure how much of a problem this is though because I’d expect that it’s bulk uploads of existing RDF that matter most, so if we handle that client side that’s ok.

Some noSQL+graph stores allow keys in the data and thus have some mechanisms for dealing with concurrent writes: e.g. JanusGraph allows turning off consistency checking during bulk uploads so they don’t have to check the keys.

Graph DBs I looked at :

Neo4j https://neo4j.com/

AllegroGraph https://franz.com/agraph/allegrograph/

  • RDF quad store
  • single machine - but also Federation (“When a user creates an AllegroGraph federated repository, a virtual index of the constituent stores is created and maintained in the client session…”)
  • SPARQL queries

IBM Graph https://www.ibm.com/bb-en/marketplace/graph

  • built over TinkerPop

GraphBase http://graphbase.net/

  • distributed graph databases
  • own structure (Graph Simple Form)
  • each vertex 128 bit ID
  • builtin query API (?)
  • can layer RDF ontop using Jena - not clear what the encoding is. Assume no XIDs
  • office in same building as Dgraph?

BlazeGraph https://www.blazegraph.com/

  • RDF quad store
  • single machine to 50B quads

Cray Graph Engine http://www.cray.com/products/analytics/cray-graph-engine

  • RDF store
  • Single machine on Cray hardware

OntoText GraphDB https://ontotext.com/products/graphdb/

  • RDF store
  • Enterprise version is distributed
  • Master-worker distribution with single DB copied on all instances.

ArangoDB https://www.arangodb.com

  • Distributed graph/noSQL DB
  • data sharded across DB nodes in the cluster
  • SQL like query lanaguage
  • graphs built on their noSQL infrastructure (?) all on RocksDB
  • each document indexed by a key - can be user specified (?)

JanusGraph http://janusgraph.org/

  • distributed graph database
  • has an option to turn off ID checking in batch loading
  • seems like there is an ID manager that allocates blocks of 64bit IDs out to instances
  • vertex based storage with random sharding across storage backends
  • no external id allocation, but seems to have other keys (?).

StarDog http://www.stardog.com/

  • RDF store
  • SPARQL query
  • clustering with master server distribution with Apache ZooKeeper
  • replicated store with 2PC coordination for distributed writes

OrientDB http://orientdb.com/orientdb/

  • noSQL + graph DB
  • SQL like queries
  • distribution: multimaster and sharding
  • not sure how ID and keys work

DataStax Enterprise Graph https://www.datastax.com/products/datastax-enterprise-graph

  • Graph DB over Apache Cassadra
  • TinkerPop/Gremlin
  • not clear how IDs and keys work in the graph

cayley https://cayley.io/

  • ? don’t know, docs not there?

(Michael Compton) #2

another note on URIs in RDF and their use in things like linked data and schema.org etc

URIs are often slow moving things. They are often created and managed by humans. The kinds of URI from https://github.com/dgraph-io/dgraph/issues/1047 or in the linked data web or on schema.org aren’t created on the fly by a machine. They are minted by humans, agreed on by consensus and bulk uploaded into a machine. Even when we do mint a URI on the fly, the process to ensure that it’s unique generally happens outside the DB or is based on some data property of the node that’s meant to be unique.

For example a use of RDF might be to have an existing schema with URI that’s bulk uploaded, say about people, and then data about individuals is added/modified on the fly, the people themselves don’t need URIs - they can be blank nodes. Even if for a particular application you want them to be proper URIs, the process that guarantees uniqueness of them has to be external to the triple store’s node ID handling anyway - e.g. a process that mints different URIs for two people with the same name.

So we don’t really miss out on anything in the sense of using RDF if we don’t have XIDs

(Manish R Jain) #3

Thanks for the analysis, @michaelcompton. That’s pretty informed.

Yeah, I think XIDs don’t fit into our architecture – we’re not really a triple store, we just chose to use the RDF format for data input. Over time, that might evolve into a more JSON-y way to import data, if that’s easier for developers.

We can bake XID support into our client; so RDFs can still be imported into Dgraph – sounds like that should be sufficient for now.

I know that’s what Neo4j website says, but are they really? I think it’s more like replicas holding entire DB copies. Vertical scaling, not horizontal.

(Michael Compton) #4

Yes, sorry it’s master-slave replication of a single DB.

(Michael Compton) #5

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.