Normalizing Dgraph's SPO (triplet) model

Continuing the discussion from The Good, The Bad, The Ugly - State of Dgraph:

And maybe having more normalized triples might help with a query planner. And let me try to explain this again…

S-P-O is how types and every kind of values works.

  _:foo1 <dgraph.type> "Foo" .
  _:foo1 <dgraph.type> "Base" .
  _:bar1 <dgraph.type> "Bar" .
  _:bar1 <dgraph.type> "Base" .
  _:foo1 <Base.name> "Baz" .
  _:bar1 <Base.name> "Baz" .
  _:foo1 <Foo.bars> _:bar1 .
  _:bar1 <Bar.foos> _:foo1 .

Ok, this is basically data at the rawest level besides looking at the badger posting lists for:

# GraphQL Schema
interface Base {
  id: ID!
  name: String @search
}
type Foo implements Base {
  bars: [Bar] @hasInverse(field: "foos")
}
type Bar implements Base {
  foos: [Foo] @hasInverse(field: "bars")
}
# DQL Generated Schema (not including dgraph.*)
type Base {
  Base.name
}
type Foo {
  Foo.bars
}
type Bar {
  Bar.foos
}
Base.name: string @index(term) .
Foo.bars: [uid] .
Bar.foos: [uid] .
# GraphQL Mutation
mutation {
  addFoo(
    input: [{
      name: "Baz"
      bars: [{
        name: "Baz"
      }]
    }]
  ) { numUids }
}

But you see here how there is now multiple values stored for the same type string:

  _:foo1 <dgraph.type> "Base" .
  _:bar1 <dgraph.type> "Base" .

If you needed to update this type name to a new name, you would have to update for this very small dataset two values.

And in the same way multiple of the same “Baz” string stored:

  _:foo1 <Base.name> "Baz" .
  _:bar1 <Base.name> "Baz" .

If you needed to know all of the nodes having this same value, you would have to find this value twice. Or if you wanted to update this value for both nodes at once, you would have to update two distinct values.

What if (learning from typeDB) it would convert to an even lower level of triples (making up some new predicate names here):

  _:foo <schema.types> "Foo" .
  _:bar <schema.types> "Bar" .
  _:base <schema.types> "Base" .
  _:foo1 <dgraph.type>  _:foo .
  _:foo <dgraph.nodes>  _:foo1 .
  _:foo1 <dgraph.type> _:base .
  _:base <dgraph.nodes> _:foo1 .
  _:bar1 <dgraph.type> _:bar .
  _:bar <dgraph.nodes> _:bar1 .
  _:bar1 <dgraph.type> _:base .
  _:base <dgraph.nodes> _:bar1 .
  _:baz <dgraph.values> "Baz" .
  _:foo1 <Base.name> _:baz .
  _:baz <dgraph.valueUsedBy> _:foo1 .
  _:bar1 <Base.name> _:baz .
  _:baz <dgraph.valueUsedBy> _:bar1 .
  _:foo1 <Foo.bars> _:bar1 .
  _:bar1 <Bar.foos> _:foo1 .

This would be equivalent to a 2NF for triplet store now. There is only one type of the string “Base” and changing a type name would simply be changing a single value in the database. Values are only stored once so that they can be quickly searched backwards traversing to every node that uses that value, etc.

This could be taken to my theory of a 3NF of a triplet store by taking the name out of the predicate value and normalizing that to a singular instance, so renaming an edge/predicate could be done with one string value change. But this would require facets to be first class citizens and facets to support edges not just values at this lower level for the 3+NF in my theory.

some people are building this kind of logic right now in a higher layer on top of Dgraph supporting a low-code database with a strict GraphQL schema. So this then would beg the question if that would be needed if the lower level could be made better like this.

EDIT: Another requirement of being able to do this would be being able to split predicates. Whatever singularity predicate stored the values might grow very large. But maybe not as large if values are often repeated. For a simple use case, a person’s name dictionary of every person in a country would be equal to the population of that country, but taking into consideration singularity and then being able to normalize it into firstName, middleName, lastName, etc. you would be able to reduce this number greatly to only unique names, and if each of these were stored uniquely in a single predicate that would shrink the data size again as there are commonalities between first (given), middle, and last (sir) names.

this might help fix the problem of type overflow/splits that was inherited by 21.12

1 Like