Performance question: "x is_type boolean" or "x <type> value"

Just wondering what would be the best practice to typify the nodes in my dgraph database. Nodes can either be ‘topics’ or ‘phrases’ (might have more in the future.) There will be over 100 millions of nodes in production.

What would be the best approach, using an indexed predicate that defines takes a string value holding the node type:

_:0 <xid> "abc" .
_:0 <type> "topic" .
_:1 <xid> "xyz" .
_:1 <type> "phrase" .

Or, using an indexed boolean predicate labeled for each type:

_:0 <xid> "abc" .
_:0 <is_topic> true .
_:1 <xid> "xyz" .
_:1 <is_term> true .

What is the most suitable, performant and scalable solution for those who do have an idea?

Hey @lazharichir

That’s a good question and we have been recommending the first method i.e. defining a string type edge. The advantage is that you can just use the same predicate in your queries while checking for type instead of checking for a different predicate each time.

Both should be equally performant as they would generate equal sized index posting lists.

Will the second method not be better to equaly distribute edges over multiple machines in a dgraph cluster? With the first approach you will always have a type predicate for every node, which will probably also increase the index size / slowing down lookup performance of that “large” type predicate index?

That’s a good point. The second approach replaces one large predicate with many smaller predicates, which scales better when using a larger cluster.

Agreed. The second one is better. The first approach causes issues for us, because if you index on type edge, then it generates huge posting lists for us internally (topic → all nodes of type topic, phrase → all nodes of type phrase), which slows down everything.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.