Question about schema design - string literal object vs node object

Elli_Schwarz · May 15, 2023, 5:40pm

I’m wondering how dgraph stores string literal values internally. Let’s say I have a Person type, of which I’ll have thousands of nodes. The Person type has a predicate “jobTitle”, and the object is a string literal value, essentially an enum of which there is one of 10 possible values. Does dgraph store each string separately, or do all instances of Person nodes point to the same string value? In other words, if I have a hundred Persons with the jobTitle “software engineer”, and another 100 with the jobTitle “systems administrator”, does dgraph store each of those strings 100 times or 1 time?

Would I see any space savings by having the predicate jobTitle point instead to a UID object, instead of a string, which represents one of the 10 values? In other words, if dgraph is actually storing the string literal “software engineer” 100 times in my above example where jobTitle is a string literal value, would I see a savings if jobTitle was instead an object UID with predicate “value” which is the string “software engineer” stored only one time?

Thank you!

MichelDiz · May 15, 2023, 6:13pm

Hey Schwarz,

You’ve got some insightful questions, and I appreciate you reaching out! You’re correct in your understanding that Dgraph currently stores all entities as separate nodes, and each of these nodes contains their own values. In the case of your “Person” type and “jobTitle” predicate, Dgraph does indeed store each of these strings separately(I mean, in the node itself). We use indexing to quickly locate and access these values when needed. So yes, if you have a hundred “Persons” with the “jobTitle” of “software engineer”, Dgraph stores that string a hundred times on each node basically.

That said, we’re currently working on a new Type system that will change this logic. It’s a long-term project.

As for your second question, yes, using a UID object instead of a string for the “jobTitle” could indeed lead to some improvements, not only in storage but also in other areas such as indexing and query processing. The ability to point to a UID object could allow for more efficient queries and potentially a more flexible schema. This is something we’re looking to implement soon as part of the mentioned improvement in the Type System.

I hope this helps! If you have any other questions, don’t hesitate to ask!

Cheers.

MichelDiz · May 15, 2023, 6:44pm

By the way, nothing stops you from implementing a similar approach yourself. You can create a structure of nodes (perhaps a tree) and use this structure as a base for your queries. This can effectively achieve the desired effect. However, you need to align your queries in accordance with this structure. It’s quite straightforward using recursive queries.

Elli_Schwarz · May 16, 2023, 4:09pm

Michel, thank you very much for your prompt response. I’m wondering if you think query performance would be better if we were to directly query on the node UID vs. on the indexed string “software engineer”.

This gets to a related question. My understanding is that in DQL, you can’t specify your own unique node UIDs (unless you ingest the entire graph at once, in which case you could use the same “blank” nodes). In the above case, assuming there is a performance benefit to querying on the UID, if the UIDs were deterministic, for example, as in a pure RDF database, the subject or object would be a URL, so we’d query for the node “http://foo.org/jobTitle/softwareEngineer” instead of a string lookup on “software engineer”. This is especially the case since we basically have an enum of jobTitles, so it would be much more efficient for us to create our own UIDs for each of them, than to have to do a lookup each time we want to add an edge pointing to a jobTitle of “software engineer” or another job title. Even if we cache the UID values in our code, it would be much easier to use custom UIDs if possible, so in that case we could hard-code UIDs and even be consistent across multiple dgraph clusters (such as different test and production clusters). Is there a way to use a custom UID in DQL?

Thank you!!

MichelDiz · May 16, 2023, 4:25pm

Yes in part. The discovery would be fast, but filters would be slow I think. Unless the values are indexed.

But you can reserve a range of UIDs and use them manually.

You have a misconception here. Blank Nodes are always new nodes. It is a unique identifier that lives only in the context of a transaction. Once the transaction is committed, the blank node is reset. You cannot use it as a source of truth.

No, UIDs are a central part of Dgraph’s design. There is a whole chain of logical dependencies that make trying to simulate the behavior of a Trilpestore impossible. You can turn it into external IDs like this example External IDs and Upsert Block - DQL

But technically it’s impossible to do what you’re talking about. We would need to start the entire database from scratch and redesign it.

Elli_Schwarz · May 18, 2023, 4:53pm

Michel,
Thank you very much for your explanation. I have a follow up question about how I would query such a data structure in DQL with a filter in a recurse query.

Explanation: We want to perform a recursive search to display a network graph. We currently perform a recursive query to do this, but we want to filter out certain paths. In this example we have three companies associated with projects, and projects associated with contracts. There are (for this example) hundreds of millions of projects but a small list of contracts, so based on your feedback above, the contracts are now pulled out as separate nodes and referred to indirectly by the projects, so the same contract string is not stored a hundreds of times, but rather is a pointer to one string for that contract. When we do this, the filtering we are used to performing does not work. We are providing a simplified example of our problem with a schema, upsert, and query to illustrate.

#--- Schema

# Types
type <Company> {
   company.name
   company.hasProject
}

type <Project> {
   project.name
   project.hasContract
}

type <Contract> {
   contract.name
}

# Predicates
<company.name>: string @index(hash) .
<contract.name>: string @index(hash) .
<project.name>: string @index(hash) .
<company.hasProject>: [uid] @reverse .
<project.hasContract>: uid .

Here’s our upsert:

#--- Upsert
upsert {
   query {
      getCompanyAlpha(func:eq(company.name, "Alpha")) {
	     compA as uid
	  }
	  getCompanyBravo(func:eq(company.name, "Bravo")) {
	     compB as uid
	  }
	  getCompanyBravo(func:eq(company.name, "Charlie")) {
	     compC as uid
	  }
	  getProj1(func:eq(project.name, "Proj1")) {
	     proj1 as uid
	  }
	  getProj2(func:eq(project.name, "Proj2")) {
	     proj2 as uid
	  }
	  getProj3(func:eq(project.name, "Proj3")) {
	     proj3 as uid
	  }
	  getContract1(func:eq(contract.name, "TheContract1")) {
	     contract1 as uid
	  }
	  getContract2(func:eq(contract.name, "TheContract2")) {
	     contract2 as uid
	  }
   }
   mutation {
      set {
	     uid(compA) <company.name> "Alpha" .
		 uid(compB) <company.name> "Bravo" .
		 uid(compC) <company.name> "Charlie" .
		 uid(proj1) <project.name> "Proj1" .
		 uid(proj2) <project.name> "Proj2" .
		 uid(proj3) <project.name> "Proj3" .
		 uid(contract1) <contract.name> "TheContract1" .
		 uid(contract2) <contract.name> "TheContract2" .
		 uid(compA) <company.hasProject> uid(proj1) .
		 uid(compA) <company.hasProject> uid(proj2) .
		 uid(compA) <company.hasProject> uid(proj3) .
		 uid(compB) <company.hasProject> uid(proj1) .
		 uid(compC) <company.hasProject> uid(proj3) .
		 uid(proj1) <project.hasContract> uid(contract1) .
		 uid(proj2) <project.hasContract> uid(contract1) .
		 uid(proj3) <project.hasContract> uid(contract2) .
	  }
   }
}

And here is the query that we are used to, but it doesn’t work here because of the extra level of indirection. If our project.hasContract would be a predicate between a project node and a contract string literal value, we’d know how to query it with the filter below. But now we have an extra level where we have a contract node with a string literal value, and I’m not sure how to filter on that.

So, in other words, I realize that the query below won’t work because hasProject doesn’t point to a node with a contract.name predicate and a string literal value… but I don’t understand how I can build a filter that can accomplish this now that we’re one predicate removed from the actual string.

#--- Query
{
   find(func: eq(company.name, "Alpha")) @recurse {
       company.name
	   company.hasProject @filter(eq(contract.name, "TheContract2"))
	   ~company.hasProject
	   project.name
	   project.hasContract
	   contract.name
   }
}

Thank you very much for all of your help!

Elli_Schwarz · May 18, 2023, 7:14pm

I think we might have figure it out. The key is that we now realize the uid_in function. I’m trying to understand if we also need that filter on the reverse edge; if we do, the query doesn’t return what we expect.

{

   var(func: eq(contract.name, "TheContract2"))  {
      myId as uid
   }  

   find(func: eq(company.name, "Alpha")) @recurse {
       company.name
       company.hasProject @filter(uid_in(project.hasContract, uid(myId)))
       ~company.hasProject
       project.name
       project.hasContract
       contract.name
   }
}

Is this the approach you’d recommend for this case?

Thank you!

Topic		Replies	Views
How to define and store Map data as a predicate Dgraph	3	631	June 6, 2020
Type/Schema System: introducing object types in schema Users	4	1875	November 28, 2017
RDF graph / Benchmark Users	2	563	April 6, 2018
Nodes (models) sharing predicates - DGraph schema design best practice Dgraph discussion , schema , best-practice	5	3303	February 3, 2018
I'm very confused by the docs Dgraph kind:question	5	406	October 29, 2021

Question about schema design - string literal object vs node object

Related topics