Migrating away from generic "Type" edge in our app, plus modeling complex schema


(Jeff Hull) #1

Hi All,

This post may apply directly to the “stronger type system” feature in the 2018 Dgraph product roadmap.

My team has had a lot of success using Dgraph, and we have built a pretty complex web application on top of it. However, the approach we took to defining the schema will not scale based on the Dgraph team’s recommendations against a generic “type” edge per https://docs.dgraph.io/howto/#giving-nodes-a-type. I am looking for advice on how to migrate away from our current approach to one which can store millions or billions of nodes, while maintaining a rich “meta-schema” we have put in place to add context to the data. Here are the schema requirements of our app (schema in a general sense, not necessarily directly mapping to Dgraph’s schema features):

  1. A schema that is constantly changing, and is being created and modified by non-technical users
  2. The schema-level concept of “Classes”, “Properties”, and “Relations”. Each of these has a name and description.
  3. The content-level concept of “Entities”, each of which has a “Class”, “Property Values”, and “Relation Values”.
  4. Properties have a “type” such as number, string, date, boolean, image, and visualization.
  5. Classes have a set of predefined “Properties” and “Relations” their corresponding entities can have. Each has a min count and max count.
  6. For a given Class + Relation combo, the class of the “target entity” in the source entity -> Relation -> target entity is constrained at the class level. For example, the class Business Requirement and relation Depends On may have a valid target class of Key Value Driver, whereas for the class Software System may also use the relation Depends On, but with a valid target class of Software Library.

Based on the dgraph team’s advice to use *specific" predicates, rather than generic ones (due to mutation aborts and sharding between machines), I think our entire model needs to be redesigned. The shortcoming seems to be:

We use way too many generic predicates. The predicates “VertexType”, “Class”, “Property”, “Relation”, “ExternalId”, “Namespace”, and others are used by ALL entities in the system. Thus as the system scales, we basically must perform all mutations in series, because any two transactions would very often conflict.

However, I am not sure how to retain our rich schema features using the current Dgraph schema features while also following the Dgraph team’s predicate suggestions we will need in order to scale.

I think in summary, we need the Dgraph schema to do for us what we have been doing ourselves at the vertex level. These would be:

  1. Attach scalar attributes to predicates themselves. Similar to facets, but at the schema level. softwareSystem is a class and must have a “name” and “description”. We would also use a “type” attribute to identify softwareSystem as a “class”
  2. We must be able to create relationships between schema items. For instance, "softwareSystem allows schema property type latestVersion, which has a min and max occurrence count (1 and 1). Or softwareSystem allows relation type usesLibrary, which in turn can point to the class softwareLibrary, and has a min and max occurrence count (0 and unlimited).

Hopefully this makes sense. We would be very interested to provide comments on the Dgraph team’s type system design and would be happy to provide more detail about what we have built.


(Pawan Rawal) #2

Hey @tamethecomplex

I am not sure if you know that you can use the IgnoreIndexConflict field to reduce the number of conflicts and hence aborts that you have. https://docs.dgraph.io/clients/#committing-the-transaction

So if you were to use this option, an abort would only happen if two concurrent transactions are modifying the same data. The same index being modified won’t cause any issues.

The idea is that wherever possible you should use a more fine-grained predicate. For e.g. for nodes which are Class, you could have Class.name and Class.description. This would help with sharding the data.


These points are interesting. We were not thinking of adding meta information to the schema. I would like to understand your use-case more here.


(Wolf Dan) #3

This is actually what I’m doing in my Elixir app, the records looks something like this:

  vertex "user" do
    field(:email, :string, index: [:hash])
    field(:password, :password)
    field(:username, :string, index: [:hash, :fulltext])
    field(:main_image, :string)

    # User data
    field(:biography, :string)
    field(:gender, :enum)
    field(:birthday, :datetime)
    field(:location, :string)
    field(:languages, {:array, :enum})

    # Timestamps
    field(:created_at, :datetime)
    field(:updated_at, :datetime)
  end

When I query or mutate into Dgraph the vertex module is translated into user.email user.username etc… (I don’t know if that’s possible in the language that are you using, but this is my approach in Elixir) I can tell you that works really good and also gives you a better control in you indexes and better schema control


(Jeff Hull) #4

Hey @pawan,

A primary feature of our app is storing domain-specific knowledge. The app relies on a well-defined schema for the given domain, in order to guide the user in browsing content and adding existing content. The intent behind our formal schema definition is similar to Web Ontology Language (OWL), but does not have as many features.

The app must know which classes are eligible to have which relations and properties, because this will drive what is presented in the UI. The user must have access to the name and description of classes, properties, and relations, in order to understand the type of information they are expected to contain.

I have been thinking about this more over the weekend, and I think we could accomplish most of what we want by maintaining “meta schema” information ourselves as data, and then mapping to dgraph predicates prior to building our queries. So we would have “class” data in the system describing a “Business Requirement” class, but under the hood, instances of that class would be retrieved using the edge “class.businessRequirement”, and we would do the same for relations and properties.

However, if we do maintain this meta-schema layer, one big question I have is around search. If I want to search all names in the system, currently I would just search the “name” predicate. If we map from a meta-schema to the underlying dgraph schema before doing searches, we would need to use a string builder to search each individual “name” predicate. This would eventually turn into thousands of regexp() terms in the search query. Maybe in this case, the addition of a facet concept at the schema level would allow me to search all predicates with a given tag - “name”.

Thanks for pointing me to this - I do remember the thread where this was brought up. This will definitely be helpful on widely-used predicates for which we don’t need to do an upsert.


(Jeff Hull) #5

@pawan, expanding further on this, I think richer schema features would allow Dgraph to satisfy use cases similar to relational databases in that there would be explicitly-defined relationships between “table” predicates and “column” predicates. This is really no different than what we have built in the application layer on top of Dgraph.

I guess the question becomes…does schema definition like this belong in the database layer or in the application layer? I personally don’t mind implementing it at the application layer, but I can see a set of built-in features to enable this rich schema definition being valuable to users.

Here is a diagram I created in an attempt to show conceptually how this richer schema might work.


(Pawan Rawal) #6

Do you usually perform regex searches across different types (I am assuming you have the concept of type of nodes)? I understand that constructing the query would become difficult if you have person.name and animal.name and we could possibly add a mechanism in the schema to make it easier.

We are in the process of supporting the GraphQL spec. GraphQL schema can have types and fields can have Descriptions. Not sure if that would help?

Though a problem I see in GraphQL schema is that all queries and mutations must be predefined (expected input types and output types). So its a lot of work.


(Jeff Hull) #7

Yes, exactly, we allow a global search on names. This is necessary for users that might not remember exactly what they are looking for but can recall a keyword.

I just did a little reading on the GraphQL schema language, but I would need to read more to grok how it might best suit use in Dgraph. Honestly I think I can boil down our current Dgraph schema needs to one very helpful feature: facets on schema items. So instead of writing this to search names:

regexp(animal.name, /fido/i) OR regexp(person.name, /fido/i)

I would write something like this:

schemaFacet(regexp(searchTag: name, /fido/i))

And my corresponding schema definition might look like this:

person.name: string @index(trigram)  @facets(searchTag=name, ...) .
animal.name: string @index(exact, fulltext)  @facets(searchTag=name, ...) .

This is my impression as well. It reminds me a bit of working with XML, where these really elaborate schemas are actually more burdensome than helpful.

One thing I like about Dgraph is that to me, you took a very practical approach to the query language. It reminds me of this thread on the cayley forum which is where I originally learned about Dgraph and ultimately led me to adopt it (https://discourse.cayley.io/t/modified-graphql-aimed-at-graph-dbs/485). From a developer productivity standpoint, I would personally be much more inclined to define my schema once (in the dgraph schema definition), and use some tool to auto-generate a GraphQL schema for me that others can use to grok my data model. Similar to a tool I already use to auto-generate Swagger API documentation: https://www.npmjs.com/package/typescript-rest-swagger… Writing swagger definition files seems like a suboptimal use of time when the computer can generate the definitions for me based on code I’ve already written.

Anyway, to summarize…from my perspective, the addition of a simple facet feature at the schema level would enable intelligent grouping of schema items for queries while also following the Dgraph team’s suggestion of avoiding a small number of widely-used predicates.


(Patrick Mualaba) #8

Hello, very interesting en very important discussion :slight_smile:

Another approach could be to add a few useful, but currently missing functions on edge keys (instead of edge values) to GraphQL+ -

Default “edge value” function works like this:

me(func: eq(name@en, "Steven Spielberg")) @filter(has(director.film)) {
  name@en
  director.film
  initial_release_date
}

Why not add the following “edge key” functions:

me(func: eq(startsWith(person. as p), "Steven Spielberg")) @filter(has(startsWith(director. as d))) {
  p
  d
  initial_release_date
}


me(func: eq(endsWith(.name@en as n@en), "Steven Spielberg")) @filter(has(endsWith(.film as f))) {
  f
  name@en : n@en
  initial_release_date
}

me(func: eq(contains(name as n), "Steven Spielberg")) @filter(has(contains(director as d))) { 
  d
  name : n
  initial_release_date
}

me(func: eq(regexp(/^person/ as p), "Steven Spielberg")) @filter(has(contains(director as d))) { 
  d
  p
  initial_release_date
}

I think such functions would be very useful and will probably be more performant then going the facets route ?


(Pawan Rawal) #9

I understand the motivation behind this — that is performing OR queries across different predicates by grouping the predicates. Though if you already know which predicates to add the tag to in the schema, then the query should also be constant?


(Pawan Rawal) #10

If you think these will be helpful please add a Github issue and we will get to it.


(Patrick Mualaba) #11

I created a github issue:


(Patrick Mualaba) #12

Indeed, edge-key functions allow for performing OR queries across namespaced predicates even when no facets are attached. But I don’t know which query will then be less expensive: the OR query on namespced predicates without facets “at runtime”, or the one with facets added “at schema design time” ?


(Jeff Hull) #13

The “OR” queries are definitionaly one big benefit to attaching facets to predicates, but there are more. For example, I could also use facets as a shorthand to expand a subset of predicates in my results block. So let’s say I have a set of predicates related to “alerts”…such as “alertSent”, “alertMessage”, “alertSentTime”, “alertCreatedTime”. Instead of writing:

{
alertSent
alertMessage
alertSentTime
alertCreatedTime
}

I could have attached the facet “category: alertPredicates” to each of these schema items. Then I can expand them in the result block like this:

{
  expand(facet(category: alertPredicates))
}

Another use case is if I wanted to return an edge as well as a name and descrpition of an edge (to provide a detailed explanation of the edge contents in the query results themselves). If I’m retrieving a “business requirement”, I could write something like:

{
  humanReadablePredicateName: facetValue(businessRequirement, name)
  humanReadablePredicateDescription: facetValue(businessRequirement, description)
  businessRequirement
}

Actually, for us, these predicates will always need to be resolved at runtime. This is because new predicates will be created by users - eventually there will be thousands of predicates. The schema definitions will be auto-generated based on the users’ selections (such as whether the predicate is a string or a number), and run against the dgraph database to create that predicate in the dgraph schema. So corresponding queries on the “name” category, will include a set of underlying predicates that exist only as application meta-data, not anywhere in our code itself.


(Jeff Hull) #14

Hi @pmualaba, this is an interesting take I have not thought about before. I like how you have aliased the predicates using “as” so that the underlying predicates will be expanded in the query block. Using the “key” approach, key being the predicate name itself, is also attractive because nothing new needs to be added to the Dgraph schema definition.

My main question though is what capabilities might facets provide that the “predicate name as key” approach wouldn’t? For one - you could attach many facets to a given predicate. In addition to “searchTag” for performing OR queries, you could attach meta-schema information like “human readable predicate name” and “detailed predicate description” to predicates which can be returned to a user to provide context to the data that predicate contains.

I’m not sure the approach the Dgraph team would take to implement this feature, but I was thinking what they might do is “pre-process” the query to expand out all the terms, before actually running the query. So for the “OR” query example, whether your “predicate name as key” approach, or “facet key :value” approach, either should expand to the same underlying query before being run against the database. In this case I would expect the performance of each query to be quivalent?


(Patrick Mualaba) #15

What you achieved with facets is indeed already remarkable! As far as I can see the main difference between the two approaches (facets vs edgekeys) would be that facets need more configuration upfront in order to anticipate the queries you will need in your user stories. The edgekeys approach allows for adhoc queries without the need to change the underlying schema. But facets indeed have great value for many other use cases as you already have pointed out.


(Pawan Rawal) #16

Sounds useful. Can you please create a Github issue if not done already and link to this issue please?


(Jeff Hull) #17

Thanks Pawan. I created github issue https://github.com/dgraph-io/dgraph/issues/2009. I also linked to the issue created by @pmualaba. I went ahead and created a separate issue because I didn’t want to conflate the two similarly-motivated but different feature requests.


(system) #18

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.