Migrating away from generic "Type" edge in our app, plus modeling complex schema

tamethecomplex · January 6, 2018, 3:20pm

Hi All,

This post may apply directly to the “stronger type system” feature in the 2018 Dgraph product roadmap.

My team has had a lot of success using Dgraph, and we have built a pretty complex web application on top of it. However, the approach we took to defining the schema will not scale based on the Dgraph team’s recommendations against a generic “type” edge per https://docs.dgraph.io/howto/#giving-nodes-a-type. I am looking for advice on how to migrate away from our current approach to one which can store millions or billions of nodes, while maintaining a rich “meta-schema” we have put in place to add context to the data. Here are the schema requirements of our app (schema in a general sense, not necessarily directly mapping to Dgraph’s schema features):

A schema that is constantly changing, and is being created and modified by non-technical users
The schema-level concept of “Classes”, “Properties”, and “Relations”. Each of these has a name and description.
The content-level concept of “Entities”, each of which has a “Class”, “Property Values”, and “Relation Values”.
Properties have a “type” such as number, string, date, boolean, image, and visualization.
Classes have a set of predefined “Properties” and “Relations” their corresponding entities can have. Each has a min count and max count.
For a given Class + Relation combo, the class of the “target entity” in the source entity → Relation → target entity is constrained at the class level. For example, the class Business Requirement and relation Depends On may have a valid target class of Key Value Driver, whereas for the class Software System may also use the relation Depends On, but with a valid target class of Software Library.

Based on the dgraph team’s advice to use *specific" predicates, rather than generic ones (due to mutation aborts and sharding between machines), I think our entire model needs to be redesigned. The shortcoming seems to be:

We use way too many generic predicates. The predicates “VertexType”, “Class”, “Property”, “Relation”, “ExternalId”, “Namespace”, and others are used by ALL entities in the system. Thus as the system scales, we basically must perform all mutations in series, because any two transactions would very often conflict.

However, I am not sure how to retain our rich schema features using the current Dgraph schema features while also following the Dgraph team’s predicate suggestions we will need in order to scale.

I think in summary, we need the Dgraph schema to do for us what we have been doing ourselves at the vertex level. These would be:

Attach scalar attributes to predicates themselves. Similar to facets, but at the schema level. softwareSystem is a class and must have a “name” and “description”. We would also use a “type” attribute to identify softwareSystem as a “class”
We must be able to create relationships between schema items. For instance, "softwareSystem allows schema property type latestVersion, which has a min and max occurrence count (1 and 1). Or softwareSystem allows relation type usesLibrary, which in turn can point to the class softwareLibrary, and has a min and max occurrence count (0 and unlimited).

Hopefully this makes sense. We would be very interested to provide comments on the Dgraph team’s type system design and would be happy to provide more detail about what we have built.

pawan · January 7, 2018, 11:35pm

Hey @tamethecomplex

I am not sure if you know that you can use the IgnoreIndexConflict field to reduce the number of conflicts and hence aborts that you have. Get started with Dgraph

So if you were to use this option, an abort would only happen if two concurrent transactions are modifying the same data. The same index being modified won’t cause any issues.

The idea is that wherever possible you should use a more fine-grained predicate. For e.g. for nodes which are Class, you could have Class.name and Class.description. This would help with sharding the data.

These points are interesting. We were not thinking of adding meta information to the schema. I would like to understand your use-case more here.

WolfDan · January 8, 2018, 2:21am

This is actually what I’m doing in my Elixir app, the records looks something like this:

  vertex "user" do
    field(:email, :string, index: [:hash])
    field(:password, :password)
    field(:username, :string, index: [:hash, :fulltext])
    field(:main_image, :string)

    # User data
    field(:biography, :string)
    field(:gender, :enum)
    field(:birthday, :datetime)
    field(:location, :string)
    field(:languages, {:array, :enum})

    # Timestamps
    field(:created_at, :datetime)
    field(:updated_at, :datetime)
  end

When I query or mutate into Dgraph the vertex module is translated into user.email user.username etc… (I don’t know if that’s possible in the language that are you using, but this is my approach in Elixir) I can tell you that works really good and also gives you a better control in you indexes and better schema control

tamethecomplex · January 8, 2018, 11:31am

Hey @pawan,

A primary feature of our app is storing domain-specific knowledge. The app relies on a well-defined schema for the given domain, in order to guide the user in browsing content and adding existing content. The intent behind our formal schema definition is similar to Web Ontology Language (OWL), but does not have as many features.

The app must know which classes are eligible to have which relations and properties, because this will drive what is presented in the UI. The user must have access to the name and description of classes, properties, and relations, in order to understand the type of information they are expected to contain.

I have been thinking about this more over the weekend, and I think we could accomplish most of what we want by maintaining “meta schema” information ourselves as data, and then mapping to dgraph predicates prior to building our queries. So we would have “class” data in the system describing a “Business Requirement” class, but under the hood, instances of that class would be retrieved using the edge “class.businessRequirement”, and we would do the same for relations and properties.

However, if we do maintain this meta-schema layer, one big question I have is around search. If I want to search all names in the system, currently I would just search the “name” predicate. If we map from a meta-schema to the underlying dgraph schema before doing searches, we would need to use a string builder to search each individual “name” predicate. This would eventually turn into thousands of regexp() terms in the search query. Maybe in this case, the addition of a facet concept at the schema level would allow me to search all predicates with a given tag - “name”.

Thanks for pointing me to this - I do remember the thread where this was brought up. This will definitely be helpful on widely-used predicates for which we don’t need to do an upsert.

tamethecomplex · January 8, 2018, 2:45pm

@pawan, expanding further on this, I think richer schema features would allow Dgraph to satisfy use cases similar to relational databases in that there would be explicitly-defined relationships between “table” predicates and “column” predicates. This is really no different than what we have built in the application layer on top of Dgraph.

I guess the question becomes…does schema definition like this belong in the database layer or in the application layer? I personally don’t mind implementing it at the application layer, but I can see a set of built-in features to enable this rich schema definition being valuable to users.

Here is a diagram I created in an attempt to show conceptually how this richer schema might work.

pawan · January 9, 2018, 5:20am

Do you usually perform regex searches across different types (I am assuming you have the concept of type of nodes)? I understand that constructing the query would become difficult if you have person.name and animal.name and we could possibly add a mechanism in the schema to make it easier.

We are in the process of supporting the GraphQL spec. GraphQL schema can have types and fields can have Descriptions. Not sure if that would help?

Though a problem I see in GraphQL schema is that all queries and mutations must be predefined (expected input types and output types). So its a lot of work.

tamethecomplex · January 9, 2018, 12:27pm

Yes, exactly, we allow a global search on names. This is necessary for users that might not remember exactly what they are looking for but can recall a keyword.

I just did a little reading on the GraphQL schema language, but I would need to read more to grok how it might best suit use in Dgraph. Honestly I think I can boil down our current Dgraph schema needs to one very helpful feature: facets on schema items. So instead of writing this to search names:

regexp(animal.name, /fido/i) OR regexp(person.name, /fido/i)

I would write something like this:

schemaFacet(regexp(searchTag: name, /fido/i))

And my corresponding schema definition might look like this:

person.name: string @index(trigram)  @facets(searchTag=name, ...) .
animal.name: string @index(exact, fulltext)  @facets(searchTag=name, ...) .

This is my impression as well. It reminds me a bit of working with XML, where these really elaborate schemas are actually more burdensome than helpful.

One thing I like about Dgraph is that to me, you took a very practical approach to the query language. It reminds me of this thread on the cayley forum which is where I originally learned about Dgraph and ultimately led me to adopt it (https://discourse.cayley.io/t/modified-graphql-aimed-at-graph-dbs/485). From a developer productivity standpoint, I would personally be much more inclined to define my schema once (in the dgraph schema definition), and use some tool to auto-generate a GraphQL schema for me that others can use to grok my data model. Similar to a tool I already use to auto-generate Swagger API documentation: typescript-rest-swagger - npm… Writing swagger definition files seems like a suboptimal use of time when the computer can generate the definitions for me based on code I’ve already written.

Anyway, to summarize…from my perspective, the addition of a simple facet feature at the schema level would enable intelligent grouping of schema items for queries while also following the Dgraph team’s suggestion of avoiding a small number of widely-used predicates.

pmualaba · January 9, 2018, 9:59pm

Hello, very interesting en very important discussion

Another approach could be to add a few useful, but currently missing functions on edge keys (instead of edge values) to GraphQL+ -

Default “edge value” function works like this:

me(func: eq(name@en, "Steven Spielberg")) @filter(has(director.film)) {
  name@en
  director.film
  initial_release_date
}

Why not add the following “edge key” functions:

me(func: eq(startsWith(person. as p), "Steven Spielberg")) @filter(has(startsWith(director. as d))) {
  p
  d
  initial_release_date
}


me(func: eq(endsWith(.name@en as n@en), "Steven Spielberg")) @filter(has(endsWith(.film as f))) {
  f
  name@en : n@en
  initial_release_date
}

me(func: eq(contains(name as n), "Steven Spielberg")) @filter(has(contains(director as d))) { 
  d
  name : n
  initial_release_date
}

me(func: eq(regexp(/^person/ as p), "Steven Spielberg")) @filter(has(contains(director as d))) { 
  d
  p
  initial_release_date
}

I think such functions would be very useful and will probably be more performant then going the facets route ?

pawan · January 9, 2018, 11:40pm

I understand the motivation behind this — that is performing OR queries across different predicates by grouping the predicates. Though if you already know which predicates to add the tag to in the schema, then the query should also be constant?

pawan · January 9, 2018, 11:48pm

If you think these will be helpful please add a Github issue and we will get to it.

pmualaba · January 10, 2018, 6:37am

I created a github issue:

github.com/dgraph-io/dgraph

"Edge-key" functions vs "Edge-value" functions

opened 06:34AM - 10 Jan 18 UTC

closed 03:34AM - 11 Jan 19 UTC

pmualaba

There is an interesting discussion going on disscuss.dgraph.io which leads to th…is Github issue: https://discuss.hypermode.com/t/migrating-away-from-generic-type-edge-in-our-app-plus-modeling-complex-schema/2107/8 Since Dgraph added mutations, Dgraph documentation also changed the ideal datamodeling technique for edge organization. Moving away from organizing your edges into a few "super edges" to organizing them in many "namespaced edges". (specialization over generalization) This switch is getting adopted by the Dgraph community but this also leads to the arising of the need to be able to wórk with these "namespaced edges" in GraphQL+-. Which can now only be done using the "facet label workaround" technique, which is discussed in the above post. What we need are optimized Dgraph native "Edge-key" functions. Adding a few utility functions on edge keys (instead of edge values) to GraphQL+ - An example implementation of this idea could look like the following: Default “edge **value**” function works like this: ``` me(func: eq(person.name@en, "Steven Spielberg")) @filter(has(director.film)) { person.name@en director.film initial_release_date } ``` Why not add the following “edge **key**” functions: ``` me(func: eq(startsWith(person. as _:p), "Steven Spielberg")) @filter(has(startsWith(director. as _:d))) { _:p _:d initial_release_date } me(func: eq(endsWith(.name@en as _:n@en), "Steven Spielberg")) @filter(has(endsWith(.film as _:f))) { _:f name@en : _:n@en initial_release_date } me(func: eq(contains(name as _:n), "Steven Spielberg")) @filter(has(contains(director as _:d))) { _:d name : _:n initial_release_date } me(func: eq(regexp(/^person/ as _:p), "Steven Spielberg")) @filter(has(contains(director as _:d))) { _:d _:p initial_release_date } ``` The "blank" variables _:d, _:p, _:n would then automatically be resolved to their "fully-qualified-namespaced-edge name" in the response. This way we can indeed performing OR queries across different predicates by grouping the predicates again which were originally distributed by their namespace. Then we can easily do a grouped ".name" edge query across namespaced edge keys like "person.name", "pet.name", "object.name", ... (without the need to attach a "name" edge to every instance) Or implement class inheritance hierarchies: nodes with edges "type.person.teacher", "type.person.student", "type.person.client", ... can then be queried for "type.person" to return all nodes which are of type person. Or... many more use cases can profit from having access to native "edge-key functions" in GraphQL+-

pmualaba · January 10, 2018, 6:47am

Indeed, edge-key functions allow for performing OR queries across namespaced predicates even when no facets are attached. But I don’t know which query will then be less expensive: the OR query on namespced predicates without facets “at runtime”, or the one with facets added “at schema design time” ?

tamethecomplex · January 10, 2018, 11:51am

The “OR” queries are definitionaly one big benefit to attaching facets to predicates, but there are more. For example, I could also use facets as a shorthand to expand a subset of predicates in my results block. So let’s say I have a set of predicates related to “alerts”…such as “alertSent”, “alertMessage”, “alertSentTime”, “alertCreatedTime”. Instead of writing:

{
alertSent
alertMessage
alertSentTime
alertCreatedTime
}

I could have attached the facet “category: alertPredicates” to each of these schema items. Then I can expand them in the result block like this:

{
  expand(facet(category: alertPredicates))
}

Another use case is if I wanted to return an edge as well as a name and descrpition of an edge (to provide a detailed explanation of the edge contents in the query results themselves). If I’m retrieving a “business requirement”, I could write something like:

{
  humanReadablePredicateName: facetValue(businessRequirement, name)
  humanReadablePredicateDescription: facetValue(businessRequirement, description)
  businessRequirement
}

Actually, for us, these predicates will always need to be resolved at runtime. This is because new predicates will be created by users - eventually there will be thousands of predicates. The schema definitions will be auto-generated based on the users’ selections (such as whether the predicate is a string or a number), and run against the dgraph database to create that predicate in the dgraph schema. So corresponding queries on the “name” category, will include a set of underlying predicates that exist only as application meta-data, not anywhere in our code itself.

tamethecomplex · January 10, 2018, 11:59am

Hi @pmualaba, this is an interesting take I have not thought about before. I like how you have aliased the predicates using “as” so that the underlying predicates will be expanded in the query block. Using the “key” approach, key being the predicate name itself, is also attractive because nothing new needs to be added to the Dgraph schema definition.

My main question though is what capabilities might facets provide that the “predicate name as key” approach wouldn’t? For one - you could attach many facets to a given predicate. In addition to “searchTag” for performing OR queries, you could attach meta-schema information like “human readable predicate name” and “detailed predicate description” to predicates which can be returned to a user to provide context to the data that predicate contains.

I’m not sure the approach the Dgraph team would take to implement this feature, but I was thinking what they might do is “pre-process” the query to expand out all the terms, before actually running the query. So for the “OR” query example, whether your “predicate name as key” approach, or “facet key :value” approach, either should expand to the same underlying query before being run against the database. In this case I would expect the performance of each query to be quivalent?

pmualaba · January 10, 2018, 1:07pm

What you achieved with facets is indeed already remarkable! As far as I can see the main difference between the two approaches (facets vs edgekeys) would be that facets need more configuration upfront in order to anticipate the queries you will need in your user stories. The edgekeys approach allows for adhoc queries without the need to change the underlying schema. But facets indeed have great value for many other use cases as you already have pointed out.

pawan · January 12, 2018, 5:06am

Sounds useful. Can you please create a Github issue if not done already and link to this issue please?

tamethecomplex · January 12, 2018, 11:45am

Thanks Pawan. I created github issue Allow facets at the schema level · Issue #2009 · dgraph-io/dgraph · GitHub. I also linked to the issue created by @pmualaba. I went ahead and created a separate issue because I didn’t want to conflate the two similarly-motivated but different feature requests.

system · February 11, 2018, 11:45am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Facets in GraphQL Dev graphql , rfc , area:facets	35	6403	July 6, 2021
Performance advise on graph/schema design Dgraph performance , slash	8	1378	October 25, 2021
Schema - Query language Documentation	1	1136	October 22, 2020
Running Stack Overflow on Dgraph - Dgraph Blog Blog	4	1519	January 8, 2018
The Good, The Bad, The Ugly - State of Dgraph Users	2	3959	August 28, 2022

Migrating away from generic "Type" edge in our app, plus modeling complex schema

Related topics