Performance advise on graph/schema design

We have been using DGraph for ~1.5 years now for a specific service in our infrastructure. With the release of Slash, we are looking to adopt DGraph for other services as well. The initial switch to DGraph required mainly to “learn to think graph”, but data was static most of the time and was primarily read-heavy. Our other services are more high-volume read/write operations (relative to our business) and we are hoping that the DGraph community can help us with some experience on the schema design for our purpose.

Context
To understand our needs, I provide a bit of context here. We aggregate lifestyle data for fellow researchers during (clinical) trials. Each trial has its own requirements, but often we aggregate Fitbit, GPS, and some other custom data from our app. We see that, on average, we receive an update from Fitbit about 50-100 times a day per user (1 update equals 5 data endpoints we consume). GPS can vary from 100 to 2000 times (depending on the energy saving settings of a user and project requirements). Other data from our app piggy backs on the GPS when possible. We are looking into reducing that number to match that of the Fitbit frequency. We run several analytic services on that data, which are inserted as a “data provider” (like Fitbit, and GPS) as well in the database. This data is often visualized in an app for the users, a dashboard for the researchers, and as input for an intervention. At the moment, all data is stored as an event which has a starting time, ending time, provider type, and user parameter attached to it in MongoDB. With our analytics services, we often only need a specific value in the nested JSON that is provided by a provider, hence we are in favor of fully “breaking” the current schemas of provider data we consume and writing our own “universal” one; based on schema.org terminology where possible.

Schema
Our previous schema (for the original service) was written for GraphQL± and depends on localization and facet-support, but for this I think we can adopt GraphQL as is. A high-level schema design is written below:

// I omit the ID field / predicate for brevity

type User {
    userId: String!
    // some other static information we might want to store

    // we will link to Provider nodes if there is data from this user for a specific provider
    hasProviders: [Provider]

    // same goes for the other fields below
    hasDates: [Date]
    hasCategories: [Categories]
    hasData: [Data]
}

type Provider {
    // e.g. Fitbit
    name: String!
    hasDate: [Data]
    hasDates: [Dates]
}

type Date {
    // or a datetime field preferably 
    date: String! 
    hasData: [Data]
}

type Category {
    // a category would be something like an data type that can provider-agnostic, such as: activity, sleep etc.
    name: String!
    hasData: [Data]
    hasProviders: [Provider] 
}

type Data {
    name: String! // the names will be based on existing naming schemas 
    value: String! // still need to determine how to deal with different types
    provider: Provider! //the same data type can be offered by multiple providers
    startTime: String //optional if intraday data
    endTime: String //optional if intraday data
    // some other fields such as updated, created etc. timestamps
}

For an example, somewhere in a nested JSON (provider data) there is the “sleep_duration” key with an accompanying value in minutes. When parsing that data, we would check if the user exists, the provider exists, the date object exist, and if the data point itself already exists. Then we would perform an upsert-type command to populate the (new) data point of sleep_duration.

Doing the key-level parsing might sounds tedious, but it allows for very short traveling when asking questions like: (1) For what providers do we have data from within this time range? (2) Which data categories do we have on user X? (3) What is the average sleep duration of today? The intent is here to make the querying as fast as possible for our analytics services, since they do most of the read operations.

Besides the upsert requirements, I am concerned with the amount of edges that will be connected to some nodes. Depending on our detail of ingestion, we could ingest about 3000 different data nodes (depending if we include intraday data) per user per day. Each of those nodes will connect to the User node for the duration of the project (6 to 12 months on average). Also, each Provider node will have all data nodes related to that provider connected to it. In a study with 250 people for 12 months, that would result in (1000 per provider per day) 1000365250 = ~91.25M edges per provider node and less per user ofc. After a study of this size, we would end up with ~275M nodes and more edges even. This seems a big excessive, mainly because the expected resources required would be high in relation to a document-based database.

Our biggest question is how to determine the balance between shortest travel path and performance? We could remove the Provider type and rely on it as a field in the Data type, but the same could be said for the User type (albeit a smaller amount of edges are involved here). If you follow that logic, you end up with only the Data type and all the other types as field; and we are back to a document-based schema (not what we want). The content of the Category type could also be added as a label field on the Data type (similar to an approach we already use) to remove some complexity of it.

I understand it is a big open-ended question with very specific requirements on our side, so feel free to ask for more examples, explanation, etc. We have several years of data we can use to test performance later on, but first we would like to get some insights from the theoretical point of view; how to best design the graph itself.

Hi D-graph team , please answer these queries .

I would say, just design how you see it in your mind.

After all that’s the whole point of Dgraph.
You mimick graph structure as you see them in your mind, and alter the schema as you go. Unlike SQL or NoSQL, you are not tied down to ghosts of your past.

Checkout: Loading close to 1M edges/sec into Dgraph & Did I hit 1B+ transactions w/ Dgraph? for proof that it can handle anything you throw at it.

1 Like

Hi @martwetzels,
Thanks for the detailed post.

I understand that you have already been using Dgraph to serve the same data using the GraphQL± API and now you are switching to GraphQL as you are moving to Slash.

FYI, Slash is also going to support GraphQL± queries, in addition to GraphQL. See the docs here.

Apart from that, I wanted to have a look on your current GraphQL± schema along with the raw GraphQL± queries that you run most, to understand your requirements better. It would be great if you could share that here or DM me if you don’t want it to be public.

Also, with your current GraphQL± schema and queries, are you getting the performance you desire or you think it could be improved by modelling in a different way?

1 Like

Thank you for the quick replies!

@abhijit-kar I am intrigued to see how Dgraph will perform once we have everything in place. I think we will try out a few data sources and see as we go.

@abhimanyusinghgaur We are using Dgraph and GraphQL± for a different service that holds our behavior change programs. This is fairly striaght-forward by using the types Program (bound to a user) and Messages (unique ones) that can be linked to a Program. We use facets to determine when a person should receive a message. The schema we are describing in the post is new for us; we are currently storing this in MongoDB but are looking to migrate to Dgraph. At the moment I cannot share any queries or experience on the matter at hand because we still need to implement it. Our mobile clients would most often pull a subset of the user’s (summary) data but our research interface would query population-wide data. The analysis are (atm) user-based but we query more granular data (like GPS) then with the mobile app.

If you are interested, I can DM you with some common queries we do for the existing service running with Dgraph? We are running into a few restrictions (in knowledge probably) for some “more advanced” queries from our perspective that the docs haven’t been able to shed light on. At the moment we take care of additional steps after having pulled it on a client, but I assume it can be done with the variables blocks as well by Dgraph.

Sure, please do that. I would be happy to help you in optimizing your queries, if possible. Also, please share the schema for your existing service, which will help me understand the data better.

And, re-reading your post again, I got some more insights.

The data-type for the Data.value will be different for different kinds of data. In GraphQL, the type system is very strict, unlike a document-based model where you can just dump any kind of data.
What I feel here is that you should have an interface Data which represents data as an abstract entity. And, then you can have different implementations for it, like SleepData for which the value will be an Int, GpsData for which the value will be an object type consisting of latitude and longitude, etc. That way you will be able to answer questions like this straight from the db:

Hi @martwetzels,

Thanks for joining the call with us today, always a pleasure talking to you. Please find below a summary of the call + action item.

  • Data mutation: A lot of data coming every minute or so + data may be redundant. Goal here is to have a way where only the new data needs to be inserted. The way to achieve this we discussed was basically through an upsert where you can first find out which data points don’t exist already and only insert them, which is possible and is an efficient way.
  • Querying all the data points for a time range
  • Querying all the data points for a time range for a user

Your system is such that a User can remain there for about 1-2yrs at max. After that, there won’t be any data coming in for that user. But, there will be new users coming in, so the data linked to Providers is going to keep increasing.

So, you are concerned about having millions/billions of edges from a single Provider node to a lot of Data nodes and how to traverse that efficiently. The most important thing here is to build a graph that can be traversed efficiently to answer the queries you are doing. For that, we discussed how can you design your schema. There were two major approaches that we discussed:

  1. Have User, Provider and Data as a type, and in Data nodes have a predicate to store the timestamp and index it. Also, link the Data node to the corresponding User and Provider nodes. That way you will be able to answer both the queries with the minimal graph traversal with the help of index on the timestamp. This is where you will need the between function, which will make your queries more efficient than the current state. We already forwarded this feature internally and expressed your interest in

  2. Have User, Provider and Data as a type, but this time don’t store timestamp in the Data. Instead have different types called Year, Month, Day, Hour, …, Second. Since we know apriori that a year is going to have only 12 months, so it will have 12 different edges connecting it to each month. Similarly, for other types like Month, Day, etc.
    Then, for a User, link it with the data through the hierarchy of Year->month-> … → Second.
    Finally, the data will have an edge back to the User and Provider.
    This way, you will need small indexes, and this should be more efficient for your queries than the previous approach.

Agreed Action Item

  • You are going to try both these approaches, and perform some benchmarks. Then you are going to ping us back with the results, and then we can see if it can be improved further.

Best
Omar & Abhimanyu

2 Likes

Hi @martwetzels

I am sending this message to check how things are going so far, did you have the chance to try both approaches we discussed last time.

It would be nice if you can keep us posted, so we can best support you

All the Best,

Omar Ayoubi

Sorry to resurrect and old post, but would love to hear core team comment on this ↑. I’m trying to come up with a list of best practices to follow in how I design my Dgraph schema and perhaps I’ve been conditioned through years of SQL hell to be nervous about n+1 queries and performance issues—but should I really just go ahead and create whatever schema I can think of with reckless abandon, or are there caveats?

Based on this ↑ it would seem that the best practices I should follow are:

  • Worry less about size of database and more about how many edges your queries will need to traverse
  • In order to ensure best possible query performance:
    1. Design your schema in such a way that the number of edges your queries need to traverse is minimised
    2. Rely on indexes (using @search directive?) of relevant predicates to get performance out of queries
  • Generally it’s preferable to have more granular types if the existence of those types reduces the number edges a query needs to traverse …?

Am I missing anything?