We have been using DGraph for ~1.5 years now for a specific service in our infrastructure. With the release of Slash, we are looking to adopt DGraph for other services as well. The initial switch to DGraph required mainly to “learn to think graph”, but data was static most of the time and was primarily read-heavy. Our other services are more high-volume read/write operations (relative to our business) and we are hoping that the DGraph community can help us with some experience on the schema design for our purpose.
Context
To understand our needs, I provide a bit of context here. We aggregate lifestyle data for fellow researchers during (clinical) trials. Each trial has its own requirements, but often we aggregate Fitbit, GPS, and some other custom data from our app. We see that, on average, we receive an update from Fitbit about 50-100 times a day per user (1 update equals 5 data endpoints we consume). GPS can vary from 100 to 2000 times (depending on the energy saving settings of a user and project requirements). Other data from our app piggy backs on the GPS when possible. We are looking into reducing that number to match that of the Fitbit frequency. We run several analytic services on that data, which are inserted as a “data provider” (like Fitbit, and GPS) as well in the database. This data is often visualized in an app for the users, a dashboard for the researchers, and as input for an intervention. At the moment, all data is stored as an event which has a starting time, ending time, provider type, and user parameter attached to it in MongoDB. With our analytics services, we often only need a specific value in the nested JSON that is provided by a provider, hence we are in favor of fully “breaking” the current schemas of provider data we consume and writing our own “universal” one; based on schema.org terminology where possible.
Schema
Our previous schema (for the original service) was written for GraphQL± and depends on localization and facet-support, but for this I think we can adopt GraphQL as is. A high-level schema design is written below:
// I omit the ID field / predicate for brevity
type User {
userId: String!
// some other static information we might want to store
// we will link to Provider nodes if there is data from this user for a specific provider
hasProviders: [Provider]
// same goes for the other fields below
hasDates: [Date]
hasCategories: [Categories]
hasData: [Data]
}
type Provider {
// e.g. Fitbit
name: String!
hasDate: [Data]
hasDates: [Dates]
}
type Date {
// or a datetime field preferably
date: String!
hasData: [Data]
}
type Category {
// a category would be something like an data type that can provider-agnostic, such as: activity, sleep etc.
name: String!
hasData: [Data]
hasProviders: [Provider]
}
type Data {
name: String! // the names will be based on existing naming schemas
value: String! // still need to determine how to deal with different types
provider: Provider! //the same data type can be offered by multiple providers
startTime: String //optional if intraday data
endTime: String //optional if intraday data
// some other fields such as updated, created etc. timestamps
}
For an example, somewhere in a nested JSON (provider data) there is the “sleep_duration” key with an accompanying value in minutes. When parsing that data, we would check if the user exists, the provider exists, the date object exist, and if the data point itself already exists. Then we would perform an upsert-type command to populate the (new) data point of sleep_duration.
Doing the key-level parsing might sounds tedious, but it allows for very short traveling when asking questions like: (1) For what providers do we have data from within this time range? (2) Which data categories do we have on user X? (3) What is the average sleep duration of today? The intent is here to make the querying as fast as possible for our analytics services, since they do most of the read operations.
Besides the upsert requirements, I am concerned with the amount of edges that will be connected to some nodes. Depending on our detail of ingestion, we could ingest about 3000 different data nodes (depending if we include intraday data) per user per day. Each of those nodes will connect to the User node for the duration of the project (6 to 12 months on average). Also, each Provider node will have all data nodes related to that provider connected to it. In a study with 250 people for 12 months, that would result in (1000 per provider per day) 1000365250 = ~91.25M edges per provider node and less per user ofc. After a study of this size, we would end up with ~275M nodes and more edges even. This seems a big excessive, mainly because the expected resources required would be high in relation to a document-based database.
Our biggest question is how to determine the balance between shortest travel path and performance? We could remove the Provider type and rely on it as a field in the Data type, but the same could be said for the User type (albeit a smaller amount of edges are involved here). If you follow that logic, you end up with only the Data type and all the other types as field; and we are back to a document-based schema (not what we want). The content of the Category type could also be added as a label field on the Data type (similar to an approach we already use) to remove some complexity of it.
I understand it is a big open-ended question with very specific requirements on our side, so feel free to ask for more examples, explanation, etc. We have several years of data we can use to test performance later on, but first we would like to get some insights from the theoretical point of view; how to best design the graph itself.