We are currently in the process of evaluating dgraph as a possible reporting solution for a new product we are developing. With some basic tests it appears that dgraph performs roughly the same as a few other databases that we have used in the past. I have an idea as to how I can increase the performance in dgraph, but I wanted to know if this solution is something that is acceptable within dgraph or if there is possibly a better way of doing this. Apologies in advance for the verbiage I’m using. I’m still trying to learn the new terminology.
This is essentially what our base test was for all the databases we were considering.
We have 2 entities that matter, a ticket and a tag. For testing purposes I created duplicate datasets in each database consisting of 100,000 tickets, 1,000 tags, and between 0 and 10 associations per ticket to the tags. I’ve pseudo-coded the structures below:
Ticket {
id string
is_ticket string
created_at datetime
tags []Tag
}
Tag {
id string
is_tag string
name string
}
In dgraph I’m storing this in such a way that each ticket is its own node. It has an edge that points to n tags. What I’m interested in getting is a list of tags, and a count for each tag that is how many tickets have that tag. Here is the query I used:
{
q(func: has(is_tag)) {
name
count(~tag @filter(gt(created_at, "2017-01-01T00:00:00Z") AND lt(created_at, "2017-01-31T00:00:00Z")))
}
}
This gives me exactly what I’m looking for, but I’m calling it’s efficiency into question. Every tag is being pulled, and then every ticket connected to any of those tags are pulled and evaluated against which feels very inefficient. If I’m understanding how the database is handling this, the number of evaluations it’s doing is equal to the number of tags * the number of tickets.
There is another entity which we will have to add in the future called organization that I think I may be able to increase the efficiency with. The new set of structures would look something like this:
Ticket {
id string
is_ticket string
organization_id string
created_at datetime
tags []Tag
}
Tag {
id string
is_tag string
organization_id string
name string
}
Organization {
id string
name string
}
So our average query would be something to the effect of grab all tickets for a particular organization, then grab it’s tags, then return the aggregates. This still puts a ton of on-the-fly filtering onto the database that I think may slow it down.
The actual questions
-
Is the a more efficient way of doing the query above?
-
Would it be better to add edges to the organization node which pre-grouped the ticket data. The organization structure with this grouping may look something like this:
Organization {
id string
name string
ticket_group TicketGroup
}
TicketGroup {
date date
tickets []Ticket
}
This way to get the same data set, we would find the organization. Then include the ticket_group edges that pertain to the date(s) that we are wanting. Then use the tickets associated to those groups for the actual query.
Some back-of-the-napkin math tells me that if we have X tickets per day and we are querying for Y days of data, then we would reduce the number of evaluations from X * Y down to Y.
If you need any other information or if I didn’t make sense at any (or all) point during this post let me know, and I’ll try to clarify it.