Optimizing search functionality

We are currently in the process of evaluating dgraph as a possible reporting solution for a new product we are developing. With some basic tests it appears that dgraph performs roughly the same as a few other databases that we have used in the past. I have an idea as to how I can increase the performance in dgraph, but I wanted to know if this solution is something that is acceptable within dgraph or if there is possibly a better way of doing this. Apologies in advance for the verbiage I’m using. I’m still trying to learn the new terminology.

This is essentially what our base test was for all the databases we were considering.

We have 2 entities that matter, a ticket and a tag. For testing purposes I created duplicate datasets in each database consisting of 100,000 tickets, 1,000 tags, and between 0 and 10 associations per ticket to the tags. I’ve pseudo-coded the structures below:

Ticket {
  id string
  is_ticket string
  created_at datetime
  tags []Tag
}

Tag {
  id string
  is_tag string
  name string
}

In dgraph I’m storing this in such a way that each ticket is its own node. It has an edge that points to n tags. What I’m interested in getting is a list of tags, and a count for each tag that is how many tickets have that tag. Here is the query I used:

{
  q(func: has(is_tag)) {
      name
      count(~tag @filter(gt(created_at, "2017-01-01T00:00:00Z") AND lt(created_at, "2017-01-31T00:00:00Z")))
  }
}

This gives me exactly what I’m looking for, but I’m calling it’s efficiency into question. Every tag is being pulled, and then every ticket connected to any of those tags are pulled and evaluated against which feels very inefficient. If I’m understanding how the database is handling this, the number of evaluations it’s doing is equal to the number of tags * the number of tickets.

There is another entity which we will have to add in the future called organization that I think I may be able to increase the efficiency with. The new set of structures would look something like this:

Ticket {
  id string
  is_ticket string
  organization_id string
  created_at datetime
  tags []Tag
}

Tag {
  id string
  is_tag string
  organization_id string
  name string
}

Organization {
  id string
  name string
}

So our average query would be something to the effect of grab all tickets for a particular organization, then grab it’s tags, then return the aggregates. This still puts a ton of on-the-fly filtering onto the database that I think may slow it down.

The actual questions

  1. Is the a more efficient way of doing the query above?

  2. Would it be better to add edges to the organization node which pre-grouped the ticket data. The organization structure with this grouping may look something like this:

Organization {
  id string
  name string
  ticket_group TicketGroup
}

TicketGroup {
  date date
  tickets []Ticket
}

This way to get the same data set, we would find the organization. Then include the ticket_group edges that pertain to the date(s) that we are wanting. Then use the tickets associated to those groups for the actual query.

Some back-of-the-napkin math tells me that if we have X tickets per day and we are querying for Y days of data, then we would reduce the number of evaluations from X * Y down to Y.

If you need any other information or if I didn’t make sense at any (or all) point during this post let me know, and I’ll try to clarify it.

So the number of evaluations is less than num_tags * num_tickets. We get a list of all tickets corresponding to all tags and then filter it. So imagine ~tag to return a list of all tickets corresponding to all tickets and filtering is performed for each of those. An optimization that I think Dgraph could do is that we could come up with a between operator which can filter the value within a range. It would be interesting to see how that impacts the performance.

{
  q(func: has(is_tag)) {
      name
      count(~tag @filter(between(created_at, "2017-01-01T00:00:00Z", "2017-01-31T00:00:00Z")))
  }
}

Adding an organization to reduce the number of tickets and hence tags would be helpful for sure as lesser tickets and tags would be fetched and evaluated against. You could even do it in a single query using variables.

Thank you for the information.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.