Complex Aggregate Sort Query


(Erik Miller) #1

I have a specific query that I need to run.

Conceptually I have two node types, programs (movies, show, episodes) and timeslots (the time and channel a program is airing)

I need to return a list of programs, that includes the first airing of it timeslot, and also allow for sorting of the entire result set using data in the returned timeslot.

So far I have this simple query…

{q(func: has(program_id)) {
  uid
  program_id
  timeslots (first: 1, orderasc: gmt_date_time){
    timeslot_id
    gmt_date_time
    }
  }
} 

This correctly gives me a list of programs, and only the earliest airing of the program. Now what I need to do is determine a way to sort the result set by the gmt_date_time.


(Daniel Mai) #2

You can use multiple query blocks to accomplish this. See the examples from the sorting docs for an idea of how to structure it.


(Erik Miller) #3

I’m still trying to grok how to perform this query. Here’s what I have now, which runs but is returning data in the wrong order.

{
  pids as var(func: has(program_id)) {
    uid
    program_id
    timeslots (first: 1, orderasc: gmt_date_time){
      gdt as gmt_date_time
    }
  }

  q(func: uid(pids), orderasc: gdt) {
    program_id
    timeslots (first: 1, orderasc: gmt_date_time){
      expand(_all_)
    }
  }
}

Am I going about it the right way? I’m not sure if there is an implicit association between the vars pids and gdt. For example when using the var function, multiple pid, gmt tuples are returned. Does the following query recognize which pid is matched on and implicitly use the correct gdt?

I also thought maybe there was a way to expose the data as a variable inside the query and sort on it

{
  q(func: has(program_id), orderasc: val(gdt)) {
    uid
    program_id
    timeslots (first: 1, orderasc: gmt_date_time){
      gdt as gmt_date_time
    }
  }
}

This errors out with “Variable: [gdt] used before definition.” Which I understand, but maybe there is another way to make something similar work.


(Erik Miller) #4

I think I have it figured out…

{
  pids as var(func: has(program_id)) {  #get all programs
    uid
    #get the earliest timeslot uids, and the gmt_date_time
    timeslots (first: 1, orderasc: gmt_date_time){ 
      ts_gdt as gmt_date_time
    }
    min_gdt as min(val(ts_gdt))
  }

  q(func: uid(pids), orderasc: val(min_gdt)) {
    program_id
    timeslots (first: 1, orderasc: gmt_date_time){ 
      gmt_date_time
    }
  }
}

I went back to the Tour of Dgraph tutorial to reread how variables are use and realized the difference between query and value variables

I’m not sure how performant it is at scale, I’ll be loading more data and testing. Thanks for the help.


(Erik Miller) #5

Thie above query seems to function as expected, but performance is poor. 70sec with ~1million timeslots and 50k programs. How can I optimize it to perform better? Both elasricsearch and neo4j can handle this query in a few seconds.


(Michel Conrado (Support Engineer)) #6

Maybe two things, don’t use sort in the var block and add a indexed predicate like “has_ program_id: true”. You gonna have a better perf cuz it’s indexed - has a smaller cut than the has function. Bool index is faster.

pids as var(func: eq(has_program_id, true)) {


(Erik Miller) #7

That helped quite a bit. My current query is…

{
  pids as var(func: eq(program_node, true)) {  #get all programs
    # #get the earliest timeslot uids, and the gmt_date_time
    timeslots{ 
      ts_gdt as gmt_date_time
    }
    min_gdt as min(val(ts_gdt))
  }  
      
  q(func: uid(pids), orderasc: val(min_gdt), first: 100, offset: 200) {
    program_id
    timeslots (first: 1, orderasc: gmt_date_time){ 
      gmt_date_time
    }
  }
}

Which takes about 6 seconds with a similarly sized dataset, which is only 1/20th the expected size.

I’m running everything in docker now, what would be be the best way to get more performance?
Is there an explain mechanism?


(Michel Conrado (Support Engineer)) #8

What are your specs?


(Erik Miller) #9

I’m running docker on a EC2 instance, t2.xlarge. So 16G of RAM and 4 vcpus. With a single zero, alpha, and ratel. Also my current schema is…

<channel_day_id>: string @index(hash) .
<channel_id>: int @index(int) .
<channel_node>: bool @index(bool) .
<gmt_date_time>: datetime @index(hour) .
<program_id>: int @index(int) .
<program_node>: bool @index(bool) .
<timeslot_id>: string @index(hash) .
<timeslot_node>: bool @index(bool) .
<title_id>: int @index(int) .
<title_node>: bool @index(bool) .
<tms_id>: string @index(exact) .
<tmsprogram_node>: bool @index(bool) .
<version_id>: int @index(int) .
<channel>: uid .
<channel_day>: uid .
<channel_days>: uid .
<program>: uid .
<timeslots>: uid .
<title>: uid .
<tmsprogram>: uid .
<tmsprograms>: uid .
<version>: uid .
<versions>: uid .

(Erik Miller) #10

I just ran it with twice the data in the hopes performance wouldn’t jump too much but it basically doubled to 14 seconds. I appreciate the help, but I don’t think dgraph is the ideal system for this type of query. I’ll be investigating other technologies since it doesn’t seem to be great for my use case.


(Michel Conrado (Support Engineer)) #11

humm, can you compare with any other that you have already used? or other like SQL or even Neo4J. With the same data.

Also, how you are running this query? Ratel UI? A client? terminal (cURL)?

Dgraph team is working on some significant redesigns. Soon maybe this might help.

But tell me, do you need this availability of millions of data constantly?

Cheers.


(Erik Miller) #12

I’m looking for a replacement to elasticsearch, due to its inabilty to handle many to many joins. This particular query is modeled as a parent child relationship in ES and many users expect similar functionality in any new system. It runs against the entire dataset in about 5 seconds. The neo4j poc I created ran the much smaller dataset I’m using for dgraph in 5 seconds. I’ve been running everything in the ratel UI. I can certainly work around the limitations by querying more sorted.timeslots in the outer branch and allowing duplicate programs in.the result set, then doing additional data manipulations on the client side but I can also do that with other db systems. To answer your question, I do need the availability of millions of nodes, more like a billion in total, and the ability to join on them. If I don’t have the right query, I would appreciate more input.


(Michel Conrado (Support Engineer)) #13

I’ve heard weeks ago, from some users comparisons of Dgraph against elasticsearch. And was good ones. As I don’t use it, I can’t say about it. Also I’ve heard that Vespa (from Yahoo guys) are better than elasticsearch.

As elasticsearch and Dgraph have distinct proposals I do not know if it is fair to compare them directly. But the users seemed to me very satisfied with Dgraph.

I did not get this phrase, did you force the Dgraph more than the neo4j and hoped Dgraph was better in this hypothesis? or is that phrase truncated?

Ratel UI isn’t meant to get that amount of data. Just to explore, test, plan it. You must use the clients tho.

A short time ago a user started doing Dgraph benchmarks. See what you think.