Filtering by multiple child types slow

beepbeep · September 8, 2021, 4:33pm

What I want to do

Retrieve paginated results from a large dataset (~10M nodes generated). Effectively filtering out the root query using children predicates.

What I did

Below is a cut down version of the schema for what I’m trying to do

Basic Schema

type Restaurant {
    id: ID!
    name:String @search(by: [fulltext, trigram, exact])
    description: String @search(by: [fulltext, trigram])
    externalID: String @search(by: [fulltext, term, trigram, exact])
    safetyRating: String @search(by: [fulltext, term, trigram, exact])

    cuisineTypes: [Cuisine]
    equipments: [String] @search(by: [fulltext, term, trigram, exact])
    services: [String] @search(by: [fulltext, term, trigram, exact])

    address:Address
}

type Address {
    id: ID!
    address1: String @search(by: [fulltext, trigram])
    address2: String
    suite: String
    city: City
    postalCode: String
}

type City {
    id: ID!
    name: String! @search(by: [fulltext, trigram, exact])
    state: State!
}

type State {
    id: ID!
    code: String! @search(by: [fulltext, trigram, exact])
    name: String! @search(by: [fulltext, trigram, exact])
    country: Country!
}

type Country {
    id: ID!
    code: String! @search(by: [fulltext, trigram, exact])
    name: String! @search(by: [fulltext, trigram, exact])
}

type Cuisine {
    id: ID!
    name: String! @search(by: [fulltext, trigram, exact])
}

What I’m hoping to do is use a varied list of filters to produce a paginated set of Restaurants for our users going through a restaurant list (think something similar to yelp)

I’ve tried this kind of query but it’s incredibly slow and times out. In this example I’m looking for a list of all restaurants that serve a given set of cuisines within a given state. I use @cascade here because I realized that any filtering on a child predicate only happens after the root query is executed, which is why I think the query is really slow since it tries to pull the entire data set first.

Attempted DQL query

query TestQuery($offset: int, $state: string, $cuisines: string){
    var(func: type(State)) @filter(eq(State.name, $state)) {
        stateIDs as uid
    }

    var(func: type(Cuisines)) @filter(anyofterms(Cuisine.name, $cuisines)) {
        cuisineIDs as uid
    }

    FilteredRestaurants(func: type(Restaurant), first:10, offset: $offset) @cascade{
        uid
        Restaurant.name
        Restaurant.cuisineTypes @filter(uid(cuisineIDs)) {
            Cuisine.name
        }
        Restaurant.address {
            Address.city{
                Address.address1
                City.state @filter(uid(stateIDs)) {
                    State.name
                }
            }
        }
    }
}

What can I do to optimize this? I’ve thought about using reverse edges but there’s more than 1 child predicate here that are on the same “level” within the parent node, so I’m not sure how to do this at all.

Dgraph metadata

Dgraph Version v21.03.0-76-ged09b8cc1

Valdanito · September 9, 2021, 2:26am

You can try to change the filter order, or remove the type filter

    var(func: eq(State.name, $state)) @filter(type(State)) {
        stateIDs as uid
    }

    var(func: anyofterms(Cuisine.name, $cuisines)) @filter(type(Cuisines)) {
        cuisineIDs as uid
    }

query TestQuery($offset: int, $state: string, $cuisines: string){
      FilteredRestaurants(func: type(Restaurant), first:10, offset: $offset) @cascade{
        uid
        Restaurant.name
        Restaurant.cuisineTypes @filter(anyofterms(Cuisine.name, $cuisines)) {
            Cuisine.name
        }
        Restaurant.address {
            Address.city{
                Address.address1
                City.state @filter(eq(State.name, $state)) {
                    State.name
                }
            }
        }
    }
}

beepbeep · September 9, 2021, 2:10pm

I tried swapping the filter order and also removing the type filter in the var blocks like you suggested, it still timed out.

I think the biggest time sink is in the main query itself at the root query level, which is trying to pull every single Restaurant first before applying the Cuisine and State filter, and that initial pull takes forever.

iluminae · September 10, 2021, 5:21am

Yea I feel like you have a couple challenges here:

@cascade makes your pagination apply last, as it seems you may know already
the (all restaurants)-[cuisineTypes]->(cuisine) can probably be more easily accessed by changing it to (cuisine)<-[cuisineTypes]-(restaurants) utilizing a reverse relationship there.
the bigger issue may be the number of hops between state and restaurant. You could work your way back from state->cities->restaurants and use that other result with a uid() or a uid_in() filter… But it still may be too many results from that restaurants in state bit. For something like this in one of my use cases I added what would be an edge from restaurant to state in your case to easily filter stuff like this that would otherwise require too many hops/results to filter.

With an edge from restaurant to state the query becomes really easy to process.

beepbeep · September 10, 2021, 2:39pm

Yeah haha, it’s definitely a challenge for a beginner like me

So is there no way to utilize @cascade with a large dataset without pre-filtering everything first? If I apply filters, only the results that have all the filters should show up right? (i.e. if I apply state and cuisine filters, only results that get validated for both should show up, effectively acting as a cascade for the filtered fields?)
I’m not entirely sure how that (cuisine)<-[cuisineTypes]-(restaurants) query would look when used with other filters. I know how to do the single reverse query but not how to use multiple of them at the same time, especially when constructing a larger query using even more filters.
For example I have a query below, and I’ve pulled the reverse uids of the Restaurants, but I have no idea how to apply them to the final query and essentially merge the two lists of uids since there might be duplicates in the two lists or one that belongs in one set but not the other. I don’t think uid_in combines the two sets and only give me the intersection either.

Attempt at reverse query

	var(func: alloftext(Cusine.name, $cuisine)) {
		cuisineFilteredRestaurants as ~Restaurant.cuisineTypes
	}

	var(func: alloftext(State.name, $state)) {
		~City.state {
			~Address.city {
				stateFilteredRestaurants as ~Restaurants.address
			}
		}
	}

	FilteredRestaurants(func: type(Restaurants), first:$page, offset:$offset)) @filter(?????){
		uid
		Restaurants.name
		Restaurants.cuisineTypes 
		Restaurant.address{
			Address.city {
				City.state {
					State.name
				}
			}
		}
	}
}```

Hmm so I guess for anything you know you want to filter on, you always want to create a direct edge where possible? It would be easier to filter but it would still be the same large dataset no?

iluminae · September 10, 2021, 4:14pm

Here let me free-hand an example of what I meant by 2 and 3: (note: i did not run this)

query TestQuery($offset: int, $state: string, $cuisines: string){
    statevar as var(func: type(State)) @filter(eq(name,$state))
    // this starts at cuisines, which will be only a few, then follows the reverse edge to every restaurant that serves it.
    q(func: type(Cuisines)) @filter(anyofterms(Cuisine.name, $cuisines)) {
        Cuisine.name
        ~Restaurant.cuisineTypes @filter(uid_in(stateEdge,statevar)) (first:10, offset: $offset) {
          //this is actually the restaurants level
          uid
          Restaurant.name
        }
    }
}

Assumptions:

you have a new edge called stateEdge that goes from restaurant to state.
you have a reverse edge index on Restaurant.cuisineTypes

Using the uid_in() filter allows you to not need @cascade. I have a lot of frustrations with how they changed @cascade in this version of the database but yes, it basically removes 100% of your pagination until the result is fully gathered and ready to send, then it applies your pagination. Do not use it on datasets where the pagination makes the query of a runnable complexity.

edit: Another small note: you may have too many levels here for what you are trying to store. Like, a restaurant will always have 1 address and will be the only restaurant at that address right? maybe you just put that on the restaurant node then. Then have an edge to city. Maybe then you wouldnt need the edge directly to state because city wont have that many results after already being filtered by state. You would have then the ability to do as above but:

var(func: type(State)) @filter(eq(name,$state)) { citiesVar as ~City.state }
...
~Restaurant.cuisineTypes @filter(uid_in(citiy,citiesVar)) (first:10, offset: $offset) {

edit edit: oh… um I just realized your initial schema is in GraphQL but you are performing DQL in your queries, is that intended? or did you mean to use GraphQL to query this?

beepbeep · September 10, 2021, 6:34pm

Oh you’ve given me a good idea of passing uids into the next query. Take a look and tell me what you think (assuming that I link the state directly to the restaurant).

#filter by cuisine types
var(func: anyofterms(Cusine.name, $cuisine)) {
	cuisineFilteredRestaurants as ~Restaurant.cuisineTypes
}

#filter down the cuisine filtered ones by state
stateFilteredRestaurants as var(func: eq(Restaurant.state, $state)) @filter(uid_in(cuisineFilteredRestaurants))

#any number of additional filters here
someOtherFilteredRestaurant as var(func: eq(somethingElse, $somethingElse)) @filter(uid_in(finalFilteredRestaurants))

FilteredRestaurants(func: uid_in(finalFilteredRestaurants), first:$page, offset:$offset)){
	...
}

The end goal is to be able to produce a filtered list of restaurants regardless of the types and permutations of filters used, so I feel this way gives me a way to generate queries dynamically since I can just insert more filter blocks and pass down the filtered uids down through each block.

Re: Too many levels:

Yeah there is a lot of levels, but something I want to do in the future is to do analytics and suggestions based on a variety of factors, one of which is location. So what I had in mind is for example doing a reverse lookup on cities, having it on its own tier would allow me to do an easier reverse lookup then having a city filter on the master restaurant list, which could be very slow as I need to pull the full list first and then filter it down. I don’t know if what I said makes sense.

Re: GraphQL schema:

Does it make a huge difference? I started in GraphQL since that’s what was easier to understand to begin the work, but as I learnt more DQL I began implementing more DQL schema from the preconstructed one that was based on GraphQL. I just assumed GraphQL was something for quick searches on the front end and DQL was for more detailed queries on the backend.

amaster507 · September 11, 2021, 2:13pm

This is axactly what I do, but I do it with GraphQL on the client side with multiple queries so I can honor auth rules.

There are a few differences:

Don’t run DQL on client side as it is like giving full db control to the client
DQL doesn’t honor auth rules
GraphQL gets rewritten into DQL under the hood of Dgraph, so everything you can do in GraphQL can be done in DQL but the opposite is not true.
You are correct that GraphQL is simpler to understand and use and how you approached it would be my recommendation.
Think of GraphQL as the auto-generated safe API and DQL as the backend direct db language. So just like you wouldn’t want to let clients run SQL from the clients bypassing all controls in the API the same would be true for DQL vs. GraphQL

May I suggest a schema change for handling addresses that might help. Here is how I do it:

A POI links to an address and an address has both a city and state linked to it. Cities then link to states and states to cities. I also do counties and countries following this same concept. Every relationship has an inverse defined so reverse queries can be anywhere to make things easier to “lookup”. I use tags instead of cousines but similar concept.

So I can do the following:

find all cities in a state
find all addresses in a state
find all addresses in a city
find all POIs given a tag
combine any of these given any AND, OR, NOT logic

beepbeep · September 13, 2021, 3:21pm

Ah ok thanks for going over how GraphQL and DQL differs. Since all GraphQL gets rewritten into DQL, it shouldn’t matter which schema I start with then right? Since it just gets translated over and any additional indexes or features I would need I can modify the generated DQL schema?

So if I’m reading what you wrote correctly, you’re essentially proposing that I link all the separate nodes I care about together. That should be doable, I would need to create a new field for the types and then try to figure out a mutation that lets me upsert those fields. This sounds like it would be easier to do in DQL? Not entirely sure how the mutation would look just yet.

Wait so would I want a states in cities and cities in states? Wouldn’t I just want one of them and the other can be handled by a reverse connection? Or is it advised to avoid reverse where possible especially if you know you’ll be looking up that reverse edge a lot and just make it into a real edge?

amaster507 · September 13, 2021, 8:56pm

GraphQL schema will generate a DQL schema but it does lock it down after that to what is editable in the DQL schema or else it could break the API. I suggest designing from GraphQL schema point of view if you know you will be using that API.

Yes for sure!

GraphQL does not use the @reverse power under the hood in DQL. Rather it depends on having complimentary inverse relationships that are kept in sync by the API. (But this sync could be broken by DQL if done manually, these can be fixed if they get out of sync, search this forum)

Again, this is just me, but I recommend just using inverse relationships if you are going to use the GraphQL layer. That way you already have the relationships there when you might need/want them later.

beepbeep · September 13, 2021, 9:15pm

Could you elaborate on this? What do you mean by lock it down to what is editable?

For example I know after you generate a DQL schema, you can go into it and add reverse indices to the uid predicates, which would not be reflected back in the GraphQL schema. But if you were to add term indices to string predicates, it would be reflected back in the GraphQL schema.

So did you mean that GraphQL schema will generate a DQL schema, but it won’t work backwards unless it’s something that’s editable from the GraphQL schema?

Tbh I’ll probably be using both. I’m currently building out a DQL backend in go to be able to perform more complex queries and having a GraphQL built in apollo on the frontend to enable some very basic and quick user queries to be built out. I don’t know if this model is completely correct but it seems to generally be how these kinds of systems are designed.

amaster507 · September 13, 2021, 10:00pm

It won’t work backwards at all. Editing the DQL schema will have no effect at all on the GraphQL schema.

No, it would not.

beepbeep · September 14, 2021, 8:23pm

Ah ok, thanks for helping out with this, it’s really helped a lot with making my queries better. I’ll play around with the schema to see if it improves my query times. Right now it takes around 3 seconds for my filtered queries but deconstructing the schema and flattening it might improve it based on what’s been mentioned in this thread.

amaster507 · September 14, 2021, 10:18pm

Dgraph works really good with wide data unlike traditional rdbms that have limitations on joins.

Topic		Replies	Views
Pagination on child object that belongs to various parent object types Dgraph	3	562	August 18, 2021
How to filter parent nodes on child nodes attributes Dgraph	1	1333	January 11, 2020
Filter by child node predicate in GraphQL GraphQL dgraph	1	774	October 29, 2021
Filter on child types GraphQL kind:question	11	1444	September 7, 2021
Query based type node and filter its nested value GraphQL	3	625	July 15, 2020

Filtering by multiple child types slow

What I want to do

What I did

Dgraph metadata

Related Topics