Filters in GraphQL

This post comes out of a brief discussion I had with @akhiltak yesterday.

TL;DR This post outlines one extension we can make to GraphQL to support additional queries.

Right now our implementation of GraphQL can support simple queries which require a lookup using uid or xid. To be a general purpose database, we need to support many different types of queries. In this post we will go through the different types of queries that we need to support and how we can represent them in our query language.

The most common type of query is querying by a particular attribute. i.e. check if a given attribute has a particular value.

Example 1

Find me the actor named “Angelina Jolie”. This is representable in GraphQL as follows:

{
    film.actor(type.object.name.en: "Angelina Jolie") {
        film.actor.film {
            film.performance.film {
                type.object.name.en
            }
        }
    }
}

A corresponding Cypher query for this would be:

MATCH (a:Actor) -[:ActorFilm]-> (p:Performance) -[:PerformanceFilm]-> (f:Film)
WHERE a.name = "Angelina Jolie"
RETURN p.name

Example 2

A somewhat more complex example, would be finding all musical dramas where the actor was Angelina Jolie.

This is also representable in GraphQL as:

{
    film.actor(type.object.name.en: "Angelina Jolie") {
        film.actor.film {
            film.performance.film {
                film.film.genre (type.object.name.en: "Musical Drama")
                type.object.name.en
            }
        }
    }
}

A corresponding Cypher query for this would be:

MATCH (a:Actor) -[:ActorFilm]-> (p:Performance) -[:PerformanceFilm]-> (f:Film) -(:FilmGenre)-> (g:Genre)
WHERE a.name = "Angelina Jolie" AND g.name = "Musical Drama"
RETURN p.name

A Gremlin query for the same will be:

g.V().has("name", "Angelina Jolie")
   .out("actorfilm").as("f").out("filmgenre").has("name", "Musical Drama")
   .select("f").by("name")

Example 3

All actors whose name starts with Angelina. This query cannot be represented in the GraphQL grammar. The following is a proposed extension.

{
    film.actor(type.object.name.en: { op: "starts_with", value:"Angelina" }) {
        _uid_
        type.object.name.en
    }
}

A corresponding Cypher query for this would be:

MATCH (a:Actor)
WHERE a.name STARTS WITH "Angelina"
RETURN a._uid_, a.name

Or a gremlin query:

g.V().hasLabel("actor").has("name", startsWith("Angelina")).as("a")
   .select("a").by("name")

This extension to GraphQL allows us to add many different operators such as comparison operators (>, <, <=, >= etc.), string operators (starts_with, ends_with, substring, etc.) and geospatial operators.

Something to Consider

However, if we start considering other operators that we would need to support for a general purpose language, such as union, disjunction, grouping it becomes considerably harder to represent them in GraphQL, whereas they are simpler in Cypher or Gremlin. Moreover those languages are considerably easier to understand to the end user (Cypher is just a variant of SQL, and Gremlin is just like LINQ) as opposed to GraphQL. Should we consider that not being the first query language that we support?

Regardless, for now, how does the extension described above sound?

4 Likes

Hey @kostub,

Nice summary – I like how you’ve explained your proposal with examples. Makes it very easy to follow.

So, example 1 and 2. My ideas were around using an operator in the var value, like so:

{
    film.actor(type.object.name.en: "=Angelina Jolie") {
        film.actor.film {
            film.performance.film {
                type.object.name.en
            }
        }
    }
}

Note the = sign in the value, which means the value must be exactly equal to "Angelina Jolie".

The second example would be the same:

{
    film.actor(type.object.name.en: "=Angelina Jolie") {
        film.actor.film {
            film.performance.film {
                film.film.genre (type.object.name.en: "=Musical Drama")
                type.object.name.en
            }
        }
    }
}

The third example gets interesting, and here’s my proposal:

{
    film.actor(type.object.name.en: "~Angelina") {
        _uid_
        type.object.name.en
    }
}

So in other words, we keep the variable’s key-value relationship and use simplified expressions to do basic searches. I doubt we’ll support regular expressions ever, but here’re some of the things that we could do:

Operators:

Value | Meaning
---------------
= | Exact phrase match
~ | Single term match
| | Multiple term union match
& | Multiple term intersection match

For, e.g.
type.object.name.en: "Angelina|Jessica"
type.object.name.en:"Barack&Obama"
type.object.name.en:"~Obama"
type.object.name.en:"=Barack Obama", in fact here equality might be optional. By default, we only do exact phrase match.

Regarding starts with, that’s a bit more tricky operator, because then we’ll have to index the position of the terms along with the terms themselves. So, we should only do this if needed later.


Regarding GraphQL v/s other languages – as I’ve mentioned elsewhere, GraphQL supports many things which are a lot more complicated than Gremlin or Cypher. The latter languages return lists of things, while the former allows returning an entire sub graph. You can convert a sub graph to lists, but not vice-versa. In addition, GraphQL supports types, schemas, introspection, etc., which can make interaction with the database as if you’re querying for a document; which is very powerful. You can read up a bit more in the spec, and possibley other topics regarding the same in discuss.

We’ll most likely support at least Gremlin, but closer to v1.0; or once we have GraphQL nailed down.

I am working on filters using index. Imagine a GraphQL query like

friend(name: John) {
  ...
}

Here we want to apply the filter name=John. However, there might be other kinds of arguments. Here are some possibilities and I wonder what you all (@minions) think.

  1. Look for = instead of : symbol. If so, this is a filter.
  • Have a list of reserved arguments. If not in list, then it is treated as a filter. If attribute is not indexed, no filtering is applied and the argument is ignored.
  • Require the argument name to satisfy some constraint, e.g., have a filter prefix or suffix.

How about: <attribute>.operation(<values>)

Like:

{
 me(_uid_: 1) {
  friends(first: 10, name.equals("John")) {
   relatives(name.contains("alice", "bob", ....))
  }
 }
}

We could have a list of operators, like:

  • equals: exact match
  • conains: contains some of the specified strings
    etc.

This would enable more customization I think. What are your thoughts about such syntax?

Update: since we allow dots in the predicates, we could have some other operator like ‘@’.

I would prefer something like what @ashwin95r or @kostub proposed i.e. to have the operation as kind of a keyword instead of it being part of the value. It’s more verbose agreed but would avoid problems in the situations in which characters like | , =, ~ ,& are part of the value.

I think with the given example, @ashwin95r’s proposal looks good. But, if you start to see how predicates really are, type.object.name.en, then it becomes harder to figure if the suffix .equals is part of the predicate, or is it an instruction. Also, GraphQL expects key:val pairs within the brackets; but using named operators would switch away from that.

By definition, they shouldn’t be part of the value. Our tokenizer should focus on alphanumeric terms.

Note that the mathematical operators is exactly how Google’s Go datastore APIs work.
https://cloud.google.com/appengine/docs/go/datastore/query-restrictions

1 Like

Could it be better to keep to the argument syntax in the GraphQL spec?

How about the following?

{
 me(_uid_: 1) {
  friends(first: 10, name.equals: John) {
   relatives(name.hasOneOf: "alice,bob") {
  }
 }
}

Or totally ignore arguments and have our own “language” within the brackets.

{
 me(_uid_: 1) {
  friends((first 10) (equals name "John")) {
   relatives((or (contains name "John") (contains name "Tom"))) {
   }
  }
 }
}

I don’t think it’s worth ditching GraphQL altogether, just for this. GraphQL has a lot of attraction for us, both regarding the usability of the database and the applicability to a wider audience. No one likes to learn yet another language.

My vote is to keep the key, val pair, and just use operators, either in the key or the value. For e.g., you could also do something like:

"type.object.name.en =": "Angelina Jolie"
"type.object.name.en ~": "jolie"
"type.object.name.en &=": ["angelina", "brad"]
"type.object.name.en |=": ["angelina", "brad"]

How about this

{
 me(_uid_: 1) {
  friends(first: 10, Filter: ("name =", "John")) {
   relatives(Filter: ("name ~", "alice", "bob", ....), Order: ("name"))
  }
 }
}

The Key would be like Filter, order and the first argument in the value would describe the filter and the rest will be the things it’ll look for. This would retain the (key, val) pair constraint.

1 Like

Just replacing what’s inside brackets with our own “mini language” is probably not considered ditching GraphQL?

My personal preference: Keep things simple and just apply constraints to argName or argValue.

Back to an old topic: I do think that GraphQL is less versatile which is not always a bad thing. It is good for a lot of common uses and a great query language to begin with. (Its many other features like fragments, type checking etc etc do not add much operationally, in the sense that they don’t give you power to do more things.)

1 Like

Haha… Bingo! @ashwin95r We reached at the same advice, independently.

Also, for good or bad, we’re using GraphQL and we aim to keep our implementation within its spec as much as we can. I don’t think this is the time to switch to another language. We will support Gremlin as we reach near v1.0.

1 Like

So we are going for Filter: ("name =", "John")? If that’s the case, gql needs some work. If confirmed, I can proceed to work on that?

My advice would be to stay away from the Filter and brackets. If we do simple key-vals, they’d fit right into our current implementation and would also be in-sync with what GraphQL supports.

After some discussion on Slack, here’s my updated recommendation.

OR: type.object.name.en: "term1 | term2 | term3"
AND: type.object.name.en: "term1 & term2 & term3"
MIX: type.object.name.en: "(term1 & term2) | term3"
SINGLE TERM: type.object.name.en: "term1"
EXACT PHRASE: type.object.name.en: "term1 term2"

Exact phrase is without any operator in between. So, we consider term1 term2 as one term including the space.

So, the query might look like this:

{
  me(_uid_: 0x01) {
    friends(type.object.name.en: "john | snow", first: 10) {
      relatives(type.object.name.en: "rob | sansa | arya")
    }
  }
}

Honestly, this looks like it gives us everything we need. What do you guys think?

P.S. Note that our index tokenizer would have removed all the special characters from the terms and lower cased everything.

2 Likes

I like this. Seems much easier to parse than nested () or {}.

I’d prefer a solution that is more general purpose than just for string matching for a few operators. For me the whole reason this discussion came up was to be able to support geo-spatial queries in GraphQL.

The reasons I don’t like the current approaches:

  1. Supporting numeric comparisons

For example, if we want to do something like scifi movies released after 2003. It would be odd to encode the operator name either has part of the value or as part of the key, i.e.

"releaseYear >" : 2003
 releaseYear : >2003

To me these are both weird choices of syntax.
I’d prefer something along the lines of what @ashwin95r proposed such as (very gremlin like)

Filter: releaseYear.gt(2003)

or more just like a regular programming language:

releaseYear > 2003

2… Not all operators have a single character representation.

I gave the example of starts_with just as an illustration. For geospatial queries, I need to support operators such as near, geowithin etc.

3… For the particular case of string matching, exact match should be the default. Term matching seems to me more like an added feature for a particular use case of text search. It should be a separate operator (e.g. contains). For the exact match case, we should not deviate from the GraphQL syntax. The current understood GraphQL syntax of

name : "John"

means an exact match in GraphQL and we shouldn’t change the meaning of it or prefix it with ‘=’ operators on either the key or the value.

hmm… The problem with the gremlin-like syntax is that it’s tricky to know the difference between the predicate name and the operator. name and releaseYear are simplistic examples. When using freebase data, predicate names are lot bigger, and all contain dots and such.

I think if we need to handle all these 3 diverse cases, then maybe something along the lines of having a dedicated filter operation. Adding a complex example which handles all the 3 cases.

{
  me(_uid_: 0x01) {
    friends (first: 10) {
      filter {  // intersection between 2 conditions defined within.
        type.object.name.en (anyof: ["john", "snow"])
        born.on (ge: 1990, lt: 2000)
      }
      filter {  // results are union with above filter.
        child {
          _count_ (gt: 2)
        }
      }
      relatives {
        filter {
          home.geolocation (near: {lat: 12.43, lon: -53.211, rad: 10k})
        }
      }
    }
  }
}
1 Like

@mrjn I like this syntax. Does the above implicitly do an and within the same filter and or between two filter clauses? i.e. would i interpret your query as:

(((name is any of john or snow) AND (born between 1990 and 2000)) OR (has more than 2 children)) AND (relatives home is near 10km of the given location)

So to put this syntax in context of the above examples, would my second example (finding all musical dramas where the actor was Angelina Jolie.) be rewritten as below?

{
    film.actor {
        filter {
            type.object.name.en (eq: "Angelina Jolie")
        }
        film.actor.film {
            film.performance.film {
                film.film.genre {
                    filter {
                        type.object.name.en (eq: "Musical Drama")
                    }
                }
                type.object.name.en
            }
        }
    }
}

Or am I misunderstanding the syntax?

This part is correct. Also, we only take the first 10 such results. But, relative is a just an edge out from the results of friends. So, you pick the first 10 results with the above filter, then find their relatives who live within 10km of the given geolocation.

I think yours is a bit tricky, because you aren’t starting with any node, but directly with string matching. It might be something like this:

{
  filter {
    type.object.name.en (eq: "Angelina Jolie")
  }
  film.actor.film {
    film.performance.film {
      film.film.genre {
        filter {
          type.object.name.en (eq: "Musical Drama")
        }
        type.object.name.en
      }
    }
  }
}

Note that casing would be ignored for string matches. Also, I think we can still use the query syntax I proposed above for string matching.

"(angelina & brad) | (jolie & pitt)"

I think this syntax is pretty powerful. This can also be used to do “angelina jolie”, where these two otherwise separate terms would be considered as one term, because they don’t have an operator in between.

Using

filter {
 Some.condition
}

Would violate the graphql spec that the respone should have the same pattern as the query. So that may not work well for us.

1 Like

If that is the case, then we cannot expression the query in example 2 easily (finding all musical dramas where the actor was Angelina Jolie).

Because in that query, film.genre is just an edge, and we just select all edges which have the name “Musical Drama”. But this does not affect the list of films returned (many of them will just not return the genre edge).

We need a way to limit the top level results based on the values in related vertices. Both Cypher and Gremlin support that.