[RFC] TF-IDF scoring for fulltext search in Dgraph

Motivation

Dgraph already supports fulltext search via the DQL function anyoftext, but it does not currently expose a score to the user. It would be very useful for anyone building search functionality on top of Dgraph to have some metric to measure relevance of each result.

Instead of vanilla TF-IDF, we will use a variant called BM25, currently employed by Lucene (and, by extension, Solr and ElasticSearch), which addresses the shortcomings of TF-IDF.

User Impact

The user would be able to access the score of a result like so:

{
  movie(func:alloftext(name@en, "the dog which barks")) {
    name@en
    score: dgraph.score
  }
}

The proposed predicate dgraph.score would be consistent with the existing reserved predicate dgraph.type.

Ideally, the user should also be able to filter and sort by dgraph.score.

Architecture

  • Dgraph already utilizes Bleve for tokenizing, multi-language stemming and stop-word removal.
  • Computing TF-IDF itself would require some additional indexing.

Assume the following documents:

  • doc1: “Rain, rain, go away!”
  • doc2: “She hates rain.”

The values we need are:

  1. N: how many documents are there in total?
    • Dgraph’s count index already does this
    • docCount = 2
  2. n: for a term t, how many documents contain t?
    • Dgraph’s current fulltext index can give us the list of documents with t, we can just use the length of it
    • [away:1 go:1 hate:1 rain:2 she:1]
  3. freq: for a term t and document d, how many times does t occur in d?
    • This needs a new index.
    • [doc1:[away:1 go:1 rain:2] doc2:[hate:1 rain:1 she:1]]
  4. dl: for a document d, how many terms are present in d?
    • This needs a new index.
    • [doc1:4 doc2:3]
  5. avgDl: the average dl over all documents.
    • This needs a new index. We can use avgDl = totalDl / docCount.
    • totalDl = 7

Further Reading

5 Likes
Doc 1: "Rain, rain, go away!"

Tokenize Doc 1 for full text gives

rain->Doc1[weight=2] away->Doc1[weight=1] go->Doc1[weight=1]

This is for TF

For IDF, we can use the length of the posting list for the index key for a particular term

Would this also support scoring for multiple fulltext searches in the same level?

{
  movie(func:type(movie)) @filter(alloftext(name@en, "the dog which barks") OR alloftext(description, "the dog which barks")) {
    name@en
    score: dgraph.score
  }
}

@amaster507 you’re right, a global dgraph.score within a query block would not support multiple alloftext queries. Perhaps this should be made available as a Facet instead.

It is possible for multiple filters to be on the same field as well, so even a facet would have limitations…

ps: some guy called Chewxy wrote a bunch of highly performant libraries for these:

2 Likes

@amaster507 here’s a solution I’ve worked out. This is a sample of what a DQL query would look like, compare this with an ElasticSearch Boolean query:

{
  # must(Princess): must match, contributes to score
  var(func: alloftext(name@en, "Princess")) {
    scores1 as score()
  }

  # should(Bride): may or may not match, contributes to score
  var(func: alloftext(name@en, "Bride")) {
    scores2 as score()
  }

  # filter(The): must match, doesn't contribute to score
  # must_not(Terminator) must not match, doesn't contribute to score
  var(func: uid(scores1, scores2)) @filter(alloftext("The") AND NOT anyoftext("Terminator")) {
    scores as math(max(scores1, 0) + max(scores2, 0))
  }

  # sort and display the results
  search(func: uid(scores), orderdesc: val(scores)) {
    name@en
    roles: val(scores)
  }
}

TL;DR:

  • If you use fulltext search as a func, it will give you a score
  • If you use fulltext search as a @filter, it will not give a score
  • You can use value variables to combine / filter / sort scores.
1 Like