[RFC] TF-IDF scoring for fulltext search in Dgraph

ajeet · May 10, 2021, 11:31am

Motivation

Dgraph already supports fulltext search via the DQL function anyoftext, but it does not currently expose a score to the user. It would be very useful for anyone building search functionality on top of Dgraph to have some metric to measure relevance of each result.

Instead of vanilla TF-IDF, we will use a variant called BM25, currently employed by Lucene (and, by extension, Solr and ElasticSearch), which addresses the shortcomings of TF-IDF.

User Impact

The user would be able to access the score of a result like so:

{
  movie(func:alloftext(name@en, "the dog which barks")) {
    name@en
    score: dgraph.score
  }
}

The proposed predicate dgraph.score would be consistent with the existing reserved predicate dgraph.type.

Ideally, the user should also be able to filter and sort by dgraph.score.

Architecture

Dgraph already utilizes Bleve for tokenizing, multi-language stemming and stop-word removal.
Computing TF-IDF itself would require some additional indexing.

Assume the following documents:

doc1: “Rain, rain, go away!”
doc2: “She hates rain.”

The values we need are:

N: how many documents are there in total?
- Dgraph’s count index already does this
- docCount = 2
n: for a term t, how many documents contain t?
- Dgraph’s current fulltext index can give us the list of documents with t, we can just use the length of it
- [away:1 go:1 hate:1 rain:2 she:1]
freq: for a term t and document d, how many times does t occur in d?
- This needs a new index.
- [doc1:[away:1 go:1 rain:2] doc2:[hate:1 rain:1 she:1]]
dl: for a document d, how many terms are present in d?
- This needs a new index.
- [doc1:4 doc2:3]
avgDl: the average dl over all documents.
- This needs a new index. We can use avgDl = totalDl / docCount.
- totalDl = 7

{
  # must(Princess): must match, contributes to score
  var(func: alloftext(name@en, "Princess")) {
    scores1 as score()
  }

  # should(Bride): may or may not match, contributes to score
  var(func: alloftext(name@en, "Bride")) {
    scores2 as score()
  }

  # filter(The): must match, doesn't contribute to score
  # must_not(Terminator) must not match, doesn't contribute to score
  var(func: uid(scores1, scores2)) @filter(alloftext("The") AND NOT anyoftext("Terminator")) {
    scores as math(max(scores1, 0) + max(scores2, 0))
  }

  # sort and display the results
  search(func: uid(scores), orderdesc: val(scores)) {
    name@en
    roles: val(scores)
  }
}

TL;DR:

If you use fulltext search as a func, it will give you a score
If you use fulltext search as a @filter, it will not give a score
You can use value variables to combine / filter / sort scores.

Topic		Replies	Views
Feature request: full text search with tf-idf Scoring Dgraph dgraph , status:accepted , kind:feature , area:querylang , exp:expert	11	1665	January 11, 2021
Scoring FTS results Dgraph	1	570	June 24, 2019
Fuzzy Full Text Search Dgraph kind:question , dql	0	1126	December 4, 2021
Query too slow against fulltext index Users	1	401	November 21, 2018
String matching in Dgraph v0.7.4 - Dgraph Blog Blog	0	853	April 10, 2017

[RFC] TF-IDF scoring for fulltext search in Dgraph

Motivation

User Impact

Architecture

Further Reading

Related topics