Feature request: full text search with tf-idf Scoring

Moved from GitHub dgraph/3211

Posted by pjebs:

Bleve supports tf-idf Scoring.

It would be great if there was a function that could filter based on scoring by relevance. The function would accept an argument where we set our own threshold value.

1 Like

pjebs commented :

Bleve also supports fuzzy search which would be great.

srfrog commented :

Bleve also supports fuzzy search which would be great.

We evaluated Bleve’s fuzzy search but it didn’t fit with our data model so we made our own. The master branch has fuzzy matching, it will be available in a future release.

niravmehta commented :

We want to build an auto-complete type of search and need to search across multiple predicates for a match, and then show results based on some ranking.

How can we perform full text search in a given set of predicates? And return results based on some scoring?

quanghaisoft commented :

How can we perform full text search in a given set of predicates? And return results based on some scoring?

MichelDiz commented :

This feature does not currently exist. There is not much to do. But I have an idea that can help you. But you will have to use Facets.

The idea is to map the text by your application and then create a word structure. So you can track its usage. The only problem is that if you need the words to be unique in the system, you will need to manually add the number of repetitions. Inside the facets by the Map_text_body edge. Otherwise you would create several nodes of equal words. You can do it, but I think it would be kind of messy.

Using this structure you can create a system that analyzes words. You can identify texts that use a specific word thus identifying similarities. Let’s say several texts contain the word “Dgraph”. You can search for the word “Dgraph” and use @ reverse to identify which texts use it and also to collect values in facets.

An auto-complete system would be easily built. For you would just collect the mapped words in “Map_text_body”. And you can also rank using values on facets.

Here is a map of the structure itself. (Remember, this is just an idea for now)

You have the node with the text body and you have an edge that will contain the words mapped to nodes. Each word is a node.

Below I explain a little better.

The Schema

Map_text_body: [uid] @reverse . #pay attention, this [uid] syntax is just for master or Dgraph 1.1
# Map_text_body: uid @reverse . #Use this for oldest versions.
name: string @index(exact) .
Text_body: string .

Mutation

{
  set {
    _:Blank <name> "My Text Node" .
    _:Blank <Text_body> "Sir, in my heart there was a kind of fighting \n That would not let me sleep. Methought I lay \n Worse than the mutines in the bilboes. Rashly— \n And prais'd be rashness for it—let us know \n Our indiscretion sometimes serves us well ..." .

   #The mapping
    _:Blank <Map_text_body> _:Sir (times=1) .
    _:Blank <Map_text_body> _:heart (times=1) .
    _:Blank <Map_text_body> _:in_ (times=2) .
    _:Blank <Map_text_body> _:there (times=1) .
    _:Blank <Map_text_body> _:was (times=1).
    _:Blank <Map_text_body> _:my (times=1).
    
   #Creating the mapped words
    _:Sir 		<name> "Sir" .
    _:heart		<name> "heart" .
    _:in_ 		<name> "in" .
    _:there 	        <name> "there" .
    _:was 		<name> "was" .
    _:my 		<name> "my" .

    # Please add more words based on the "Text_body"
  }
}

Query

{
  q(func: eq(name,"My Text Node")) {
    uid
    Text_body
    Map_text_body @facets(orderdesc: times) {
      uid
      name
    }
  }
}

Response

{
  "data": {
    "q": [
      {
        "uid": "0x2",
        "Text_body": "Sir, in my heart there was a kind of fighting \n That would not let me sleep. Methought I lay \n Worse than the mutines in the bilboes. Rashly— \n And prais'd be rashness for it—let us know \n Our indiscretion sometimes serves us well ...",
        "Map_text_body": [
          {
            "uid": "0x5",
            "name": "in",
            "Map_text_body|times": 2
          },
          {
            "uid": "0x3",
            "name": "Sir",
            "Map_text_body|times": 1
          },
          {
            "uid": "0x4",
            "name": "heart",
            "Map_text_body|times": 1
          },
          {
            "uid": "0x6",
            "name": "there",
            "Map_text_body|times": 1
          },
          {
            "uid": "0x7",
            "name": "was",
            "Map_text_body|times": 1
          },
          {
            "uid": "0x8",
            "name": "my",
            "Map_text_body|times": 1
          }
        ]
      }
    ]
  },
  "extensions": {
    "server_latency": {
      "parsing_ns": 15859,
      "processing_ns": 1857051,
      "encoding_ns": 598937
    },
    "txn": {
      "start_ts": 25
    }
  }
}

Using Facets you can do various functions in Dgraph. Like assign Facet values ​​to a variable, Sorting, Aggregation, Filtering and so on.

BTW

You will soon be able to use Upsert Block, this will make this process faster and more automatic.

But of course, you will have to map and create the block (write the query) manually or by some process that you have created in your application.

e.g:

 upsert {
      query {
		#Check if the text already exist
        var(func: eq(name,"My Text Node")) {
          v_Text as uid
        }

		#Check if any words already exist
        var(func: has(type_mapped_word)) @filter(eq(name,"Sir")) {
          v_Sir as uid
        }
        var(func: has(type_mapped_word)) @filter(eq(name,"heart")) {
          v_heart as uid
        }
        var(func: has(type_mapped_word)) @filter(eq(name,"in")) {
          v_in as uid
        }
        var(func: has(type_mapped_word)) @filter(eq(name,"there")) {
          v_there as uid
        }
    # You're free to define the form that will categorize the type.
    # Here I am using "type_mapped_word".

      }
      mutation {
          set {
           uid(v_Text) <name> "My Text Node" .
           uid(v_Text) <Text_body> "Sir, in my heart there was a kind of fighting \n That would not let me sleep. Methought I lay \n Worse than the mutines in the bilboes. Rashly— \n And prais'd be rashness for it—let us know \n Our indiscretion sometimes serves us well ..." .
           uid(v_Text) <Map_text_body>  uid(v_Sir) (times=1) .
           uid(v_Text) <Map_text_body> uid(v_heart) (times=1) .
           uid(v_Text) <Map_text_body> uid(v_in) (times=2) .
           uid(v_Text) <Map_text_body> uid(v_there) (times=1) .
           uid(v_Text) <Map_text_body> uid(v_was) (times=1).
           uid(v_Text) <Map_text_body> uid(v_my) (times=1).
    
           uid(v_Sir) 		<name> "Sir" .
           uid(v_Sir)) 		<type_mapped_word> "" .

           uid(v_heart)		<name> "heart" .
           uid(v_heart)		<type_mapped_word> "" .

           uid(v_in) 		<name> "in" .
           uid(v_in) 		<type_mapped_word> "" .

           uid(v_there) 	<name> "there" .
           uid(v_there) 	<type_mapped_word> "" .

           uid(v_was) 		<name> "was" .
           uid(v_was) 		<type_mapped_word> "" .

           uid(v_my) 		<name> "my" .
           uid(v_my) 		<type_mapped_word> "" .
          }
      }
    }

niravmehta commented :

Interesting approach. If I understand this right, it’s basically maintaining our own index of search words. But using Facets will allow tf-idf scoring.

What we’re doing at the moment is to add a predicate - “search_index” to each node, and concatenate all searchable values into single field. Then searching within that index predicate.

This works as a simple search for now, but of course it does not do any scoring.

Of course, we’d love for this to be built-in :wink: Just the ability to search across multiple indexed predicates would work.

MichelDiz commented :

Yes, exactly, it is your own index of search words with tf-idf scoring.

I think it would be great to implement this as built-in. Since we use KV indexing, it would be easy to implement this. But it does require some work and thought.

@campoy , what do you think of this feature? Can you define exp, status, and priority flags?

campoy commented :

Thanks for the feature request, it does sound like something we could definitely add.

I’ll be working on our roadmap once 1.1 is released (late July) and will take this feature into account then.

We’d really love if the scoring was somehow able to be used to sort. I feel like it’d be useful for match, allofterms, and alloftext queries, though I imagine the implementation would be radically different between them. This seems like it’d be a pretty substantial feature request, but it would be incredibly useful!

6 Likes

I’d also find this very useful - could you possibly expose the score as some kind of metadata in the query result?

Having built a search engine powered by Slash GraphQL, I too concur that this is a very nice feature to have. I wouldn’t have to pull date off Slash and then process them locally on the machine.

Having said that, I would like to put this feature on the list for processing. And given my past experiences with tf-idf, I don’t mind having a go integrating this feature, but I would like to see where it fits in our roadmap first.

4 Likes