Is it possible to fuzzy search a tokenized string?

What I want to do

I would like to be able to use the fuzzy func to match tokens in a long string. For instance, suppose I have pred with “I like to eat spaghetti” value, and search text “spaghti”.

What I did

  • I could use a term index, and anyofterms(pred, "spaghti"). That would not match my example, because “spaghti” does not directly match any of the term tokens
  • I could use a trigram index, and fuzzy(pred, "spaghti", n). But since the fuzzy match is done against the entire string, the only way to get a match would be to use a prohibitively expensive levenshtein distance n
  • I could use a fulltext index, and anyoftext(pred, "spaghti"). That would not match my example, because stemming will not get from “spaghti” to the “spaghetti” token
  • I could use a trigram index and a regexp, but that won’t work because “spaghti” isn’t an exact substring

What I need is a way to do a fuzzy search against the tokenized string, because the levenshtein distance from “spaghti” to “spaghetti” is 2 which would be reasonable to compute. Is there any way to do this today? Which index(es) would I need, and which function(s) would I use?

I’m working around this limitation today by combining a fulltext index/anyoftext query with a trigram index/regexp query with a substring, but that doesn’t handle the case where the query closely matches one of the terms, but has a typo that makes the regexp match not exact.

1 Like

A trigram index is how regular expressions work on the backend. I believe the entire string should be trigram-ed (yes, I coined that term), not just the word. Have you tried just indexing it with regex (same thing as backwards-compantible trigram), and using match to search for "spaghti"?

https://dgraph.io/tour/search/3/

https://dgraph.io/docs/tutorial-7/

UPDATE - I was thinking backwards… per the link, I believe you need Levenshtein Distance with match.

J

Thanks for the reply! The links do highlight the two options available, regex and Levenshtein, via a trigram index. The problem is, neither of them will fully work for this use case.

If the triple stored simply spaghetti, then match(pred, "spaghti", 2) would work fine with a levenshtein distance of 2.

The problem is that the predicate stores i like to eat spaghetti. We would need match(pred, "spaghetti", 16) if I’m counting right, which in addition to our target predicate would also match any other string of length 16 or shorter by replacing every letter of "spaghti" and then adding 9 new letters.

What we need here is Levenshtein distance with tokenization. That would allow us to break down i like to eat spaghetti into ["i", "like", "to", "eat", "spaghetti"], and then a levenshtein distance of 2 would match the token "spaghetti" and match our target predicate.

Is there any current way to combine the tokenization of a fulltext index/query, with the fuzzy levenshtein matching of a trigram index with fuzzy query? This combination of tokenization with fuzzy search is what bleve, elastic, and other fulltext indices use for standard fulltext search against longer strings.

I think this is a Feature Request to Dgraph. Maybe someone from Dgraph can help with this?

In the meantime, you could split the words on the front end, and either combine a custom search in custom DQL, or somehow combine two queries in graphql (not sure if you can do that). I realize this is not exactly the same thing…

J

This really sounds like a cool feature request :slight_smile:

2 Likes

Thanks! I’ll be able to get most of what I need with regexp for now. I’m hoping to eventually drop an ES cache in favor of searching directly against dgraph, and I’ll keep an eye out for this feature. Along with the tf-idf scoring that’s already in the pipeline, it should allow me to do so once its ready.

Is there another process for feature requests I should go through?

I love this idea too, and it would be really useful in the things I’ll be developing.

Can I second this as a feature request? It might also be worth considering ignoring punctuation in the tokenization, and also potentially accents on letters too (e.g. ‘é’ would resolve to ‘e’ in the tokenization process, if it doesn’t already).

we have a new tokenizer which I wrote. Will be looking to integrate that soon (didn’t last quarter because I was busy with Dgraph Day)