Strategy for partial text search

jonas_levin · January 17, 2022, 3:30pm

I have created a schema for a Profile and want to create efficant indexes for partial text search. My goal is to return relevant profiles through a search on the username, firstname, or lastname.

For example with the query string jo, only Nodes should be returned which username, firstname, or lastname start with jo. Later maybe rank them on a score that takes into consideration the number of times the query is included.

After my research I think this is only possible through the trigram index but based on this answer it needs a regex of atleast 3 characters. Does that mean that a query string with only 2 characters would not work with the trigram index?
Or is there another strategy that would return a similar result, maybe just check if the username, firstname, or lastname contain the query characters?

My basic schema:

type Profile {
  id // will store reference to id from MongoDB
  username
  firstname
  lastname
  phone
  picture
}

id: string @index(exact) .
username: string @index(trigram)
firstname: string @index(trigram)
lastname: string @index(trigram)

amaster507 · January 17, 2022, 6:46pm

That is the current limitation. You cannot search for jo in a regex, but you should be able to search for jo. because that has a third character in it. Quite a bad limitation, but probably one of the more complicated functions in dgraph if I had to guess.

Also let it be known that regex searches are really slow in Dgraph compared to any other filter functions.

jonas_levin · January 17, 2022, 7:50pm

Yea regex searches are mostly slower. MongoDB’s regex search is also quite slow. I tried Neo4j before trying DGraph now and Neo4j had the STARTS WITH operator which was very useful for my case.
I probably wont try to mess with regex and just try to build a search with the term argument

amaster507 · January 17, 2022, 8:38pm

This was probably one of the biggest challenges in development for us. How do we do “global searches” that we are use to doing with MySQL like operator and wild cards across dynamically concatenated fields.

SELECT
  *
FROM
  user
  LEFT JOIN address ON ...
WHERE
  CONCAT_WS(user.username, user.first_name, user.last_name, address.city, address.state) LIKE `%jo%`

It is IMPOSSIBLE to do this in Dgraph.

jdgamble555 · January 17, 2022, 11:25pm

So that is impossible, and perhaps evokes a feature requests to concat fields, and run match on those results.

That being said, I don’t see why you couldn’t just run a match on each field returned as a var, and join the results.

It will probably be very slow, but still possible.

It would be cool if dql had an async opperation to run them at the same time, then join those results.

Either way, I would think it is still possible, but I do not dare claim to be a dql expert.

@amaster507 Thoughts?

J

amaster507 · January 18, 2022, 12:14am

You can’t do the %jo% pattern match. Regex must be three characters. So the limitation is multifaceted and deceptive here.

jdgamble555 · January 18, 2022, 1:39am

Sorry, I wasn’t responding to the original question, but your SQL equivalent. Obviously two letters can’t work with trigrams. The match would work otherwise though, correct?

J

amaster507 · January 18, 2022, 1:50am

it might look something like DQL:

query {
  block1 as var(func: regexp(User.username, /foo/i))
  block2 as var(func: regexp(User.first_name, /foo/i))
  block3 as var(func: regexp(User.last_name, /foo/i))
  var(func: regexp(City.name)) {
    City.hasAddresses {
      block4 as Address.forUser
    }
  }
  searchUser(func: uid(block1, block2, block3, block4)) {
    id: uid
    username: User.username
    first_name: User.first_name
    # more fields/edges
  }
}

But again, very slow. I removed most places where I allowed regexp indexes in my app by the user just because how poorly they performed.

But where this fails is when you want to search for full names like: “oo ba” should match user with first name “foo” and last name “bar”, but you simply can’t do that without then splitting every string on spaces and do even more blocks, that gets really complex really quick and is not worth the effort for the poor query performance in the end.

But to give Dgraph credit,

That is what the above Dgraph query does, just not the same equivalence of concatenating a string and searching across that concatenated string. Even for the same size of data in MySQL to Dgraph, MySQL handled this without any noticable query performance degredation while Dgraph was very poor performance with any regexp query from my experience. I use it now only for administration and restrict if from users.

jdgamble555 · January 18, 2022, 2:22am

Thanks for the example!

Yeah, that is what I have been talking about where you can’t do fuzzy phrase searches. I don’t even know if what you just described is possible, as it would guarantee the individual words, but not necessarily in the desired order.

I wonder if it is possible to speed this up to equal or better performance of the SQL equivalent.

J

amaster507 · January 18, 2022, 2:29am

exactly and no way to do ranking of results like I did before with rdbms.

Topic		Replies	Views
Strange regexp behavior Users	7	472	April 9, 2019
Dgraph search query returns inconsistent results after upgrade to dgraph:v1.1.0 Users	1	437	October 18, 2019
Question about String indexing GraphQL	4	459	July 2, 2020
@regexp vs @trigram Dgraph Cloud graphql , kind:question , slash-graphql	7	1036	June 9, 2021
What is the equivalent of allofterms for a trigram search? Dgraph kind:question	4	366	August 1, 2021

Strategy for partial text search

My basic schema:

Related topics