Fulltext search across multiple fields

amaster507 · May 28, 2021, 4:01pm

Experience Report for Feature Request

What you wanted to do

Fulltext search across multiple fileds

What you actually did

Nothing, yet, as it is not possible.

Why that wasn’t great, with examples

Alternatively we are contemplating creating duplicate data fields that hold all the strings concatenated together, but then we break the single source of truth and would rely on other mechanisms to keep this field in sync with the others.

Any external references to support your case

Use to being able to do logic within a filter in MySQL do filter on a concatenated field.

amaster507 · May 28, 2021, 7:00pm

I will expand on this to maybe start another think tank sessions after just spending 3 hours on the phone trying to work through this.

Context

In our knowledge graph we access control data to public, users, groups, and group access rights. This means that for each piece of data we have an access control list defining who can access the data (user/group) and then for each user in a group they have rights that define what group data they are allowed to access. Here is part of our schema for this without any auth rules/directives for some simplification:

GraphQL Schema

type ACL {
  id: ID!
  grants: Contact
}
type Contact {
  id: ID!
  isPublic: Boolena
  access: [ACL]
  firstName: String
  lastName: String
  organization: String
  addresses: [Address]
  phones: [Phones]
  notes: [Notes]
  isUser: User
  isGroup: Group
  hasMemberships: [Member]
}
type User {
  id: ID!
  username: String
  isContact: Contact
}
type Group {
  id: ID
  isContact: Contact
  members: [Member]
}
type Member {
  id: ID
  group: Group
  member: Contact
  accessRights: [AccessRight] #enum not shown here
}
type Address {
  id: ID!
  isPublic: Boolean
  access: [ACL]
  line1: String
  city: String # just a string for simplicity here
  state: String
  forContacts: [Contact]
}
type Phone {
  id: ID!
  isPublic: Boolean
  access: [ACL]
  number: String
  forContacts: [Contact]
}
type Note {
  id: ID!
  isPublic: Boolean
  date: DateTime
  content: String
  forContact: Contact
}

This schema works beautifully with auth rules to query down to the data that any specific user is able to access. But now we want to enable a “global search” for a specific data type. Given a search string, find all contacts that best match that search string looking at the contact’s name, address, phone number, or note content. (We have identified in our real schema 12 nested fields right now we want to use).

Normally the best way to do this would build an index across all of these fields and search that index looking for the best matches, this would mean we could run a query at a single field something like: Contact.indexedData that is a string with all the terms that is related to that Contact and could easily get results, but that doesn’t work to honor access control rules for the index. Talking through this we would have to have multiple indexes for every contact and then access control those indexes for which users, and groups with access rights would be allowed to see the index. And even worse these indexes would have to be rebuilt/updated on a recurring schedule. I calculated some data and we would end up with roughly 4.8 Billion nodes of just indexes that contain a lot of duplicate data.

So since we can’t use any kind of prebuilt index, we need to run a query (or multiple queries) to find the Contacts using a search term across multiple nested fields. Then we need to find which contacts appears the most in these queries and create a ordered hashmap and slice that to get the top 5-10 results and query those again to display them from the search.

This sounds (and is becoming) very complicated for something that should be a really fast and light global search across all contacts. Started down this rabbit whole with some pseudo code and logically concluded that we would need to run 12*(words in search string > 2 characters) to be able to do a partial word match. One instance is searching for “Home Jack”, should return contacts whose organization name is “Home Improvement” (full text search) or contacts having an address with the city of “Jacksonville” (regex partial word match). In the business logic layer we will then take these queries and flatten them down to a list of contact ids. With these lists of contact ids we will then find the ones that appear the most often across all the queries. Then with those top 5-10 ids, we will run another query to fetch the actual data to display.

The problem I am looking at now is the efficiency of running 12-48+ queries in a single query operation for just a “simple” search. I understand these queries will run in parallel, but I am concerned thinking about the load it will place on the database.

I have not implemented this solution yet to even try to see how it might perform, I guess my next step is just to handwrite these 12-48 queries that might get generated for a single search and see how it performs or if it comes crashing down.

If anyone has any ideas how to efficiently create such a “global search” across multiple nested fields of a type that honors access control rules, please chime in. I feel like I have been tasked with this simple search form that in essence requires a custom in house silicon valley sized search application. Dgraph, provides a lot of great features, but this is something where I am just stuck racking my head thinking there isn’t a good solution.

If I was in a sql system I would form a query that just got every field and then build a temp table to perform a search over, but then I would still have to handle auth rules somehow.

I think I am stuck on this with either a solution that does not search over very many fields and only fields at the root level or a seemingly costly search that might get the results wanted but not within the timeframe needed.

Juri · November 6, 2021, 1:17am

No new update on this? this feature is really important

pshaddel · November 6, 2021, 8:33am

Hello @amaster507 .
We had the same issue and even with running these queries at the same time it was not so efficient. We gave it a try and used elastic search experimentally and the performance on full text search massively improved and at the same time we were able to search all fields we want.
Also I should allude to complexities that might come to you, like syncing data of elastic search with dgraph.

Another point which I should highlight is that when you add nodes to elastic search your query time decreases. Imagine you have 1M users and you want to do full-text search, If you have one node then the query time is 10s. You can decrease that to 5s by adding an exact similar node. As far as I’m concerned this is not possible with dgraph because they shard predicates.

Juri · November 7, 2021, 12:37am

try out typesense, it’s a cheaper open source competitor to elasticsearch

jdgamble555 · November 7, 2021, 5:03am

I would definitely stay away from elastic search, as there are many versions, forks, and many hidden costs. Algolia is better, but still can be costly.

If you must use an external search, I would look into Typesense or MeiliSearch. With your dataset you might even look into exporting to Google Big Query, which can be fast.

However, I don’t see why you can’t do a custom query in DQL. First run your rules inputting the fields you want to search, then query your 3 different fields using vars. You would only be searching for each results, not the entire thing. Of course, that is an “AND” query. I know you probably want every possible option.

I am not understanding your worry for performing a search. If you can perform a regular full-text search, I don’t see why you would have issues with a few extra fields (indexes help, not burden).

In other words, you shouldn’t have any more worries than you normally do while dealing with any operation in a database that large in the first place.

Just my opinion, but I am definitely a programmer not a server guy.

J

amaster507 · November 7, 2021, 12:04pm

What we ended up building instead of a global search is a very advance UI giving end users the ability to do deep searches of data in specific fields. We then kept a “global search” that is simple but not as powerful for finding data from the most common fields.

For instance instead of using the “global search” to search contacts by address->city/state we put this specific option in the filter builder and reserve the “global search” for the common fields like firstName, lastName, orgName. Users can build these custom filters save them and then toggle multiple custom filters on and off together. An idea I want to take this even further is the ability to add modifiers to saved filters like placeholders, so a user creates a saved filter for a specific string across multiple fields joined with different logic can then use a unique search string for each use of that custom filter instead of needing to duplicate the filter and change the value in every field.

dtrckd · January 11, 2023, 2:51pm

Hi there,

I don’t see this feature mentioned in the Dgraph’s roadmap Feature Requests - PM RoadMap · GitHub

Is there any chance of this feature will be considered ? This seems like something that could greatly improve the relevance and greatness of the Dgraph search function.

Topic		Replies	Views
RFC: Allow multiple unique fields in graphql schema GraphQL	11	2712	March 22, 2021
Composite @id fields GraphQL kind:enhancement , status:accepted , ticket:created	9	2765	February 2, 2022
Multiconditional query GraphQL kind:question	5	463	October 21, 2020
GraphQL: Connected filter on non-scalar list element fields GraphQL	10	1768	April 21, 2021
There can be only one input field named * GraphQL docs , status:accepted , kind:bug , ticket:created	8	3242	December 10, 2020