I will expand on this to maybe start another think tank sessions after just spending 3 hours on the phone trying to work through this.
Context
In our knowledge graph we access control data to public, users, groups, and group access rights. This means that for each piece of data we have an access control list defining who can access the data (user/group) and then for each user in a group they have rights that define what group data they are allowed to access. Here is part of our schema for this without any auth rules/directives for some simplification:
GraphQL Schema
type ACL {
id: ID!
grants: Contact
}
type Contact {
id: ID!
isPublic: Boolena
access: [ACL]
firstName: String
lastName: String
organization: String
addresses: [Address]
phones: [Phones]
notes: [Notes]
isUser: User
isGroup: Group
hasMemberships: [Member]
}
type User {
id: ID!
username: String
isContact: Contact
}
type Group {
id: ID
isContact: Contact
members: [Member]
}
type Member {
id: ID
group: Group
member: Contact
accessRights: [AccessRight] #enum not shown here
}
type Address {
id: ID!
isPublic: Boolean
access: [ACL]
line1: String
city: String # just a string for simplicity here
state: String
forContacts: [Contact]
}
type Phone {
id: ID!
isPublic: Boolean
access: [ACL]
number: String
forContacts: [Contact]
}
type Note {
id: ID!
isPublic: Boolean
date: DateTime
content: String
forContact: Contact
}
This schema works beautifully with auth rules to query down to the data that any specific user is able to access. But now we want to enable a “global search” for a specific data type. Given a search string, find all contacts that best match that search string looking at the contact’s name, address, phone number, or note content. (We have identified in our real schema 12 nested fields right now we want to use).
Normally the best way to do this would build an index across all of these fields and search that index looking for the best matches, this would mean we could run a query at a single field something like: Contact.indexedData
that is a string with all the terms that is related to that Contact and could easily get results, but that doesn’t work to honor access control rules for the index. Talking through this we would have to have multiple indexes for every contact and then access control those indexes for which users, and groups with access rights would be allowed to see the index. And even worse these indexes would have to be rebuilt/updated on a recurring schedule. I calculated some data and we would end up with roughly 4.8 Billion nodes of just indexes that contain a lot of duplicate data.
So since we can’t use any kind of prebuilt index, we need to run a query (or multiple queries) to find the Contacts using a search term across multiple nested fields. Then we need to find which contacts appears the most in these queries and create a ordered hashmap and slice that to get the top 5-10 results and query those again to display them from the search.
This sounds (and is becoming) very complicated for something that should be a really fast and light global search across all contacts. Started down this rabbit whole with some pseudo code and logically concluded that we would need to run 12*(words in search string > 2 characters) to be able to do a partial word match. One instance is searching for “Home Jack”, should return contacts whose organization name is “Home Improvement” (full text search) or contacts having an address with the city of “Jacksonville” (regex partial word match). In the business logic layer we will then take these queries and flatten them down to a list of contact ids. With these lists of contact ids we will then find the ones that appear the most often across all the queries. Then with those top 5-10 ids, we will run another query to fetch the actual data to display.
The problem I am looking at now is the efficiency of running 12-48+ queries in a single query operation for just a “simple” search. I understand these queries will run in parallel, but I am concerned thinking about the load it will place on the database.
I have not implemented this solution yet to even try to see how it might perform, I guess my next step is just to handwrite these 12-48 queries that might get generated for a single search and see how it performs or if it comes crashing down.
If anyone has any ideas how to efficiently create such a “global search” across multiple nested fields of a type that honors access control rules, please chime in. I feel like I have been tasked with this simple search form that in essence requires a custom in house silicon valley sized search application. Dgraph, provides a lot of great features, but this is something where I am just stuck racking my head thinking there isn’t a good solution.
If I was in a sql system I would form a query that just got every field and then build a temp table to perform a search over, but then I would still have to handle auth rules somehow.
I think I am stuck on this with either a solution that does not search over very many fields and only fields at the root level or a seemingly costly search that might get the results wanted but not within the timeframe needed.