As we implement filtering and sorting, we might add more entries to task.Query.
Initially, task.Query contains basically just attr and uids as ints. Recently, we add on terms which is for exact match. I’m looking to add partial match and sorting. That would require different keys for RocksDB lookups. They might have different prefixes. We can keep adding more fields to task.Query but that doesn’t seem to be the most elegant design.
I wonder if task.Query can instead contain one prefix and one suffix array. They will replace attr, uids, terms and will also work for other kinds of keys that I am trying to add. (We no longer keep task.Query in subgraph, so no worries about getting rid of attr.) The lookup keys would be prefix + suffix from suffix array, as you expect. There can be multiple variants of createTaskQuery that will create the right task.Query.
Another alternative is to keep prefix, attr and one suffix array and lookup keys would be prefix + attr + suffix. This might be a safer change as there might be some code that needs attr. Let me check.
So, regarding the terms, they could potentially be located on a different server, served by a different RAFT group. For e.g., if we have multiple geo-location predicates, they’d still get indexed in the same _loc_ predicate. Thus, the queries to these geo-location predicates won’t necessarily be served by the same server.
I could imagine the same issue in string indexing as well. We might have name and address predicates all indexing to the same _term_ predicate. So we can no longer assume that the same server will be serving both the terms and the names.
Does that make sense? How would task.Query look given this? Technically, all you need is attr in the query, and then either a uid list, or a list of suffix keys. I think the prefix isn’t required. I don’t think we need to convert a uid list to suffix keys. We can use suffix keys if needed, when we are not dealing with uids.
For the generic indexing system, we might not be able to assume that. We need to have it so that each worker would generate the corresponding indexing data, and then shoot it off to be written to the right group, depending upon the generated attribute.
For generic indexing, predicate will be _term_ or _loc_ and the indexing data will be located on whichever machine is handling that predicate. As a worker processes a mutation, it will need to shoot off additional RPC calls to update the index.
The code will need to be modified to support this.
We do want to support non-generic indexing? If that’s the case, we will need something to distinguish between the different indexing data, e.g., between exact and partial match. Don’t we need something like a prefix in task.Query?
Actually, by default that would happen anyway. If the server is handling that particular group which is corresponding to that indexing predicate, it would automatically just run it on the same machine.
I don’t see the point of prefix. I think attribute is what is needed.
I might repeat stuff that you know. But please bear with me. Just want to make sure that our terminology is the same.
By “non-generic” indexing, I mean indexing restricted to one predicate. For example, if I want to index “nickname” predicate then the key for the posting list will contain “nickname”. Is this consistent what your interpretation of a “non-generic” index?
Fix an indexed predicate. There might be multiple indices. There can be exact match and there can be partial match. Say we have a value like “hello world”. The exact match index will receive “hello world” → UID while the partial match index will receive “hello” → UID and “world” → UID. Now, there can be another value which is just “hello”. In that case, both exact match and partial index will receive “hello” → UID but the keys have to be different to distinguish between the different indices.
For the exact match index, the key might be “:predicate|hello” and for the partial match index, the key might be “$predicate|hello”. That is why I am asking whether we need to store “:” or “$” in task.Query.
Are you suggesting that we just treat “$predicate” and “:predicate” as predicates in task.Query?
I think this notion of exact and partial matches with $ and : is a bit complicated. I think we don’t really need that. Each tokenizer should just keep things simple, and spit out an attribute, and the suffix key. So, something like [Barack Obama] would generate → [Barack] and [Obama], and if someone needs [Barack Obama] back, we intersect the two lists and get the results back. I think this is sufficient until we hear otherwise from some client.
For Geo filtering, how does the filtering logic work? Is it just a simple lookup of a few index entries? If so, we can keep task.Query simple. Otherwise, if worker has to do something more complicated in order to reduce the amount of data transferred back, then I would prefer the above design.
Adding what we discussed on hangouts for other’s benefit:
We’ll keep the current structure as is and rename terms to tokens. Geo queries will do the conversion to the index keys and additional filtering on the server side. This will keep the workers simple and have them just do index lookups for uids.