I have nodes (words) with a label string attribute.
I’d like to search for any duplicate word in my database.
For instance, let’s say my code has added 2 nodes with the same label “cat”.
What query would group words nodes by the count of their label and return only counts greater than 1?
I tried queries like the following one, but of course I cannot use a variable on a groupby applied to a scalar attribute:
{
var(func: has(label)) @groupby(label) {
c as count(uid)
}
me(func: uid(c)) @filter(gt(val(c), 1)) {
label
ct: val(c)
}
}
Assuming you are using gby the correct way, you would need to provide a sample mutation. To understand your line of reasoning.
The only way to search for something is via search functions and correct indexing as desired.
This query does not make much sense.
I’m assuming you have in the predicate “label” the word “cat”. So will never be able to do “@filter(gt”.
The correct would be this, but I know it is not the desired one:
{
var(func: has(label)) @filter(eq(label, "cat")) @groupby(label) {
# PS. For now that's wrong in Dgraph. You can't groupby label value, just UIDs linking
c as count(uid)
}
me(func: uid(c)) {
label
ct: val(c)
}
}
Well, for the simple fact that you are grouping by a Label. You will already have the nodes supposedly duplicates in the very same group. Then you do not need anything else. If the group has more than one, you have found the duplicates.
Now if you want to search by group and then in another predicate find a duplicate. Add a new filter in the query block would suffice. But in this case you would be looking for duplicates (according to your criteria) in the same group.
There is no way the Dgraph returns to you something like duplicate. As an answer like “True” for “we have a duplicate”. Dgraph assumes that you are entering the data correctly. You have to use filters and check this via application for this.
And then programmatically process the result by identifing counts greater than 2, but it would not be efficient at all when I’ll have 2 million words returned. I’d like them to be filtered out during the dGraph process and get as a result only labels where count is greater than 1.
Please, show a mutation example of your structure.
That’s not my example.
Inside a groupby block, only aggregations are allowed and count may only be applied to uid
You can not use group by in a normal block, usually in a Var Block as it was already being used.
More about Groupby: Get started with Dgraph
Briefly, as mentioned, this is not possible. This usage is not within the groupby’s propose as you want. Btw, The “count” use the UID not a “label” or other attribute.
Without an example mutation of its structure I have no way to help. Your query structure could be in many ways, for example.
{
var(func: has(label)) {
label @groupby(label) {
c as count(uid)
}
}
me_noFilters(func: uid(c)) {
label
ct: val(c)
}
}
And then my code that looks for existing node misses the existing label “cat”, so I add again the “cat” word:
{
set {
_:word1 <label> "cat" .
}
}
The problem is that I now have 2 "cat"s in my database.
So I’ve developed some unit/integration test that check if all the words in my database have duplicates.
If so I’ll then be able to fix my code and remove or merge the duplicates in the database.
So for now I have the “groupby” request that returns ALL words with their count, but I would really prefer to return only words with count greater than 1.
In Dgraph you can’t do that.There are no specific conditions for Nodes itself, but for indexed values within the Node. This kind of procedure you have to do in your application.
PS. @groupby does not group by value, only by UID. So technically it is impossible for you to return “cat” in the same group. Because it is not a connection via UID.
Indeed it works if I know which word is duplicated and query for it.
But if I have millions of words, if I understand what you’re saying, I’ll have to get them all with their count and iterate over them to find those wich have a count greater than 1.
It’s too bad dGraph has no way to filter duplicates.
You can try to use Get started with Dgraph and do a background task and search for duplicates.
Also, before inserting any new tags into the DB. You could use “upserts” Get started with Dgraph that is the only way out, solve the problem in the beginning and not after.
Creating a background task for that in Dgraph would take a lot of resource from the machine. Therefore infeasible.
I do not know which words could be duplicated, so I cannot search them based on their string.
I already use upserts, but sometimes you have bugs like a supposedly safe mutation where you forgot to remove a sub-sub-field of the serialized struct which will imply nodes creation. I cannot use upsert on all predicates for performance reasons and I cannot avoid all bugs, so I need tests that check the database coherence.
dGraph can generate labels and their count, so it could theoretically filter on that groupby result. It’s basically a loop to remove elements based on their count. I don’t understand the difficulty here, I just understand it’s not implemented or just not the way dGraph works.