How to find duplicate nodes based on a scalar attribute value?

Hello,

I have nodes (words) with a label string attribute.
I’d like to search for any duplicate word in my database.

For instance, let’s say my code has added 2 nodes with the same label “cat”.
What query would group words nodes by the count of their label and return only counts greater than 1?

I tried queries like the following one, but of course I cannot use a variable on a groupby applied to a scalar attribute:

{
  var(func: has(label)) @groupby(label) {
  	c as count(uid)
  }
  me(func: uid(c)) @filter(gt(val(c), 1)) {
    label
    ct: val(c)
  }
}

You cannot use @groupby for “string attribute”.

Assuming you are using gby the correct way, you would need to provide a sample mutation. To understand your line of reasoning.

The only way to search for something is via search functions and correct indexing as desired.

This query does not make much sense.
I’m assuming you have in the predicate “label” the word “cat”. So will never be able to do “@filter(gt”.
The correct would be this, but I know it is not the desired one:

{
  var(func: has(label)) @filter(eq(label, "cat")) @groupby(label) {
# PS. For now that's wrong in Dgraph. You can't groupby label value, just UIDs linking
  	c as count(uid)
  }
  me(func: uid(c)) {
    label
    ct: val(c)
  }
}

Well, for the simple fact that you are grouping by a Label. You will already have the nodes supposedly duplicates in the very same group. Then you do not need anything else. If the group has more than one, you have found the duplicates.

Now if you want to search by group and then in another predicate find a duplicate. Add a new filter in the query block would suffice. But in this case you would be looking for duplicates (according to your criteria) in the same group.

There is no way the Dgraph returns to you something like duplicate. As an answer like “True” for “we have a duplicate”. Dgraph assumes that you are entering the data correctly. You have to use filters and check this via application for this.

Cheers.

Your example query returns an error:

Vars can be assigned only when grouped by UID attribute

And that’s exactly my problem: I cannot use values of such grouped values and thus I cannot filter results based on the count values.

I could just use the first query block like this:

{
  wordCount(func: has(label)) @groupby(label) {
  	count(uid)
  }
}

And then programmatically process the result by identifing counts greater than 2, but it would not be efficient at all when I’ll have 2 million words returned. I’d like them to be filtered out during the dGraph process and get as a result only labels where count is greater than 1.

Please, show a mutation example of your structure.

That’s not my example.

Inside a groupby block, only aggregations are allowed and count may only be applied to uid

You can not use group by in a normal block, usually in a Var Block as it was already being used.
More about Groupby: https://docs.dgraph.io/query-language/#groupby

Briefly, as mentioned, this is not possible. This usage is not within the groupby’s propose as you want. Btw, The “count” use the UID not a “label” or other attribute.

Without an example mutation of its structure I have no way to help. Your query structure could be in many ways, for example.

{
  var(func: has(label)) {
  	label @groupby(label) {
     c as count(uid)
   }
  }

 me_noFilters(func: uid(c)) {
    label
    ct: val(c)
  }

}

Sorry for the misunderstanding, I thought you were suggesting a query.

I’m not sure what you need exactly as a mutation example, as my question arises with very basic situations:

Suppose I first create the following nodes

{
 set {
    _:word1 <label> "cat" .
    _:word2 <label> "dog" .
 }
}

And then my code that looks for existing node misses the existing label “cat”, so I add again the “cat” word:

{
 set {
    _:word1 <label> "cat" .
 }
}

The problem is that I now have 2 "cat"s in my database.

So I’ve developed some unit/integration test that check if all the words in my database have duplicates.
If so I’ll then be able to fix my code and remove or merge the duplicates in the database.

So for now I have the “groupby” request that returns ALL words with their count, but I would really prefer to return only words with count greater than 1.

So, in this case:

# Schema
# label: string @index(exact) .
{
  duplicate(func: has(label))@filter(eq(label, "cat")) {
    uid
    label
  }
}
{
  "data": {
    "duplicate": [
      {
        "uid": "0x2711",
        "label": "cat"
      },
      {
        "uid": "0x2713",
        "label": "cat"
      }
    ]
  }

In Dgraph you can’t do that.There are no specific conditions for Nodes itself, but for indexed values within the Node. This kind of procedure you have to do in your application.

Just a reference below.

Ref About indexing: https://docs.dgraph.io/query-language/#indexing

Count works only with edges. Never with nodes in a queue.

Count index

For predicates with the @count Dgraph indexes the number of edges out of each node. This enables fast queries of the form:

{
  q(func: gt(count(pred), threshold)) {
    ...
  }
}

PS. @groupby does not group by value, only by UID. So technically it is impossible for you to return “cat” in the same group. Because it is not a connection via UID.

https://docs.dgraph.io/query-language/#groupby

Thanks for your answer.

Indeed it works if I know which word is duplicated and query for it.

But if I have millions of words, if I understand what you’re saying, I’ll have to get them all with their count and iterate over them to find those wich have a count greater than 1.

It’s too bad dGraph has no way to filter duplicates.

You can try to use https://docs.dgraph.io/query-language/#regular-expressions and do a background task and search for duplicates.

Also, before inserting any new tags into the DB. You could use “upserts” https://docs.dgraph.io/howto/#upsert-procedure that is the only way out, solve the problem in the beginning and not after.

Creating a background task for that in Dgraph would take a lot of resource from the machine. Therefore infeasible.

I do not know which words could be duplicated, so I cannot search them based on their string.

I already use upserts, but sometimes you have bugs like a supposedly safe mutation where you forgot to remove a sub-sub-field of the serialized struct which will imply nodes creation. I cannot use upsert on all predicates for performance reasons and I cannot avoid all bugs, so I need tests that check the database coherence.

dGraph can generate labels and their count, so it could theoretically filter on that groupby result. It’s basically a loop to remove elements based on their count. I don’t understand the difficulty here, I just understand it’s not implemented or just not the way dGraph works.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.