Deduplication based on multiple field values

gainday · November 30, 2020, 3:32am

Dear friends:
Problem: I am currently experiencing a difficult problem. I need to de-duplicate data based on 3 fields and delete the previous error due to program BUG.
schema:

		trade_card_source: uid @reverse .
		trade_time: int @index(int) .
		trade_serial: string @index(hash) .
		trade_status: int .
		type Trade {
			trade_card_source: Account
			trade_time:	int
			trade_serial: string
			trade_status: int
		}

My solution is to use dgraph’s @groupby to group the three fields, and then de-duplicate them based on the results, but I have not found how dgraph controls my result set. The mutation statement is as follows:

upsert {
  query {
    var(func: type(Trade))
    @groupby(trade_card_source, trade_time, trade_serial)  {
        u as count(uid)
      }
  }
    
  mutation @if(gt(len(u), 1)) {
    delete {
      uid(u) * * .
    }
  }
}

describe: Group according to the edges: trade_card_source, predicate: trade_time, trade_serial, and then count the number of nodes in each group, and delete if there is more than one. This cannot be implemented at present, and how to exclude the first data in each group?

anand · November 30, 2020, 3:43am

Hi @gainday,
There is a solution with an example on data merge here . The idea is to create a new merged node and then delete the duplicates. Please review this approach.

gainday · November 30, 2020, 4:03am

@anand Thank you very much for your help, but my question is different from the repeated data scenario you provided. In my scenario, I need to confirm the uniqueness based on the combination of three fields (1 edge trade_card_source, 2 predicates trade_time, trade_serial) Value, so my solution is to use @groupby(trade_card_source, trade_time, trade_serial) to divide these three values into the same group, and then filter out the nodes with more than 1 in each group, and then delete the excess Nodes (currently I did not think of how to control the result set of each group after grouping through dgraph, for example: there are 5 nodes in a group, I need to delete 4 of them), do you have a good way?

anand · November 30, 2020, 4:49am

Hi @gainday,
As per the documentation on @groupby:
" If the `groupby` is applied to a `uid` predicate, the resulting aggregations can be saved in a variable (mapping the grouped UIDs to aggregate values) and used elsewhere in the query to extract information other than the grouped or aggregated edges."

Since the groupby clause you are using contains a mix of scalar and uid predicates, we cannot use the resultant count(uid) for extracting additional information (or next actions like deletes).

MichelDiz · November 30, 2020, 5:32am

Yep, groupby won’t work on scalar values. But you can try other ways.
Here some examples (just the query, the upsert block should be the same).

If your duplicates have typos and you wanna find them by those typos.
Also, to avoid overhead. I would traverse the entity source. I would never do a too broad query like (func: type(Trade)).

{
  var(func: has(Account_number)) {
    SOURCE as <~trade_card_source>
  }
    toDelete(func: uid(SOURCE))
    @filter(Not has(trade_serial) OR Not has(trade_status)) @normalize 
      {
        UID : uid
        trade_card_source { trade_card_source: uid }
        trade_time : trade_time
        trade_serial : trade_serial
      }
    q(func: uid(SOURCE))
    @filter(has(trade_serial) AND  has(trade_status) AND  has(trade_time)) @normalize 
      {
        UID : uid
        trade_card_source { trade_card_source: uid }
        trade_time : trade_time
        trade_serial : trade_serial
      }

  }

Please ignore the normalize part. It is just a visual sugar that I added.

If you just wanna find duplicates

{
  var(func: has(Account_number)) {
    SOURCE as <~trade_card_source>
  }
  SE as selectOne(func: uid(SOURCE), first:1) @normalize 
      {
       UID : uid
        trade_card_source { trade_card_source: uid }
        trade_time : trade_time
        trade_serial : trade_serial
      }
    toDelete(func: uid(SOURCE)) @filter(NOT uid(SE))
    @normalize
      {
        UID : uid
        trade_card_source { trade_card_source: uid }
        trade_time : trade_time
        trade_serial : trade_serial
      }
  }

gainday · November 30, 2020, 6:53am

thank you very much！

gainday · November 30, 2020, 7:27am

Hi, @MichelDiz:
The solution you provided is actually just to judge the uniqueness of the data based on trade_card_source, but our data needs to be based on the combination of trade_card_source, trade_time, and trade_serial as the primary key to determine whether the data is duplicated. Does dgraph have any relevant solutions? Thank you.

MichelDiz · November 30, 2020, 3:01pm

Do you mean loop through all the entities’ sources and check whether are duplicated and apply some rules to cleanup? For now, we don’t have any function similar to loops in DQL. Maybe the new Lambda would help there.

BTW, you should fix this bug. You should not rely your data on the Database functions to fix it.

That was my concern. You should do controlled queries. Not too wide/broad ones. Cuz in general what you doing, you are possibly plainning to do it all the time as a background task. And this would overload the cluster depending on the size of the dataset and what resources you have available.

I would do it on demand and per user. But a fix would be better.

gainday · December 1, 2020, 11:54am

i agree you idea, thank you very much for your help.

Topic		Replies	Views
Upsert resulting duplicate node Dgraph dgraph	8	529	January 29, 2022
Writing only unique nodes based on values not working Users	3	372	May 16, 2019
Query to delete duplicates Users mutation	5	687	April 1, 2020
Duplicate entries aren't removed after schema alter from [default] to [int] Dgraph dgraph , status:accepted , kind:bug	0	274	February 19, 2020
Mutation of 2 different nodes from a same type , query showing only 1 node, what happened? Dgraph Clients pydgraph	0	543	February 8, 2022

Deduplication based on multiple field values

Related Topics