Deduplication based on multiple field values

Dear friends:
Problem: I am currently experiencing a difficult problem. I need to de-duplicate data based on 3 fields and delete the previous error due to program BUG.
schema:

		trade_card_source: uid @reverse .
		trade_time: int @index(int) .
		trade_serial: string @index(hash) .
		trade_status: int .
		type Trade {
			trade_card_source: Account
			trade_time:	int
			trade_serial: string
			trade_status: int
		}

My solution is to use dgraph’s @groupby to group the three fields, and then de-duplicate them based on the results, but I have not found how dgraph controls my result set. The mutation statement is as follows:

upsert {
  query {
    var(func: type(Trade))
    @groupby(trade_card_source, trade_time, trade_serial)  {
        u as count(uid)
      }
  }
    
  mutation @if(gt(len(u), 1)) {
    delete {
      uid(u) * * .
    }
  }
}
    

describe: Group according to the edges: trade_card_source, predicate: trade_time, trade_serial, and then count the number of nodes in each group, and delete if there is more than one. This cannot be implemented at present, and how to exclude the first data in each group?

Hi @gainday,
There is a solution with an example on data merge here . The idea is to create a new merged node and then delete the duplicates. Please review this approach.

@anand Thank you very much for your help, but my question is different from the repeated data scenario you provided. In my scenario, I need to confirm the uniqueness based on the combination of three fields (1 edge trade_card_source, 2 predicates trade_time, trade_serial) Value, so my solution is to use @groupby(trade_card_source, trade_time, trade_serial) to divide these three values into the same group, and then filter out the nodes with more than 1 in each group, and then delete the excess Nodes (currently I did not think of how to control the result set of each group after grouping through dgraph, for example: there are 5 nodes in a group, I need to delete 4 of them), do you have a good way?

Hi @gainday,
As per the documentation on @groupby:
" If the `groupby` is applied to a `uid` predicate, the resulting aggregations can be saved in a variable (mapping the grouped UIDs to aggregate values) and used elsewhere in the query to extract information other than the grouped or aggregated edges."

Since the groupby clause you are using contains a mix of scalar and uid predicates, we cannot use the resultant count(uid) for extracting additional information (or next actions like deletes).

Yep, groupby won’t work on scalar values. But you can try other ways.
Here some examples (just the query, the upsert block should be the same).

If your duplicates have typos and you wanna find them by those typos.
Also, to avoid overhead. I would traverse the entity source. I would never do a too broad query like (func: type(Trade)).

{
  var(func: has(Account_number)) {
    SOURCE as <~trade_card_source>
  }
    toDelete(func: uid(SOURCE))
    @filter(Not has(trade_serial) OR Not has(trade_status)) @normalize 
      {
        UID : uid
        trade_card_source { trade_card_source: uid }
        trade_time : trade_time
        trade_serial : trade_serial
      }
    q(func: uid(SOURCE))
    @filter(has(trade_serial) AND  has(trade_status) AND  has(trade_time)) @normalize 
      {
        UID : uid
        trade_card_source { trade_card_source: uid }
        trade_time : trade_time
        trade_serial : trade_serial
      }

  }

Please ignore the normalize part. It is just a visual sugar that I added.

If you just wanna find duplicates

{
  var(func: has(Account_number)) {
    SOURCE as <~trade_card_source>
  }
  SE as selectOne(func: uid(SOURCE), first:1) @normalize 
      {
       UID : uid
        trade_card_source { trade_card_source: uid }
        trade_time : trade_time
        trade_serial : trade_serial
      }
    toDelete(func: uid(SOURCE)) @filter(NOT uid(SE))
    @normalize
      {
        UID : uid
        trade_card_source { trade_card_source: uid }
        trade_time : trade_time
        trade_serial : trade_serial
      }
  }

thank you very much!

Hi, @MichelDiz:
The solution you provided is actually just to judge the uniqueness of the data based on trade_card_source, but our data needs to be based on the combination of trade_card_source, trade_time, and trade_serial as the primary key to determine whether the data is duplicated. Does dgraph have any relevant solutions? Thank you.

Do you mean loop through all the entities’ sources and check whether are duplicated and apply some rules to cleanup? For now, we don’t have any function similar to loops in DQL. Maybe the new Lambda would help there.

BTW, you should fix this bug. You should not rely your data on the Database functions to fix it.

That was my concern. You should do controlled queries. Not too wide/broad ones. Cuz in general what you doing, you are possibly plainning to do it all the time as a background task. And this would overload the cluster depending on the size of the dataset and what resources you have available.

I would do it on demand and per user. But a fix would be better.

i agree you idea, thank you very much for your help. :blush:

1 Like