Duplicate nodes with Live Loader and upsertPredicate

MattH · March 18, 2022, 7:14pm

What I want to do

Upsert data into Dgraph using Live Loader using the upsertPredicate option, against my GraphQL schema. I’m getting duplicate nodes despite having the xid set.

What I did

Here’s the Type definition:

type Country {
    id: ID!
    xid: String! @id @dgraph(pred: "xid")
    country_code: String! @id @search(by: [exact])
    country_name: String @search(by: [exact])
}

Here’s the sample .json file that I’m working with:

[
        {"dgraph.type":"Country", "uid":"_:MEX", "xid":"MEX", "Country.country_code":"MEX", "Country.country_name":"Mexico"},
        {"dgraph.type":"Country", "uid":"_:ARG", "xid":"ARG", "Country.country_code":"ARG", "Country.country_name":"Argentina"}
]

Here’s the command that I’m running:

dgraph live --files /data/1/data/dgraph/sample-data/json/countries1.json --alpha localhost:9080 --zero <zeroIP>:5080 --format json --cwd /data/1/data/dgraph --log_dir /data/1/logs/dgraph/ --upsertPredicate "xid"

If I run that twice, I’m expecting to see no new data inserted on the second run. Instead, I’m getting duplicate nodes:


curl --location --request POST 'http://localhost:8080/graphql' --header 'Content-Type: application/graphql' -d '
query MyQuery {
  queryCountry {
  id
  xid
  country_code
  country_name
  }
}
' | jq

{
  "data": {
    "queryCountry": [
      {
        "id": "0x41b556a",
        "xid": "MEX",
        "country_code": "MEX",
        "country_name": "Mexico"
      },
      {
        "id": "0x41b556b",
        "xid": "ARG",
        "country_code": "ARG",
        "country_name": "Argentina"
      },
      {
        "id": "0x44f36ac",
        "xid": "ARG",
        "country_code": "ARG",
        "country_name": "Argentina"
      },
      {
        "id": "0x44f36ad",
        "xid": "MEX",
        "country_code": "MEX",
        "country_name": "Mexico"
      }
    ]
  } ......

What am I missing here? Thanks in advance for any guidance.

Dgraph metadata

dgraph version

Dgraph version : v21.03.1
Dgraph codename : rocket-1
Dgraph SHA-256 : a00b73d583a720aa787171e43b4cb4dbbf75b38e522f66c9943ab2f0263007fe
Commit SHA-1 : ea1cb5f35
Commit timestamp : 2021-06-17 20:38:11 +0530
Branch : HEAD
Go version : go1.16.2
jemalloc enabled : true

MattH · March 18, 2022, 8:21pm

I realized that I was missing the hash index on the xid predicate.

So, I’ve updated my GraphQL schema to:

type Country {
    id: ID!
    xid: String! @search(by: [hash]) @dgraph(pred: "xid")
    country_code: String! @id @search(by: [exact])
    country_name: String @search(by: [exact])
}

Unfortunately, I’m still seeing the same behavior…

MattH · March 21, 2022, 2:10pm

A few more details. I’m now testing with .rdf to see if that makes any difference over .json.

The files looks like this:

countries1.rdf

<_:MEX> <Country.country_name> "Mexico" .
<_:MEX> <Country.country_code> "MEX" .
<_:MEX> <dgraph.type> "Country" .
<_:MEX> <xid> "MEX" . 

<_:ARG> <Country.country_name> "Argentina" .
<_:ARG> <Country.country_code> "ARG" .
<_:ARG> <dgraph.type> "Country" .
<_:ARG> <xid> "ARG" .

countries2.rdf

<_:MEX> <Country.country_name> "Mexico2" .
<_:MEX> <Country.country_code> "MEX" .
<_:MEX> <dgraph.type> "Country" .
<_:MEX> <xid> "MEX" . 

<_:ARG> <Country.country_name> "Argentina2" .
<_:ARG> <Country.country_code> "ARG" .
<_:ARG> <dgraph.type> "Country" .
<_:ARG> <xid> "ARG" .

If I run this file using the Live loader command:

dgraph live --files /data/1/data/dgraph/sample-data/json/countries1.rdf --alpha localhost:9080 --zero <alphaIP:5080 --format=rdf --upsertPredicate "xid"

This will give me two nodes as expected.

If I then run the second file using:

dgraph live --files /data/1/data/dgraph/sample-data/json/countries2.rdf --alpha localhost:9080 --zero <alphaIP:5080 --format=rdf --upsertPredicate "xid"

I get two new nodes with the same xid but different ids.


"queryCountry": [
      {
        "xid": "MEX",
        "id": "0x9187ee1",
        "country_code": "MEX",
        "country_name": "Mexico"
      },
      {
        "xid": "ARG",
        "id": "0x9187ee2",
        "country_code": "ARG",
        "country_name": "Argentina"
      },
      {
        "xid": "MEX",
        "id": "0x94952e3",
        "country_code": "MEX",
        "country_name": "Mexico2"
      },
      {
        "xid": "ARG",
        "id": "0x94952e4",
        "country_code": "ARG",
        "country_name": "Argentina2"
      }
    ]

If I put all four RDF transactions in the same file, I will only get two nodes. I don’t know if Live is expected to the process the .rdf file in order, but the nodes will NOT have the ‘2’ in the name, so I don’t believe the upsert is actually working.

The GraphQL schema looks like this:

type Country {
    id: ID!
    xid: String! @search(by: [hash]) @dgraph(pred: "xid")
    country_code: String! @id @search(by: [exact])
    country_name: String @search(by: [exact])
}

Any help would be appreciated. Thank you in advance.

MattH · March 21, 2022, 5:49pm

FWIW, I tested with xidmap and it seems to work as expected.

The question with xidmap though is we’re running a cluster. I wanted to be able to use upsertPredicate to distribute the load to the other nodes without having to worry about the extra store. I’m not sure how that would work with xidmap. Would the xidmap need distributed to all nodes in the cluster or just be on the server invoking Dgraph Live?

Thanks in advance for any help! @MichelDiz @porsche

porsche · March 21, 2022, 6:28pm

Below are two ways to ingest data to Dgraph

Live uploader
– Used to feed the data to graph as it comes
Bulk uploader
– Used to bootstrap your graph with data offline

Firstly…

we tried #1 then moved to #2
In both above cases though, predicated will be moved around the alpha nodes unless you disable it
We disabled the predicate movement after which cluster is in stable condition
We deployed our cluster in AKS

MattH · March 21, 2022, 6:51pm

Thanks for the response @porsche

We need to load batches of data while keeping the cluster is available so will need to use Live Loader.

upsertPredicate doesn’t seem to be acting as expected though, so looks like I may need to resort to the xidMap.

I was/am hoping somebody could just tell me if I’m doing something wrong in my implementation, but since it works with xidmap, I don’t think I am. I’m wondering if my issues are related to only using a GraphQL schema and not DQL.

For example, I’m not sure why the xid predicate needs to be exposed with the directive @dgraph(pred: “xid”). I don’t know what Live is doing behind the scenes but it also seems to somehow bypass my “@id” unique constraint on Country.code_code.

Simha_Srivatsa · May 26, 2023, 6:28pm

@MattH Hi, did you find the solution for this? I went through the same thing and I’m now considering using xidmap.

MattH · May 26, 2023, 8:16pm

Hi @Simha_Srivatsa

Yes, I did eventually get upserts working with Live Loader and the GraphQL schema. I didn’t want to use the xidmap file because I really didn’t want to have to worry about preserving and distributing that file across the cluster. I never understood why the upsertPredicate didn’t seem work per the documentation.

I’m no longer working with Dgraph and it’s no longer being used on that project, but if I recall correctly, I had to set one of the id’s explicitly to :_xid, so in using my examples above, one of the ids for Mexico had to be set to the value: _:MEX

I don’t think that’s the way it’s supposed to have to work, but it worked. Remind me to check this for you next week when I’m back in the office and I’ll see if I can get more details for you.

Raphael · June 15, 2023, 7:54pm

@Simha_Srivatsa, Let me know if you have issues with Live Loader and the usage of upsert predicate.
We might have an incorrect implementation or documentation about how the value is saved in the predicate. It should not have the “_:” prefix. I will check the history of bug report with have one that, but I would appreciate your feedback while I’m testing the behavior too …

Raphael · June 15, 2023, 8:57pm

@Simha_Srivatsa
I have verfied the behavior. The main point is the following

when using the upsertPredicate option (-U) in dgraph live you should let Dgraph handle the predicate. To be clear, you have to create the predicate in the schema with an index but let dgraph live loader populate it.
If you you have the RDF
<_:obj1> <name> "obj1-v1" .
and load it with
dgraph live -f test.rdf -U xid
then dgraph live will create a node with the predicate xid = ":obj1" ( we can argue that we could have removed the ':’ but it is how it is working) and with the predicate name of course.
You should not set the value of xid predicate in your RDF file.

If you re-run the dgraph live to save the same data or modify just the name, Dgraph will correctly use upsert operation and verify the presence of a node having xid = “_:obj1”. In our case it will lead to an update.
If the behavior is not satisfying for you, please share details about your use case.

Topic		Replies	Views
Live loader produces duplicates with upsertPredicate enabled Dgraph	3	585	March 18, 2022
Duplicate Nodes while using live loader Dgraph dgraph	1	393	November 12, 2020
How to merge nodes or avoid Duplicate nodes in Dgraph live loading? Dgraph	5	401	July 29, 2021
Using a GraphQL Schema + Live Loader to ingest data GraphQL	5	977	July 8, 2021
Upsert resulting duplicate node Dgraph dgraph	8	570	January 29, 2022

Duplicate nodes with Live Loader and upsertPredicate

What I want to do

What I did

Dgraph metadata

Related topics