How to update a large amount of data in dgraph every day

I have built a knowledge graph using the communication information collected by a specific app. Every node in dgraph has a specific user_id. Every day, the app collects many new relationships of user_id that need to be added to dgraph, but many user_id nodes have been existed in dgraph. My current practice is to query every node of new data using its user_id in dgraph for its uid and then upsert the new node. Is there a more efficient way to mutate the data? I want to know how to improve the efficiency of updating data, or is it possible to use some configuration to use dgraph live to improve efficiency?

2 Likes

Hi, thank you for the question. Let me find someone to help you.

Can you elaborate what upsert procedure are you doing?
For me upsert-block and bulk-upsert works fine for you.

upsert {
  query {
    v as var(func: eq(user_id, "phy_37531"))
    p as var(func: eq(user_id, "mat_456"))
  }

  mutation {
    set {
      uid(v) <name> "thmoas" .
      uid(v) <user_id> "phy_37531" .
      uid(p) <name> "lily" .
      uid(p) <user_id> "mat_456" .
      uid(v) <communicate> [uid(p)] .
    }
  }
}

This is what I do in my procedure. But in this way I can’t use dgraph live directly and need to query every node that should be updated. In this way I don’t how to batch the procedure and I don’t use the synchronization procedure. Is there any more efficient way to update dgraph? Or should I design the synchronous upsert process and batch to replace dgraph live?

This syntax is wrong.

You can, by doing the following.

1 - Try first populate your cluster using Bulk Loader. It is faster than the Live loader.

2 - Use the Live loader or clients to insert continuous data.

3 - After Bulk or live loader. You can perform upsert queries to link entities. (This is a common way in GraphDBs).

1 Like

But if I use live loader in step 2 directly, I will get two nodes which has the user_id of “phy_37531” if there exists “phy_37531”, how can I merge the node which has the same user_id using live loader?

In this case, you use a client doing the upsert block.

You can use Bulk or Live only if the data doesn’t exist in the DB. If both data you already had inserted in the DB you go to step 3. Do you use blank nodes always? if so Bulk and Live have a special flag called “-x” that stores the uid mapping. It can be useful for Posterous insertions.

for more run:

dgraph live -h | grep xid

I don’t know how are your datasets. But imagine the following.

You brought your data from some other DB, some source that doesn’t use Dgraph standards. (This can be useful even for CSV files). One way to link such entities is by using some value that relates them. Like “id” (foreign key, id or something) or some value that is unique and that is in the entities and serves to identify them.

Simple schema sample

<friend>: [uid] .
<linkto>: string @index(hash) .
<name>: string @index(exact) .

Dataset sample

{
   "set": [
      {
         "name": "User 1",
         "linkto": "User 2"
      },
      {
         "name": "User 2",
         "linkto": "User 3"
      },
      {
         "name": "User 3",
         "linkto": "User 2"
      },
      {
         "name": "User 4",
         "linkto": "User 2"
      },
      {
         "name": "User 5",
         "linkto": "User 2"
      },
      {
         "name": "User 6",
         "linkto": "User 2"
      },
      {
         "name": "User 7",
         "linkto": "User 2"
      }
   ]
}

The upsert block to link them

You have to run this upsert one by one until the links are over.

How do I know the links are over? easy, if the upsert response has the field “vars”. There are links. If has only the “uids” field. It’s over.

upsert {
  query {
    v0 as var(func: has(linkto), first:1) { # Never remove the "first" param.
    LK as linkto
    }
    LINK as var(func: eq(name, val(LK)))
  }

  mutation {
    set {
      uid(v0) <friend> uid(LINK) .
      uid(LINK) <friend> uid(v0) .
    }
    delete {
      uid(v0) <linkto> * .
    }
  }
}

After the link the Query

{
	q(func: has(name)) {
		name
        linkto #just to check if this value exists
		friend {
			name
            linkto
		}
	}
}

The result

So you can see the data with the relations.

 {
  "data": {
    "q": [
      {
        "name": "User 7",
        "friend": [
          {
            "name": "User 2"
          }
        ]
      },
      {
        "name": "User 1",
        "friend": [
          {
            "name": "User 2"
          }
        ]
      },
      {
        "name": "User 2",
        "friend": [
          {
            "name": "User 7"
          },
          {
            "name": "User 1"
          },
          {
            "name": "User 3"
          },
          {
            "name": "User 4"
          },
          {
            "name": "User 5"
          },
          {
            "name": "User 6"
          }
        ]
      },
      {
        "name": "User 3",
        "friend": [
          {
            "name": "User 2"
          }
        ]
      },
      {
        "name": "User 4",
        "friend": [
          {
            "name": "User 2"
          }
        ]
      },
      {
        "name": "User 5",
        "friend": [
          {
            "name": "User 2"
          }
        ]
      },
      {
        "name": "User 6",
        "friend": [
          {
            "name": "User 2"
          }
        ]
      }
    ]
  }
}
1 Like

Our data is from elastic search and we update the link information in dgraph every day at a fixed time. The incremental data per day is mainly the relationship between users, and most users have existed in dgraph. From the above, I need to use upsert to update the daily new data, but the update speed is limited. Is there another way to update incremental relation data which most entities have existed in dgraph besides “upsert”?

I think the live loader is not suit for this case.

But the need for data increment is very common in real world scenarios. Is there any good way to satisfy this situation where a large amount of data is updated on existing dgraph and guarantee the efficiency of data storage?

If in elasticsearch you have a unique ID for the entities. You could convert this information to Blank Node. And that way you could generate UID mapping by Blank Node (using dgraph live -h | grep xid. That would be the fastest way theoretically.

So if you have a unique ID in ES. Just transform it in to “uid” key.

{
	"set": {
		"uid": "_:ES_IDHere", # it won't work without the -x file
		"somePredicate": "some new data"
	}
}

The mapped uids for each blank node will be recorded in file and you can reuse it always as needed. Using Liveloader ou bulkloader.

BTW, I almost forgot. Here goes a tip.

You can use unique decimal numbers and use them in any data insertion approach.

e.g:
You have an ID with numbers (only, no letters). You have 4399 and 275 ids to “merge” or update data. Without using upsert or something.

So the decimal numbers will always be related to the hex UID.

Decimal to Hexadecimal Converter

4399 to hex is 0x112F
275 to hex is 0x113

So if you use

{
	"set":  [
	{
		"uid": "4399", # Don't use _: or letters - so the parser knows what to do
		"somePredicate": "some new data edit 3"
	},
	{
		"uid": "275",
		"somePredicate": "some new data edit 3"
	}
]
}

It will always be mutated to 0x112F and 0x113 respectively. This approach is even faster. It doesn’t rely on anything else.

2 Likes

Is uid globally unique? If so, how will this work with uints for UIDs?

Yes, it is.

uid64 can be represented in string format

for example

the number 249 is 0xf9

{  
   q(func: uid(249)){
    uid
  }
}

You can use methods to convert numbers to hex. And then you add a 0x e.g 3500 (use this to calculate Decimal to Hexadecimal Converter) is DAC in hex. So it is uid(0xDAC) or uid(3500).

I must have misunderstood the docs. I thought uid is unsigned int. Is it stored as a 64 bit unsigned integer?

which part?

it is uint64 which is unsigned integer 64 bits wide. I’m not sure how it is really stored. If it is like an address or something. But what is the issue?

If it was a uint then the probability of collisions would have been too high to generate a unique int from a has of the UUID.

Ahh, you mean unique as UUID generates. So, in that case it doesn’t apply to Dgraph. UUID is totally different from Dgraph’s UID. And it would be hard to be fast with UUID.

I was thinking of the bulk import scenario at the start of this thread. Imagine that we are bulk importing from ES daily. Some of the entities in ES are new and some are mutations of previously imported entities. In order to not create duplicates of mutated entities, we must set uid, so one suggestion was to use a hash of the id in ES. Another option is to propagate the dgraph generated id back to ES which does not scale very well. What is your solution for this scenario? ES is the ground truth for some of the entities in the graph and we can’t alter that.

From what I understand, you cannot set the dgraph uid manually. The dgraph zero leases out blocks of uids at a time, so if you try to hash an id from another system and insert it as your dgraph ‘uid’ you’ll likely start to run into errors about trying to insert non-leased uids.

At the moment, dgraph is a bit painful to use for workloads like this (constant syncing from an upstream system) since if you want the highest throughput possible, you’ll need to maintain a map somewhere of esDocId => dgraph UID… which obviously becomes its own challenge when you have a distributed pool of writer processes :slight_smile: