How to update a large amount of data in dgraph every day

mutation

(yeahvip) #1

I have built a knowledge graph using the communication information collected by a specific app. Every node in dgraph has a specific user_id. Every day, the app collects many new relationships of user_id that need to be added to dgraph, but many user_id nodes have been existed in dgraph. My current practice is to query every node of new data using its user_id in dgraph for its uid and then upsert the new node. Is there a more efficient way to mutate the data? I want to know how to improve the efficiency of updating data, or is it possible to use some configuration to use dgraph live to improve efficiency?


(Shekar Mantha) #2

Hi, thank you for the question. Let me find someone to help you.


(Michel Conrado) #3

Can you elaborate what upsert procedure are you doing?
For me upsert-block and bulk-upsert works fine for you.

https://docs.dgraph.io/mutations/#upsert-block


(yeahvip) #4
upsert {
  query {
    v as var(func: eq(user_id, "phy_37531"))
    p as var(func: eq(user_id, "mat_456"))
  }

  mutation {
    set {
      uid(v) <name> "thmoas" .
      uid(v) <user_id> "phy_37531" .
      uid(p) <name> "lily" .
      uid(p) <user_id> "mat_456" .
      uid(v) <communicate> [uid(p)] .
    }
  }
}

This is what I do in my procedure. But in this way I can’t use dgraph live directly and need to query every node that should be updated. In this way I don’t how to batch the procedure and I don’t use the synchronization procedure. Is there any more efficient way to update dgraph? Or should I design the synchronous upsert process and batch to replace dgraph live?


(Michel Conrado) #5

This syntax is wrong.

You can, by doing the following.

1 - Try first populate your cluster using Bulk Loader. It is faster than the Live loader.

2 - Use the Live loader or clients to insert continuous data.

3 - After Bulk or live loader. You can perform upsert queries to link entities. (This is a common way in GraphDBs).


(yeahvip) #6

But if I use live loader in step 2 directly, I will get two nodes which has the user_id of “phy_37531” if there exists “phy_37531”, how can I merge the node which has the same user_id using live loader?


(Michel Conrado) #7

In this case, you use a client doing the upsert block.

You can use Bulk or Live only if the data doesn’t exist in the DB. If both data you already had inserted in the DB you go to step 3. Do you use blank nodes always? if so Bulk and Live have a special flag called “-x” that stores the uid mapping. It can be useful for Posterous insertions.

for more run:

dgraph live -h | grep xid

I don’t know how are your datasets. But imagine the following.

You brought your data from some other DB, some source that doesn’t use Dgraph standards. (This can be useful even for CSV files). One way to link such entities is by using some value that relates them. Like “id” (foreign key, id or something) or some value that is unique and that is in the entities and serves to identify them.

Simple schema sample

<friend>: [uid] .
<linkto>: string @index(hash) .
<name>: string @index(exact) .

Dataset sample

{
   "set": [
      {
         "name": "User 1",
         "linkto": "User 2"
      },
      {
         "name": "User 2",
         "linkto": "User 3"
      },
      {
         "name": "User 3",
         "linkto": "User 2"
      },
      {
         "name": "User 4",
         "linkto": "User 2"
      },
      {
         "name": "User 5",
         "linkto": "User 2"
      },
      {
         "name": "User 6",
         "linkto": "User 2"
      },
      {
         "name": "User 7",
         "linkto": "User 2"
      }
   ]
}

The upsert block to link them

You have to run this upsert one by one until the links are over.

How do I know the links are over? easy, if the upsert response has the field “vars”. There are links. If has only the “uids” field. It’s over.

upsert {
  query {
    v0 as var(func: has(linkto), first:1) { # Never remove the "first" param.
    LK as linkto
    }
    LINK as var(func: eq(name, val(LK)))
  }

  mutation {
    set {
      uid(v0) <friend> uid(LINK) .
      uid(LINK) <friend> uid(v0) .
    }
    delete {
      uid(v0) <linkto> * .
    }
  }
}

After the link the Query

{
	q(func: has(name)) {
		name
        linkto #just to check if this value exists
		friend {
			name
            linkto
		}
	}
}

The result

So you can see the data with the relations.

 {
  "data": {
    "q": [
      {
        "name": "User 7",
        "friend": [
          {
            "name": "User 2"
          }
        ]
      },
      {
        "name": "User 1",
        "friend": [
          {
            "name": "User 2"
          }
        ]
      },
      {
        "name": "User 2",
        "friend": [
          {
            "name": "User 7"
          },
          {
            "name": "User 1"
          },
          {
            "name": "User 3"
          },
          {
            "name": "User 4"
          },
          {
            "name": "User 5"
          },
          {
            "name": "User 6"
          }
        ]
      },
      {
        "name": "User 3",
        "friend": [
          {
            "name": "User 2"
          }
        ]
      },
      {
        "name": "User 4",
        "friend": [
          {
            "name": "User 2"
          }
        ]
      },
      {
        "name": "User 5",
        "friend": [
          {
            "name": "User 2"
          }
        ]
      },
      {
        "name": "User 6",
        "friend": [
          {
            "name": "User 2"
          }
        ]
      }
    ]
  }
}

(yeahvip) #8

Our data is from elastic search and we update the link information in dgraph every day at a fixed time. The incremental data per day is mainly the relationship between users, and most users have existed in dgraph. From the above, I need to use upsert to update the daily new data, but the update speed is limited. Is there another way to update incremental relation data which most entities have existed in dgraph besides “upsert”?


#9

I think the live loader is not suit for this case.


(yeahvip) #10

But the need for data increment is very common in real world scenarios. Is there any good way to satisfy this situation where a large amount of data is updated on existing dgraph and guarantee the efficiency of data storage?


(Michel Conrado) #11

If in elasticsearch you have a unique ID for the entities. You could convert this information to Blank Node. And that way you could generate UID mapping by Blank Node (using dgraph live -h | grep xid. That would be the fastest way theoretically.

So if you have a unique ID in ES. Just transform it in to “uid” key.

{
	"set": {
		"uid": "_:ES_IDHere", # it won't work without the -x file
		"somePredicate": "some new data"
	}
}

The mapped uids for each blank node will be recorded in file and you can reuse it always as needed. Using Liveloader ou bulkloader.


(Michel Conrado) #12

BTW, I almost forgot. Here goes a tip.

You can use unique decimal numbers and use them in any data insertion approach.

e.g:
You have an ID with numbers (only, no letters). You have 4399 and 275 ids to “merge” or update data. Without using upsert or something.

So the decimal numbers will always be related to the hex UID.

https://www.rapidtables.com/convert/number/decimal-to-hex.html

4399 to hex is 0x112F
275 to hex is 0x113

So if you use

{
	"set":  [
	{
		"uid": "4399", # Don't use _: or letters - so the parser knows what to do
		"somePredicate": "some new data edit 3"
	},
	{
		"uid": "275",
		"somePredicate": "some new data edit 3"
	}
]
}

It will always be mutated to 0x112F and 0x113 respectively. This approach is even faster. It doesn’t rely on anything else.