How to realize the same social account fusion of multiple batches of data through dgraph

yeahvip · January 18, 2023, 8:49am

In my demand, I get multiple batches of social records through multiple channels, how should I realize the aggregation of the same social accounts in multiple batches. What I know is upsert, but it is too slow in the case of large amounts of data.

matthewmcneely · January 18, 2023, 8:00pm

@yeahvip Are you planning to use the live loader? If so, check out the -xidmap flag. This would work for initial and subsequent loads, but not if you’ve already got social account IDs in your graph.

One approach I’ve used in the past with success is to pull existing IDs from the graph in the batch loader code. When examining a record to add, check to see if the external ID already has a Dgraph ID. If so, associate the existing ID in the exported RDF or JSON, otherwise assign it a blank uid: _:<external id>

yeahvip · February 7, 2023, 3:27am

Hello, in the case you mentioned, we need to maintain the external id dictionary all the time. Our test shows that the working mechanism is to look up uid through the mapping between external id and uid when entering the database. If the number of incoming data is large, the dictionary query speed of new incoming data will become the bottleneck. In addition, if the same data is entered in two batches, the uid will be regenerated due to the existence of the blank node, and the external id needs to be queried again. Will future versions of dgraph support the form of custom Uids rather than automatically generating Uids?

matthewmcneely · February 7, 2023, 4:29pm

Ah, that’s a different issue perhaps. Are you aware of the @id directive in Dgraph? https://dgraph.io/docs/graphql/schema/ids/#the-id-directive Maybe this is more to the point.

yeahvip · February 9, 2023, 2:33am

What you mentioned is the performance of graphql. Our system is basically based on the usage of dql. Is there any relevant scheme in dql?

matthewmcneely · February 9, 2023, 2:40am

It doesn’t have the @id directive, but upsert operations are supported: https://dgraph.io/docs/mutations/upsert-block/

yeahvip · February 13, 2023, 1:46am

the speed of upsert is unacceptable with millions of rdfs, and when upsert is used, dgraph live and dgraph bulk can’t be used.

matthewmcneely · February 13, 2023, 7:33pm

Right, so I think your best option is to always use the -xidmap flag in bulk/live loading.

MichelDiz · February 13, 2023, 8:13pm

Can I ask why did you end up with this opinion? how did you test it? Assuming you have tested correctly to say unacceptable.

In my opinion it depends on the situation. Upsert runs concurrently. So it runs in its own go routine. Depending on how you build your upsert query. It will probably be faster than Liveloader (except Bulk) to do this job. Because it runs concurrently from within the DB.

Not precisely, Liveloader has a flag called “upsertPredicate” dgraph/dgraph/cmd/live/run.go at f893f96f389218fe26bc638828d7fa57c61afec8 · dgraph-io/dgraph · GitHub and it creates upserts based on the XID value.

I haven’t done a comparison yet, but certainly mapping is faster. But if you use XIDs, you necessarily have to use Upsert. But if you don’t use it, I would recommend xidmap. Names can be confusing. xidmap(which maps BlankNodes to UID) is one thing and XIDs and External XIDs are another. They are not necessarily the same thing.
See for XIDs external-ids-upsert-block

Topic		Replies	Views
Questions about importing data Dgraph kind:question , area:bulk-loader , area:live-loader	5	624	March 19, 2021
How to update a large amount of data in dgraph every day Dgraph mutation	23	3772	August 10, 2020
Understanding bulk data loads, and bulk updates, with XID in v0.8 Users	2	875	November 1, 2017
Custom XIDs for creating edges Dgraph	1	501	August 15, 2020
How to speed up using java client to upsert massive data Dgraph	6	986	January 16, 2020

How to realize the same social account fusion of multiple batches of data through dgraph

Related topics