Upload this CSV file – timings with or without uid mapping

The attached CSV file has population figures for every country, gender and year. It has ca: 20 000 rows and 4 columns, equalling something like 100 000 nquads.

population.csv (702.1 KB)

3 questions:
a) Coming from a Python environment, which method is best to get this CSV data inserted fast?

Here’s a benchmark with a similar dataset for Pandas bulk insert to Postgres:

b) How long would it take (roughly) to insert this without any checks?

c) If each row had a UID (treating the columns as properties and the UID as the node), how much additional time (% penalty) could such a check introduce?

Trying to devise a way to upload such CSV files fast and ideally many of them concurrently with some safety checks … Perhaps someone has experience from this?


100k nquads would be very very fast, probably not even adequate enough to benchmark. The live loader would insert a pre-formatted version of this in probably 1-2s maybe (big swag but you get the point). Obviously it matters what your dgraph is provisioned with but assuming an appropriately sized system, it will be super quick.

Using upserts to idempotently insert each of 100k things to the same node (xid->uid translation) would probably add a small amount, but reading 100k strings out of dgraph is real fast, would probably add only another second or so? Again the numbers here are so small you would not get a consistent time on execution.

But it all depends what data shape you have, what indicies are being built when you insert each thing (like trigram index with long strings can increase the amount of data you are actually saving by a lot).

Sorry, not a real answer other than ‘probably pretty fast’

Thank you for the pointers and examples, that’s super helpful :+1: