Upload this CSV file – timings with or without uid mapping

Daniel · November 15, 2021, 4:09pm

The attached CSV file has population figures for every country, gender and year. It has ca: 20 000 rows and 4 columns, equalling something like 100 000 nquads.

population.csv (702.1 KB)

3 questions:
a) Coming from a Python environment, which method is best to get this CSV data inserted fast?

Here’s a benchmark with a similar dataset for Pandas bulk insert to Postgres:

github.com

NaysanSaran/pandas2postgresql/blob/master/notebooks/Psycopg2_Bulk_Insert_Speed_Benchmark.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# From Pandas to PostgreSQL: Bulk Insert"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*By Naysan Saran, May 2020.*  \n",
    "Updated in June 2020."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},

This file has been truncated. show original

b) How long would it take (roughly) to insert this without any checks?

c) If each row had a UID (treating the columns as properties and the UID as the node), how much additional time (% penalty) could such a check introduce?

Trying to devise a way to upload such CSV files fast and ideally many of them concurrently with some safety checks … Perhaps someone has experience from this?

Thanks!

iluminae · November 15, 2021, 7:37pm

100k nquads would be very very fast, probably not even adequate enough to benchmark. The live loader would insert a pre-formatted version of this in probably 1-2s maybe (big swag but you get the point). Obviously it matters what your dgraph is provisioned with but assuming an appropriately sized system, it will be super quick.

Using upserts to idempotently insert each of 100k things to the same node (xid->uid translation) would probably add a small amount, but reading 100k strings out of dgraph is real fast, would probably add only another second or so? Again the numbers here are so small you would not get a consistent time on execution.

But it all depends what data shape you have, what indicies are being built when you insert each thing (like trigram index with long strings can increase the amount of data you are actually saving by a lot).

Sorry, not a real answer other than ‘probably pretty fast’

Daniel · November 15, 2021, 8:37pm

Thank you for the pointers and examples, that’s super helpful

Topic		Replies	Views
Questions about importing data Dgraph kind:question , area:bulk-loader , area:live-loader	5	545	March 19, 2021
From spreadsheet to online table to dgraph storage Dgraph	3	310	November 9, 2021
Go, dgraph and 320 million rows needs a bit of help to import faster Users example	6	647	September 27, 2019
Fast Data Loading Dgraph	3	519	September 12, 2018
Which is the best batch upsert? Dgraph	3	261	July 4, 2023

Upload this CSV file – timings with or without uid mapping

Related Topics