Dgraph Enhancement Proposal: bulk + live loader?


#1

DESCRIPTION
In the spirit of PEPs and JEPs, I would like to get feedback on something I’ve been thinking about and thinking about implementing (we have some free resources at the moment).

I’m not sure of the validity or the exact challenges to this proposal, so any feedback would be appreciated ( wink, wink, @mrjn, @dmai :wink: ). Specifically, it looks like some improvements have been made to the live loader, but I don’t know the details so I don’t know if this proposal is defunct.

My use case is cybersecurity: I have many data sets I would like to “link” together (think IP address as a node, and DNS query nodes linking IP addresses), but I wanna be able to import data at high-speed later in the future (so I can’t use the bulk importer after the first data set). I have a 6-node Dgraph cluster with 384GB of RAM and 96 cores total.

I know this isn’t a typical OLTP workload, but I’m still interested in testing Dgraph for our purposes.

PROPOSAL
Short version is adding a “fast path” in dgraph zero/alpha endpoint that would directly import the bulk loader’s output via DB.Load().

This avoids the Raft consensus (if I’m understanding things correctly, this is the current bottleneck for live loading) and brings us closer to the theoretical performance limit of sequential write hard disk speed.

DETAILS

  1. User would use the bulk loader to generate a Dgraph-compatible Badger instance, just like today.
  2. User would call the /assign endpoint to allocate the right amount of UIDs.
  3. User would send the Badger data (maybe via DB.Backup()?), broken up by predicate, to the zero/alpha that is assigned that predicate.
  4. The alpha/zero would use DB.Load() to load the Badger data.

ASSUMPTIONS / UNKNOWNS

  1. The predicate might have to be locked to concurrent mutations (unless the Zero/Alpha does this already?).
  2. Default grpc maximum message size is 4MB, so it might be worth changing that so data can be sent in larger chunks.
  3. I don’t understand if the Badger data would have to go through the Zero leader before going to the actual Alpha. I’m on a 10Gb network, so that’s not a big issue, but might be for others.
  4. Having to make sure that the uids specified in the RDF/JSON are valid might be performance issue?

INSPIRATION
Clickhouse supports importing data using a variety of different formats, including a documented binary format. I’ve used this to great effect to generate the binary data “outside” of the database and then importing it, and the approach scaled very well.

QUESTIONS

  • Is there a way to bulk-verify that a UID is valid, besides a query like ip.addresses(func: uid(0x579683, 0x5af1c7)? I’m presuming a query like this hits the Bloom / Cuckoo filter so it’s pretty fast?
  • Has this idea been discounted already, or subsumed by the new improved live loader?

Thanks ahead of time for any tips, especially for any pointers in the code!


(Manish R Jain) #2

Not a bad idea. Something I’ve thought about in the past, but there’re some big challenges there which make it hard to implement. FWIW, we don’t need to change how things get proposed, i.e. Raft mechanisms remain the same. Also, Zero does not see data, only metadata (uids, txns).

The problem with such an approach is the merging of existing data with the new data, in particular handling of indices. In the live path, the data and corresponding index updates go together as one transaction. In a bulk-style path, where we pre-calculate all the posting lists (as they’d be stored in Badger), both the data and the corresponding indices, the data and indices would be applied under different transactions, which could cause correctness issues.

But some mechanism may still be possible. Needs more careful thought.


#3

Thank you for the response, @mrjn.

As I’m sure you know, it’s quite common to disable/remove indices before performing bulk-load operations in other databases (I’m thinking of PostgreSQL here).

Perhaps a similar operation can be used here? Or, better yet, the bulk + live loader could remember the schema’s indices before bulk load and restore them afterwards.