DESCRIPTION
In the spirit of PEPs and JEPs, I would like to get feedback on something I’ve been thinking about and thinking about implementing (we have some free resources at the moment).
I’m not sure of the validity or the exact challenges to this proposal, so any feedback would be appreciated ( wink, wink, @mrjn, @dmai ). Specifically, it looks like some improvements have been made to the live loader, but I don’t know the details so I don’t know if this proposal is defunct.
My use case is cybersecurity: I have many data sets I would like to “link” together (think IP address as a node, and DNS query nodes linking IP addresses), but I wanna be able to import data at high-speed later in the future (so I can’t use the bulk importer after the first data set). I have a 6-node Dgraph cluster with 384GB of RAM and 96 cores total.
I know this isn’t a typical OLTP workload, but I’m still interested in testing Dgraph for our purposes.
PROPOSAL
Short version is adding a “fast path” in dgraph zero/alpha endpoint that would directly import the bulk loader’s output via DB.Load().
This avoids the Raft consensus (if I’m understanding things correctly, this is the current bottleneck for live loading) and brings us closer to the theoretical performance limit of sequential write hard disk speed.
DETAILS
- User would use the bulk loader to generate a Dgraph-compatible Badger instance, just like today.
- User would call the
/assign
endpoint to allocate the right amount of UIDs. - User would send the Badger data (maybe via
DB.Backup()
?), broken up by predicate, to the zero/alpha that is assigned that predicate. - The alpha/zero would use
DB.Load()
to load the Badger data.
ASSUMPTIONS / UNKNOWNS
- The predicate might have to be locked to concurrent mutations (unless the Zero/Alpha does this already?).
- Default grpc maximum message size is 4MB, so it might be worth changing that so data can be sent in larger chunks.
- I don’t understand if the Badger data would have to go through the Zero leader before going to the actual Alpha. I’m on a 10Gb network, so that’s not a big issue, but might be for others.
- Having to make sure that the uids specified in the RDF/JSON are valid might be performance issue?
INSPIRATION
Clickhouse supports importing data using a variety of different formats, including a documented binary format. I’ve used this to great effect to generate the binary data “outside” of the database and then importing it, and the approach scaled very well.
QUESTIONS
- Is there a way to bulk-verify that a UID is valid, besides a query like
ip.addresses(func: uid(0x579683, 0x5af1c7)
? I’m presuming a query like this hits the Bloom / Cuckoo filter so it’s pretty fast? - Has this idea been discounted already, or subsumed by the new improved live loader?
Thanks ahead of time for any tips, especially for any pointers in the code!