[Feature Request] In-memory mode

Moved from GitHub dgraph/4813

Posted by marvin-hansen:

TL’DR:
Add in-memory mode to boost performance and use in-memory (predicate) compression to pack as much data as possible in the given memory. That smokes a lot of benchmarks and helps a lot with graph OLAP.

Experience Report

Right now, we get about a 6 - 10 ms latency with a 5 node Dgraph cluster, and that’s nice, but only the case because we load balance across all alphas and the current graph size is small enough that it fits completely in memory. Specifically, we run a count query against all predicates and fetch them all in a quiet hour so they are in memory when needed. That works very well, latency is very low and it’s relatively easy to script.

What you wanted to do

For our unified graph, we actually wanted one single DB for operations and analytics in one cluster.
But we had some troubles with DGraph and some hard questions about scalability that are still not answered so we moved forward.

Why that wasn’t great, with examples

We simply could not figure out if DGraph is going to sustain performance beyond a few billion predicates. There are no case studies of very large deployments, and that indicates there are most likely none.

What you actually did

We decided to factor out analytics and instead use an in-memory graph that goes all the way up to 500 billion ops per second and, in fact, have to integrate a secondary graph for analytics because of the alternative already ships with all important graph analytic algorithms so it makes sense.

What would be the greatest possible solution?

Please consider adding an persistent in-memory mode to DGraph.
Persistent in-memory means, the entire graph is in memory AND synced on disk so you get both, the speed from querying everything from memory AND the safety of having all data on disk in case a node goes down or needs a restart, say for an update.

Preserve “correctness” and transactional safety but improve performance with an option to hold the entire graph in memory and sync memory to disk. A bit like Redis-graph, but actually useful.

I know, one USP of DGraph is exactly that you can manage graphs way bigger than available memory because it only retains the keys in memory but stores the actual values on disk. Sometimes, you either don’t have a graph that big, or equally possible, you have a lot of memory so you can hold the entire graph in memory. Because memory prices are falling already for years, you get so much cheap memory on GCP, that in-memory analytics becomes a no-brainer.

We are actually moving all analytics to a persistent in-memory graph, because, it’s actually 30% cheaper than an equivalent GPU accelerated column-based storage but equivalent in performance on most tasks. In terms of system complexity, in-memory is way simpler to handle and scales much either because allocating memory or adding more high-mem nodes can be done with auto-scaling in Kubernetes.

As a rule of thumb, you need twice the memory relative to the graph size plus ~20% for the system. If a graph is about 150 GB in size, you need at least 300 GB memory plus some 50GB reserved for the system, so in total, you need 350GB. With predicate compression, however, it is certainly feasible to pack twice the graph size in the same memory and that’s a huge additional relative to the total gain in performance. Obviously, memory usage is high but we are good with that.

Any external references to support your case

Trillion Triple Benchmark:
https://dzone.com/articles/got-a-minute-check-out-anzographs-record-shatterin

danielmai commented :

An in-memory option is available today. By default, Dgraph mmaps LSM tree tables. You can set this to in-memory by setting the flag --badger.tables=ram.

1 Like

marvin-hansen commented :

@danielmai Hang on does the flag --badger.tables=ram syncs memory to disk?

I was talking about holding the graph in memory AND syncing memory to disk so you get the speed from memory AND keep your data safe on disk. Otherwise, it’s only volatile and, I guess, nobody wants that for serious data…

Hi. Just picking this thread back up. Is in-memory option available? Can you point me to the docs for the same wrt dgraph? While badger is used by Dgraph and that has in memory mode, I am not sure how to configure it via Dgraph.

Also, just a question to add to this. Can I use Dgraph in-memory as a replacement for redis? So that I can do graph queries in memory? Thanks.

When running in memory and distributed, does that add failover? As it doesn’t (appear to) sync to disk in that case, just running multiple nodes a way of combatting dataloss for long (weeks) running jobs?