Moved from GitHub dgraph/4678
Posted by marvin-hansen:
When working on large datasets, query & analytics becomes an apparent pain point.
What you wanted to do
On another graph DB, query performance bogged down quite early. On DGraph, the performance was roughly 10x better, but still not exactly great mainly due to CPU bound operations.
What you actually did
Data sampling. Essentially, I sampled a smaller dataset, did my queries & data pre-processing, and once the end of the line was reached, let it run on the full dataset.
Why that wasn’t great, with examples
Data sampling comes with multiple issues:
- You can sample for distribution similarity or for variance similarity, but not both.
- On complex data, you lose a lot of data resolution so your sample isn’t very representative
- You aren’t escaping the re-run pain, you just kicking it further down the road.
- Sampling time data isn’t particularly useful due to shifting properties so you end up with sliding window sampling on a sub-set, but that means you are back to square one.
- On graph data, sub-graphs usually represent only a sub-set of the main-graph, but most useful graph algorithms would actually need the complete graph to deliver useful results. That, however, usually takes a couple of hours
What would be a truly great way to solve this?
GPU accelerated queries & analytic functions as these are up 1000X faster than CPU based queries.
BrytlytDB crunches 1.1. billion data (500GB) between 0.005(!) and 0.188 seconds on a 5 node IBM cluster equipped with 20 Nvidia P100 GPUs.
MapD (now OmniSciDB) does the same task slightly slower but still under 1 second with 8 Pascal Titan X cards.
Both are about 250X faster than Postgres means GPU accelerated queries deliver very real performance gains. To the best of my knowledge, this is enough speedup to run complex queries and analytic tasks on a full dataset and a complete graph.
However, Blazegraph one of the very few available GPU accelerated graph database was acquired by
Amazon and it said to be the foundation for Amazon Neptune.
For graphs, a multi-GPU solution is about 700-1800X faster than CPU on analytics and, on average, on selected queries 156X faster than the non-GPU version. This means that you can realize 150X performance improvement for your existing graph database just by adding a GPU. With the reasonably priced T4 GPU offered by all leading cloud providers, affordable hardware is certainly given. Considering the expected performance bump with the upcoming Turing / GeForce 30X series, even consumer GPU’s should deliver massive GPU acceleration.
This kind of massive performance gains would be an excellent addition to the Enterprise edition.
Any external references to support your case
MapD / OmniSciDB
Graphs on GPU’s
Blazegraph GPU accelerated Graph DB