A colleague and I have been experimenting with dgraph and have had some bad experiences. I am hoping someone can tell us what we’re doing wrong or suggest ways we could avoid shooting ourselves in the foot as we no doubt have been doing.
The goal was to load some image data into a 1-node alpha setup running on k8s using the definition from the docs:
https://github.com/dgraph-io/dgraph/blob/master/contrib/config/kubernetes/dgraph-single.yaml
Our worker nodes were originally i2.xlarge nodes but we were clearly CPU bound so I changed to c5d.2xlarge. Alpha’s data dir was stored on local SSD and I adjusted the LRU cache to be about half of ram (8192MB).
Here is our schema:
<bytes>: int @index(int) .
<s3etag>: string @index(hash) .
<s3key>: string @index(hash, trigram) .
<updatedAt>: datetime @index(hour) .
<_type>: string @index(hash) .
Our importer is relatively simple and can be viewed at:
We ran our importer from another machine in the same vpc/subnet as the k8s cluster. The importer machine was not cpu or memory bound while the importer was running.
Our incredibly basic test query, run via ratel, is:
{
images(func: eq(_type, "Image"), first: 10) {
s3key
}
}
Bullet points of our experience:
- Typically we were seeing approximate load speeds of 1-2 batches/sec, each batch is 1000 records. Take this as a ballpark number … we did not go to any lengths to get accurate measurements.
- We were able to load about 8.5G of on-disk data, or ~2.2M nodes and 0 edges before the importer would disconnect / error.
- Our test query returned successfully in ~18 seconds of latency.
- During the load we had a lot of difficulty applying schema changes while the importer was running. We generally tried to do it ahead of time but would occassionally do it during intentionally to see what would happen. At these times the schema changes seemed to hang, at least from ratel (‘kubectl logs’ would indicate some succeeded, some never seemed to apply).
- We had at least one instance where dgraph caused the k8s worker node to bog down to the point where k8s node readyness checks began reporting “not ready” and ssh became mostly unresponsive. It ultimately killed by the OOM killer.
- Subsequently we set a 14Gi memory limit on the alpha container and we got much further than previously (without the limit the max we hit was ~1.2M nodes, with the limit ~2.2M nodes were created), but things exploded when alpha lost its connection to zero. The error we saw was:
W0304 16:46:59.809346 1 groups.go:751] No membership update for 10s. Closing connection to Zero.
E0304 16:47:22.745482 1 pool.go:206] Echo error from dgraph-0.dgraph.default.svc.cluster.local:5080. Err: rpc error: code = DeadlineExceeded desc = context deadline exceeded
All this said, surely our approach had some naiveties and was suboptimal. Was the “1 server” model inappropriate for basic testing? What would be a more reasonable test? Why was our experience so unexpectedly bad?
TIA!