Rough times kicking the tires with 1 server and a simple importer

A colleague and I have been experimenting with dgraph and have had some bad experiences. I am hoping someone can tell us what we’re doing wrong or suggest ways we could avoid shooting ourselves in the foot as we no doubt have been doing.

The goal was to load some image data into a 1-node alpha setup running on k8s using the definition from the docs:

Our worker nodes were originally i2.xlarge nodes but we were clearly CPU bound so I changed to c5d.2xlarge. Alpha’s data dir was stored on local SSD and I adjusted the LRU cache to be about half of ram (8192MB).

Here is our schema:

<bytes>: int @index(int) .
<s3etag>: string @index(hash) .
<s3key>: string @index(hash, trigram) .
<updatedAt>: datetime @index(hour) .
<_type>: string @index(hash) .

Our importer is relatively simple and can be viewed at:

We ran our importer from another machine in the same vpc/subnet as the k8s cluster. The importer machine was not cpu or memory bound while the importer was running.

Our incredibly basic test query, run via ratel, is:

  images(func: eq(_type, "Image"), first: 10) {

Bullet points of our experience:

  • Typically we were seeing approximate load speeds of 1-2 batches/sec, each batch is 1000 records. Take this as a ballpark number … we did not go to any lengths to get accurate measurements.
  • We were able to load about 8.5G of on-disk data, or ~2.2M nodes and 0 edges before the importer would disconnect / error.
  • Our test query returned successfully in ~18 seconds of latency.
  • During the load we had a lot of difficulty applying schema changes while the importer was running. We generally tried to do it ahead of time but would occassionally do it during intentionally to see what would happen. At these times the schema changes seemed to hang, at least from ratel (‘kubectl logs’ would indicate some succeeded, some never seemed to apply).
  • We had at least one instance where dgraph caused the k8s worker node to bog down to the point where k8s node readyness checks began reporting “not ready” and ssh became mostly unresponsive. It ultimately killed by the OOM killer.
  • Subsequently we set a 14Gi memory limit on the alpha container and we got much further than previously (without the limit the max we hit was ~1.2M nodes, with the limit ~2.2M nodes were created), but things exploded when alpha lost its connection to zero. The error we saw was:

W0304 16:46:59.809346 1 groups.go:751] No membership update for 10s. Closing connection to Zero.
E0304 16:47:22.745482 1 pool.go:206] Echo error from dgraph-0.dgraph.default.svc.cluster.local:5080. Err: rpc error: code = DeadlineExceeded desc = context deadline exceeded

All this said, surely our approach had some naiveties and was suboptimal. Was the “1 server” model inappropriate for basic testing? What would be a more reasonable test? Why was our experience so unexpectedly bad?


I’d do a few things:

  1. If you’re just starting out, always better to run Dgraph directly on a server, instead of on k8s. The operational complexity of k8s compounds the learning challenge of a new DB. You can run it on your laptop, or on a desktop, or just get an instance from AWS, and run both Zero and Alpha there on the same server. You’d get direct access to logs rolling in, and the learning curve would be a lot smoother.

  2. Dgraph is built around concurrency and persistence. Persistence adds latency because we need to write to Raft WAL and then a commit would cause a disk write to the actual DB. So, instead of doing a write txns serially, do like 10 of them at a time (each with a batch size of 1000) – if there are aborts due to conflicts, you can reduce the concurrency. But, start with 10 and then go to 1, if you see aborts, instead of starting with just 1 pending write txn at a time. Live loader have squeezed 120K records per second in my latest changes: Optimize XidtoUID map used by live and bulk loader (#2998) · dgraph-io/dgraph@b718993 · GitHub

  3. What version are you on? We’re releasing v1.0.12 today, so try with that.

  4. If you don’t care about too much about resilience while starting out, you can put the w directories on /tmp. That helps speed things up a lot, again because persistence can be a bottleneck (though in your case, it definitely is lack of concurrency).

You can run it on your laptop, or on a desktop, or just get an instance from AWS, and run both Zero and Alpha there on the same server.

I’d be interested in some kind of “hobby deploy” setup that could be trivially deployed to a single cheap instance on heroku/digitalocean/aws without any ops chops required (that could get us non-devops folk experimenting freely until a hosted solution is available). I’m keen to try out dgraph but I’m a little short on time, it would be nice to jump straight to writing mutations and queries for a side project or two.

I typically use docker compose for testing purposes, running on my desktop. So much so that we now have a compose tool which spits out a docker-compose.yml file based on how many instances you need to run. I find dealing with docker a lot simpler than k8s.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.