Docker Swarm setup

I’ve managed to get a cluster going with Docker Swarm. It’s setup and operating really well. I have a question regarding setting up my AWS endpoints to leverage the cluster. Should I be creating an ELB and distribute requests to all three instances or should I point my main entry to the master?

I’m not sure how Dgraph distributes itself.

Also regarding dgraph-ratel. I noticed that there’s no constraint in the docker-compose.yaml to have it run on any particular server? I’m wondering how to expose it to a consistent endpoint.

Thanks.

There is no master if multiple nodes are running the same RAFT group you can write and read from all of them so yeah a load balancer should work well.

What is the value of --replicas flag that you are using to run Zero and how many Dgraph servers are you using? I can try and explain based on that.

You could add a constraint and keep it to only one of the nodes or you can run it on all of them, it depends on what you are looking for.

I’m using the docker-compose.yaml file listed on the deploy section of the site: https://docs.dgraph.io/deploy/#using-docker-swarm same exact setup just different hostnames. So basically 3 replicas.

Ok great, I’ll adjust the load balancing to point to all three instances. :slight_smile: very cool.

Thanks Pawan!

I’d suggest using dgraph/dgraph:nightly image as it has some fixes with regards to replicas.

Thanks Pawan. I’m not comfortable using nightly in production right now so we’ll stay on latest until the next version is available.

I’m reviewing the docker-compose file that is listed on the Deploy section. Here’s my version of it:

version: "3"
networks:
  dgraph:
services:
  zero:
    image: dgraph/dgraph:latest
    volumes:
      - data-volume:/dgraph
    ports:
      - 5080:5080
      - 6080:6080
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == AP-GRAPH-1
    command: dgraph zero --my=zero:5080 --replicas 3
  server_1:
    image: dgraph/dgraph:latest
    hostname: "server_1"
    volumes:
      - data-volume:/dgraph
    ports:
      - 8080:8080
      - 9080:9080
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == AP-GRAPH-1
    command: dgraph server --my=server_1:7080 --memory_mb=17192 --zero=zero:5080
  server_2:
    image: dgraph/dgraph:latest
    hostname: "server_2"
    volumes:
      - data-volume:/dgraph
    ports:
      - 8081:8081
      - 9081:9081
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == AP-GRAPH-2
    command: dgraph server --my=server_2:7081 --memory_mb=17192 --zero=zero:5080 -o 1
  server_3:
    image: dgraph/dgraph:latest
    hostname: "server_3"
    volumes:
      - data-volume:/dgraph
    ports:
      - 8082:8082
      - 9082:9082
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == AP-GRAPH-3
    command: dgraph server --my=server_3:7082 --memory_mb=17192 --zero=zero:5080 -o 2
  ratel:
    image: dgraph/dgraph:latest
    hostname: "ratel"
    ports:
      - 8000:8000
    networks:
      - dgraph
    command: dgraph-ratel
volumes:
  data-volume:

I have an ELB setup to spread the requests around to each instance. I need to configure health checks from the balancer. I believe the health check is available at http://address:8080/health? Only AP-GRAPH-1 responds with that. Looks like the second and third servers are running from 8082 and 8083.

Am I taking the right approach here?

Nightly has bug fixes on top of the last release and should be safe to use. It is also recommended as some bugs were fixed with regards to replicas. We will do a release early next week.

Yeah, second and third servers are running on port 8081 and 8082 as can be seen from the config. Docker swarm doesn’t allow multiple services to use the same port, unfortunately. Kubernetes does a better job at this.

Hmmmm I’ve pointed my ELB to the following:

AP-GRAPH1 -> GET ip:8080/health
AP-GRAPH2 -> GET ip:8081/health
AP-GRAPH3 -> GET ip:8082/health

Only AP-GRAPH-1 responds with an OK. The other two replicas do not. Is this normal?

I’m going to update to the nightly right now.

Also side note, when I initialize the cluster with nightly the services don’t seem to go online. They just remain 0/1 not matter how long I wait. Maybe the nightly build requires different configurations?

memory_mb changed to lru_mb in master. I can checkout about the health check tomorrow.

I just change memory_mb and having the same issue. Here’s my docker file for reference:

version: "3"
networks:
  dgraph:
services:
  zero:
    image: dgraph/dgraph:nightly
    volumes:
      - data-volume:/dgraph
    ports:
      - 5080:5080
      - 6080:6080
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == AP-GRAPH-1
    restart: on-failure
    command: dgraph zero --my=zero:5080 --replicas 3
  server_1:
    image: dgraph/dgraph:nightly
    hostname: "server_1"
    volumes:
      - data-volume:/dgraph
    ports:
      - 8080:8080
      - 9080:9080
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == AP-GRAPH-1
    restart: on-failure
    command: dgraph server --my=server_1:7080 --lru_mb=17192 --zero=zero:5080
  server_2:
    image: dgraph/dgraph:nightly
    hostname: "server_2"
    volumes:
      - data-volume:/dgraph
    ports:
      - 8081:8081
      - 9081:9081
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == AP-GRAPH-2
    restart: on-failure
    command: dgraph server --my=server_2:7081 --lru_mb=17192 --zero=zero:5080 -o 1
  server_3:
    image: dgraph/dgraph:nightly
    hostname: "server_3"
    volumes:
      - data-volume:/dgraph
    ports:
      - 8082:8082
      - 9082:9082
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == AP-GRAPH-3
    command: dgraph server --my=server_3:7082 --lru_mb=17192 --zero=zero:5080 -o 2
  ratel:
    image: dgraph/dgraph:nightly
    hostname: "ratel"
    ports:
      - 8000:8000
    networks:
      - dgraph
    restart: on-failure
    command: dgraph-ratel
volumes:
  data-volume:

There is no image named dgraph/dgraph:nightly. The image is dgraph/dgraph:master.

You can inspect the error message by running docker stack ps dgraph --no-trunc.

I changed nightly to master and removed the restart: on-failure line. docker swarm master node automatically handles the restart for you if a container goes down and it should not be part of the config file. The restart key is supported while using docker compose which is different from docker swarm. Here is the updated file.

version: "3"
networks:
  dgraph:
services:
  zero:
    image: dgraph/dgraph:master
    volumes:
      - data-volume:/dgraph
    ports:
      - 5080:5080
      - 6080:6080
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == AP-GRAPH-1
    command: dgraph zero --my=zero:5080 --replicas 3
  server_1:
    image: dgraph/dgraph:master
    hostname: "server_1"
    volumes:
      - data-volume:/dgraph
    ports:
      - 8080:8080
      - 9080:9080
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == AP-GRAPH-1
    command: dgraph server --my=server_1:7080 --lru_mb=17192 --zero=zero:5080
  server_2:
    image: dgraph/dgraph:master
    hostname: "server_2"
    volumes:
      - data-volume:/dgraph
    ports:
      - 8081:8081
      - 9081:9081
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == AP-GRAPH-2
    command: dgraph server --my=server_2:7081 --lru_mb=17192 --zero=zero:5080 -o 1
  server_3:
    image: dgraph/dgraph:master
    hostname: "server_3"
    volumes:
      - data-volume:/dgraph
    ports:
      - 8082:8082
      - 9082:9082
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == AP-GRAPH-3
    command: dgraph server --my=server_3:7082 --lru_mb=17192 --zero=zero:5080 -o 2
  ratel:
    image: dgraph/dgraph:master
    hostname: "ratel"
    ports:
      - 8000:8000
    networks:
      - dgraph
    command: dgraph-ratel
volumes:
  data-volume:

You can see all the running services using docker service ls and you can see logs from a service using something like docker logs -f <service_name>.

Ok that worked beautifully. BTW the command for logs is docker service logs <service_name>.

I’m still wondering about setting up a load balancer. Because Docker forces the same service on multiple ports I’m not sure if a typical AWS ELB will be able to point to them individually. I can setup a target group with a port specific to the service but the ELB itself doesn’t seem translate the incoming port 9080 to the specific target port.

I suspect using another proxy method would be the solution but I wanted to ask in case I’m setting the ELB up incorrectly.

Aside from that just a couple points about mistakes on the Deploy section of the website:

  1. In the Ports used by different nodes section the service address is listed as 9090 but everywhere else in the same doc it’s 9080
  2. In the docker section it says to list containers use docker-machine ps should be docker machine ls
  3. I don’t believe the Deploy section mentions a suggestion for setting memory to half the size of the machine’s capacity. It’s listed elsewhere on the docs but would be good for new users to see that right in deploy. Mater of fact explaining why the LRU memory setting is important might be helpful.

Thanks

Hmmm also noticed that dgraph_server_2 and dgraph_server_3 are throwing errors similar to this:

dgraph_server_2.1.u4kina8qwx3s@AP-GRAPH-2    | 2018/04/16 16:16:24 pool.go:158: Echo error from zero:5080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
dgraph_server_2.1.u4kina8qwx3s@AP-GRAPH-2    | 2018/04/16 16:16:27 groups.go:105: Error while connecting with group zero: rpc error: code = Unavailable desc = all SubConns are in TransientFailure

I don’t think the latest branch was doing this. Are we missing a detail in the configuration you supplied me with?

Hi, just bumping up this request. Still can’t seem to get the other two nodes stable.

Those are transient errors. I just tried with dgraph/dgraph:latest since we released v1.0.5 and the cluster seems to work fine.

What makes you believe that the other two nodes are not stable? Are they not responding to queries/mutations?

Regarding the ELB, maybe have a look at Docker for Swarm. From the link I can see

Elastic Load Balancers (ELBs) are set up to help with routing traffic to your swarm.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.