Issue with Swarm and second replica failing

I’m having a problem where the second instance of my Dgraph server is connecting however subsequently failing. I’m using Docker Swarm on AWS (so these are EC2 instances):

version: "3"
networks:
  dgraph:
services:
  zero:
    image: dgraph/dgraph:latest
    volumes:
      - data-volume:/dgraph
    ports:
      - 5080:5080
      - 6080:6080
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == AP-GRAPH-1
    command: dgraph zero --my=zero:5080 --replicas 2
  server_1:
    image: dgraph/dgraph:latest
    hostname: "server_1"
    volumes:
      - data-volume:/dgraph
    ports:
      - 8080:8080
      - 9080:9080
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == AP-GRAPH-1
    command: dgraph server --my=server_1:7080 --lru_mb=17192 --zero=zero:5080
  server_2:
    image: dgraph/dgraph:latest
    hostname: "server_2"
    volumes:
      - data-volume:/dgraph
    ports:
      - 8081:8081
      - 9081:9081
    networks:
      - dgraph
    deploy:
      placement:
        constraints:
          - node.hostname == AP-GRAPH-2
    command: dgraph server --my=server_2:7081 --lru_mb=17192 --zero=zero:5080 -o 1
  ratel:
    image: dgraph/dgraph:latest
    hostname: "ratel"
    ports:
      - 8000:8000
    networks:
      - dgraph
    command: dgraph-ratel
    deploy:
      placement:
        constraints:
          - node.hostname == AP-GRAPH-1
volumes:
  data-volume:

The errors I’m seeing inside the Docker container of server_2 are:

[centos@AP-GRAPH-2 ~]$ docker logs 312fb793bc2a
2018/06/02 15:54:37 groups.go:88: Current Raft Id: 0
2018/06/02 15:54:37 worker.go:99: Worker listening at address: [::]:7081
2018/06/02 15:54:37 gRPC server started.  Listening on port 9081
2018/06/02 15:54:37 HTTP server started.  Listening on port 8081
2018/06/02 15:54:57 pool.go:158: Echo error from zero:5080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/06/02 15:54:57 pool.go:108: == CONNECT ==> Setting zero:5080
2018/06/02 15:54:57 groups.go:105: Error while connecting with group zero: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/06/02 15:55:17 groups.go:105: Error while connecting with group zero: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/06/02 15:55:17 pool.go:158: Echo error from zero:5080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
2018/06/02 15:55:17 pool.go:158: Echo error from zero:5080. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure

Really not sure what’s going on. Note that to mitigate if it’s a security group issue, I’ve allowed ALL ports from anywhere to access the boxes (for now at least).

Note that I’ve tried dgraph:master as well to see if it made a difference. Didn’t seem like it.

This might be due to data volume causing conflicts in Badger locking.

Try this instead. Also, note that you should use either 1 or 3 replicas. 2 can easily cause issues in the cluster.

# This file can be used to setup a Dgraph cluster with 3 Dgraph servers and 1 Zero node on a
# Docker Swarm with replication.
# It expects three virtual machines with hostnames host1, host2 and host3 to be part of the swarm.
# There is a constraint to make sure that Dgraph servers run on a particular host.

# Data would be persisted to a docker volume called data-volume on the virtual machines which are
# part of the swarm.
# Run `docker stack deploy -c docker-compose-multi.yml` on the Swarm leader to start the cluster.

version: "3.2"
networks:
  dgraph:
services:
  zero:
    image: dgraph/dgraph:latest
    ports:
      - 5080:5080
      - 6080:6080
    networks:
      - dgraph
    command: dgraph zero --my=zero:5080 --replicas 3
  server_1:
    image: dgraph/dgraph:latest
    hostname: "server_1"
    ports:
      - 8080:8080
      - 9080:9080
    networks:
      - dgraph
    command: dgraph server --my=server_1:7080 --lru_mb=2048 --zero=zero:5080
  server_2:
    image: dgraph/dgraph:latest
    hostname: "server_2"
    ports:
      - 8081:8081
      - 9081:9081
    networks:
      - dgraph
    command: dgraph server --my=server_2:7081 --lru_mb=2048 --zero=zero:5080 -o 1
  server_3:
    image: dgraph/dgraph:latest
    hostname: "server_3"
    ports:
      - 8082:8082
      - 9082:9082
    networks:
      - dgraph
    command: dgraph server --my=server_3:7082 --lru_mb=2048 --zero=zero:5080 -o 2

Update: I reckon the issue I’m seeing with data volume might be just because I’m running this on my computer; while you are running it on multiple servers in AWS. There’s not enough information in the logs to diagnose why one of the servers crashed. So, try with setting the replicas to 3.

1 Like

David, What type of instances are you using and which AMI?

We are attempting to use c5d.2xlarge and mount the NVMe as /mnt/dgraph but we are having issues with permissions in dgraph.

We are using the latest AWS AMI: Amazon Linux 2 LTS Candidate 2 AMI (HVM), SSD Volume Type

Update: Pay attention to Docker and Docker-Compose versions. A similar issue was solved just upgrading Docker-compose Docker-Compose issue transient failure

It is likely that it is unrelated, but it is important to pay attention.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.