Why we have been spcified zero node, the alpha node still try to connect to own's 5080 port?

Report a Dgraph Bug

What version of Dgraph are you using?

Dgraph Version
$ dgraph version
 
Dgraph version   : v20.11.0-g98f68280e
Dgraph codename  : unnamed-mod
Dgraph SHA-256   : 231757777e8b15164caff0a8655b84c1b2f75616c404841171266cb6047c1cf9
Commit SHA-1     : 98f68280e
Commit timestamp : 2021-03-29 21:34:28 +0530
Branch           : master
Go version       : go1.15.9
jemalloc enabled : true

Have you tried reproducing the issue with the latest release?

only tried last commit

What is the hardware spec (RAM, OS)?

CentOS 7.6

Steps to reproduce the issue (command/config used to run Dgraph).

use machine A as zero node, and another machine B as alpha node use A as it’s zero node .

Expected behaviour and actual result.


Experience Report for Feature Request

Note: Feature requests are judged based on user experience and modeled on Go Experience Reports. These reports should focus on the problems: they should not focus on and need not propose solutions.

What you wanted to do

What you actually did

Why that wasn’t great, with examples

i think when we specify a zero node, it should not use it’s default zero address. i’m i right?

Any external references to support your case

and here is the zero log

You used the dgraph alpha with the --zero machineA flag? If this is not set, dgraph alpha will use localhost:5080 by default.

@joaquin
of course Yes, in my first pic bottom, and also used --bindall flag and whitelist flag. all the firewall has been closed.
i have looked source code, seems like the grpc connect went wrong, i don’t known why the zero instance and the alpha instance can only in the same machine.
machine A run zero instance, it’s ip is 192.168.3.2, start command is dgraph zero --bindall=true --my 192.168.3.2:5080 --raft idx=1
machine B run alpha instance, it’s ip is 192.168.3.4, my alpha start command is: dgraph alpha --bindall=true --zero=192.168.3.2:5080 -v 5 --my=192.168.3.4:7080 --security whitelist=192.168.3.1/16.
the more clear log is:

it’s too much errors to deploy dgraph cluster… i’m going to give up. :sweat:

Hello @Ro0tk1t . I investigated using a similar setup using vagrant, and I was not able to reproduce the issue. This is how I ran the test below:

Setup two machines

I used Vagrantfile configuration file that looks like this:

@hosts = {"zero"=>"192.168.3.2", "alpha"=>"192.168.3.4"}

Vagrant.configure("2") do |config|
  @hosts.each do |hostname, ipaddr|
    config.vm.define hostname do |node|
      node.vm.box = "generic/centos7"
      node.vm.hostname = "#{hostname}"
      node.vm.network "private_network", ip: ipaddr
      node.vm.synced_folder ".", "/vagrant"
    end
  end
end

And then I bring up the two systems. I previously build the dgraph binary on my host and copied in to the local directory, I was using 32f1f5893 commit. Then I launch the systems, run dgraph on the zero machine, and then run dgraph on the alpha machine.

# launch two vm systems
vagrant up

Setup and run dgraph zero

# log into zero and run commands
vagrant ssh zero

sudo firewall-cmd --zone=public --permanent --add-port=5080/tcp
sudo firewall-cmd --zone=public --permanent --add-port=6080/tcp
sudo firewall-cmd --reload

sudo cp /vagrant/dgraph /usr/local/bin
dgraph zero --bindall=true --my 192.168.3.2:5080 --raft idx=1 &
logout

Setup and run dgraph alpha

# log into alpha and run commands
vagrant ssh alpha

sudo firewall-cmd --zone=public --permanent --add-port=7080/tcp
sudo firewall-cmd --zone=public --permanent --add-port=8080/tcp
sudo firewall-cmd --zone=public --permanent --add-port=9080/tcp
sudo firewall-cmd --reload

sudo cp /vagrant/dgraph /usr/local/bin
dgraph alpha --bindall=true --zero=192.168.3.2:5080 \
  -v 5 --my=192.168.3.4:7080 --security whitelist=192.168.3.1/16

Health checks

On the host system, I can also check the status of the two services:

$ curl -s 192.168.3.2:6080/state | jq '.zeros."1"'
{
  "id": "1",
  "groupId": 0,
  "addr": "192.168.3.2:5080",
  "leader": true,
  "amDead": false,
  "lastUpdate": "0",
  "learner": false,
  "clusterInfoOnly": false,
  "forceGroupId": false
}

and

$ curl -s 192.168.3.4:8080/health | jq '.[0]'
{
  "instance": "alpha",
  "address": "192.168.3.4:7080",
  "status": "healthy",
  "group": "1",
  "version": "v20.11.0-rc1-506-g32f1f5893",
  "uptime": 820,
  "lastEcho": 1617424155,
  "ongoing": [
    "opRollup"
  ],
  "ee_features": [
    "backup_restore",
    "cdc"
  ],
  "max_assigned": 2
}

@joaquin
em… i solved this, by delete each file it generated last running. but there looks like a new problem appears. i post it at a new place: